* [PATCH v1] perf evlist: Force adding default events only to core PMUs
@ 2024-05-25 15:29 Ian Rogers
2024-05-25 16:43 ` Linus Torvalds
0 siblings, 1 reply; 34+ messages in thread
From: Ian Rogers @ 2024-05-25 15:29 UTC (permalink / raw)
To: Linus Torvalds, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
Kan Liang, James Clark, Dominique Martinet, linux-perf-users,
linux-kernel
PMUs other than core PMUs may have a 'cycles' event. Default opening a
non-core cycles event with perf record can lead to perf_event_open
failure. Avoid this by only opening the default 'cycles' event on core
PMUs.
Closes: https://lore.kernel.org/lkml/CAHk-=wiWvtFyedDNpoV7a8Fq_FpbB+F5KmWK2xPY3QoYseOf_A@mail.gmail.com/
Fixes: 617824a7f0f7 ("perf parse-events: Prefer sysfs/JSON hardware events over legacy")
Signed-off-by: Ian Rogers <irogers@google.com>
---
tools/perf/builtin-record.c | 6 ++----
tools/perf/builtin-top.c | 3 +--
tools/perf/util/evlist.c | 43 ++++++++++++++++++++++++++++++++++---
tools/perf/util/evlist.h | 1 +
4 files changed, 44 insertions(+), 9 deletions(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 66a3de8ac661..b968c3c2def6 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -3198,7 +3198,7 @@ static int switch_output_setup(struct record *rec)
unsigned long val;
/*
- * If we're using --switch-output-events, then we imply its
+ * If we're using --switch-output-events, then we imply its
* --switch-output=signal, as we'll send a SIGUSR2 from the side band
* thread to its parent.
*/
@@ -4154,9 +4154,7 @@ int cmd_record(int argc, const char **argv)
record.opts.tail_synthesize = true;
if (rec->evlist->core.nr_entries == 0) {
- bool can_profile_kernel = perf_event_paranoid_check(1);
-
- err = parse_event(rec->evlist, can_profile_kernel ? "cycles:P" : "cycles:Pu");
+ err = evlist__add_default_events(rec->evlist);
if (err)
goto out;
}
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 1d6aef51c122..90b97fc24edb 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -1665,8 +1665,7 @@ int cmd_top(int argc, const char **argv)
goto out_delete_evlist;
if (!top.evlist->core.nr_entries) {
- bool can_profile_kernel = perf_event_paranoid_check(1);
- int err = parse_event(top.evlist, can_profile_kernel ? "cycles:P" : "cycles:Pu");
+ int err = evlist__add_default_events(top.evlist);
if (err)
goto out_delete_evlist;
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 3a719edafc7a..ddca50cb049f 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -32,6 +32,7 @@
#include "util/sample.h"
#include "util/bpf-filter.h"
#include "util/stat.h"
+#include "util/strbuf.h"
#include "util/util.h"
#include <signal.h>
#include <unistd.h>
@@ -93,14 +94,12 @@ struct evlist *evlist__new(void)
struct evlist *evlist__new_default(void)
{
struct evlist *evlist = evlist__new();
- bool can_profile_kernel;
int err;
if (!evlist)
return NULL;
- can_profile_kernel = perf_event_paranoid_check(1);
- err = parse_event(evlist, can_profile_kernel ? "cycles:P" : "cycles:Pu");
+ err = evlist__add_default_events(evlist);
if (err) {
evlist__delete(evlist);
return NULL;
@@ -187,6 +186,44 @@ void evlist__delete(struct evlist *evlist)
free(evlist);
}
+int evlist__add_default_events(struct evlist *evlist)
+{
+ struct perf_pmu *pmu = NULL;
+ bool can_profile_kernel = perf_event_paranoid_check(1);
+ struct strbuf events;
+ bool first = true;
+ int err;
+
+ err = strbuf_init(&events, /*hint=*/32);
+ if (err)
+ return err;
+
+ while ((pmu = perf_pmus__scan_core(pmu)) != NULL) {
+ if (!first) {
+ err = strbuf_addch(&events, ',');
+ if (err)
+ goto err_out;
+ }
+ err = strbuf_addstr(&events, pmu->name);
+ if (err < 0)
+ goto err_out;
+
+ if (can_profile_kernel)
+ err = strbuf_addstr(&events, "/cycles/P");
+ else
+ err = strbuf_addstr(&events, "/cycles/Pu");
+
+ if (err < 0)
+ goto err_out;
+
+ first = false;
+ }
+ err = parse_event(evlist, events.buf);
+err_out:
+ strbuf_release(&events);
+ return err;
+}
+
void evlist__add(struct evlist *evlist, struct evsel *entry)
{
perf_evlist__add(&evlist->core, &entry->core);
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index cb91dc9117a2..269db02f7b45 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -97,6 +97,7 @@ void evlist__init(struct evlist *evlist, struct perf_cpu_map *cpus,
void evlist__exit(struct evlist *evlist);
void evlist__delete(struct evlist *evlist);
+int evlist__add_default_events(struct evlist *evlist);
void evlist__add(struct evlist *evlist, struct evsel *entry);
void evlist__remove(struct evlist *evlist, struct evsel *evsel);
--
2.45.1.288.g0e0cd299f1-goog
^ permalink raw reply related [flat|nested] 34+ messages in thread* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-25 15:29 [PATCH v1] perf evlist: Force adding default events only to core PMUs Ian Rogers @ 2024-05-25 16:43 ` Linus Torvalds 2024-05-25 21:14 ` Linus Torvalds 0 siblings, 1 reply; 34+ messages in thread From: Linus Torvalds @ 2024-05-25 16:43 UTC (permalink / raw) To: Ian Rogers Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Sat, 25 May 2024 at 08:30, Ian Rogers <irogers@google.com> wrote: > > PMUs other than core PMUs may have a 'cycles' event. Default opening a > non-core cycles event with perf record can lead to perf_event_open > failure. Avoid this by only opening the default 'cycles' event on core > PMUs. > > Closes: https://lore.kernel.org/lkml/CAHk-=wiWvtFyedDNpoV7a8Fq_FpbB+F5KmWK2xPY3QoYseOf_A@mail.gmail.com/ > Fixes: 617824a7f0f7 ("perf parse-events: Prefer sysfs/JSON hardware events over legacy") > Signed-off-by: Ian Rogers <irogers@google.com> This makes 'perf record' work for me again. Tested-by: Linus Torvalds <torvalds@linux-foundation.org> Thanks, Linus ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-25 16:43 ` Linus Torvalds @ 2024-05-25 21:14 ` Linus Torvalds 2024-05-27 10:58 ` Leo Yan 0 siblings, 1 reply; 34+ messages in thread From: Linus Torvalds @ 2024-05-25 21:14 UTC (permalink / raw) To: Ian Rogers Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Sat, 25 May 2024 at 09:43, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > This makes 'perf record' work for me again. Oh, wait, no it doesn't. It makes just the plain "perf record" without any arguments work, which was what I was testing because I was lazy. So now $ perf record sleep 1 works fine. But $ perf record -e cycles:pp sleep 1 is still completely broken (with or without ":p" and ":pp"). So no. That still needs to be fixed, or the whole "prefer sysfs/JSON by default" needs to be reverted. Linus ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-25 21:14 ` Linus Torvalds @ 2024-05-27 10:58 ` Leo Yan 2024-05-28 5:36 ` Ian Rogers 0 siblings, 1 reply; 34+ messages in thread From: Leo Yan @ 2024-05-27 10:58 UTC (permalink / raw) To: Linus Torvalds Cc: Ian Rogers, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote: > On Sat, 25 May 2024 at 09:43, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > This makes 'perf record' work for me again. > > Oh, wait, no it doesn't. > > It makes just the plain "perf record" without any arguments work, > which was what I was testing because I was lazy. > > So now > > $ perf record sleep 1 > > works fine. But > > $ perf record -e cycles:pp sleep 1 > > is still completely broken (with or without ":p" and ":pp"). Seems to me that this patch fails to check if a PMU is a core-attached PMU that can support common hardware events. Therefore, we should consider adding the following check. diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c index 30f958069076..bc1822c2f3e3 100644 --- a/tools/perf/util/parse-events.c +++ b/tools/perf/util/parse-events.c @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state, while ((pmu = perf_pmus__scan(pmu)) != NULL) { bool auto_merge_stats; + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core) + continue; + if (parse_events__filter_pmu(parse_state, pmu)) continue; To be clear, I only compiled this change but I have no chance to test it. @Ian, could you confirm this? Thanks, Leo > So no. That still needs to be fixed, or the whole "prefer sysfs/JSON > by default" needs to be reverted. ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-27 10:58 ` Leo Yan @ 2024-05-28 5:36 ` Ian Rogers 2024-05-28 17:00 ` Linus Torvalds 2024-05-28 19:44 ` Arnaldo Carvalho de Melo 0 siblings, 2 replies; 34+ messages in thread From: Ian Rogers @ 2024-05-28 5:36 UTC (permalink / raw) To: Leo Yan Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Mon, May 27, 2024 at 3:58 AM Leo Yan <leo.yan@linux.dev> wrote: > > On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote: > > On Sat, 25 May 2024 at 09:43, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > This makes 'perf record' work for me again. > > > > Oh, wait, no it doesn't. > > > > It makes just the plain "perf record" without any arguments work, > > which was what I was testing because I was lazy. > > > > So now > > > > $ perf record sleep 1 > > > > works fine. But > > > > $ perf record -e cycles:pp sleep 1 > > > > is still completely broken (with or without ":p" and ":pp"). > > Seems to me that this patch fails to check if a PMU is a core-attached > PMU that can support common hardware events. Therefore, we should > consider adding the following check. > > diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c > index 30f958069076..bc1822c2f3e3 100644 > --- a/tools/perf/util/parse-events.c > +++ b/tools/perf/util/parse-events.c > @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state, > while ((pmu = perf_pmus__scan(pmu)) != NULL) { > bool auto_merge_stats; > > + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core) > + continue; > + > if (parse_events__filter_pmu(parse_state, pmu)) > continue; > > To be clear, I only compiled this change but I have no chance to test > it. @Ian, could you confirm this? Hi Leo, so the code is working as intended. I believe it also agrees with what Arnaldo thinks. If you do: $ perf stat -e cycles ... and you have /sys/devices/pmu1/events/cycles /sys/devices/pmu2/events/cycles The output of perf stat should contain counts for pmu1 and pmu2. Were the event 'data_read' or 'inst_retired.any' we wouldn't be having the question as this has always been perf's behavior - changing that behavior would be a regression and why we can't just limit wildcard searches to core PMUs. The issue is about legacy events. There are more potential legacy names than those in 'enum perf_hw_id' as there are all the hardware cache events. So ignoring "hw_config != PERF_COUNT_HW_MAX" when adding an event to >1 PMU is only a partial solution, but I (and I think others agree) don't think legacy names should be special in this way. If you ask for an event and multiple PMUs advertise it then the behavior should be to count the event. Linus' trouble stems from cycles being used as a default value without restricting it to core PMUs. This patch solves that. There is then the issue that if an event that doesn't support sampling is wild card matched by `perf record` then it should be ignored. I'm in less agreement here, such events failing could be regarded as a feature. The workaround is to add the PMU name when specifying the event. If we do allow such events to be skipped, should the skipping be accompanied by a warning? Presumably if no events can be opened then things should terminate. Restricting everything to the core PMU I think is short-sighted as knowing what's happening on accelerators is increasingly important. For the sake of fixing TMA metrics and things like default attributes for ARM I added a notion that evsels in perf stat can be skippable. The plumbing for this with perf record is more tricky as the logic has half migrated into libperf - parsing and skippable are in perf, the event opening and mmap logic is all in libperf, and we need to add ignoring skipped evsels when adding the events to the mmap-ed ring buffer. I have a patch that's WIP for this, but I also think we could also agree that when >1 PMU advertises an event, perf's behavior when matching should be to open all such events. You avoid this by specifying a PMU name. Thanks, Ian > Thanks, > Leo > > > > So no. That still needs to be fixed, or the whole "prefer sysfs/JSON > > by default" needs to be reverted. > > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-28 5:36 ` Ian Rogers @ 2024-05-28 17:00 ` Linus Torvalds 2024-05-28 17:39 ` Ian Rogers 2024-05-28 19:44 ` Arnaldo Carvalho de Melo 1 sibling, 1 reply; 34+ messages in thread From: Linus Torvalds @ 2024-05-28 17:00 UTC (permalink / raw) To: Ian Rogers Cc: Leo Yan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Mon, 27 May 2024 at 22:37, Ian Rogers <irogers@google.com> wrote: > > If you do: > > $ perf stat -e cycles ... You always talk about "perf stat", because you want to ignore a big part of the issueL > The issue is about legacy events. No. The issue isn't "perf stat". That works. Even with the broken version. > I have a patch that's WIP for this, but I also think we could > also agree that when >1 PMU advertises an event, perf's behavior when > matching should be to open all such events. You avoid this by > specifying a PMU name. Christ. You're still ignoring the elephant in the room. Stop using "perf stat" as an example. It's bogus. You're ignoring the issue. Lookie here: $ perf stat make ... Performance counter stats for 'make': 5,262.78 msec task-clock # 1.195 CPUs utilized 46,891 context-switches # 8.910 K/sec 6 cpu-migrations # 1.140 /sec 198,402 page-faults # 37.699 K/sec 10,238,763,227 cycles # 1.946 GHz 16,280,644,955 instructions # 1.59 insn per cycle 3,252,743,314 branches # 618.066 M/sec 83,702,386 branch-misses # 2.57% of all branches 4.405792739 seconds time elapsed 4.287784000 seconds user 0.921132000 seconds sys but then $ perf record make Error: cycles:P: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat' because that broken thing (a) picked a different cycles than before and (b) your argument that it should pick both IS WRONG BECAUSE ONE OF THEM DOESN'T EVEN WORK. Why is this so hard to just accept? Why do you keep mis-stating the problem? How hard is it to realize that I DO NOT WANT "perf stat"? The perf error message is bogus crap. If I ask for a "perf record", it shouldn't pick the wrong PMU that can't do it, and then say "do you want to do something else"? I don't care a whit for "legacy events". I care about the fact that you changed the code to pick the WRONG event, and then you are blaming anything but that. If perf would go "Oh, this one doesn't support 'record', let's see if another one does", it would all be good. If perf would go "Oh, let's prioritize core events", it would all be good. But no. Your patch actively picked a bad event, and then you try to blame some "legacy" thing. Yes, the legacy thing picked the right event, but it's not even about legacy. You could have picked the right event any number of other ways. It's about "it damn well worked when you didn't go out of your way to pick the wrong event". In other words, this isn't about "legacy" and "new". This is about "right" and "wrong". The old code picked right - for whatever reasons. The new code picked wrong - also for whatever reasons. Don't try to make it be anything else. Just admit that the new code picked the wrong PMU, instead of trying to make excuses for it. Linus ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-28 17:00 ` Linus Torvalds @ 2024-05-28 17:39 ` Ian Rogers 2024-05-28 18:12 ` Linus Torvalds 0 siblings, 1 reply; 34+ messages in thread From: Ian Rogers @ 2024-05-28 17:39 UTC (permalink / raw) To: Linus Torvalds Cc: Leo Yan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Tue, May 28, 2024 at 10:01 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, 27 May 2024 at 22:37, Ian Rogers <irogers@google.com> wrote: > > > > If you do: > > > > $ perf stat -e cycles ... > > You always talk about "perf stat", because you want to ignore a big > part of the issueL > > > The issue is about legacy events. > > No. > > The issue isn't "perf stat". > > That works. Even with the broken version. > > > I have a patch that's WIP for this, but I also think we could > > also agree that when >1 PMU advertises an event, perf's behavior when > > matching should be to open all such events. You avoid this by > > specifying a PMU name. > > Christ. You're still ignoring the elephant in the room. > > Stop using "perf stat" as an example. It's bogus. You're ignoring the issue. > > Lookie here: > > $ perf stat make > ... > Performance counter stats for 'make': > > 5,262.78 msec task-clock # > 1.195 CPUs utilized > 46,891 context-switches # > 8.910 K/sec > 6 cpu-migrations # > 1.140 /sec > 198,402 page-faults # > 37.699 K/sec > 10,238,763,227 cycles # > 1.946 GHz > 16,280,644,955 instructions # 1.59 > insn per cycle > 3,252,743,314 branches # > 618.066 M/sec > 83,702,386 branch-misses # > 2.57% of all branches > > 4.405792739 seconds time elapsed > > 4.287784000 seconds user > 0.921132000 seconds sys > > but then > > $ perf record make > Error: > cycles:P: PMU Hardware doesn't support > sampling/overflow-interrupts. Try 'perf stat' > > because that broken thing (a) picked a different cycles than before > and (b) your argument that it should pick both IS WRONG BECAUSE ONE OF > THEM DOESN'T EVEN WORK. By default it picked "cycles:P" with this patch it now picks "<core pmu>/cycles/P". As your Tested-by attested this does work. > Why is this so hard to just accept? Why do you keep mis-stating the problem? I'm not. What your saying is that the arm_dsu_0 PMU advertising cycles doesn't work if you use it for perf record. I agree. This change fixed the issue. What to do if someone writes "perf record -e cycles" well there are two advertised cycles events isn't it reasonable perf record try to open them both? > How hard is it to realize that I DO NOT WANT "perf stat"? The perf > error message is bogus crap. If I ask for a "perf record", it > shouldn't pick the wrong PMU that can't do it, and then say "do you > want to do something else"? It's not about perf record having some kind of intelligence and picking a PMU. perf record either opens an event on the PMU you ask for it it wild cards across them all. For legacy events they used to be handled differently and I'm trying to fix this. In part so I can fix the PMU behavior on the Apple M1 and later CPUs that fail to implement legacy events properly in their PMU driver. You've said you used to use Apple CPUs for ARM testing, I'm trying to fix a problem that will help you. > I don't care a whit for "legacy events". I care about the fact that > you changed the code to pick the WRONG event, and then you are blaming > anything but that. Sure, I added "if user == linus then pick wrong PMU". The code was reviewed by IBM and Intel. ARM were on the CC list. The change baked on linux-next for a good long while. All of this points to my problem that I'm often fixing problems for ARM with a complete lack of testing/reviewing/acking... by them. > If perf would go "Oh, this one doesn't support 'record', let's see if > another one does", it would all be good. > > If perf would go "Oh, let's prioritize core events", it would all be good. But as I've pointed out it wouldn't because then you'd break the behavior of doing things like gathering memory bandwidth from uncore PMUs. An example: ``` $ sudo perf stat -e data_read -a sleep 1 Performance counter stats for 'system wide': 4,447.10 MiB data_read 1.001748581 seconds time elapsed ``` By making the wildcard only work for core PMUs the best case is you'd make this: ``` $ sudo perf stat -e 'imc_free_running/data_read/' -a sleep 1 Performance counter stats for 'system wide': 4,454.56 MiB imc_free_running/data_read/ 1.001865448 seconds time elapsed ``` But that wouldn't work again as ARM decided to mess up the naming convention, my unmerged fix for that is here: https://lore.kernel.org/lkml/20240515060114.3268149-1-irogers@google.com/ it says Marvell but Marvell followed ARM's lead. > But no. Your patch actively picked a bad event, and then you try to > blame some "legacy" thing. > > Yes, the legacy thing picked the right event, but it's not even about > legacy. You could have picked the right event any number of other > ways. > > It's about "it damn well worked when you didn't go out of your way to > pick the wrong event". > > In other words, this isn't about "legacy" and "new". I'm not clear what you think is "new". All the events are being picked in your case from sysfs, the way this has all worked is years if not decades old. What is new is that because of an event name the behavior should be uniform, motivated initially by fixing your other ARM test platform of Apple. > This is about "right" and "wrong". The old code picked right - for > whatever reasons. The new code picked wrong - also for whatever > reasons. > > Don't try to make it be anything else. Just admit that the new code > picked the wrong PMU, instead of trying to make excuses for it. I agree it picked the wrong PMU for default events. This was a problem on no systems that anybody was bothering to test with. Having been made aware of the issue I fixed it in this patch, you're welcome. What is still not clear from this is what should the behavior be of: $ perf record -e cycles ... Should it wildcard all events and open them on all PMUs potentially failing? Well this has always been perf's behavior were the event: $ perf record -e inst_retired.any ... where inst_retired.any could be an event advertised on an accelerator or device where sampling doesn't work. If inst_retired.any doesn't work for you as an example, pick another event that does. A GPU has instructions and cycles so the likelihood of naming conflicts is high. We can make perf record ignore opening events on PMUs that don't support sampling, it's an invasive and non-trivial change not suited to landing in 6.10. It is also a behavior change, see this thread for how popular those are. So 6.10 is now in a mess. We likely fail tests, reverting this change has a bunch of consequences and presumably I'm expected to dig through, figure those out and then provide fixes. Thanks! For 6.11 I currently suggest we revert the revert and apply this patch. This would also I think be the best thing to do for 6.10. I appreciate my opinions are worth much less than others. I don't see why the priority should be to fix things on an ARM system that nobody is actively testing on rather than say Apple devices fixed by the reverted change, RISC-V system, etc. Anyway... Thanks, Ian > Linus > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-28 17:39 ` Ian Rogers @ 2024-05-28 18:12 ` Linus Torvalds 2024-05-28 18:58 ` Ian Rogers 0 siblings, 1 reply; 34+ messages in thread From: Linus Torvalds @ 2024-05-28 18:12 UTC (permalink / raw) To: Ian Rogers Cc: Leo Yan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Tue, 28 May 2024 at 10:40, Ian Rogers <irogers@google.com> wrote: > > I agree it picked the wrong PMU for default events. This was a problem > on no systems that anybody was bothering to test with. Having been > made aware of the issue I fixed it in this patch, you're welcome. You didn't just pick it for default events. You also picked it for when the user explicitly asks for "profile for cycles" > What is still not clear from this is what should the behavior be of: > > $ perf record -e cycles ... Why do you claim that? I've already told you that CLEARLY it's wrong to pick a cycles event that doesn't support 'record'. I've also suggested that you might look at core only PMUs. But more importantly, you should look at documented and historical behavior. So what is your argument? Because from where I'm sitting, you keep making irrelevant arguments about *other* events, not about "cycles". It used to work. It doesn't any more. > Should it wildcard all events and open them on all PMUs potentially > failing? Well this has always been perf's behavior were the event: > > $ perf record -e inst_retired.any ... You keep making up irrelevant arguments. Lookie here: I do "perf list" to just see the events, and what do I get? Let me quote that for you: List of pre-defined events (to be used in -e or -M): ... cpu-cycles OR cycles [Hardware event] and then later on in the list I get general: cpu_cycles [Cycle. Unit: armv8_pmuv3_0] and dammit, your patch broke the DOCUMENTED way to get the most obvious profiling data: cycles. So stop making shit up. All your arguments have been bogus garbage that have been talking about entirely different things than the one thing I reported was broken. And you *keep* doing that. Days into this, you keep making shit up that isn't about this very simple thing. Every single time I tell you what the problem is, you try to twist to be about something entirely different. Either a different 'perf' command entirely, or about a different event that is ENTIRELY IRRELEVANT. What the hell is your problem? Why can't you just admit that you f*cked up, and fix the thing I told you was broken, and that is very clearly broken and there is no "what about" issues AT ALL. So stop the idiocy already. Face the actual issue. Don't make up other things. Dammit, if I wanted "arm_dsu_58/cycles/", I would SAY so. I didn't. I said "cycles", which is the thing that has always worked across architectures, that is DOCUMENTED to be the same as "cpu-cycles", and that used to work just fine. It's literally RIGHT THERE in "perf list". Using "-e cycles" is literally also what the man-pages say. This is not me doing something odd. And yes, I use an explicit "-e cycles:pp" because the default is not that. and "cycles:pp" does better than the default. Again, this is all documented, with "man perf-record" literally talking about "-e cycles" and then pointing to "man perf-list" for the modifier details, which detail that 'pp' as meaning "use maximum detected precise level". Which is exactly what I want (even if on this machine, it turns out that "p" and "pp" are the same because the armv8 pmuv3 doesn't have that "correct for instruction drift" that Intel does on x86). Why is this simple thing so hard for you to understand? The fact is, if you make "cycles" mean ANYTHING ELSE than the long-documented actual obvious thing, you have broken perf. It's that simple. So stop the excuses already. Stop making up other stuff that isn't relevant. Stop bringing up events or PMU's that are simply not the issue. Face your bug head on, instead of making me have to tell you the same thing over and over and over again. Linus ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-28 18:12 ` Linus Torvalds @ 2024-05-28 18:58 ` Ian Rogers 2024-05-28 19:42 ` Linus Torvalds 0 siblings, 1 reply; 34+ messages in thread From: Ian Rogers @ 2024-05-28 18:58 UTC (permalink / raw) To: Linus Torvalds Cc: Leo Yan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Tue, May 28, 2024 at 11:13 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, 28 May 2024 at 10:40, Ian Rogers <irogers@google.com> wrote: > > > > I agree it picked the wrong PMU for default events. This was a problem > > on no systems that anybody was bothering to test with. Having been > > made aware of the issue I fixed it in this patch, you're welcome. > > You didn't just pick it for default events. You also picked it for > when the user explicitly asks for "profile for cycles" > > > What is still not clear from this is what should the behavior be of: > > > > $ perf record -e cycles ... > > Why do you claim that? > > I've already told you that CLEARLY it's wrong to pick a cycles event > that doesn't support 'record'. > > I've also suggested that you might look at core only PMUs. > > But more importantly, you should look at documented and historical behavior. > > So what is your argument? Because from where I'm sitting, you keep > making irrelevant arguments about *other* events, not about "cycles". > > It used to work. It doesn't any more. > > > Should it wildcard all events and open them on all PMUs potentially > > failing? Well this has always been perf's behavior were the event: > > > > $ perf record -e inst_retired.any ... > > You keep making up irrelevant arguments. > > Lookie here: I do "perf list" to just see the events, and what do I > get? Let me quote that for you: > > List of pre-defined events (to be used in -e or -M): > ... > cpu-cycles OR cycles [Hardware event] > > and then later on in the list I get > > general: > cpu_cycles > [Cycle. Unit: armv8_pmuv3_0] > > and dammit, your patch broke the DOCUMENTED way to get the most > obvious profiling data: cycles. > > So stop making shit up. All your arguments have been bogus garbage > that have been talking about entirely different things than the one > thing I reported was broken. > > And you *keep* doing that. Days into this, you keep making shit up > that isn't about this very simple thing. > > Every single time I tell you what the problem is, you try to twist to > be about something entirely different. Either a different 'perf' > command entirely, or about a different event that is ENTIRELY > IRRELEVANT. > > What the hell is your problem? Why can't you just admit that you > f*cked up, and fix the thing I told you was broken, and that is very > clearly broken and there is no "what about" issues AT ALL. > > So stop the idiocy already. Face the actual issue. Don't make up other things. > > Dammit, if I wanted "arm_dsu_58/cycles/", I would SAY so. I didn't. I > said "cycles", which is the thing that has always worked across > architectures, that is DOCUMENTED to be the same as "cpu-cycles", and > that used to work just fine. > > It's literally RIGHT THERE in "perf list". Using "-e cycles" is > literally also what the man-pages say. This is not me doing something > odd. But nobody else ever reported the issue, even ARM who maintain the PMU driver whose event name conflicts. This hasn't been a problem for anybody else. > And yes, I use an explicit "-e cycles:pp" because the default is not > that. and "cycles:pp" does better than the default. This is wrong. cycles:P means use the maximum precision, so if the PMU supports it it will use cycles:ppp and on x86 this is often the case. The number of 'p's used after the event with :P is read from sysfs: ``` $ cat /sys/devices/cpu/caps/max_precise 3 ``` The default using cycles:P is intentional and better than cycles:pp. If `/sys/devices/cpu/caps/max_precise` is wrong for your PMU driver then that should get fixed. > Again, this is all documented, with "man perf-record" literally > talking about "-e cycles" and then pointing to "man perf-list" for the > modifier details, which detail that 'pp' as meaning "use maximum > detected precise level". Which is exactly what I want (even if on this > machine, it turns out that "p" and "pp" are the same because the armv8 > pmuv3 doesn't have that "correct for instruction drift" that Intel > does on x86). > > Why is this simple thing so hard for you to understand? > > The fact is, if you make "cycles" mean ANYTHING ELSE than the > long-documented actual obvious thing, you have broken perf. It's that > simple. > > So stop the excuses already. Stop making up other stuff that isn't > relevant. Stop bringing up events or PMU's that are simply not the > issue. > > Face your bug head on, instead of making me have to tell you the same > thing over and over and over again. I keep saying the same thing as I don't agree with you, you have broken perf in 6.10 and are presumably looking to me to pick up the pieces. To work around the naming conflict on systems with arm_dsu_*/events/cycles/ I don't think it is unreasonable to specify armv8_pmuv3_0/cycles/, the blame for this lies with ARM's event name within the arm_dsu_* PMU driver, admittedly they didn't know this would be an issue given perf's non-uniform handling of legacy events. When ARM requested that legacy events be a lower priority than sysfs/json then the ball was set in motion for this problem. This was done to work around an ARM PMU problem on Apple ARM CPUs. Let's say an Apple CPU has a PMU called armv8_pmuv3_0. If you try to program a legacy event on it then it will be broken - I lack a system to test this but I'm reliably informed (user bugs and by ARM) and heck it was a bunch of work to get this working if it was for nobody. To fix this perf must read event data for armv8_pmuv3_0/cycles/ from sysfs. If you don't specify a PMU then perf will try to program a legacy event. This is the behavior in Linux 6.9. Oh, but legacy events are broken on my Apple ARM CPU. The change made it so when you don't specify a PMU you will use the sysfs/json one first instead of the legacy one. Viola, your favorite `perf record -e cycles:pp ..` should be fixed on Apple ARM CPUs. So I think the revert is a real regression for a larger user base. There is a testing issue here, not least I don't possess an Apple ARM machine. All of these issues revolve around ARM and yet they do minimal to try to fix them, beside complain that I should fix the legacy/sysfs/json issue which is how we got here. There's of course the alternate universe view that I need to admit I'm wrong and I should stop going out of my way to break people. Hello alternate universe, I admit it, I'm wrong and a terrible person please accept my most sincere apologies. Please alternate universe, can you tell me how to wild card event names and make them work across PMUs without the behavior being specific to certain perf comands or event names (which I think is worse than having to specify which PMU you think the event should apply)? Thanks, Ian > Linus > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-28 18:58 ` Ian Rogers @ 2024-05-28 19:42 ` Linus Torvalds 2024-05-28 20:03 ` Ian Rogers 0 siblings, 1 reply; 34+ messages in thread From: Linus Torvalds @ 2024-05-28 19:42 UTC (permalink / raw) To: Ian Rogers Cc: Leo Yan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Tue, 28 May 2024 at 11:59, Ian Rogers <irogers@google.com> wrote: > > But nobody else ever reported the issue, even ARM who maintain the PMU > driver whose event name conflicts. This hasn't been a problem for > anybody else. I'm not blaming you for having had a bug. I'm blaming you for NOT DEALING WITH THE BUG APPROPRIATELY. I reported the bug and bisected it four days ago. Taking some time to fix the bug is fine. But that's not what you've been doing. Since then, pretty much ALL you have done is argue about irrelevant thingas that weren't about the regression in question. The fact that you still don't agree, having broken documented behavior, and still argue against just having it fixed, I can't do anything about. > So I think the revert is a real regression for a larger user base. I didn't have much choice, did I? You refuse to even acknowledge the bug I hit. I'd have been happy if you had just fixed the bug. You didn't. You just argued. > There is a testing issue here, not least I don't possess an Apple ARM > machine. This is not an Apple ARM machine. I have one of those too, but this isn't it. It's an Ampere Computing system, based on an ARM Neoverse N1 (and the ARM PMU's both for the core and for the interconnect). But that is pretty much irrelevant by now. The issue is that you don't fix bugs you leave behind, forcing the revert. I'm happy to test any patches. But I'm done arguing. The "cycles" thing needs to work. This is not a "pretty please". This is a "if you can't understand that and acknowledge that without arguing, just work on something else, ok?" Linus ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-28 19:42 ` Linus Torvalds @ 2024-05-28 20:03 ` Ian Rogers 2024-05-28 20:33 ` Linus Torvalds 0 siblings, 1 reply; 34+ messages in thread From: Ian Rogers @ 2024-05-28 20:03 UTC (permalink / raw) To: Linus Torvalds Cc: Leo Yan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Tue, May 28, 2024 at 12:45 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, 28 May 2024 at 11:59, Ian Rogers <irogers@google.com> wrote: > > > > But nobody else ever reported the issue, even ARM who maintain the PMU > > driver whose event name conflicts. This hasn't been a problem for > > anybody else. > > I'm not blaming you for having had a bug. > > I'm blaming you for NOT DEALING WITH THE BUG APPROPRIATELY. > > Taking some time to fix the bug is fine. > > But that's not what you've been doing. On LKML: Issue reported: 2024-05-25 1:31 ` Linus Torvalds [this message].. Fix posted: 2024-05-25 15:32 ` Ian Rogers [this message] The only thing I've tried to clear up is the ambiguity of when an event doesn't have a PMU what does it mean? Perf's metrics don't specify PMUs and have uncore events. We can't restrict non-PMU specifying events to just core events without rewriting them even if it best matches your mental model, perf has never worked this way. > Since then, pretty much ALL you have done is argue about irrelevant > thingas that weren't about the regression in question. > > The fact that you still don't agree, having broken documented > behavior, and still argue against just having it fixed, I can't do > anything about. > > > So I think the revert is a real regression for a larger user base. > > I didn't have much choice, did I? You refuse to even acknowledge the > bug I hit. I'd have been happy if you had just fixed the bug. You > didn't. You just argued. > > > There is a testing issue here, not least I don't possess an Apple ARM > > machine. > > This is not an Apple ARM machine. I have one of those too, but this > isn't it. It's an Ampere Computing system, based on an ARM Neoverse N1 > (and the ARM PMU's both for the core and for the interconnect). > > But that is pretty much irrelevant by now. > > The issue is that you don't fix bugs you leave behind, forcing the revert. But you've traded a fix for one set of users with a fix for another. I suspect the number of ARM neoverse N1 users of the PMU are small, not least as these devices tend to be in the cloud where PMU support is deliberately limited. As test expectations were for the patch applied, I think things are further regressed. I'm glad you're happy. Thanks, Ian > I'm happy to test any patches. But I'm done arguing. The "cycles" > thing needs to work. This is not a "pretty please". > > This is a "if you can't understand that and acknowledge that without > arguing, just work on something else, ok?" > > Linus > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-28 20:03 ` Ian Rogers @ 2024-05-28 20:33 ` Linus Torvalds 2024-05-28 21:37 ` Ian Rogers 0 siblings, 1 reply; 34+ messages in thread From: Linus Torvalds @ 2024-05-28 20:33 UTC (permalink / raw) To: Ian Rogers Cc: Leo Yan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Tue, 28 May 2024 at 13:03, Ian Rogers <irogers@google.com> wrote: > > But you've traded a fix for one set of users with a fix for another. No. You don't seem to understand what "regression" means. The "no regressions" rule is that we do not introduce NEW bugs. It *literally* came about because we had an endless dance of "fix two bugs, introduce one new one", and that then resulted in a system that you cannot TRUST. We had years of this with suspend/resume in particular, where the argument was always exactly "this fixed many more systems" when something else broke. But because random other things kept breaking, it meant that people couldn't upgrade the kernel and feel safe about it. Your old laptop might no longer work, because somebody had deemed that all those *new* laptops were more important. So I introduced that "no regressions" rule something like two decades ago, because people need to be able to update their kernel without fear of something they relied on suddenly stopping to work. The fact that the "suddenly stopped working" may be a minority DOES NOT MATTER ONE WHIT. Stability matters. It's MUCH MUCH better to have legacy bad behavior that you never dealt with correctly, than to introduce *new* bugs that hit something that used to work. So something that "perf" has never done correctly is simply not an issue. You deal with that by saying "that has never worked properly before either". You might even document it along with (hopefully) possible workarounds. The whole "one step forward, two steps back" is absolutely fine if you are line dancing. But we're not line dancing. We take it slow and steady, and if you can't fix something without breaking something else, then that thing simply does not get fixed. And there are always exceptions. Sometimes something may need to be broken because it's an acute security issue. And if it takes a year for somebody to find a regression, and in the meantime others have started relying on the new behavior, then at some point it's a "yes, that's a regression, but it wasn't reported in a timely enough manner". And sometimes the use-case is basically a museum machine and we retire support for it because such a machine should use museum software too. So it's not like it's some long-term absolute guarantee. But it absolutely *is* the #1 rule in the kernel. A fix that breaks something else is not a fix at all. And in this case, when the regression was noted within days, and within the merge window, there's simply no discussion about it. Linus ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-28 20:33 ` Linus Torvalds @ 2024-05-28 21:37 ` Ian Rogers 2024-05-28 21:42 ` Linus Torvalds 0 siblings, 1 reply; 34+ messages in thread From: Ian Rogers @ 2024-05-28 21:37 UTC (permalink / raw) To: Linus Torvalds Cc: Leo Yan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Tue, May 28, 2024 at 1:33 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, 28 May 2024 at 13:03, Ian Rogers <irogers@google.com> wrote: > > > > But you've traded a fix for one set of users with a fix for another. > > No. > > You don't seem to understand what "regression" means. > > The "no regressions" rule is that we do not introduce NEW bugs. > > It *literally* came about because we had an endless dance of "fix two > bugs, introduce one new one", and that then resulted in a system that > you cannot TRUST. > > We had years of this with suspend/resume in particular, where the > argument was always exactly "this fixed many more systems" when > something else broke. > > But because random other things kept breaking, it meant that people > couldn't upgrade the kernel and feel safe about it. > > Your old laptop might no longer work, because somebody had deemed that > all those *new* laptops were more important. > > So I introduced that "no regressions" rule something like two decades > ago, because people need to be able to update their kernel without > fear of something they relied on suddenly stopping to work. > > The fact that the "suddenly stopped working" may be a minority DOES > NOT MATTER ONE WHIT. > > Stability matters. I think stability in the context of this problem is an abstract term especially given the low amount of ARM support for PMUs. Does it matter more that between Linux 6.9 and 6.10 `perf record -e cycles:pp ..` which is objectively worse than just `perf record ..` (which is fixed by the patch) work on a Neoverse N1, or someone who is running Linux on an ARM Apple machine is able to run `perf record -e cycles:pp ..` without triggering a PMU driver bug? Which user is making the strange request and whose stability matters more? A phrase I think I coined was the Beyonce principle, she said in a song, "if you like it, then you should've put a ring on it." The principle in software is that if you want something then you better put a test on it - this made it into Titus Winter's book unattributed so maybe it wasn't me :-). I've been helping drive testing in the perf tool. If we don't follow the Beyonce principle we're held hostage to Hyrum's law, "With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody." Something we could do is rename the ARM dsu PMU's event from cycles to clockticks (as is conventional in existing PMUs) and then have a test to enforce that no PMU has events named cycles. Making this a driver bug doesn't satisfy the desire to shout at the person whose change exposed the latent issue, but maybe that's the right fix. > It's MUCH MUCH better to have legacy bad behavior that you never dealt > with correctly, than to introduce *new* bugs that hit something that > used to work. > > So something that "perf" has never done correctly is simply not an issue. > > You deal with that by saying "that has never worked properly before either". > > You might even document it along with (hopefully) possible workarounds. > > The whole "one step forward, two steps back" is absolutely fine if you > are line dancing. > > But we're not line dancing. > > We take it slow and steady, and if you can't fix something without > breaking something else, then that thing simply does not get fixed. > > And there are always exceptions. Sometimes something may need to be > broken because it's an acute security issue. > > And if it takes a year for somebody to find a regression, and in the > meantime others have started relying on the new behavior, then at some > point it's a "yes, that's a regression, but it wasn't reported in a > timely enough manner". > > And sometimes the use-case is basically a museum machine and we retire > support for it because such a machine should use museum software too. > > So it's not like it's some long-term absolute guarantee. But it > absolutely *is* the #1 rule in the kernel. > > A fix that breaks something else is not a fix at all. > > And in this case, when the regression was noted within days, and > within the merge window, there's simply no discussion about it. We moved development to being in perf-next automatically merging to linux-next because (as I understand it) there was a complaint of new possibly untested features coming in during the merge window. If IBM can test s390 on linux-next and post about breakages in the perf tool, why is this beyond the people who support and use Neoverse N1s? Why use linux-next if nobody is testing there? Thanks, Ian > Linus > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-28 21:37 ` Ian Rogers @ 2024-05-28 21:42 ` Linus Torvalds 0 siblings, 0 replies; 34+ messages in thread From: Linus Torvalds @ 2024-05-28 21:42 UTC (permalink / raw) To: Ian Rogers Cc: Leo Yan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Tue, 28 May 2024 at 14:37, Ian Rogers <irogers@google.com> wrote: > > I think stability in the context of this problem is an abstract term > especially given the low amount of ARM support for PMUs. Just stop arguing. The 'cycles' thing is documented. It's CP:U cycles. This is just a *fact*. I have now blocked you. You will get no more replies from me, because it's not worth my time. I simply have better things to do that deal with people who cannot accept documented and existing behavior, and spend days arguing that their broken fix was a "fix". Just go play in any of the other open source projects that (sadly) don't have regression rules. Linus ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-28 5:36 ` Ian Rogers 2024-05-28 17:00 ` Linus Torvalds @ 2024-05-28 19:44 ` Arnaldo Carvalho de Melo 2024-05-28 19:51 ` Ian Rogers 2024-05-28 20:00 ` Linus Torvalds 1 sibling, 2 replies; 34+ messages in thread From: Arnaldo Carvalho de Melo @ 2024-05-28 19:44 UTC (permalink / raw) To: Ian Rogers Cc: Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Mon, May 27, 2024 at 10:36:45PM -0700, Ian Rogers wrote: > On Mon, May 27, 2024 at 3:58 AM Leo Yan <leo.yan@linux.dev> wrote: > > On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote: > > > On Sat, 25 May 2024 at 09:43, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > This makes 'perf record' work for me again. > > > Oh, wait, no it doesn't. > > > It makes just the plain "perf record" without any arguments work, > > > which was what I was testing because I was lazy. > > > So now > > > $ perf record sleep 1 > > > works fine. But > > > $ perf record -e cycles:pp sleep 1 > > > is still completely broken (with or without ":p" and ":pp"). > > Seems to me that this patch fails to check if a PMU is a core-attached > > PMU that can support common hardware events. Therefore, we should > > consider adding the following check. > > +++ b/tools/perf/util/parse-events.c > > @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state, > > while ((pmu = perf_pmus__scan(pmu)) != NULL) { > > bool auto_merge_stats; > > > > + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core) > > + continue; > > + > > if (parse_events__filter_pmu(parse_state, pmu)) > > continue; > > To be clear, I only compiled this change but I have no chance to test > > it. @Ian, could you confirm this? > Hi Leo, > so the code is working as intended. I believe it also agrees with what > Arnaldo thinks. > If you do: > $ perf stat -e cycles ... > and you have > /sys/devices/pmu1/events/cycles > /sys/devices/pmu2/events/cycles > The output of perf stat should contain counts for pmu1 and pmu2. Were > the event 'data_read' or 'inst_retired.any' we wouldn't be having the Sure, what is being asked is to count events and if those two events in those two PMUs can count, then do what the user asked. For 'perf record' we're asking for sampling, if the event has the name specified and can't be sampled, skip it, warn the user and even so only if verbose mode is asked, something like: root@x1:~# perf record -e cycles -a sleep 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ] root@x1:~# perf evlist cpu_atom/cycles/ cpu_core/cycles/ dummy:u root@x1:~# Cool, there are two 'cycles' events, one in a PMU named 'cpu_atom', another in a 'cpu_core' one, both can be sampled, my workload may run/use resources on then, I'm interested, sample both. But if we had some other PMU, to use a name Jiri uses in tests/fake PMUs, the 'krava' PMU and it has a 'cycles' event, so 'krava/cycles/' and for some reason it doesn't support sampling, skip it, then the result should be the same as above. If the user finds it strange after looking at sysfs that 'krava/cycles/' isn't being sampled, the usual workflow is to ask perf for more verbosity, using -v (or multiple 'v' letters to get increasing levels of verbosity), in which case the user would see: root@x1:~# perf record -v -e cycles -a sleep 1 WARNING: skipping 'krava/cycles/' event, it doesn't support sampling. [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ] root@x1:~# perf evlist - Arnaldo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-28 19:44 ` Arnaldo Carvalho de Melo @ 2024-05-28 19:51 ` Ian Rogers 2024-05-29 14:50 ` James Clark 2024-05-28 20:00 ` Linus Torvalds 1 sibling, 1 reply; 34+ messages in thread From: Ian Rogers @ 2024-05-28 19:51 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Tue, May 28, 2024 at 12:44 PM Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > On Mon, May 27, 2024 at 10:36:45PM -0700, Ian Rogers wrote: > > On Mon, May 27, 2024 at 3:58 AM Leo Yan <leo.yan@linux.dev> wrote: > > > On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote: > > > > On Sat, 25 May 2024 at 09:43, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > This makes 'perf record' work for me again. > > > > > Oh, wait, no it doesn't. > > > > > It makes just the plain "perf record" without any arguments work, > > > > which was what I was testing because I was lazy. > > > > > So now > > > > > $ perf record sleep 1 > > > > > works fine. But > > > > > $ perf record -e cycles:pp sleep 1 > > > > > is still completely broken (with or without ":p" and ":pp"). > > > > Seems to me that this patch fails to check if a PMU is a core-attached > > > PMU that can support common hardware events. Therefore, we should > > > consider adding the following check. > > > > +++ b/tools/perf/util/parse-events.c > > > @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state, > > > while ((pmu = perf_pmus__scan(pmu)) != NULL) { > > > bool auto_merge_stats; > > > > > > + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core) > > > + continue; > > > + > > > if (parse_events__filter_pmu(parse_state, pmu)) > > > continue; > > > > To be clear, I only compiled this change but I have no chance to test > > > it. @Ian, could you confirm this? > > > Hi Leo, > > > so the code is working as intended. I believe it also agrees with what > > Arnaldo thinks. > > > If you do: > > > $ perf stat -e cycles ... > > > and you have > > > /sys/devices/pmu1/events/cycles > > /sys/devices/pmu2/events/cycles > > > The output of perf stat should contain counts for pmu1 and pmu2. Were > > the event 'data_read' or 'inst_retired.any' we wouldn't be having the > > Sure, what is being asked is to count events and if those two events in > those two PMUs can count, then do what the user asked. > > For 'perf record' we're asking for sampling, if the event has the name > specified and can't be sampled, skip it, warn the user and even so > only if verbose mode is asked, something like: > > root@x1:~# perf record -e cycles -a sleep 1 > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ] > root@x1:~# perf evlist > cpu_atom/cycles/ > cpu_core/cycles/ > dummy:u > root@x1:~# > > Cool, there are two 'cycles' events, one in a PMU named 'cpu_atom', > another in a 'cpu_core' one, both can be sampled, my workload may > run/use resources on then, I'm interested, sample both. > > But if we had some other PMU, to use a name Jiri uses in tests/fake > PMUs, the 'krava' PMU and it has a 'cycles' event, so 'krava/cycles/' > and for some reason it doesn't support sampling, skip it, then the > result should be the same as above. > > If the user finds it strange after looking at sysfs that 'krava/cycles/' > isn't being sampled, the usual workflow is to ask perf for more > verbosity, using -v (or multiple 'v' letters to get increasing levels of > verbosity), in which case the user would see: > > root@x1:~# perf record -v -e cycles -a sleep 1 > WARNING: skipping 'krava/cycles/' event, it doesn't support sampling. > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ] > root@x1:~# perf evlist The problem here is that we're hiding a problem rather than reporting it. Typically we report the issue and more than that we ask the user to work around the issue. That would be analogous to wanting the user to specify what PMU they want the event to apply to, which has always been perf's behavior. Thanks, Ian > - Arnaldo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-28 19:51 ` Ian Rogers @ 2024-05-29 14:50 ` James Clark 2024-05-29 17:33 ` Ian Rogers 2024-05-29 18:44 ` Arnaldo Carvalho de Melo 0 siblings, 2 replies; 34+ messages in thread From: James Clark @ 2024-05-29 14:50 UTC (permalink / raw) To: Ian Rogers, Arnaldo Carvalho de Melo Cc: Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On 28/05/2024 20:51, Ian Rogers wrote: > On Tue, May 28, 2024 at 12:44 PM Arnaldo Carvalho de Melo > <acme@kernel.org> wrote: >> >> On Mon, May 27, 2024 at 10:36:45PM -0700, Ian Rogers wrote: >>> On Mon, May 27, 2024 at 3:58 AM Leo Yan <leo.yan@linux.dev> wrote: >>>> On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote: >>>>> On Sat, 25 May 2024 at 09:43, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >>>>>> This makes 'perf record' work for me again. >> >>>>> Oh, wait, no it doesn't. >> >>>>> It makes just the plain "perf record" without any arguments work, >>>>> which was what I was testing because I was lazy. >> >>>>> So now >> >>>>> $ perf record sleep 1 >> >>>>> works fine. But >> >>>>> $ perf record -e cycles:pp sleep 1 >> >>>>> is still completely broken (with or without ":p" and ":pp"). >> >>>> Seems to me that this patch fails to check if a PMU is a core-attached >>>> PMU that can support common hardware events. Therefore, we should >>>> consider adding the following check. >> >>>> +++ b/tools/perf/util/parse-events.c >>>> @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state, >>>> while ((pmu = perf_pmus__scan(pmu)) != NULL) { >>>> bool auto_merge_stats; >>>> >>>> + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core) >>>> + continue; >>>> + >>>> if (parse_events__filter_pmu(parse_state, pmu)) >>>> continue; >> >>>> To be clear, I only compiled this change but I have no chance to test >>>> it. @Ian, could you confirm this? >> >>> Hi Leo, >> >>> so the code is working as intended. I believe it also agrees with what >>> Arnaldo thinks. >> >>> If you do: >> >>> $ perf stat -e cycles ... >> >>> and you have >> >>> /sys/devices/pmu1/events/cycles >>> /sys/devices/pmu2/events/cycles >> >>> The output of perf stat should contain counts for pmu1 and pmu2. Were >>> the event 'data_read' or 'inst_retired.any' we wouldn't be having the >> >> Sure, what is being asked is to count events and if those two events in >> those two PMUs can count, then do what the user asked. >> >> For 'perf record' we're asking for sampling, if the event has the name >> specified and can't be sampled, skip it, warn the user and even so >> only if verbose mode is asked, something like: >> >> root@x1:~# perf record -e cycles -a sleep 1 >> [ perf record: Woken up 1 times to write data ] >> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ] >> root@x1:~# perf evlist >> cpu_atom/cycles/ >> cpu_core/cycles/ >> dummy:u >> root@x1:~# >> >> Cool, there are two 'cycles' events, one in a PMU named 'cpu_atom', >> another in a 'cpu_core' one, both can be sampled, my workload may >> run/use resources on then, I'm interested, sample both. >> >> But if we had some other PMU, to use a name Jiri uses in tests/fake >> PMUs, the 'krava' PMU and it has a 'cycles' event, so 'krava/cycles/' >> and for some reason it doesn't support sampling, skip it, then the >> result should be the same as above. >> >> If the user finds it strange after looking at sysfs that 'krava/cycles/' >> isn't being sampled, the usual workflow is to ask perf for more >> verbosity, using -v (or multiple 'v' letters to get increasing levels of >> verbosity), in which case the user would see: >> >> root@x1:~# perf record -v -e cycles -a sleep 1 >> WARNING: skipping 'krava/cycles/' event, it doesn't support sampling. >> [ perf record: Woken up 1 times to write data ] >> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ] >> root@x1:~# perf evlist This makes sense to me. I like keeping the old apparent behavior unless -v is used and it will feel like the tool "just works". In the context of the commit summary "perf parse-events: Prefer sysfs/JSON hardware events over legacy": I don't follow why that should be "Prefer, even if it's an event that can't be opened, sysfs/JSON...". Surely it should be "Prefer sysfs/JSON, unless it can't be opened, then use legacy". If all events can be opened, sure go and open them all. If only core events can be opened, do that too. If only uncore events can be opened... etc. I'm happy to help with the implementation or testing of that on my big.LITTLE system. FWIW the hybrid stuff with Perf stat is already working well for me on Arm since we added PERF_PMU_CAP_EXTENDED_HW_TYPE which Ian suggested. > > The problem here is that we're hiding a problem rather than reporting > it. Typically we report the issue and more than that we ask the user > to work around the issue. That would be analogous to wanting the user > to specify what PMU they want the event to apply to, which has always > been perf's behavior. > Is the problem you are referring to that there are multiple PMUs with 'cycles' events? Surely that's only a problem in the context of the new proposed behavior, otherwise it's not really a problem. It's just something that happens to exist. Because the user could always use the defaults (no argument) or -e cycles and historically Perf correctly picked the one that could be opened. Or if they want the DSU one they could specify it. That can all still work _and_ we can support "prefer sysfs/JSON" as long as we don't prefer it when opening the event doesn't work. Thanks James > Thanks, > Ian > >> - Arnaldo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-29 14:50 ` James Clark @ 2024-05-29 17:33 ` Ian Rogers 2024-05-30 15:37 ` James Clark 2024-05-29 18:44 ` Arnaldo Carvalho de Melo 1 sibling, 1 reply; 34+ messages in thread From: Ian Rogers @ 2024-05-29 17:33 UTC (permalink / raw) To: James Clark Cc: Arnaldo Carvalho de Melo, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On Wed, May 29, 2024 at 7:50 AM James Clark <james.clark@arm.com> wrote: > > On 28/05/2024 20:51, Ian Rogers wrote: > > On Tue, May 28, 2024 at 12:44 PM Arnaldo Carvalho de Melo > > <acme@kernel.org> wrote: > >> > >> On Mon, May 27, 2024 at 10:36:45PM -0700, Ian Rogers wrote: > >>> On Mon, May 27, 2024 at 3:58 AM Leo Yan <leo.yan@linux.dev> wrote: > >>>> On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote: > >>>>> On Sat, 25 May 2024 at 09:43, Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> > >>>>>> This makes 'perf record' work for me again. > >> > >>>>> Oh, wait, no it doesn't. > >> > >>>>> It makes just the plain "perf record" without any arguments work, > >>>>> which was what I was testing because I was lazy. > >> > >>>>> So now > >> > >>>>> $ perf record sleep 1 > >> > >>>>> works fine. But > >> > >>>>> $ perf record -e cycles:pp sleep 1 > >> > >>>>> is still completely broken (with or without ":p" and ":pp"). > >> > >>>> Seems to me that this patch fails to check if a PMU is a core-attached > >>>> PMU that can support common hardware events. Therefore, we should > >>>> consider adding the following check. > >> > >>>> +++ b/tools/perf/util/parse-events.c > >>>> @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state, > >>>> while ((pmu = perf_pmus__scan(pmu)) != NULL) { > >>>> bool auto_merge_stats; > >>>> > >>>> + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core) > >>>> + continue; > >>>> + > >>>> if (parse_events__filter_pmu(parse_state, pmu)) > >>>> continue; > >> > >>>> To be clear, I only compiled this change but I have no chance to test > >>>> it. @Ian, could you confirm this? > >> > >>> Hi Leo, > >> > >>> so the code is working as intended. I believe it also agrees with what > >>> Arnaldo thinks. > >> > >>> If you do: > >> > >>> $ perf stat -e cycles ... > >> > >>> and you have > >> > >>> /sys/devices/pmu1/events/cycles > >>> /sys/devices/pmu2/events/cycles > >> > >>> The output of perf stat should contain counts for pmu1 and pmu2. Were > >>> the event 'data_read' or 'inst_retired.any' we wouldn't be having the > >> > >> Sure, what is being asked is to count events and if those two events in > >> those two PMUs can count, then do what the user asked. > >> > >> For 'perf record' we're asking for sampling, if the event has the name > >> specified and can't be sampled, skip it, warn the user and even so > >> only if verbose mode is asked, something like: > >> > >> root@x1:~# perf record -e cycles -a sleep 1 > >> [ perf record: Woken up 1 times to write data ] > >> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ] > >> root@x1:~# perf evlist > >> cpu_atom/cycles/ > >> cpu_core/cycles/ > >> dummy:u > >> root@x1:~# > >> > >> Cool, there are two 'cycles' events, one in a PMU named 'cpu_atom', > >> another in a 'cpu_core' one, both can be sampled, my workload may > >> run/use resources on then, I'm interested, sample both. > >> > >> But if we had some other PMU, to use a name Jiri uses in tests/fake > >> PMUs, the 'krava' PMU and it has a 'cycles' event, so 'krava/cycles/' > >> and for some reason it doesn't support sampling, skip it, then the > >> result should be the same as above. > >> > >> If the user finds it strange after looking at sysfs that 'krava/cycles/' > >> isn't being sampled, the usual workflow is to ask perf for more > >> verbosity, using -v (or multiple 'v' letters to get increasing levels of > >> verbosity), in which case the user would see: > >> > >> root@x1:~# perf record -v -e cycles -a sleep 1 > >> WARNING: skipping 'krava/cycles/' event, it doesn't support sampling. > >> [ perf record: Woken up 1 times to write data ] > >> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ] > >> root@x1:~# perf evlist > > This makes sense to me. I like keeping the old apparent behavior unless > -v is used and it will feel like the tool "just works". > > In the context of the commit summary "perf parse-events: Prefer > sysfs/JSON hardware events over legacy": > > I don't follow why that should be "Prefer, even if it's an event that > can't be opened, sysfs/JSON...". > > Surely it should be "Prefer sysfs/JSON, unless it can't be opened, then > use legacy". If all events can be opened, sure go and open them all. If > only core events can be opened, do that too. If only uncore events can > be opened... etc. Currently event parsing is not the same as event opening. Event parsing I hope is about generating a set of `struct perf_event_attr`, I'd like to see it migrate to libperf and it needn't pull in evsels and evlists that are a construct on top of these. There's always more work to do. This whole can't the tool do blah, blah, blah.. well fine, but that's not actually describing the tool. It is describing a different tool and the question is how to get from what the tool is to what you describe. A lot of people talk to me about what the tool should do, they are a lot more hesitant to actually do the work. A command line with an event like: $ perf record -e inst_retired.any ... seems simple enough. But how many PMUs support inst_retired.any, what should the set of 'struct perf_event_attr' look like? Oh that's obvious Ian, you're talking about an Intel event so there should be exactly one and it should be on the PMU 'cpu' (note, until recently we didn't even really track PMUs associated with an event). But then what about hybrid? Ah yes, you want two 'struct perf_event_attr' one set up for cpu_core and one set up for cpu_atom. Okay, so how in the tool do I "know" this? Well the tool should scan PMUs, look for the event and then create a 'struct perf_event_attr' for each one. Okay that's where we are today, although there are some places where core and non-core PMUs are treated differently and I'd like that not to be the case as it has caused conditions where we will program events differently, and in particular I believe this breaks ARM Apple M? PMUs. It also aligned with RISC-V plans for how to support events. When a legacy event occurs there is a whole bunch of other processing. The legacy event names are hard coded into the parse events parser: https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n332 An example of a legacy event name is cycles. The parsing of cycles used to create exactly one 'struct perf_event_attr' with the PMU type set to the special PERF_TYPE_HARDWARE and the config set to PERF_COUNT_HW_CPU_CYCLES. But then what about hybrid? Well then we need an event for each PMU type otherwise when a process gets rescheduled between CPU types it will suddenly become invisible. This is something we fixed in Linux 6.x. Ah, Ian did you know on Apple M? PMUs the legacy events are broken and we need to always use sysfs or json events? The fact that ARM's PMU didn't look like a core PMU meant we wouldn't try to program legacy events on them and now your code to fix Intel's hybrid has exposed a latent ARM PMU bug? Oh and by the way this has been broken for multiple Linux releases, go fix it by yesterday. My expectation on fixing this was that Intel would just say to go away. But no, Kan was cool with it and I went ahead inverted the priorities and now sysfs and json events have priority - the largest part of the work is making the test expectations align with what the code does. So what this means for our cycles event, well the logic to scan all PMUs for inst_retired.any is now the default logic but in the case of a legacy event we also remember a legacy encoding is possible. So if the event isn't found in sysfs or json for a core PMU then we use the legacy encoding. Note, I'm just talking about making 'struct perf_event_attr' not opening events. Great, so why did you need a patch in 6.10? The process was incomplete. Let's say on a Apple M? system you went: $ perf record -e 'armv8_pmuv3_0/cycles/' ... then you would indeed get a sysfs encoded event. But what about: $ perf record -e 'cycles' ... I may want the event on the BIG and little cores. Well in that case the sysfs and json overriding legacy event logic hadn't been applied. This means that the legacy encoding would always be used, but as I already mentioned the legacy encoding is broken on a Apple M? system. The user is supposed to somehow know the fix is to add the PMU name and specify multiple events. With the now reverted change the event parsing logic would scan for the cycles event on every PMU and create a struct perf_event_attr for it (the code makes evsels, but they are just wrappers on top of a struct perf_event_attr). But why scan all PMUs? Well that was done to make hybrid work and because perf's behavior has always been either to handle an event as legacy or handle it as a potential event on all PMUs. We're making legacy second class, given the desire to prefer sysfs and json, and we only use that encoding when the scan fails to find the event on a core PMU. So great, ignoring the revert, that fixed everything? Well no. The tool in places was hard coding 'struct perf_event_attr' which is of course broken were things to be hybrid or BIG.little. So the fix for that was to not hard code things. We need a set of 'struct perf_event_attr', ah I know a way to get that let's just use our event parsing logic. So a 'struct perf_event_attr' hard coding type to PERF_TYPE_HARDWARE, the config to PERF_COUNT_HW_CPU_CYCLES and also setting the precision to maximum was changed into parsing the string "cycles:P". Sounds good, no? Well no. Somebody decided to create an ARM event called cycles (Intel's name to avoid conflicts is clockticks) and now that event was getting added to our set. Although the patch sat for weeks (months?) on linux-next nobody had discovered a fairly obvious breakage. Oh Ian this is your crazy json Ian wants to pick a PMU and wants to break the world not caring about regressions mentality, admit you hate everybody and let me throw some expletives at you. Well I hope not. We were parsing "cycles:P" to generate our set of perf_event_attr, let's avoid the problem of getting uncore PMU events by making it so the events parsed are only on core PMUs - hey, this is at least better than the broken on hybrid hard coded alternative. Which is where this patch sits on top of everything. Well we hate it. Cycles is clearly only something that CPUs have (ignoring the fact ARM have explicitly called their L3 cache event cycles, cough, cough..) make everything go back. Go back to where? We want legacy events to have priority again? But we have legacy event names like l1-i-speculative-read-prefetches, we want to hard code that nonsense and make it modify the regular scan all PMUs logic? Clearly no, and that's where Arnaldo's proposal is. Under Arnaldo's proposal, as I understand it, the parsing will generate the set of 'struct perf_event_attr' in the list of evsels. When we go to open the evsels a failure needn't be fatal. Notice the important distinction here. You are talking about opening during parsing, the proposal is to postpone that. Great, go and code. I did - I even did it on the weekend. The events we want to ignore I reuse the idea of skippable that was introduced for when a hard coded 'struct perf_event_attr' could fail to open but we'd moved the code to use event parsing. The problem is that nothing in the perf record logic, that is split between perf and libperf, expects an evsel not to open. To add the workarounds was making a change large enough that it was unreasonable to think it could land in v6.10 as it may well break something else. So where does that leave us: 1) we have Arnaldo's proposal, it sounds good but it has clear behavior changes. My rough implementation showed it to be too big for v6.10. 2) v6.10 has reverted a patch fixing the event encodings for the PMU on ARM Apple M? CPUs. 3) as the reverted patch has undone something fundamental our tests are now out of sync with what event parsing does (hey ARM never made the event parsing tests work anyway so why care about this one, right?) 4) if we revert the revert it fixes the Apple M? issue but breaks Neoverse if the arm_dsu PMU is enabled. Well then apply this fix on top of the revert and you're good? No, what has explicitly been said is that "cycles" shouldn't follow any of the patterns of event parsing so far described. But why? Well because even though the PMUs are advertising two cycles events, cycles needs to be special and mean core only if no PMU is given. So 1 cycles event then? Well no, we still want PMU scanning because there's hybrid to care about. Okay, and is it just cycles that are special? We have potentially >200 legacy events where a similar argument could be made about making them core only, what about instructions, what about branches... The reply I got was, go away and I will no longer speak to you. I think the best plan we have for moving forward is Arnaldo's proposal, Linus has explicitly stated that it is wrong and there should be special event names. We can't land this in v6.10 which is why I've said that it doesn't seem so terrible that on a Neoverse if you are specifying events by hand and you don't specify a PMU and the event you specify happens to be cycles, make it clear you want the armv8_pmuv3_0 one by adding the PMU name. This has always been the way to clear up the ambiguity. > I'm happy to help with the implementation or testing of that on my > big.LITTLE system. > > FWIW the hybrid stuff with Perf stat is already working well for me on > Arm since we added PERF_PMU_CAP_EXTENDED_HW_TYPE which Ian suggested. > > > > > The problem here is that we're hiding a problem rather than reporting > > it. Typically we report the issue and more than that we ask the user > > to work around the issue. That would be analogous to wanting the user > > to specify what PMU they want the event to apply to, which has always > > been perf's behavior. > > > > Is the problem you are referring to that there are multiple PMUs with > 'cycles' events? Surely that's only a problem in the context of the new > proposed behavior, otherwise it's not really a problem. It's just > something that happens to exist. So perf will try to open events, when failing it will fall back, if an event fails to open then we report an error. This is why none of the code currently handles an evsel in an evlist where the event has failed to open. Failure to open could be a behavior someone relies upon. Let's say you want to probe the behavior of a PMU, you could be relying on failures to show you the event won't work. This is Hyrum's law: With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody. It isn't new behavior for perf to scan all PMUs, it always has, the new behavior is around legacy events. We want multiple PMU scanning for hybrid, we want all PMU scanning for uncore. The legacy changes happened because of the Apple M? PMU with me being complained at by folks at ARM who have now created this mess by their arm_dsu event name. Shouldn't it be a 1 liner fix to change "DSU_EVENT_ATTR(cycles, 0x11)," to "DSU_EVENT_ATTR(clockticks, 0x11),": https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/arm_dsu_pmu.c#n177 that's up to ARM but it would make sense to me. > Because the user could always use the defaults (no argument) or -e > cycles and historically Perf correctly picked the one that could be > opened. Or if they want the DSU one they could specify it. That can all > still work _and_ we can support "prefer sysfs/JSON" as long as we don't > prefer it when opening the event doesn't work. Hopefully this is all explained above. Thanks, Ian > Thanks > James > > > Thanks, > > Ian > > > >> - Arnaldo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-29 17:33 ` Ian Rogers @ 2024-05-30 15:37 ` James Clark 2024-05-30 16:14 ` Ian Rogers 0 siblings, 1 reply; 34+ messages in thread From: James Clark @ 2024-05-30 15:37 UTC (permalink / raw) To: Ian Rogers Cc: Arnaldo Carvalho de Melo, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On 29/05/2024 18:33, Ian Rogers wrote: > On Wed, May 29, 2024 at 7:50 AM James Clark <james.clark@arm.com> wrote: >> >> On 28/05/2024 20:51, Ian Rogers wrote: >>> On Tue, May 28, 2024 at 12:44 PM Arnaldo Carvalho de Melo >>> <acme@kernel.org> wrote: >>>> >>>> On Mon, May 27, 2024 at 10:36:45PM -0700, Ian Rogers wrote: >>>>> On Mon, May 27, 2024 at 3:58 AM Leo Yan <leo.yan@linux.dev> wrote: >>>>>> On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote: >>>>>>> On Sat, 25 May 2024 at 09:43, Linus Torvalds <torvalds@linux-foundation.org> wrote: >>>> >>>>>>>> This makes 'perf record' work for me again. >>>> >>>>>>> Oh, wait, no it doesn't. >>>> >>>>>>> It makes just the plain "perf record" without any arguments work, >>>>>>> which was what I was testing because I was lazy. >>>> >>>>>>> So now >>>> >>>>>>> $ perf record sleep 1 >>>> >>>>>>> works fine. But >>>> >>>>>>> $ perf record -e cycles:pp sleep 1 >>>> >>>>>>> is still completely broken (with or without ":p" and ":pp"). >>>> >>>>>> Seems to me that this patch fails to check if a PMU is a core-attached >>>>>> PMU that can support common hardware events. Therefore, we should >>>>>> consider adding the following check. >>>> >>>>>> +++ b/tools/perf/util/parse-events.c >>>>>> @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state, >>>>>> while ((pmu = perf_pmus__scan(pmu)) != NULL) { >>>>>> bool auto_merge_stats; >>>>>> >>>>>> + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core) >>>>>> + continue; >>>>>> + >>>>>> if (parse_events__filter_pmu(parse_state, pmu)) >>>>>> continue; >>>> >>>>>> To be clear, I only compiled this change but I have no chance to test >>>>>> it. @Ian, could you confirm this? >>>> >>>>> Hi Leo, >>>> >>>>> so the code is working as intended. I believe it also agrees with what >>>>> Arnaldo thinks. >>>> >>>>> If you do: >>>> >>>>> $ perf stat -e cycles ... >>>> >>>>> and you have >>>> >>>>> /sys/devices/pmu1/events/cycles >>>>> /sys/devices/pmu2/events/cycles >>>> >>>>> The output of perf stat should contain counts for pmu1 and pmu2. Were >>>>> the event 'data_read' or 'inst_retired.any' we wouldn't be having the >>>> >>>> Sure, what is being asked is to count events and if those two events in >>>> those two PMUs can count, then do what the user asked. >>>> >>>> For 'perf record' we're asking for sampling, if the event has the name >>>> specified and can't be sampled, skip it, warn the user and even so >>>> only if verbose mode is asked, something like: >>>> >>>> root@x1:~# perf record -e cycles -a sleep 1 >>>> [ perf record: Woken up 1 times to write data ] >>>> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ] >>>> root@x1:~# perf evlist >>>> cpu_atom/cycles/ >>>> cpu_core/cycles/ >>>> dummy:u >>>> root@x1:~# >>>> >>>> Cool, there are two 'cycles' events, one in a PMU named 'cpu_atom', >>>> another in a 'cpu_core' one, both can be sampled, my workload may >>>> run/use resources on then, I'm interested, sample both. >>>> >>>> But if we had some other PMU, to use a name Jiri uses in tests/fake >>>> PMUs, the 'krava' PMU and it has a 'cycles' event, so 'krava/cycles/' >>>> and for some reason it doesn't support sampling, skip it, then the >>>> result should be the same as above. >>>> >>>> If the user finds it strange after looking at sysfs that 'krava/cycles/' >>>> isn't being sampled, the usual workflow is to ask perf for more >>>> verbosity, using -v (or multiple 'v' letters to get increasing levels of >>>> verbosity), in which case the user would see: >>>> >>>> root@x1:~# perf record -v -e cycles -a sleep 1 >>>> WARNING: skipping 'krava/cycles/' event, it doesn't support sampling. >>>> [ perf record: Woken up 1 times to write data ] >>>> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ] >>>> root@x1:~# perf evlist >> >> This makes sense to me. I like keeping the old apparent behavior unless >> -v is used and it will feel like the tool "just works". >> >> In the context of the commit summary "perf parse-events: Prefer >> sysfs/JSON hardware events over legacy": >> >> I don't follow why that should be "Prefer, even if it's an event that >> can't be opened, sysfs/JSON...". >> >> Surely it should be "Prefer sysfs/JSON, unless it can't be opened, then >> use legacy". If all events can be opened, sure go and open them all. If >> only core events can be opened, do that too. If only uncore events can >> be opened... etc. > [...] > So great, ignoring the revert, that fixed everything? Well no. The > tool in places was hard coding 'struct perf_event_attr' which is of > course broken were things to be hybrid or BIG.little. So the fix for > that was to not hard code things. We need a set of 'struct > perf_event_attr', ah I know a way to get that let's just use our event > parsing logic. So a 'struct perf_event_attr' hard coding type to > PERF_TYPE_HARDWARE, the config to PERF_COUNT_HW_CPU_CYCLES and also > setting the precision to maximum was changed into parsing the string > "cycles:P". Sounds good, no? Well no. Somebody decided to create an > ARM event called cycles (Intel's name to avoid conflicts is > clockticks) and now that event was getting added to our set. Although > the patch sat for weeks (months?) on linux-next nobody had discovered > a fairly obvious breakage. > We did see the test failure on our Ampere test machine 7 days ago, but for some reason only on mainline (I was also on holiday at the same time). I'm checking if that machine is running all the branches and will make sure it does from now on. We are running perf-tools-next on other machines and I try to look at all the test failures. Just this one had a bit of an obscure combination of needing the DSU PMU. One thing we don't have in CI is any Apple M hardware. I can look into it but I wouldn't have high hopes for anything soon. [...] > It isn't new behavior for perf to scan all PMUs, it always has, the > new behavior is around legacy events. We want multiple PMU scanning > for hybrid, we want all PMU scanning for uncore. The legacy changes > happened because of the Apple M? PMU with me being complained at by > folks at ARM who have now created this mess by their arm_dsu event > name. Shouldn't it be a 1 liner fix to change "DSU_EVENT_ATTR(cycles, > 0x11)," to "DSU_EVENT_ATTR(clockticks, 0x11),": > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/arm_dsu_pmu.c#n177 > that's up to ARM but it would make sense to me. > Not sure about that one, that would break anyone's scripts or tools that are looking at DSU cycles. And it wouldn't fix the issue in the future if there were other reasons the event doesn't open (like non sampling core events, or someone's brand new uncore PMU that also has a cycles event). It seems like we're converging one something that works though in the other threads, but I'm still digesting the problems a bit. >> Because the user could always use the defaults (no argument) or -e >> cycles and historically Perf correctly picked the one that could be >> opened. Or if they want the DSU one they could specify it. That can all >> still work _and_ we can support "prefer sysfs/JSON" as long as we don't >> prefer it when opening the event doesn't work. > > Hopefully this is all explained above. > > Thanks, > Ian > >> Thanks >> James >> >>> Thanks, >>> Ian >>> >>>> - Arnaldo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-30 15:37 ` James Clark @ 2024-05-30 16:14 ` Ian Rogers 0 siblings, 0 replies; 34+ messages in thread From: Ian Rogers @ 2024-05-30 16:14 UTC (permalink / raw) To: James Clark Cc: Arnaldo Carvalho de Melo, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On Thu, May 30, 2024 at 8:37 AM James Clark <james.clark@arm.com> wrote: > > > > On 29/05/2024 18:33, Ian Rogers wrote: > > On Wed, May 29, 2024 at 7:50 AM James Clark <james.clark@arm.com> wrote: > >> > >> On 28/05/2024 20:51, Ian Rogers wrote: > >>> On Tue, May 28, 2024 at 12:44 PM Arnaldo Carvalho de Melo > >>> <acme@kernel.org> wrote: > >>>> > >>>> On Mon, May 27, 2024 at 10:36:45PM -0700, Ian Rogers wrote: > >>>>> On Mon, May 27, 2024 at 3:58 AM Leo Yan <leo.yan@linux.dev> wrote: > >>>>>> On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote: > >>>>>>> On Sat, 25 May 2024 at 09:43, Linus Torvalds <torvalds@linux-foundation.org> wrote: > >>>> > >>>>>>>> This makes 'perf record' work for me again. > >>>> > >>>>>>> Oh, wait, no it doesn't. > >>>> > >>>>>>> It makes just the plain "perf record" without any arguments work, > >>>>>>> which was what I was testing because I was lazy. > >>>> > >>>>>>> So now > >>>> > >>>>>>> $ perf record sleep 1 > >>>> > >>>>>>> works fine. But > >>>> > >>>>>>> $ perf record -e cycles:pp sleep 1 > >>>> > >>>>>>> is still completely broken (with or without ":p" and ":pp"). > >>>> > >>>>>> Seems to me that this patch fails to check if a PMU is a core-attached > >>>>>> PMU that can support common hardware events. Therefore, we should > >>>>>> consider adding the following check. > >>>> > >>>>>> +++ b/tools/perf/util/parse-events.c > >>>>>> @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state, > >>>>>> while ((pmu = perf_pmus__scan(pmu)) != NULL) { > >>>>>> bool auto_merge_stats; > >>>>>> > >>>>>> + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core) > >>>>>> + continue; > >>>>>> + > >>>>>> if (parse_events__filter_pmu(parse_state, pmu)) > >>>>>> continue; > >>>> > >>>>>> To be clear, I only compiled this change but I have no chance to test > >>>>>> it. @Ian, could you confirm this? > >>>> > >>>>> Hi Leo, > >>>> > >>>>> so the code is working as intended. I believe it also agrees with what > >>>>> Arnaldo thinks. > >>>> > >>>>> If you do: > >>>> > >>>>> $ perf stat -e cycles ... > >>>> > >>>>> and you have > >>>> > >>>>> /sys/devices/pmu1/events/cycles > >>>>> /sys/devices/pmu2/events/cycles > >>>> > >>>>> The output of perf stat should contain counts for pmu1 and pmu2. Were > >>>>> the event 'data_read' or 'inst_retired.any' we wouldn't be having the > >>>> > >>>> Sure, what is being asked is to count events and if those two events in > >>>> those two PMUs can count, then do what the user asked. > >>>> > >>>> For 'perf record' we're asking for sampling, if the event has the name > >>>> specified and can't be sampled, skip it, warn the user and even so > >>>> only if verbose mode is asked, something like: > >>>> > >>>> root@x1:~# perf record -e cycles -a sleep 1 > >>>> [ perf record: Woken up 1 times to write data ] > >>>> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ] > >>>> root@x1:~# perf evlist > >>>> cpu_atom/cycles/ > >>>> cpu_core/cycles/ > >>>> dummy:u > >>>> root@x1:~# > >>>> > >>>> Cool, there are two 'cycles' events, one in a PMU named 'cpu_atom', > >>>> another in a 'cpu_core' one, both can be sampled, my workload may > >>>> run/use resources on then, I'm interested, sample both. > >>>> > >>>> But if we had some other PMU, to use a name Jiri uses in tests/fake > >>>> PMUs, the 'krava' PMU and it has a 'cycles' event, so 'krava/cycles/' > >>>> and for some reason it doesn't support sampling, skip it, then the > >>>> result should be the same as above. > >>>> > >>>> If the user finds it strange after looking at sysfs that 'krava/cycles/' > >>>> isn't being sampled, the usual workflow is to ask perf for more > >>>> verbosity, using -v (or multiple 'v' letters to get increasing levels of > >>>> verbosity), in which case the user would see: > >>>> > >>>> root@x1:~# perf record -v -e cycles -a sleep 1 > >>>> WARNING: skipping 'krava/cycles/' event, it doesn't support sampling. > >>>> [ perf record: Woken up 1 times to write data ] > >>>> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ] > >>>> root@x1:~# perf evlist > >> > >> This makes sense to me. I like keeping the old apparent behavior unless > >> -v is used and it will feel like the tool "just works". > >> > >> In the context of the commit summary "perf parse-events: Prefer > >> sysfs/JSON hardware events over legacy": > >> > >> I don't follow why that should be "Prefer, even if it's an event that > >> can't be opened, sysfs/JSON...". > >> > >> Surely it should be "Prefer sysfs/JSON, unless it can't be opened, then > >> use legacy". If all events can be opened, sure go and open them all. If > >> only core events can be opened, do that too. If only uncore events can > >> be opened... etc. > > > > [...] > > > So great, ignoring the revert, that fixed everything? Well no. The > > tool in places was hard coding 'struct perf_event_attr' which is of > > course broken were things to be hybrid or BIG.little. So the fix for > > that was to not hard code things. We need a set of 'struct > > perf_event_attr', ah I know a way to get that let's just use our event > > parsing logic. So a 'struct perf_event_attr' hard coding type to > > PERF_TYPE_HARDWARE, the config to PERF_COUNT_HW_CPU_CYCLES and also > > setting the precision to maximum was changed into parsing the string > > "cycles:P". Sounds good, no? Well no. Somebody decided to create an > > ARM event called cycles (Intel's name to avoid conflicts is > > clockticks) and now that event was getting added to our set. Although > > the patch sat for weeks (months?) on linux-next nobody had discovered > > a fairly obvious breakage. > > > > We did see the test failure on our Ampere test machine 7 days ago, but > for some reason only on mainline (I was also on holiday at the same > time). I'm checking if that machine is running all the branches and will > make sure it does from now on. > > We are running perf-tools-next on other machines and I try to look at > all the test failures. Just this one had a bit of an obscure combination > of needing the DSU PMU. Thanks, I'm truly appreciative of greater testing and I appreciate it doesn't happen for free. It also has ongoing costs. Thank you! > One thing we don't have in CI is any Apple M hardware. I can look into > it but I wouldn't have high hopes for anything soon. As mentioned in these threads it is also knowingly broken - what the reverted patch was trying to address. Perhaps one could be captured on the way to e-waste if the screen,... don't work. We don't need much. Apple M is the root cause of much special behavior in the perf tool and the testing situation on it is sad. Maybe the Linux Foundation could get one? > [...] > > > It isn't new behavior for perf to scan all PMUs, it always has, the > > new behavior is around legacy events. We want multiple PMU scanning > > for hybrid, we want all PMU scanning for uncore. The legacy changes > > happened because of the Apple M? PMU with me being complained at by > > folks at ARM who have now created this mess by their arm_dsu event > > name. Shouldn't it be a 1 liner fix to change "DSU_EVENT_ATTR(cycles, > > 0x11)," to "DSU_EVENT_ATTR(clockticks, 0x11),": > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/arm_dsu_pmu.c#n177 > > that's up to ARM but it would make sense to me. > > > > Not sure about that one, that would break anyone's scripts or tools that > are looking at DSU cycles. And it wouldn't fix the issue in the future > if there were other reasons the event doesn't open (like non sampling > core events, or someone's brand new uncore PMU that also has a cycles > event). Right, but should the resolution there be to specify which PMU you want to resolve the ambiguity. The tool telling you a PMU doesn't support sampling is signal. We'd expect the tool to fail if it didn't support an event, is it really unreasonable to fail on a mode of an event? I'm further confused by the DSU driver. The names: DSU_EVENT_ATTR(cycles, 0x11), DSU_EVENT_ATTR(bus_access, 0x19), DSU_EVENT_ATTR(memory_error, 0x1a), DSU_EVENT_ATTR(bus_cycles, 0x1d), The last 3 seem unambiguous, but cycles, couldn't it be read as also possibly meaning bus_cycles? Wouldn't cpu_cycles be in keeping with ARM's other names and an objectively better name? Getting that 1 liner in v6.10 would resolve a lot of problems. I also think llc or l3 may be more "intention revealing" names for the device than dsu, but hey I don't want to start a naming war. > It seems like we're converging one something that works though in the > other threads, but I'm still digesting the problems a bit. I think the major blocker is that some people, although I can only name 1 and they've stopped listening, think event names should somehow carry special meaning and in that special case it implies use only core PMUs. The only known example of a special event is cycles, but if you look for equivalent event names in perf's code there are things as banal as branches and instructions, and as wild as dTLB-speculative-read-misses. I'd really like not to carry around notions of special event names but if we merge patches that ignore that and then they get reverted, justified through hand written tests looking to poke at things like the dsu PMU, I don't know where we are. It has also been threatened that the perf code could be removed from the kernel tree, so there's really no fun in poking a bear. Thanks, Ian > >> Because the user could always use the defaults (no argument) or -e > >> cycles and historically Perf correctly picked the one that could be > >> opened. Or if they want the DSU one they could specify it. That can all > >> still work _and_ we can support "prefer sysfs/JSON" as long as we don't > >> prefer it when opening the event doesn't work. > > > > Hopefully this is all explained above. > > > > Thanks, > > Ian > > > >> Thanks > >> James > >> > >>> Thanks, > >>> Ian > >>> > >>>> - Arnaldo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-29 14:50 ` James Clark 2024-05-29 17:33 ` Ian Rogers @ 2024-05-29 18:44 ` Arnaldo Carvalho de Melo 2024-05-29 19:25 ` Ian Rogers 1 sibling, 1 reply; 34+ messages in thread From: Arnaldo Carvalho de Melo @ 2024-05-29 18:44 UTC (permalink / raw) To: James Clark Cc: Ian Rogers, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On Wed, May 29, 2024 at 03:50:53PM +0100, James Clark wrote: > Is the problem you are referring to that there are multiple PMUs with > 'cycles' events? Surely that's only a problem in the context of the new > proposed behavior, otherwise it's not really a problem. It's just > something that happens to exist. > Because the user could always use the defaults (no argument) or -e > cycles and historically Perf correctly picked the one that could be See below to see if mixing up "cycles" for efficiency and performance cores is something sane or if I am missing something. > opened. Or if they want the DSU one they could specify it. That can all > still work _and_ we can support "prefer sysfs/JSON" as long as we don't > prefer it when opening the event doesn't work. Yeah, getting all the events in all the PMUs that match a string (after it is normalized to cover historical artifacts, as in the case of "cycles", "cpu_cycles" and "cpu-cycles", all of which should mean "cycles" the special, default event) and that can sample if that is what is being asked seems to be a sane outcome from this discussion. But lemme do try to show the differences from my Lenovo Intel Hybrid system (13th gen) and a Libre Computer Rockchip ARM64 hybrid system: There are some differences on how ARM64 supports hybrid that we may find interesting to fix or at least to (better) document, for instance: root@roc-rk3399-pc:~# dmidecode -H 1 # dmidecode 3.4 Getting SMBIOS data from sysfs. SMBIOS 3.0 present. 7 structures occupying 283 bytes. Table at 0xEAE7A020. Handle 0x0001, DMI type 1, 27 bytes System Information Manufacturer: libre-computer Product Name: roc-rk3399-pc Version: Not Specified Serial Number: b03c01a7179278b7 UUID: 63333062-3130-3761-3137-393237386237 Wake-up Type: Reserved SKU Number: Not Specified Family: Not Specified root@roc-rk3399-pc:~# This is a hybrid architecture: root@roc-rk3399-pc:~# ls -la /sys/devices/*/events/cpu_cycles -r--r--r-- 1 root root 4096 May 29 16:27 /sys/devices/armv8_cortex_a53/events/cpu_cycles -r--r--r-- 1 root root 4096 May 29 16:27 /sys/devices/armv8_cortex_a72/events/cpu_cycles root@roc-rk3399-pc:~# In an intel hybrid system we instead have: root@number:~# ls -la /sys/devices/*/events/cpu-cycles -r--r--r--. 1 root root 4096 May 29 13:59 /sys/devices/cpu_atom/events/cpu-cycles -r--r--r--. 1 root root 4096 May 29 14:00 /sys/devices/cpu_core/events/cpu-cycles root@number:~# Small difference, a - versus a _, but then both hybrid, efficiency cores (armv8_cortex_a53 vs cpu_atom) and performance ones (armv8_cortex_a72 vs cpu_core). On the Intel Hybrid system: root@number:~# perf record -e cycles -a sleep 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 4.709 MB perf.data (46911 samples) ] root@number:~# perf evlist cpu_atom/cycles/ cpu_core/cycles/ dummy:u root@number:~# root@number:~# perf evlist -v cpu_atom/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0xa00000000, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1 cpu_core/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0x400000000, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1 dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1 root@number:~# So it is recording CPU cycles in all the CPUs in the system, performance and efficiency ones and that gets clear on a per-sample base: root@number:~# perf script perf 2465078 [000] 73716.379947: 1 cpu_core/cycles/: ffffffffb40a55fa native_write_msr+0xa ([kernel.kallsyms]) perf 2465078 [001] 73716.379966: 1 cpu_core/cycles/: ffffffffb40a55fa native_write_msr+0xa ([kernel.kallsyms]) <SNIP more cpu_core/cycles/ samples> gnome-shell 2608 [018] 73716.380704: 6721618 cpu_atom/cycles/: ffffffffc0b8419c fw_domains_get_with_fallback+0xfc ([kernel.kallsyms]) podman 688107 [017] 73716.380706: 6695621 cpu_atom/cycles/: 564fc6110da0 [unknown] (/usr/bin/podman) podman 687246 [000] 73716.380842: 8844997 cpu_core/cycles/: ffffffffb515150c _raw_spin_lock_irqsave+0xc ([kernel.kallsyms]) podman 688108 [016] 73716.380913: 6737580 cpu_atom/cycles/: ffffffffb515205c native_queued_spin_lock_slowpath+0x28c ([kernel.kallsyms]) swapper 0 [004] 73716.380932: 2090132 cpu_core/cycles/: ffffffffb513ad49 poll_idle+0x59 ([kernel.kallsyms]) <SNIP> But on the ARM hybrid system, without Ian's patch, i.e. with what is in torvalds/master right now (plus some header copies updates I'm working on that are unrelated): root@roc-rk3399-pc:~# perf record -e cycles -a sleep 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.135 MB perf.data (359 samples) ] root@roc-rk3399-pc:~# perf evlist cycles dummy:u root@roc-rk3399-pc:~# It records just one "event" even tho it is recording for all CPUs, both efficiency and performance: root@number:~# perf script <SNIP> kworker/2:1-eve 10124 [002] 9687.302790: 60674 cycles: ffffc4c65bdd7380 vmap_small_pages_range_noflush+0x190 ([kernel.kallsyms]) kworker/2:1-eve 10124 [002] 9687.302957: 66040 cycles: ffffc4c65bdd7438 vmap_small_pages_range_noflush+0x248 ([kernel.kallsyms]) kworker/2:1-eve 10124 [002] 9687.303139: 71011 cycles: ffffc4c65cde0210 ww_mutex_lock+0x60 ([kernel.kallsyms]) swapper 0 [002] 9687.303342: 75390 cycles: ffffc4c65bbc31c8 update_blocked_averages+0x188 ([kernel.kallsyms]) swapper 0 [000] 9687.309276: 45496 cycles: ffffc4c65ca38978 cpuidle_enter_state+0xc8 ([kernel.kallsyms]) <SNIP> Everything appears as "cycles" but we're getting samples for all CPUs, again, performance and efficiency ones, different kinds of processors, right? root@roc-rk3399-pc:~# perf report --stdio --sort cpu # To display the perf.data header info, please use --header/--header-only options. # # Total Lost Samples: 0 # # Samples: 359 of event 'cycles' # Event count (approx.): 23873034 # # Overhead CPU # ........ ... # 31.34% 003 22.44% 004 19.30% 000 12.94% 002 9.14% 001 4.84% 005 root@roc-rk3399-pc:~# If we try, instead with cpu-cycles: root@roc-rk3399-pc:~# perf record -e cpu-cycles -a sleep 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.135 MB perf.data (346 samples) ] root@roc-rk3399-pc:~# perf evlist cpu-cycles dummy:u root@roc-rk3399-pc:~# root@roc-rk3399-pc:~# perf evlist -v cpu-cycles: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1 dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1 root@roc-rk3399-pc:~# Both 'cycles' and 'cpu-cycles' end up the same as type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES). But if we use something equivalent but with that - replaced with a _ we get a behaviour that is closer to the Intel one: root@roc-rk3399-pc:~# perf record -e cpu_cycles -a sleep 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.137 MB perf.data (390 samples) ] root@roc-rk3399-pc:~# root@roc-rk3399-pc:~# perf evlist armv8_cortex_a53/cpu_cycles/ armv8_cortex_a72/cpu_cycles/ dummy:u root@roc-rk3399-pc:~# root@roc-rk3399-pc:~# perf evlist -v armv8_cortex_a53/cpu_cycles/: type: 7 (armv8_cortex_a53), size: 136, config: 0x11 (cpu_cycles), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1 armv8_cortex_a72/cpu_cycles/: type: 8 (armv8_cortex_a72), size: 136, config: 0x11 (cpu_cycles), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1 dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1 root@roc-rk3399-pc:~# That doesn't mixes up CPU cycles for different CPU types: root@roc-rk3399-pc:~# perf script <SNIP> perf 16726 [005] 12632.206216: 3798 armv8_cortex_a72/cpu_cycles/: ffffc4c65be618d8 do_vfs_ioctl+0x424 ([kernel.kallsyms]) swapper 0 [000] 12632.206235: 78413 armv8_cortex_a53/cpu_cycles/: ffffc4c65ca38978 cpuidle_enter_state+0xc8 ([kernel.kallsyms]) perf 16726 [005] 12632.206272: 20279 armv8_cortex_a72/cpu_cycles/: ffffc4c65be113b4 kmem_cache_alloc+0x44 ([kernel.kallsyms]) sugov:4 166 [004] 12632.206409: 52979 armv8_cortex_a72/cpu_cycles/: ffffc4c65cde5de8 _raw_spin_unlock_irqrestore+0x14 ([kernel.kallsyms]) perf 16726 [005] 12632.206443: 67123 armv8_cortex_a72/cpu_cycles/: ffffc4c65be26bbc arch_local_irq_restore+0x8 ([kernel.kallsyms]) perf 16726 [005] 12632.206690: 96987 armv8_cortex_a72/cpu_cycles/: ffffc4c65bdb4a84 fault_in_readable+0xe4 ([kernel.kallsyms]) perf-exec 16727 [004] 12632.206836: 84199 armv8_cortex_a72/cpu_cycles/: ffffc4c65bd6c3b4 next_uptodate_page+0x264 ([kernel.kallsyms]) perf 16726 [005] 12632.206950: 102567 armv8_cortex_a72/cpu_cycles/: ffffc4c65bbe2aa4 up_write+0xa4 ([kernel.kallsyms]) swapper 0 [000] 12632.207030: 78413 armv8_cortex_a53/cpu_cycles/: ffffc4c65ca38978 cpuidle_enter_state+0xc8 ([kernel.kallsyms]) perf-exec 16727 [004] 12632.207037: 79507 armv8_cortex_a72/cpu_cycles/: ffffc4c65c48b89c strnlen_user+0x16c ([kernel.kallsyms]) <SNIP> So from the point of view of the user its important to differentiate samples for each type of CPU, so grouping everything into the same basket as ARM did in its big.LITTLE seems strange/"wrong". The way that in Intel when it "does the right thing" (I think no quotes are needed here, but I may be missing something) and at the tool level translates the special event name "cycles" into what the Intel PMU kernel drivers advertises to the world via sysfs as /sys/devices/cpu_{atom,core}/events/cpu-cycles (with that -) and ARM advertises as /sys/devices/armv8_cortex_a{53,72}/events/cpu_cycles (note the _) but gets translated in terms of 'struct perf_event_attr' as Intel: cpu_atom/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0xa00000000 cpu_core/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0x400000000 Versus ARM as: armv8_cortex_a53/cpu_cycles/: type: 7 (armv8_cortex_a53), size: 136, config: 0x11 (cpu_cycles) armv8_cortex_a72/cpu_cycles/: type: 8 (armv8_cortex_a72), size: 136, config: 0x11 (cpu_cycles) can possibly be made more consistent in a way that doesn't break any user experience using the perf command line. I would propose that 'cycles' explicitely asked or as the default, translates into armv8_cortex_a{53,72}/cpu_cycles/ on ARM and on /sys/devices/cpu_{atom,core}/events/cpu-cycles on Intel, and that whatever other architecture that comes to this party tries to learn from this botched experience. - Arnaldo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-29 18:44 ` Arnaldo Carvalho de Melo @ 2024-05-29 19:25 ` Ian Rogers 2024-05-30 5:35 ` Namhyung Kim 0 siblings, 1 reply; 34+ messages in thread From: Ian Rogers @ 2024-05-29 19:25 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: James Clark, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On Wed, May 29, 2024 at 11:44 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > On Wed, May 29, 2024 at 03:50:53PM +0100, James Clark wrote: > > Is the problem you are referring to that there are multiple PMUs with > > 'cycles' events? Surely that's only a problem in the context of the new > > proposed behavior, otherwise it's not really a problem. It's just > > something that happens to exist. > > > Because the user could always use the defaults (no argument) or -e > > cycles and historically Perf correctly picked the one that could be > > See below to see if mixing up "cycles" for efficiency and performance > cores is something sane or if I am missing something. > > > opened. Or if they want the DSU one they could specify it. That can all > > still work _and_ we can support "prefer sysfs/JSON" as long as we don't > > prefer it when opening the event doesn't work. > > Yeah, getting all the events in all the PMUs that match a string (after > it is normalized to cover historical artifacts, as in the case of > "cycles", "cpu_cycles" and "cpu-cycles", all of which should mean > "cycles" the special, default event) and that can sample if that is what > is being asked seems to be a sane outcome from this discussion. > > But lemme do try to show the differences from my Lenovo Intel Hybrid > system (13th gen) and a Libre Computer Rockchip ARM64 hybrid system: > > There are some differences on how ARM64 supports hybrid that we may find > interesting to fix or at least to (better) document, for instance: > > root@roc-rk3399-pc:~# dmidecode -H 1 > # dmidecode 3.4 > Getting SMBIOS data from sysfs. > SMBIOS 3.0 present. > 7 structures occupying 283 bytes. > Table at 0xEAE7A020. > > Handle 0x0001, DMI type 1, 27 bytes > System Information > Manufacturer: libre-computer > Product Name: roc-rk3399-pc > Version: Not Specified > Serial Number: b03c01a7179278b7 > UUID: 63333062-3130-3761-3137-393237386237 > Wake-up Type: Reserved > SKU Number: Not Specified > Family: Not Specified > > root@roc-rk3399-pc:~# > > This is a hybrid architecture: > > root@roc-rk3399-pc:~# ls -la /sys/devices/*/events/cpu_cycles > -r--r--r-- 1 root root 4096 May 29 16:27 /sys/devices/armv8_cortex_a53/events/cpu_cycles > -r--r--r-- 1 root root 4096 May 29 16:27 /sys/devices/armv8_cortex_a72/events/cpu_cycles > root@roc-rk3399-pc:~# > > In an intel hybrid system we instead have: > > root@number:~# ls -la /sys/devices/*/events/cpu-cycles > -r--r--r--. 1 root root 4096 May 29 13:59 /sys/devices/cpu_atom/events/cpu-cycles > -r--r--r--. 1 root root 4096 May 29 14:00 /sys/devices/cpu_core/events/cpu-cycles > root@number:~# > > Small difference, a - versus a _, but then both hybrid, efficiency > cores (armv8_cortex_a53 vs cpu_atom) and performance ones > (armv8_cortex_a72 vs cpu_core). > > On the Intel Hybrid system: > > root@number:~# perf record -e cycles -a sleep 1 > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 4.709 MB perf.data (46911 samples) ] > root@number:~# perf evlist > cpu_atom/cycles/ > cpu_core/cycles/ > dummy:u > root@number:~# > > root@number:~# perf evlist -v > cpu_atom/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0xa00000000, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1 > cpu_core/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0x400000000, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1 > dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1 > root@number:~# > > So it is recording CPU cycles in all the CPUs in the system, performance > and efficiency ones and that gets clear on a per-sample base: > > root@number:~# perf script > perf 2465078 [000] 73716.379947: 1 cpu_core/cycles/: ffffffffb40a55fa native_write_msr+0xa ([kernel.kallsyms]) > perf 2465078 [001] 73716.379966: 1 cpu_core/cycles/: ffffffffb40a55fa native_write_msr+0xa ([kernel.kallsyms]) > <SNIP more cpu_core/cycles/ samples> > gnome-shell 2608 [018] 73716.380704: 6721618 cpu_atom/cycles/: ffffffffc0b8419c fw_domains_get_with_fallback+0xfc ([kernel.kallsyms]) > podman 688107 [017] 73716.380706: 6695621 cpu_atom/cycles/: 564fc6110da0 [unknown] (/usr/bin/podman) > podman 687246 [000] 73716.380842: 8844997 cpu_core/cycles/: ffffffffb515150c _raw_spin_lock_irqsave+0xc ([kernel.kallsyms]) > podman 688108 [016] 73716.380913: 6737580 cpu_atom/cycles/: ffffffffb515205c native_queued_spin_lock_slowpath+0x28c ([kernel.kallsyms]) > swapper 0 [004] 73716.380932: 2090132 cpu_core/cycles/: ffffffffb513ad49 poll_idle+0x59 ([kernel.kallsyms]) > <SNIP> > > But on the ARM hybrid system, without Ian's patch, i.e. with what is in > torvalds/master right now (plus some header copies updates I'm working > on that are unrelated): > > root@roc-rk3399-pc:~# perf record -e cycles -a sleep 1 > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.135 MB perf.data (359 samples) ] > root@roc-rk3399-pc:~# perf evlist > cycles > dummy:u > root@roc-rk3399-pc:~# > > It records just one "event" even tho it is recording for all CPUs, both > efficiency and performance: > > root@number:~# perf script > <SNIP> > kworker/2:1-eve 10124 [002] 9687.302790: 60674 cycles: ffffc4c65bdd7380 vmap_small_pages_range_noflush+0x190 ([kernel.kallsyms]) > kworker/2:1-eve 10124 [002] 9687.302957: 66040 cycles: ffffc4c65bdd7438 vmap_small_pages_range_noflush+0x248 ([kernel.kallsyms]) > kworker/2:1-eve 10124 [002] 9687.303139: 71011 cycles: ffffc4c65cde0210 ww_mutex_lock+0x60 ([kernel.kallsyms]) > swapper 0 [002] 9687.303342: 75390 cycles: ffffc4c65bbc31c8 update_blocked_averages+0x188 ([kernel.kallsyms]) > swapper 0 [000] 9687.309276: 45496 cycles: ffffc4c65ca38978 cpuidle_enter_state+0xc8 ([kernel.kallsyms]) > <SNIP> > > Everything appears as "cycles" but we're getting samples for all CPUs, > again, performance and efficiency ones, different kinds of processors, > right? > > root@roc-rk3399-pc:~# perf report --stdio --sort cpu > # To display the perf.data header info, please use --header/--header-only options. > # > # Total Lost Samples: 0 > # > # Samples: 359 of event 'cycles' > # Event count (approx.): 23873034 > # > # Overhead CPU > # ........ ... > # > 31.34% 003 > 22.44% 004 > 19.30% 000 > 12.94% 002 > 9.14% 001 > 4.84% 005 > > root@roc-rk3399-pc:~# > > If we try, instead with cpu-cycles: > > root@roc-rk3399-pc:~# perf record -e cpu-cycles -a sleep 1 > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.135 MB perf.data (346 samples) ] > root@roc-rk3399-pc:~# perf evlist > cpu-cycles > dummy:u > root@roc-rk3399-pc:~# > root@roc-rk3399-pc:~# perf evlist -v > cpu-cycles: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1 > dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1 > root@roc-rk3399-pc:~# > > Both 'cycles' and 'cpu-cycles' end up the same as type: 0 > (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES). > > But if we use something equivalent but with that - replaced with a _ we > get a behaviour that is closer to the Intel one: > > root@roc-rk3399-pc:~# perf record -e cpu_cycles -a sleep 1 > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.137 MB perf.data (390 samples) ] > root@roc-rk3399-pc:~# > root@roc-rk3399-pc:~# perf evlist > armv8_cortex_a53/cpu_cycles/ > armv8_cortex_a72/cpu_cycles/ > dummy:u > root@roc-rk3399-pc:~# > > root@roc-rk3399-pc:~# perf evlist -v > armv8_cortex_a53/cpu_cycles/: type: 7 (armv8_cortex_a53), size: 136, config: 0x11 (cpu_cycles), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1 > armv8_cortex_a72/cpu_cycles/: type: 8 (armv8_cortex_a72), size: 136, config: 0x11 (cpu_cycles), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1 > dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1 > root@roc-rk3399-pc:~# > > That doesn't mixes up CPU cycles for different CPU types: > > root@roc-rk3399-pc:~# perf script > <SNIP> > perf 16726 [005] 12632.206216: 3798 armv8_cortex_a72/cpu_cycles/: ffffc4c65be618d8 do_vfs_ioctl+0x424 ([kernel.kallsyms]) > swapper 0 [000] 12632.206235: 78413 armv8_cortex_a53/cpu_cycles/: ffffc4c65ca38978 cpuidle_enter_state+0xc8 ([kernel.kallsyms]) > perf 16726 [005] 12632.206272: 20279 armv8_cortex_a72/cpu_cycles/: ffffc4c65be113b4 kmem_cache_alloc+0x44 ([kernel.kallsyms]) > sugov:4 166 [004] 12632.206409: 52979 armv8_cortex_a72/cpu_cycles/: ffffc4c65cde5de8 _raw_spin_unlock_irqrestore+0x14 ([kernel.kallsyms]) > perf 16726 [005] 12632.206443: 67123 armv8_cortex_a72/cpu_cycles/: ffffc4c65be26bbc arch_local_irq_restore+0x8 ([kernel.kallsyms]) > perf 16726 [005] 12632.206690: 96987 armv8_cortex_a72/cpu_cycles/: ffffc4c65bdb4a84 fault_in_readable+0xe4 ([kernel.kallsyms]) > perf-exec 16727 [004] 12632.206836: 84199 armv8_cortex_a72/cpu_cycles/: ffffc4c65bd6c3b4 next_uptodate_page+0x264 ([kernel.kallsyms]) > perf 16726 [005] 12632.206950: 102567 armv8_cortex_a72/cpu_cycles/: ffffc4c65bbe2aa4 up_write+0xa4 ([kernel.kallsyms]) > swapper 0 [000] 12632.207030: 78413 armv8_cortex_a53/cpu_cycles/: ffffc4c65ca38978 cpuidle_enter_state+0xc8 ([kernel.kallsyms]) > perf-exec 16727 [004] 12632.207037: 79507 armv8_cortex_a72/cpu_cycles/: ffffc4c65c48b89c strnlen_user+0x16c ([kernel.kallsyms]) > <SNIP> > > So from the point of view of the user its important to differentiate > samples for each type of CPU, so grouping everything into the same > basket as ARM did in its big.LITTLE seems strange/"wrong". > > The way that in Intel when it "does the right thing" (I think no quotes > are needed here, but I may be missing something) and at the tool level > translates the special event name "cycles" into what the Intel PMU > kernel drivers advertises to the world via sysfs as > /sys/devices/cpu_{atom,core}/events/cpu-cycles (with that -) and ARM > advertises as /sys/devices/armv8_cortex_a{53,72}/events/cpu_cycles (note > the _) but gets translated in terms of 'struct perf_event_attr' as > > Intel: > > cpu_atom/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0xa00000000 > cpu_core/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0x400000000 Here there is a legacy encoding of config 0 meaning PERF_COUNT_HW_CPU_CYCLES the high bits are (from perf_event.h): ``` * PERF_TYPE_HARDWARE: 0xEEEEEEEE000000AA * AA: hardware event ID * EEEEEEEE: PMU type ID ``` Intel has a cpu-cycles sysfs event and depending on whether you do or don't have the reverted patch the priority of that over a legacy event is going to vary. > Versus ARM as: > > armv8_cortex_a53/cpu_cycles/: type: 7 (armv8_cortex_a53), size: 136, config: 0x11 (cpu_cycles) > armv8_cortex_a72/cpu_cycles/: type: 8 (armv8_cortex_a72), size: 136, config: 0x11 (cpu_cycles) Here cpu_cycles isn't a legacy event which use hyphens and not underscores: https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l#n335 The two sysfs events were programmed. Were another PMU advertising cpu_cycles we'd also try to program that. > can possibly be made more consistent in a way that doesn't break any > user experience using the perf command line. > > I would propose that 'cycles' explicitely asked or as the default, > translates into armv8_cortex_a{53,72}/cpu_cycles/ on ARM and on > /sys/devices/cpu_{atom,core}/events/cpu-cycles on Intel, and that > whatever other architecture that comes to this party tries to learn from > this botched experience. I don't understand this paragraph. My expectation is that sysfs and json are priority over legacy, we transitioned to this to fix ARM, this does mean reverting the revert. We use legacy for core PMUs when the sysfs/json lookup fails and there is a legacy name. We may or may not have a notion of special events where special events are only programmed on core PMUs. My preference is no notion, as what flavor of cycles, .. go in it is just madness. Linus disagrees and won't talk about it to clear up the ambiguity. If we do have special events then can we keep scanning all PMUs by having test logic that complains when a PMU uses a special name for one of their events or must we put special case logic in the event parser? We can fix the arm_dsu bug by renaming cycles there. If that's too hard to land, clearing up ambiguity by adding a PMU name has always been the way to do this. My preference for v6.10 is revert the revert, then add either a rename of the arm_dsu event and/or the change here. We can make perf record tolerant and ignore opening events on PMUs that don't support sampling, but I think it is too big a thing to do for v6.10. I don't know what state perf test is in, and I expect the answer varies greatly between perf-next and linus/master. There is stuff queued up to go in perf-next further building on the expectations of the reverted patch. Could someone at ARM please set up a build/test server on linux-next for whatever machine is Linus' flavor of the day so that we don't see this happen in a PR again. I build test ARM on a Raspberry Pi and the BIG.little devices are similarly at the low end of the price/performance spectrum. In all likelihood they all lack precise samples, Linus' command lines show he cares about this. Thanks, Ian ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-29 19:25 ` Ian Rogers @ 2024-05-30 5:35 ` Namhyung Kim 2024-05-30 12:48 ` James Clark 0 siblings, 1 reply; 34+ messages in thread From: Namhyung Kim @ 2024-05-30 5:35 UTC (permalink / raw) To: Ian Rogers Cc: Arnaldo Carvalho de Melo, James Clark, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On Wed, May 29, 2024 at 12:25 PM Ian Rogers <irogers@google.com> wrote: > We can fix the arm_dsu bug by renaming cycles there. If that's too > hard to land, clearing up ambiguity by adding a PMU name has always > been the way to do this. My preference for v6.10 is revert the revert, > then add either a rename of the arm_dsu event and/or the change here. > > We can make perf record tolerant and ignore opening events on PMUs > that don't support sampling, but I think it is too big a thing to do > for v6.10. How about adding a flag to parse_event_option_args so that we can check if it's for sampling events. And then we might skip uncore PMUs easily (assuming arm_dsu PMU is uncore). It might not be a perfect solution but it could be a simple one. Ideally I think it'd be nice if the kernel exports more information about the PMUs like sampling and exclude capabilities. Thanks, Namhyung ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-30 5:35 ` Namhyung Kim @ 2024-05-30 12:48 ` James Clark 2024-05-30 13:46 ` Ian Rogers 0 siblings, 1 reply; 34+ messages in thread From: James Clark @ 2024-05-30 12:48 UTC (permalink / raw) To: Namhyung Kim, Ian Rogers Cc: Arnaldo Carvalho de Melo, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On 30/05/2024 06:35, Namhyung Kim wrote: > On Wed, May 29, 2024 at 12:25 PM Ian Rogers <irogers@google.com> wrote: >> We can fix the arm_dsu bug by renaming cycles there. If that's too >> hard to land, clearing up ambiguity by adding a PMU name has always >> been the way to do this. My preference for v6.10 is revert the revert, >> then add either a rename of the arm_dsu event and/or the change here. >> >> We can make perf record tolerant and ignore opening events on PMUs >> that don't support sampling, but I think it is too big a thing to do >> for v6.10. > > How about adding a flag to parse_event_option_args so that we > can check if it's for sampling events. And then we might skip > uncore PMUs easily (assuming arm_dsu PMU is uncore). It's uncore yes. Couldn't we theoretically have a core PMU that still doesn't support sampling though? And then we'd end up in the same situation. Attempting to open the event is the only sure way of knowing, rather than trying to guess with some heuristic in userspace? Maybe a bit too hypothetical but still worth considering. > > It might not be a perfect solution but it could be a simple one. > Ideally I think it'd be nice if the kernel exports more information > about the PMUs like sampling and exclude capabilities. > > Thanks, > Namhyung That seems like a much better suggestion. Especially with the ever expanding retry/fallback mechanism that can never really take into account every combination of event attributes that can fail. James ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-30 12:48 ` James Clark @ 2024-05-30 13:46 ` Ian Rogers 2024-05-30 22:51 ` Namhyung Kim 0 siblings, 1 reply; 34+ messages in thread From: Ian Rogers @ 2024-05-30 13:46 UTC (permalink / raw) To: James Clark Cc: Namhyung Kim, Arnaldo Carvalho de Melo, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On Thu, May 30, 2024 at 5:48 AM James Clark <james.clark@arm.com> wrote: > > On 30/05/2024 06:35, Namhyung Kim wrote: > > On Wed, May 29, 2024 at 12:25 PM Ian Rogers <irogers@google.com> wrote: > >> We can fix the arm_dsu bug by renaming cycles there. If that's too > >> hard to land, clearing up ambiguity by adding a PMU name has always > >> been the way to do this. My preference for v6.10 is revert the revert, > >> then add either a rename of the arm_dsu event and/or the change here. > >> > >> We can make perf record tolerant and ignore opening events on PMUs > >> that don't support sampling, but I think it is too big a thing to do > >> for v6.10. > > > > How about adding a flag to parse_event_option_args so that we > > can check if it's for sampling events. And then we might skip > > uncore PMUs easily (assuming arm_dsu PMU is uncore). > > It's uncore yes. > > Couldn't we theoretically have a core PMU that still doesn't support > sampling though? And then we'd end up in the same situation. Attempting > to open the event is the only sure way of knowing, rather than trying to > guess with some heuristic in userspace? > > Maybe a bit too hypothetical but still worth considering. > > > > > It might not be a perfect solution but it could be a simple one. > > Ideally I think it'd be nice if the kernel exports more information > > about the PMUs like sampling and exclude capabilities. > > > Thanks, > > Namhyung > > That seems like a much better suggestion. Especially with the ever > expanding retry/fallback mechanism that can never really take into > account every combination of event attributes that can fail. I think this approach can work but we may break PMUs. Rather than use `is_core` on `struct pmu` we could have say a `supports_sampling` and we pass to parse_events an option to exclude any PMU that doesn't have that flag. Now obviously more than just core PMUs support sampling. All software PMUs, tracepoints, probes. We have an imprecise list of these in perf_pmu__is_software. So we can set supports_sampling for perf_pmu__is_software and is_core. I think the problem comes for things like the AMD IBS PMUs, intel_bts and intel_pt. Often these only support sampling but aren't core. There may be IBM S390 PMUs or other vendor PMUs that are similar. If we can make a list of all these PMU names then we can use that to set supports_sampling and not break event parsing for these PMUs. The name list sounds somewhat impractical, let's say we lazily compute the supports_sampling on a PMU. We need the sampling equivalent of is_event_supported: https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242 is_event_supported has had bugs, look at the exclude_guest workaround for Apple PMUs. It also isn't clear to me how we choose the event config that we're going to probe to determine whether sampling works. The perf_event_open may reject the test because of a bad config and not because sampling isn't supported. So I think we can make the approach work if we had either: 1) a list of PMUs that support sampling, 2) a reliable "is_sampling_supported" test. I'm not sure of the advantages of doing (2) rather than just creating the set of evsels and ignoring those that fail to open. Ignoring evsels that fail to open seems more unlikely to break anything as the user is giving the events/config values for the PMUs they care about. Thanks, Ian ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-30 13:46 ` Ian Rogers @ 2024-05-30 22:51 ` Namhyung Kim 2024-06-05 20:29 ` Namhyung Kim 0 siblings, 1 reply; 34+ messages in thread From: Namhyung Kim @ 2024-05-30 22:51 UTC (permalink / raw) To: Ian Rogers Cc: James Clark, Arnaldo Carvalho de Melo, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On Thu, May 30, 2024 at 06:46:08AM -0700, Ian Rogers wrote: > On Thu, May 30, 2024 at 5:48 AM James Clark <james.clark@arm.com> wrote: > > > > On 30/05/2024 06:35, Namhyung Kim wrote: > > > On Wed, May 29, 2024 at 12:25 PM Ian Rogers <irogers@google.com> wrote: > > >> We can fix the arm_dsu bug by renaming cycles there. If that's too > > >> hard to land, clearing up ambiguity by adding a PMU name has always > > >> been the way to do this. My preference for v6.10 is revert the revert, > > >> then add either a rename of the arm_dsu event and/or the change here. > > >> > > >> We can make perf record tolerant and ignore opening events on PMUs > > >> that don't support sampling, but I think it is too big a thing to do > > >> for v6.10. > > > > > > How about adding a flag to parse_event_option_args so that we > > > can check if it's for sampling events. And then we might skip > > > uncore PMUs easily (assuming arm_dsu PMU is uncore). > > > > It's uncore yes. > > > > Couldn't we theoretically have a core PMU that still doesn't support > > sampling though? And then we'd end up in the same situation. Attempting > > to open the event is the only sure way of knowing, rather than trying to > > guess with some heuristic in userspace? > > > > Maybe a bit too hypothetical but still worth considering. Then I think it's a real problem and perf should report it like we do now. > > > > > > > > It might not be a perfect solution but it could be a simple one. > > > Ideally I think it'd be nice if the kernel exports more information > > > about the PMUs like sampling and exclude capabilities. > > > > Thanks, > > > Namhyung > > > > That seems like a much better suggestion. Especially with the ever > > expanding retry/fallback mechanism that can never really take into > > account every combination of event attributes that can fail. > > I think this approach can work but we may break PMUs. > > Rather than use `is_core` on `struct pmu` we could have say a > `supports_sampling` and we pass to parse_events an option to exclude > any PMU that doesn't have that flag. Now obviously more than just core > PMUs support sampling. All software PMUs, tracepoints, probes. We have > an imprecise list of these in perf_pmu__is_software. So we can set > supports_sampling for perf_pmu__is_software and is_core. Yep, we can do that if the kernel provides the info. But before that I think it's practical to skip uncore PMUs and hope other PMUs don't have event aliases clashing with the legacy names. :) > > I think the problem comes for things like the AMD IBS PMUs, intel_bts > and intel_pt. Often these only support sampling but aren't core. There > may be IBM S390 PMUs or other vendor PMUs that are similar. If we can > make a list of all these PMU names then we can use that to set > supports_sampling and not break event parsing for these PMUs. > > The name list sounds somewhat impractical, let's say we lazily compute > the supports_sampling on a PMU. We need the sampling equivalent of > is_event_supported: > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242 > is_event_supported has had bugs, look at the exclude_guest workaround > for Apple PMUs. It also isn't clear to me how we choose the event > config that we're going to probe to determine whether sampling works. > The perf_event_open may reject the test because of a bad config and > not because sampling isn't supported. > > So I think we can make the approach work if we had either: > 1) a list of PMUs that support sampling, > 2) a reliable "is_sampling_supported" test. > > I'm not sure of the advantages of doing (2) rather than just creating > the set of evsels and ignoring those that fail to open. Ignoring > evsels that fail to open seems more unlikely to break anything as the > user is giving the events/config values for the PMUs they care about. Yep, that's also possible. I'm ok if you want to go that direction. Thanks, Namhyung ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-30 22:51 ` Namhyung Kim @ 2024-06-05 20:29 ` Namhyung Kim 2024-06-05 23:02 ` Ian Rogers 2024-06-06 13:47 ` Arnaldo Carvalho de Melo 0 siblings, 2 replies; 34+ messages in thread From: Namhyung Kim @ 2024-06-05 20:29 UTC (permalink / raw) To: Ian Rogers Cc: James Clark, Arnaldo Carvalho de Melo, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On Thu, May 30, 2024 at 3:52 PM Namhyung Kim <namhyung@kernel.org> wrote: > > On Thu, May 30, 2024 at 06:46:08AM -0700, Ian Rogers wrote: > > On Thu, May 30, 2024 at 5:48 AM James Clark <james.clark@arm.com> wrote: > > > > > > On 30/05/2024 06:35, Namhyung Kim wrote: > > > > It might not be a perfect solution but it could be a simple one. > > > > Ideally I think it'd be nice if the kernel exports more information > > > > about the PMUs like sampling and exclude capabilities. > > > > > Thanks, > > > > Namhyung > > > > > > That seems like a much better suggestion. Especially with the ever > > > expanding retry/fallback mechanism that can never really take into > > > account every combination of event attributes that can fail. > > > > I think this approach can work but we may break PMUs. > > > > Rather than use `is_core` on `struct pmu` we could have say a > > `supports_sampling` and we pass to parse_events an option to exclude > > any PMU that doesn't have that flag. Now obviously more than just core > > PMUs support sampling. All software PMUs, tracepoints, probes. We have > > an imprecise list of these in perf_pmu__is_software. So we can set > > supports_sampling for perf_pmu__is_software and is_core. > > Yep, we can do that if the kernel provides the info. But before that > I think it's practical to skip uncore PMUs and hope other PMUs don't > have event aliases clashing with the legacy names. :) > > > > > I think the problem comes for things like the AMD IBS PMUs, intel_bts > > and intel_pt. Often these only support sampling but aren't core. There > > may be IBM S390 PMUs or other vendor PMUs that are similar. If we can > > make a list of all these PMU names then we can use that to set > > supports_sampling and not break event parsing for these PMUs. > > > > The name list sounds somewhat impractical, let's say we lazily compute > > the supports_sampling on a PMU. We need the sampling equivalent of > > is_event_supported: > > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242 > > is_event_supported has had bugs, look at the exclude_guest workaround > > for Apple PMUs. It also isn't clear to me how we choose the event > > config that we're going to probe to determine whether sampling works. > > The perf_event_open may reject the test because of a bad config and > > not because sampling isn't supported. > > > > So I think we can make the approach work if we had either: > > 1) a list of PMUs that support sampling, > > 2) a reliable "is_sampling_supported" test. > > > > I'm not sure of the advantages of doing (2) rather than just creating > > the set of evsels and ignoring those that fail to open. Ignoring > > evsels that fail to open seems more unlikely to break anything as the > > user is giving the events/config values for the PMUs they care about. > > Yep, that's also possible. I'm ok if you want to go that direction. Hmm.. I thought about this again. But it can be a problem if we ignore any failures as it can be a real error due to other reason - e.g. not supported configuration or other user mistakes. Thanks, Namhyung ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-06-05 20:29 ` Namhyung Kim @ 2024-06-05 23:02 ` Ian Rogers 2024-06-06 7:09 ` Namhyung Kim 2024-06-06 13:47 ` Arnaldo Carvalho de Melo 1 sibling, 1 reply; 34+ messages in thread From: Ian Rogers @ 2024-06-05 23:02 UTC (permalink / raw) To: Namhyung Kim Cc: James Clark, Arnaldo Carvalho de Melo, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On Wed, Jun 5, 2024 at 1:29 PM Namhyung Kim <namhyung@kernel.org> wrote: > > On Thu, May 30, 2024 at 3:52 PM Namhyung Kim <namhyung@kernel.org> wrote: > > > > On Thu, May 30, 2024 at 06:46:08AM -0700, Ian Rogers wrote: > > > On Thu, May 30, 2024 at 5:48 AM James Clark <james.clark@arm.com> wrote: > > > > > > > > On 30/05/2024 06:35, Namhyung Kim wrote: > > > > > It might not be a perfect solution but it could be a simple one. > > > > > Ideally I think it'd be nice if the kernel exports more information > > > > > about the PMUs like sampling and exclude capabilities. > > > > > > Thanks, > > > > > Namhyung > > > > > > > > That seems like a much better suggestion. Especially with the ever > > > > expanding retry/fallback mechanism that can never really take into > > > > account every combination of event attributes that can fail. > > > > > > I think this approach can work but we may break PMUs. > > > > > > Rather than use `is_core` on `struct pmu` we could have say a > > > `supports_sampling` and we pass to parse_events an option to exclude > > > any PMU that doesn't have that flag. Now obviously more than just core > > > PMUs support sampling. All software PMUs, tracepoints, probes. We have > > > an imprecise list of these in perf_pmu__is_software. So we can set > > > supports_sampling for perf_pmu__is_software and is_core. > > > > Yep, we can do that if the kernel provides the info. But before that > > I think it's practical to skip uncore PMUs and hope other PMUs don't > > have event aliases clashing with the legacy names. :) > > > > > > > > I think the problem comes for things like the AMD IBS PMUs, intel_bts > > > and intel_pt. Often these only support sampling but aren't core. There > > > may be IBM S390 PMUs or other vendor PMUs that are similar. If we can > > > make a list of all these PMU names then we can use that to set > > > supports_sampling and not break event parsing for these PMUs. > > > > > > The name list sounds somewhat impractical, let's say we lazily compute > > > the supports_sampling on a PMU. We need the sampling equivalent of > > > is_event_supported: > > > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242 > > > is_event_supported has had bugs, look at the exclude_guest workaround > > > for Apple PMUs. It also isn't clear to me how we choose the event > > > config that we're going to probe to determine whether sampling works. > > > The perf_event_open may reject the test because of a bad config and > > > not because sampling isn't supported. > > > > > > So I think we can make the approach work if we had either: > > > 1) a list of PMUs that support sampling, > > > 2) a reliable "is_sampling_supported" test. > > > > > > I'm not sure of the advantages of doing (2) rather than just creating > > > the set of evsels and ignoring those that fail to open. Ignoring > > > evsels that fail to open seems more unlikely to break anything as the > > > user is giving the events/config values for the PMUs they care about. > > > > Yep, that's also possible. I'm ok if you want to go that direction. > > Hmm.. I thought about this again. But it can be a problem if we ignore > any failures as it can be a real error due to other reason - e.g. not > supported configuration or other user mistakes. Right, we have two not good choices: 1) Try to detect whether sampling is supported, but any test doing this needs to guess at a configuration and we'll need to deflake this on off platforms like those that don't allow things like exclude guest. 2) Ignore failures, possibly hiding user errors. I would prefer for (2) the errors were pr_err rather than pr_debug, something the user can clean up by getting rid of warned about PMUs. This will avoid hiding the error, but then on Neoverse cycles will warn about the arm_dsu PMU's cycles event for exactly Linus' test case. My understanding is that this is deemed a regression, hence Arnaldo proposing pr_debug to hide it. Thanks, Ian ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-06-05 23:02 ` Ian Rogers @ 2024-06-06 7:09 ` Namhyung Kim 2024-06-06 9:42 ` James Clark 0 siblings, 1 reply; 34+ messages in thread From: Namhyung Kim @ 2024-06-06 7:09 UTC (permalink / raw) To: Ian Rogers Cc: James Clark, Arnaldo Carvalho de Melo, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On Wed, Jun 5, 2024 at 4:02 PM Ian Rogers <irogers@google.com> wrote: > > On Wed, Jun 5, 2024 at 1:29 PM Namhyung Kim <namhyung@kernel.org> wrote: > > > > On Thu, May 30, 2024 at 3:52 PM Namhyung Kim <namhyung@kernel.org> wrote: > > > > > > On Thu, May 30, 2024 at 06:46:08AM -0700, Ian Rogers wrote: > > > > On Thu, May 30, 2024 at 5:48 AM James Clark <james.clark@arm.com> wrote: > > > > > > > > > > On 30/05/2024 06:35, Namhyung Kim wrote: > > > > > > It might not be a perfect solution but it could be a simple one. > > > > > > Ideally I think it'd be nice if the kernel exports more information > > > > > > about the PMUs like sampling and exclude capabilities. > > > > > > > Thanks, > > > > > > Namhyung > > > > > > > > > > That seems like a much better suggestion. Especially with the ever > > > > > expanding retry/fallback mechanism that can never really take into > > > > > account every combination of event attributes that can fail. > > > > > > > > I think this approach can work but we may break PMUs. > > > > > > > > Rather than use `is_core` on `struct pmu` we could have say a > > > > `supports_sampling` and we pass to parse_events an option to exclude > > > > any PMU that doesn't have that flag. Now obviously more than just core > > > > PMUs support sampling. All software PMUs, tracepoints, probes. We have > > > > an imprecise list of these in perf_pmu__is_software. So we can set > > > > supports_sampling for perf_pmu__is_software and is_core. > > > > > > Yep, we can do that if the kernel provides the info. But before that > > > I think it's practical to skip uncore PMUs and hope other PMUs don't > > > have event aliases clashing with the legacy names. :) > > > > > > > > > > > I think the problem comes for things like the AMD IBS PMUs, intel_bts > > > > and intel_pt. Often these only support sampling but aren't core. There > > > > may be IBM S390 PMUs or other vendor PMUs that are similar. If we can > > > > make a list of all these PMU names then we can use that to set > > > > supports_sampling and not break event parsing for these PMUs. > > > > > > > > The name list sounds somewhat impractical, let's say we lazily compute > > > > the supports_sampling on a PMU. We need the sampling equivalent of > > > > is_event_supported: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242 > > > > is_event_supported has had bugs, look at the exclude_guest workaround > > > > for Apple PMUs. It also isn't clear to me how we choose the event > > > > config that we're going to probe to determine whether sampling works. > > > > The perf_event_open may reject the test because of a bad config and > > > > not because sampling isn't supported. > > > > > > > > So I think we can make the approach work if we had either: > > > > 1) a list of PMUs that support sampling, > > > > 2) a reliable "is_sampling_supported" test. > > > > > > > > I'm not sure of the advantages of doing (2) rather than just creating > > > > the set of evsels and ignoring those that fail to open. Ignoring > > > > evsels that fail to open seems more unlikely to break anything as the > > > > user is giving the events/config values for the PMUs they care about. > > > > > > Yep, that's also possible. I'm ok if you want to go that direction. > > > > Hmm.. I thought about this again. But it can be a problem if we ignore > > any failures as it can be a real error due to other reason - e.g. not > > supported configuration or other user mistakes. > > Right, we have two not good choices: > > 1) Try to detect whether sampling is supported, but any test doing > this needs to guess at a configuration and we'll need to deflake this > on off platforms like those that don't allow things like exclude > guest. I believe we don't need to try so hard to detect if sampling is supported or not. I hope we will eventually add that to the kernel. Also this is just an additional defense line, it should work without it in most cases. It'll just protect from a few edge cases like uncore PMUs having events of legacy name. For other events or PMUs, I think it's ok to fail. > 2) Ignore failures, possibly hiding user errors. > > I would prefer for (2) the errors were pr_err rather than pr_debug, > something the user can clean up by getting rid of warned about PMUs. > This will avoid hiding the error, but then on Neoverse cycles will > warn about the arm_dsu PMU's cycles event for exactly Linus' test > case. My understanding is that this is deemed a regression, hence > Arnaldo proposing pr_debug to hide it. Right, if we use pr_err() then users will complain. If we use pr_debug() then errors will be hidden silently. Thanks, Namhyung ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-06-06 7:09 ` Namhyung Kim @ 2024-06-06 9:42 ` James Clark 2024-06-06 13:51 ` Arnaldo Carvalho de Melo 2024-06-07 6:10 ` Namhyung Kim 0 siblings, 2 replies; 34+ messages in thread From: James Clark @ 2024-06-06 9:42 UTC (permalink / raw) To: Namhyung Kim, Ian Rogers Cc: Arnaldo Carvalho de Melo, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On 06/06/2024 08:09, Namhyung Kim wrote: > On Wed, Jun 5, 2024 at 4:02 PM Ian Rogers <irogers@google.com> wrote: >> >> On Wed, Jun 5, 2024 at 1:29 PM Namhyung Kim <namhyung@kernel.org> wrote: >>> >>> On Thu, May 30, 2024 at 3:52 PM Namhyung Kim <namhyung@kernel.org> wrote: >>>> >>>> On Thu, May 30, 2024 at 06:46:08AM -0700, Ian Rogers wrote: >>>>> On Thu, May 30, 2024 at 5:48 AM James Clark <james.clark@arm.com> wrote: >>>>>> >>>>>> On 30/05/2024 06:35, Namhyung Kim wrote: >>>>>>> It might not be a perfect solution but it could be a simple one. >>>>>>> Ideally I think it'd be nice if the kernel exports more information >>>>>>> about the PMUs like sampling and exclude capabilities. >>>>>>>> Thanks, >>>>>>> Namhyung >>>>>> >>>>>> That seems like a much better suggestion. Especially with the ever >>>>>> expanding retry/fallback mechanism that can never really take into >>>>>> account every combination of event attributes that can fail. >>>>> >>>>> I think this approach can work but we may break PMUs. >>>>> >>>>> Rather than use `is_core` on `struct pmu` we could have say a >>>>> `supports_sampling` and we pass to parse_events an option to exclude >>>>> any PMU that doesn't have that flag. Now obviously more than just core >>>>> PMUs support sampling. All software PMUs, tracepoints, probes. We have >>>>> an imprecise list of these in perf_pmu__is_software. So we can set >>>>> supports_sampling for perf_pmu__is_software and is_core. >>>> >>>> Yep, we can do that if the kernel provides the info. But before that >>>> I think it's practical to skip uncore PMUs and hope other PMUs don't >>>> have event aliases clashing with the legacy names. :) >>>> >>>>> >>>>> I think the problem comes for things like the AMD IBS PMUs, intel_bts >>>>> and intel_pt. Often these only support sampling but aren't core. There >>>>> may be IBM S390 PMUs or other vendor PMUs that are similar. If we can >>>>> make a list of all these PMU names then we can use that to set >>>>> supports_sampling and not break event parsing for these PMUs. >>>>> >>>>> The name list sounds somewhat impractical, let's say we lazily compute >>>>> the supports_sampling on a PMU. We need the sampling equivalent of >>>>> is_event_supported: >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242 >>>>> is_event_supported has had bugs, look at the exclude_guest workaround >>>>> for Apple PMUs. It also isn't clear to me how we choose the event >>>>> config that we're going to probe to determine whether sampling works. >>>>> The perf_event_open may reject the test because of a bad config and >>>>> not because sampling isn't supported. >>>>> >>>>> So I think we can make the approach work if we had either: >>>>> 1) a list of PMUs that support sampling, >>>>> 2) a reliable "is_sampling_supported" test. >>>>> >>>>> I'm not sure of the advantages of doing (2) rather than just creating >>>>> the set of evsels and ignoring those that fail to open. Ignoring >>>>> evsels that fail to open seems more unlikely to break anything as the >>>>> user is giving the events/config values for the PMUs they care about. >>>> >>>> Yep, that's also possible. I'm ok if you want to go that direction. >>> >>> Hmm.. I thought about this again. But it can be a problem if we ignore >>> any failures as it can be a real error due to other reason - e.g. not >>> supported configuration or other user mistakes. >> >> Right, we have two not good choices: >> >> 1) Try to detect whether sampling is supported, but any test doing >> this needs to guess at a configuration and we'll need to deflake this >> on off platforms like those that don't allow things like exclude >> guest. > > I believe we don't need to try so hard to detect if sampling is > supported or not. I hope we will eventually add that to the > kernel. Also this is just an additional defense line, it should > work without it in most cases. It'll just protect from a few edge > cases like uncore PMUs having events of legacy name. For > other events or PMUs, I think it's ok to fail. > > >> 2) Ignore failures, possibly hiding user errors. >> >> I would prefer for (2) the errors were pr_err rather than pr_debug, >> something the user can clean up by getting rid of warned about PMUs. >> This will avoid hiding the error, but then on Neoverse cycles will >> warn about the arm_dsu PMU's cycles event for exactly Linus' test >> case. My understanding is that this is deemed a regression, hence >> Arnaldo proposing pr_debug to hide it. > > Right, if we use pr_err() then users will complain. If we use > pr_debug() then errors will be hidden silently. > > Thanks, > Namhyung I'm not sure if anyone would really complain about warnings for attempting to open but not succeeding, as long as the event that they really wanted is there. I'm imagining output like this: $ perf record -e cycles -- ls Warning: skipped arm_dsu/cycles/ event(s), recording on armv8_pmuv3_0/cycles/, armv8_pmuv3_1/cycles/ [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.008 MB perf.data (30 samples) ] You only really need to worry when no events can be opened, but presumably that was a warning anyway. And in stat mode I wouldn't expect any warnings. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-06-06 9:42 ` James Clark @ 2024-06-06 13:51 ` Arnaldo Carvalho de Melo 2024-06-07 6:10 ` Namhyung Kim 1 sibling, 0 replies; 34+ messages in thread From: Arnaldo Carvalho de Melo @ 2024-06-06 13:51 UTC (permalink / raw) To: James Clark Cc: Namhyung Kim, Ian Rogers, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On Thu, Jun 06, 2024 at 10:42:33AM +0100, James Clark wrote: > On 06/06/2024 08:09, Namhyung Kim wrote: > > On Wed, Jun 5, 2024 at 4:02 PM Ian Rogers <irogers@google.com> wrote: > >> 2) Ignore failures, possibly hiding user errors. > >> I would prefer for (2) the errors were pr_err rather than pr_debug, > >> something the user can clean up by getting rid of warned about PMUs. > >> This will avoid hiding the error, but then on Neoverse cycles will > >> warn about the arm_dsu PMU's cycles event for exactly Linus' test > >> case. My understanding is that this is deemed a regression, hence > >> Arnaldo proposing pr_debug to hide it. > > Right, if we use pr_err() then users will complain. If we use > > pr_debug() then errors will be hidden silently. > I'm not sure if anyone would really complain about warnings for > attempting to open but not succeeding, as long as the event that they > really wanted is there. I'm imagining output like this: > $ perf record -e cycles -- ls > Warning: skipped arm_dsu/cycles/ event(s), recording on > armv8_pmuv3_0/cycles/, armv8_pmuv3_1/cycles/ > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.008 MB perf.data (30 samples) ] > You only really need to worry when no events can be opened, but > presumably that was a warning anyway. Agreed, while we don't find a way, old or new to autoritatively skip the event, when that pr_warning() gets turned into a pr_debug() so that people expecting that those skipped events were included get a message telling why they were not. > And in stat mode I wouldn't expect any warnings. Right. - Arnaldo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-06-06 9:42 ` James Clark 2024-06-06 13:51 ` Arnaldo Carvalho de Melo @ 2024-06-07 6:10 ` Namhyung Kim 1 sibling, 0 replies; 34+ messages in thread From: Namhyung Kim @ 2024-06-07 6:10 UTC (permalink / raw) To: James Clark Cc: Ian Rogers, Arnaldo Carvalho de Melo, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel Hello, On Thu, Jun 06, 2024 at 10:42:33AM +0100, James Clark wrote: > > > On 06/06/2024 08:09, Namhyung Kim wrote: > > On Wed, Jun 5, 2024 at 4:02 PM Ian Rogers <irogers@google.com> wrote: > >> > >> On Wed, Jun 5, 2024 at 1:29 PM Namhyung Kim <namhyung@kernel.org> wrote: > >>> > >>> On Thu, May 30, 2024 at 3:52 PM Namhyung Kim <namhyung@kernel.org> wrote: > >>>> > >>>> On Thu, May 30, 2024 at 06:46:08AM -0700, Ian Rogers wrote: > >>>>> On Thu, May 30, 2024 at 5:48 AM James Clark <james.clark@arm.com> wrote: > >>>>>> > >>>>>> On 30/05/2024 06:35, Namhyung Kim wrote: > >>>>>>> It might not be a perfect solution but it could be a simple one. > >>>>>>> Ideally I think it'd be nice if the kernel exports more information > >>>>>>> about the PMUs like sampling and exclude capabilities. > >>>>>>>> Thanks, > >>>>>>> Namhyung > >>>>>> > >>>>>> That seems like a much better suggestion. Especially with the ever > >>>>>> expanding retry/fallback mechanism that can never really take into > >>>>>> account every combination of event attributes that can fail. > >>>>> > >>>>> I think this approach can work but we may break PMUs. > >>>>> > >>>>> Rather than use `is_core` on `struct pmu` we could have say a > >>>>> `supports_sampling` and we pass to parse_events an option to exclude > >>>>> any PMU that doesn't have that flag. Now obviously more than just core > >>>>> PMUs support sampling. All software PMUs, tracepoints, probes. We have > >>>>> an imprecise list of these in perf_pmu__is_software. So we can set > >>>>> supports_sampling for perf_pmu__is_software and is_core. > >>>> > >>>> Yep, we can do that if the kernel provides the info. But before that > >>>> I think it's practical to skip uncore PMUs and hope other PMUs don't > >>>> have event aliases clashing with the legacy names. :) > >>>> > >>>>> > >>>>> I think the problem comes for things like the AMD IBS PMUs, intel_bts > >>>>> and intel_pt. Often these only support sampling but aren't core. There > >>>>> may be IBM S390 PMUs or other vendor PMUs that are similar. If we can > >>>>> make a list of all these PMU names then we can use that to set > >>>>> supports_sampling and not break event parsing for these PMUs. > >>>>> > >>>>> The name list sounds somewhat impractical, let's say we lazily compute > >>>>> the supports_sampling on a PMU. We need the sampling equivalent of > >>>>> is_event_supported: > >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242 > >>>>> is_event_supported has had bugs, look at the exclude_guest workaround > >>>>> for Apple PMUs. It also isn't clear to me how we choose the event > >>>>> config that we're going to probe to determine whether sampling works. > >>>>> The perf_event_open may reject the test because of a bad config and > >>>>> not because sampling isn't supported. > >>>>> > >>>>> So I think we can make the approach work if we had either: > >>>>> 1) a list of PMUs that support sampling, > >>>>> 2) a reliable "is_sampling_supported" test. > >>>>> > >>>>> I'm not sure of the advantages of doing (2) rather than just creating > >>>>> the set of evsels and ignoring those that fail to open. Ignoring > >>>>> evsels that fail to open seems more unlikely to break anything as the > >>>>> user is giving the events/config values for the PMUs they care about. > >>>> > >>>> Yep, that's also possible. I'm ok if you want to go that direction. > >>> > >>> Hmm.. I thought about this again. But it can be a problem if we ignore > >>> any failures as it can be a real error due to other reason - e.g. not > >>> supported configuration or other user mistakes. > >> > >> Right, we have two not good choices: > >> > >> 1) Try to detect whether sampling is supported, but any test doing > >> this needs to guess at a configuration and we'll need to deflake this > >> on off platforms like those that don't allow things like exclude > >> guest. > > > > I believe we don't need to try so hard to detect if sampling is > > supported or not. I hope we will eventually add that to the > > kernel. Also this is just an additional defense line, it should > > work without it in most cases. It'll just protect from a few edge > > cases like uncore PMUs having events of legacy name. For > > other events or PMUs, I think it's ok to fail. > > > > > >> 2) Ignore failures, possibly hiding user errors. > >> > >> I would prefer for (2) the errors were pr_err rather than pr_debug, > >> something the user can clean up by getting rid of warned about PMUs. > >> This will avoid hiding the error, but then on Neoverse cycles will > >> warn about the arm_dsu PMU's cycles event for exactly Linus' test > >> case. My understanding is that this is deemed a regression, hence > >> Arnaldo proposing pr_debug to hide it. > > > > Right, if we use pr_err() then users will complain. If we use > > pr_debug() then errors will be hidden silently. > > > > Thanks, > > Namhyung > > I'm not sure if anyone would really complain about warnings for > attempting to open but not succeeding, as long as the event that they > really wanted is there. I'm imagining output like this: > > $ perf record -e cycles -- ls > > Warning: skipped arm_dsu/cycles/ event(s), recording on > armv8_pmuv3_0/cycles/, armv8_pmuv3_1/cycles/ This looks good, but I guess arm_dsu (or others maybe..) has multiple instances like arm_dsu_0, arm_dsu_1, and so on. Then it should merge the similar PMUs and print once. Same thing for armv8_pmuv3. But I think it's better to skip the events if we know the PMU doesn't support sampling for sure. > > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.008 MB perf.data (30 samples) ] > > You only really need to worry when no events can be opened, but > presumably that was a warning anyway. Right, this is a problem but I'm not sure it handles the case specifically as it just reported warning on any failures first. Thanks, Namhyung > > And in stat mode I wouldn't expect any warnings. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-06-05 20:29 ` Namhyung Kim 2024-06-05 23:02 ` Ian Rogers @ 2024-06-06 13:47 ` Arnaldo Carvalho de Melo 1 sibling, 0 replies; 34+ messages in thread From: Arnaldo Carvalho de Melo @ 2024-06-06 13:47 UTC (permalink / raw) To: Namhyung Kim Cc: Ian Rogers, James Clark, Leo Yan, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, Dominique Martinet, linux-perf-users, linux-kernel On Wed, Jun 05, 2024 at 01:29:02PM -0700, Namhyung Kim wrote: > On Thu, May 30, 2024 at 3:52 PM Namhyung Kim <namhyung@kernel.org> wrote: > > On Thu, May 30, 2024 at 06:46:08AM -0700, Ian Rogers wrote: > > > On Thu, May 30, 2024 at 5:48 AM James Clark <james.clark@arm.com> wrote: > > > > On 30/05/2024 06:35, Namhyung Kim wrote: > > > > > It might not be a perfect solution but it could be a simple one. > > > > > Ideally I think it'd be nice if the kernel exports more information > > > > > about the PMUs like sampling and exclude capabilities. > > > > That seems like a much better suggestion. Especially with the ever > > > > expanding retry/fallback mechanism that can never really take into > > > > account every combination of event attributes that can fail. > > > I think this approach can work but we may break PMUs. > > > Rather than use `is_core` on `struct pmu` we could have say a > > > `supports_sampling` and we pass to parse_events an option to exclude > > > any PMU that doesn't have that flag. Now obviously more than just core > > > PMUs support sampling. All software PMUs, tracepoints, probes. We have > > > an imprecise list of these in perf_pmu__is_software. So we can set > > > supports_sampling for perf_pmu__is_software and is_core. > > Yep, we can do that if the kernel provides the info. But before that > > I think it's practical to skip uncore PMUs and hope other PMUs don't > > have event aliases clashing with the legacy names. :) > > > I think the problem comes for things like the AMD IBS PMUs, intel_bts > > > and intel_pt. Often these only support sampling but aren't core. There > > > may be IBM S390 PMUs or other vendor PMUs that are similar. If we can > > > make a list of all these PMU names then we can use that to set > > > supports_sampling and not break event parsing for these PMUs. > > > The name list sounds somewhat impractical, let's say we lazily compute > > > the supports_sampling on a PMU. We need the sampling equivalent of > > > is_event_supported: > > > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242 > > > is_event_supported has had bugs, look at the exclude_guest workaround > > > for Apple PMUs. It also isn't clear to me how we choose the event > > > config that we're going to probe to determine whether sampling works. > > > The perf_event_open may reject the test because of a bad config and > > > not because sampling isn't supported. > > > So I think we can make the approach work if we had either: > > > 1) a list of PMUs that support sampling, > > > 2) a reliable "is_sampling_supported" test. > > > I'm not sure of the advantages of doing (2) rather than just creating > > > the set of evsels and ignoring those that fail to open. Ignoring > > > evsels that fail to open seems more unlikely to break anything as the > > > user is giving the events/config values for the PMUs they care about. > > Yep, that's also possible. I'm ok if you want to go that direction. > Hmm.. I thought about this again. But it can be a problem if we ignore > any failures as it can be a real error due to other reason - e.g. not > supported configuration or other user mistakes. Well, we should not ignore any failures, just look at evsel__open_strerror(), we get some error from the kernel, we know we're doing something, we put it into context, and then we try to provide the most helpful message to the user. The question here is how to figure out if what we want to do (sample) makes sense for this event that has the name we used, since we're using a "wildcard", a "well known event name" that should be translated to events in all PMUs where it "matches", if those actual per-PMU events can provide what the user is asking. My suggestion about using pr_debug() is a compromise, one that will not show warnings all the time for such a common case (cycles) while allowing the use to follow a common, in tools/perf, way of diagnosing when things don't look sane, which is to ask for verbose mode. The best thing would be for us to be able to query, using existing facilities, if what we want is possible, i.e. interpret the error coming from the kernel in the context of what we are asking. If this is not possible because the error is too generic (EINVAL) and can map to multiple other things, then we can try to have a new kernel facility for us to get this info in a authoritative way and use a pr_warning() knowing that that warning will go away as the kernel is updated. Or we could use a big hammer, special handling some PMUs/events we know can't sample and avoiding those when they fail, ugly, but should work, and this is just for this "well known event names" people have been using since forever. Those wanting specific events to be used should know better and specify it precisely in the command line. Does this sum up all the discussion so far? Cheers, - Arnaldo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs 2024-05-28 19:44 ` Arnaldo Carvalho de Melo 2024-05-28 19:51 ` Ian Rogers @ 2024-05-28 20:00 ` Linus Torvalds 1 sibling, 0 replies; 34+ messages in thread From: Linus Torvalds @ 2024-05-28 20:00 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Ian Rogers, Leo Yan, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter, Kan Liang, James Clark, Dominique Martinet, linux-perf-users, linux-kernel On Tue, 28 May 2024 at 12:44, Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > For 'perf record' we're asking for sampling, if the event has the name > specified and can't be sampled, skip it, warn the user and even so > only if verbose mode is asked, something like: Yes. I think that's the right rule in general. However, the more I have looked at this case, the more I am also convinced that "cycles" as a name is special. It's literally documented to be an alias for cpu-cycles, both in examples and in "perf list" output, and that's what the usage is. So even if you were to have some other PMU in the system that had a "cycles" thing, if it's not a core event but some other cycles ("uncore", bus cycles, bicycles, whatever), it shouldn't be used even if it could be used for profiling. You'd have to use the full PMU name and actually list it out if you want to use a non-core counter named "cycles". And yes, we even have some documentation that says exactly that: "e.g usage:: perf stat -a -e arm_dsu_0/cycles/" So this isn't even anything new or ambiguous. This is just how things *ARE*, and absolutely have to be. Linus ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2024-06-07 6:10 UTC | newest] Thread overview: 34+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-25 15:29 [PATCH v1] perf evlist: Force adding default events only to core PMUs Ian Rogers 2024-05-25 16:43 ` Linus Torvalds 2024-05-25 21:14 ` Linus Torvalds 2024-05-27 10:58 ` Leo Yan 2024-05-28 5:36 ` Ian Rogers 2024-05-28 17:00 ` Linus Torvalds 2024-05-28 17:39 ` Ian Rogers 2024-05-28 18:12 ` Linus Torvalds 2024-05-28 18:58 ` Ian Rogers 2024-05-28 19:42 ` Linus Torvalds 2024-05-28 20:03 ` Ian Rogers 2024-05-28 20:33 ` Linus Torvalds 2024-05-28 21:37 ` Ian Rogers 2024-05-28 21:42 ` Linus Torvalds 2024-05-28 19:44 ` Arnaldo Carvalho de Melo 2024-05-28 19:51 ` Ian Rogers 2024-05-29 14:50 ` James Clark 2024-05-29 17:33 ` Ian Rogers 2024-05-30 15:37 ` James Clark 2024-05-30 16:14 ` Ian Rogers 2024-05-29 18:44 ` Arnaldo Carvalho de Melo 2024-05-29 19:25 ` Ian Rogers 2024-05-30 5:35 ` Namhyung Kim 2024-05-30 12:48 ` James Clark 2024-05-30 13:46 ` Ian Rogers 2024-05-30 22:51 ` Namhyung Kim 2024-06-05 20:29 ` Namhyung Kim 2024-06-05 23:02 ` Ian Rogers 2024-06-06 7:09 ` Namhyung Kim 2024-06-06 9:42 ` James Clark 2024-06-06 13:51 ` Arnaldo Carvalho de Melo 2024-06-07 6:10 ` Namhyung Kim 2024-06-06 13:47 ` Arnaldo Carvalho de Melo 2024-05-28 20:00 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).