[PATCH v4 00/10] perf stat fixes and improvements

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 00/10] perf stat fixes and improvements
@ 2025-11-13 18:05 Ian Rogers
  2025-11-13 18:05 ` [PATCH v4 01/10] libperf cpumap: Reduce allocations and sorting in intersect Ian Rogers
                   ` (10 more replies)
  0 siblings, 11 replies; 22+ messages in thread
From: Ian Rogers @ 2025-11-13 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Dr. David Alan Gilbert, Yang Li, James Clark,
	Thomas Falcon, Thomas Richter, linux-perf-users, linux-kernel,
	Andi Kleen, Dapeng Mi

A collection of fixes aiming to stabilize and make more reasonable
measurements/metrics such as memory bandwidth.

Tool events are changed from getting a PMU cpu map of all online CPUs
to either CPU 0 or all online CPUs. This avoids iterating over useless
CPUs for events in particular `duration_time`. Fix a bug where
duration_time didn't correctly use the previous raw counts and would
skip values in interval mode.

Change how json metrics handle tool events. Use the counter value
rather than using shared state with perf stat. A later patch changes
it so that tool events are read last, so that if reading say memory
bandwidth counters you don't divide by an earlier read time and exceed
the theoretical maximum memory bandwidth.

Do some clean up around the shared state in stat-shadow that's no
longer used.

Change how affinities work with evlist__for_each_cpu. Move the
affinity code into the iterator to simplify setting it up. Detect when
affinities will and won't be profitable, for example a tool event and
a regular perf event (or read group) may face less delay from a single
IPI for the event read than from a call to sched_setaffinity. Add a
 --no-affinity flag to perf stat to allow affinities to be disabled.

v4: Rebase. Add patch to reduce scope of walltime_nsec_stats now that
    the legacy metric code is no more. Minor tweak to the ru_stats
    clean up.

v3: Add affinity clean ups and read tool events last.
https://lore.kernel.org/lkml/20251106071241.141234-1-irogers@google.com/

v2: Fixed an aggregation index issue:
https://lore.kernel.org/lkml/20251104234148.3103176-2-irogers@google.com/

v1:
https://lore.kernel.org/lkml/20251104053449.1208800-1-irogers@google.com/

Ian Rogers (10):
  libperf cpumap: Reduce allocations and sorting in intersect
  perf pmu: perf_cpu_map__new_int to avoid parsing a string
  perf tool_pmu: Use old_count when computing count values for time
    events
  perf stat-shadow: Read tool events directly
  perf stat: Reduce scope of ru_stats
  perf stat: Reduce scope of walltime_nsecs_stats
  perf tool_pmu: More accurately set the cpus for tool events
  perf evlist: Reduce affinity use and move into iterator, fix no
    affinity
  perf stat: Read tool events last
  perf stat: Add no-affinity flag

 tools/lib/perf/cpumap.c                |  29 ++--
 tools/perf/Documentation/perf-stat.txt |   4 +
 tools/perf/builtin-stat.c              | 210 +++++++++++++++----------
 tools/perf/tests/parse-metric.c        |   2 -
 tools/perf/tests/pmu-events.c          |   2 -
 tools/perf/util/config.c               |   3 +-
 tools/perf/util/drm_pmu.c              |   2 +-
 tools/perf/util/evlist.c               | 156 +++++++++++-------
 tools/perf/util/evlist.h               |  27 +++-
 tools/perf/util/hwmon_pmu.c            |   2 +-
 tools/perf/util/parse-events.c         |   9 +-
 tools/perf/util/pmu.c                  |  12 ++
 tools/perf/util/pmu.h                  |   1 +
 tools/perf/util/stat-shadow.c          | 152 ++++++++----------
 tools/perf/util/stat.h                 |  23 ---
 tools/perf/util/tool_pmu.c             |  78 ++++++---
 tools/perf/util/tool_pmu.h             |   1 +
 17 files changed, 407 insertions(+), 306 deletions(-)

-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 01/10] libperf cpumap: Reduce allocations and sorting in intersect
  2025-11-13 18:05 [PATCH v4 00/10] perf stat fixes and improvements Ian Rogers
@ 2025-11-13 18:05 ` Ian Rogers
  2025-11-13 18:05 ` [PATCH v4 02/10] perf pmu: perf_cpu_map__new_int to avoid parsing a string Ian Rogers
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Ian Rogers @ 2025-11-13 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Dr. David Alan Gilbert, Yang Li, James Clark,
	Thomas Falcon, Thomas Richter, linux-perf-users, linux-kernel,
	Andi Kleen, Dapeng Mi

On hybrid platforms the CPU maps are often disjoint. Rather than copy
CPUs and trim, compute the number of common CPUs, if none early exit,
otherwise copy in an sorted order. This avoids memory allocation in
the disjoint case and avoids a second malloc and useless sort in the
previous trim cases.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/lib/perf/cpumap.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/tools/lib/perf/cpumap.c b/tools/lib/perf/cpumap.c
index b20a5280f2b3..7e88417ba84d 100644
--- a/tools/lib/perf/cpumap.c
+++ b/tools/lib/perf/cpumap.c
@@ -453,21 +453,33 @@ int perf_cpu_map__merge(struct perf_cpu_map **orig, struct perf_cpu_map *other)
 struct perf_cpu_map *perf_cpu_map__intersect(struct perf_cpu_map *orig,
 					     struct perf_cpu_map *other)
 {
-	struct perf_cpu *tmp_cpus;
-	int tmp_len;
 	int i, j, k;
-	struct perf_cpu_map *merged = NULL;
+	struct perf_cpu_map *merged;
 
 	if (perf_cpu_map__is_subset(other, orig))
 		return perf_cpu_map__get(orig);
 	if (perf_cpu_map__is_subset(orig, other))
 		return perf_cpu_map__get(other);
 
-	tmp_len = max(__perf_cpu_map__nr(orig), __perf_cpu_map__nr(other));
-	tmp_cpus = malloc(tmp_len * sizeof(struct perf_cpu));
-	if (!tmp_cpus)
+	i = j = k = 0;
+	while (i < __perf_cpu_map__nr(orig) && j < __perf_cpu_map__nr(other)) {
+		if (__perf_cpu_map__cpu(orig, i).cpu < __perf_cpu_map__cpu(other, j).cpu)
+			i++;
+		else if (__perf_cpu_map__cpu(orig, i).cpu > __perf_cpu_map__cpu(other, j).cpu)
+			j++;
+		else { /* CPUs match. */
+			i++;
+			j++;
+			k++;
+		}
+	}
+	if (k == 0) /* Maps are completely disjoint. */
 		return NULL;
 
+	merged = perf_cpu_map__alloc(k);
+	if (!merged)
+		return NULL;
+	/* Entries are added to merged in sorted order, so no need to sort again. */
 	i = j = k = 0;
 	while (i < __perf_cpu_map__nr(orig) && j < __perf_cpu_map__nr(other)) {
 		if (__perf_cpu_map__cpu(orig, i).cpu < __perf_cpu_map__cpu(other, j).cpu)
@@ -476,11 +488,8 @@ struct perf_cpu_map *perf_cpu_map__intersect(struct perf_cpu_map *orig,
 			j++;
 		else {
 			j++;
-			tmp_cpus[k++] = __perf_cpu_map__cpu(orig, i++);
+			RC_CHK_ACCESS(merged)->map[k++] = __perf_cpu_map__cpu(orig, i++);
 		}
 	}
-	if (k)
-		merged = cpu_map__trim_new(k, tmp_cpus);
-	free(tmp_cpus);
 	return merged;
 }
-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 02/10] perf pmu: perf_cpu_map__new_int to avoid parsing a string
  2025-11-13 18:05 [PATCH v4 00/10] perf stat fixes and improvements Ian Rogers
  2025-11-13 18:05 ` [PATCH v4 01/10] libperf cpumap: Reduce allocations and sorting in intersect Ian Rogers
@ 2025-11-13 18:05 ` Ian Rogers
  2025-11-13 18:05 ` [PATCH v4 03/10] perf tool_pmu: Use old_count when computing count values for time events Ian Rogers
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Ian Rogers @ 2025-11-13 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Dr. David Alan Gilbert, Yang Li, James Clark,
	Thomas Falcon, Thomas Richter, linux-perf-users, linux-kernel,
	Andi Kleen, Dapeng Mi

Prefer perf_cpu_map__new_int(0) to perf_cpu_map__new("0") as it avoids
strings parsing.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/drm_pmu.c   | 2 +-
 tools/perf/util/hwmon_pmu.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/drm_pmu.c b/tools/perf/util/drm_pmu.c
index 98d4d2b556d4..fc864bde4cb2 100644
--- a/tools/perf/util/drm_pmu.c
+++ b/tools/perf/util/drm_pmu.c
@@ -119,7 +119,7 @@ static struct drm_pmu *add_drm_pmu(struct list_head *pmus, char *line, size_t li
 		return NULL;
 	}
 
-	drm->pmu.cpus = perf_cpu_map__new("0");
+	drm->pmu.cpus = perf_cpu_map__new_int(0);
 	if (!drm->pmu.cpus) {
 		perf_pmu__delete(&drm->pmu);
 		return NULL;
diff --git a/tools/perf/util/hwmon_pmu.c b/tools/perf/util/hwmon_pmu.c
index 5c27256a220a..279d6b1a47f0 100644
--- a/tools/perf/util/hwmon_pmu.c
+++ b/tools/perf/util/hwmon_pmu.c
@@ -376,7 +376,7 @@ struct perf_pmu *hwmon_pmu__new(struct list_head *pmus, const char *hwmon_dir,
 		perf_pmu__delete(&hwm->pmu);
 		return NULL;
 	}
-	hwm->pmu.cpus = perf_cpu_map__new("0");
+	hwm->pmu.cpus = perf_cpu_map__new_int(0);
 	if (!hwm->pmu.cpus) {
 		perf_pmu__delete(&hwm->pmu);
 		return NULL;
-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 03/10] perf tool_pmu: Use old_count when computing count values for time events
  2025-11-13 18:05 [PATCH v4 00/10] perf stat fixes and improvements Ian Rogers
  2025-11-13 18:05 ` [PATCH v4 01/10] libperf cpumap: Reduce allocations and sorting in intersect Ian Rogers
  2025-11-13 18:05 ` [PATCH v4 02/10] perf pmu: perf_cpu_map__new_int to avoid parsing a string Ian Rogers
@ 2025-11-13 18:05 ` Ian Rogers
  2025-11-13 18:05 ` [PATCH v4 04/10] perf stat-shadow: Read tool events directly Ian Rogers
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Ian Rogers @ 2025-11-13 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Dr. David Alan Gilbert, Yang Li, James Clark,
	Thomas Falcon, Thomas Richter, linux-perf-users, linux-kernel,
	Andi Kleen, Dapeng Mi

When running in interval mode every third count of a time event isn't
showing properly:
```
$ perf stat -e duration_time -a -I 1000
     1.001082862      1,002,290,425      duration_time
     2.004264262      1,003,183,516      duration_time
     3.007381401      <not counted>      duration_time
     4.011160141      1,003,705,631      duration_time
     5.014515385      1,003,290,110      duration_time
     6.018539680      <not counted>      duration_time
     7.022065321      1,003,591,720      duration_time
```
The regression came in with a different fix, found through bisection,
commit 68cb1567439f ("perf tool_pmu: Fix aggregation on
duration_time"). The issue is caused by the enabled and running time
of the event matching the old_count's and creating a delta of 0, which
is indicative of an error.

Fixes: 68cb1567439f ("perf tool_pmu: Fix aggregation on duration_time")
Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/builtin-stat.c  | 16 +++++++++--
 tools/perf/util/tool_pmu.c | 59 +++++++++++++++++++++-----------------
 2 files changed, 46 insertions(+), 29 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 095016b2209e..654f840f7a2f 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -283,17 +283,27 @@ static int read_single_counter(struct evsel *counter, int cpu_map_idx, int threa
 	if (err && cpu_map_idx == 0 &&
 	    (evsel__tool_event(counter) == TOOL_PMU__EVENT_USER_TIME ||
 	     evsel__tool_event(counter) == TOOL_PMU__EVENT_SYSTEM_TIME)) {
-		u64 val, *start_time;
 		struct perf_counts_values *count =
 			perf_counts(counter->counts, cpu_map_idx, thread);
+		struct perf_counts_values *old_count = NULL;
+		u64 val;
+
+		if (counter->prev_raw_counts)
+			old_count = perf_counts(counter->prev_raw_counts, cpu_map_idx, thread);
 
-		start_time = xyarray__entry(counter->start_times, cpu_map_idx, thread);
 		if (evsel__tool_event(counter) == TOOL_PMU__EVENT_USER_TIME)
 			val = ru_stats.ru_utime_usec_stat.mean;
 		else
 			val = ru_stats.ru_stime_usec_stat.mean;
-		count->ena = count->run = *start_time + val;
+
 		count->val = val;
+		if (old_count) {
+			count->run = old_count->run + 1;
+			count->ena = old_count->ena + 1;
+		} else {
+			count->run++;
+			count->ena++;
+		}
 		return 0;
 	}
 	return err;
diff --git a/tools/perf/util/tool_pmu.c b/tools/perf/util/tool_pmu.c
index a72c665ee644..6a9df3dc0e07 100644
--- a/tools/perf/util/tool_pmu.c
+++ b/tools/perf/util/tool_pmu.c
@@ -446,16 +446,39 @@ bool tool_pmu__read_event(enum tool_pmu_event ev,
 	}
 }
 
+static void perf_counts__update(struct perf_counts_values *count,
+				const struct perf_counts_values *old_count,
+				bool raw, u64 val)
+{
+	/*
+	 * The values of enabled and running must make a ratio of 100%. The
+	 * exact values don't matter as long as they are non-zero to avoid
+	 * issues with evsel__count_has_error.
+	 */
+	if (old_count) {
+		count->val = raw ? val : old_count->val + val;
+		count->run = old_count->run + 1;
+		count->ena = old_count->ena + 1;
+		count->lost = old_count->lost;
+	} else {
+		count->val = val;
+		count->run++;
+		count->ena++;
+		count->lost = 0;
+	}
+}
+
 int evsel__tool_pmu_read(struct evsel *evsel, int cpu_map_idx, int thread)
 {
 	__u64 *start_time, cur_time, delta_start;
-	u64 val;
-	int fd, err = 0;
+	int err = 0;
 	struct perf_counts_values *count, *old_count = NULL;
 	bool adjust = false;
 	enum tool_pmu_event ev = evsel__tool_event(evsel);
 
 	count = perf_counts(evsel->counts, cpu_map_idx, thread);
+	if (evsel->prev_raw_counts)
+		old_count = perf_counts(evsel->prev_raw_counts, cpu_map_idx, thread);
 
 	switch (ev) {
 	case TOOL_PMU__EVENT_HAS_PMEM:
@@ -466,12 +489,11 @@ int evsel__tool_pmu_read(struct evsel *evsel, int cpu_map_idx, int thread)
 	case TOOL_PMU__EVENT_NUM_PACKAGES:
 	case TOOL_PMU__EVENT_SLOTS:
 	case TOOL_PMU__EVENT_SMT_ON:
-	case TOOL_PMU__EVENT_SYSTEM_TSC_FREQ:
 	case TOOL_PMU__EVENT_CORE_WIDE:
 	case TOOL_PMU__EVENT_TARGET_CPU:
-		if (evsel->prev_raw_counts)
-			old_count = perf_counts(evsel->prev_raw_counts, cpu_map_idx, thread);
-		val = 0;
+	case TOOL_PMU__EVENT_SYSTEM_TSC_FREQ: {
+		u64 val = 0;
+
 		if (cpu_map_idx == 0 && thread == 0) {
 			if (!tool_pmu__read_event(ev, evsel,
 						  stat_config.system_wide,
@@ -481,16 +503,9 @@ int evsel__tool_pmu_read(struct evsel *evsel, int cpu_map_idx, int thread)
 				val = 0;
 			}
 		}
-		if (old_count) {
-			count->val = old_count->val + val;
-			count->run = old_count->run + 1;
-			count->ena = old_count->ena + 1;
-		} else {
-			count->val = val;
-			count->run++;
-			count->ena++;
-		}
+		perf_counts__update(count, old_count, /*raw=*/false, val);
 		return 0;
+	}
 	case TOOL_PMU__EVENT_DURATION_TIME:
 		/*
 		 * Pretend duration_time is only on the first CPU and thread, or
@@ -506,9 +521,9 @@ int evsel__tool_pmu_read(struct evsel *evsel, int cpu_map_idx, int thread)
 	case TOOL_PMU__EVENT_USER_TIME:
 	case TOOL_PMU__EVENT_SYSTEM_TIME: {
 		bool system = evsel__tool_event(evsel) == TOOL_PMU__EVENT_SYSTEM_TIME;
+		int fd = FD(evsel, cpu_map_idx, thread);
 
 		start_time = xyarray__entry(evsel->start_times, cpu_map_idx, thread);
-		fd = FD(evsel, cpu_map_idx, thread);
 		lseek(fd, SEEK_SET, 0);
 		if (evsel->pid_stat) {
 			/* The event exists solely on 1 CPU. */
@@ -542,17 +557,9 @@ int evsel__tool_pmu_read(struct evsel *evsel, int cpu_map_idx, int thread)
 	if (adjust) {
 		__u64 ticks_per_sec = sysconf(_SC_CLK_TCK);
 
-		delta_start *= 1000000000 / ticks_per_sec;
+		delta_start *= 1e9 / ticks_per_sec;
 	}
-	count->val    = delta_start;
-	count->lost   = 0;
-	/*
-	 * The values of enabled and running must make a ratio of 100%. The
-	 * exact values don't matter as long as they are non-zero to avoid
-	 * issues with evsel__count_has_error.
-	 */
-	count->ena++;
-	count->run++;
+	perf_counts__update(count, old_count, /*raw=*/true, delta_start);
 	return 0;
 }
 
-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 04/10] perf stat-shadow: Read tool events directly
  2025-11-13 18:05 [PATCH v4 00/10] perf stat fixes and improvements Ian Rogers
                   ` (2 preceding siblings ...)
  2025-11-13 18:05 ` [PATCH v4 03/10] perf tool_pmu: Use old_count when computing count values for time events Ian Rogers
@ 2025-11-13 18:05 ` Ian Rogers
  2025-11-18  2:30   ` Namhyung Kim
  2025-11-13 18:05 ` [PATCH v4 05/10] perf stat: Reduce scope of ru_stats Ian Rogers
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 22+ messages in thread
From: Ian Rogers @ 2025-11-13 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Dr. David Alan Gilbert, Yang Li, James Clark,
	Thomas Falcon, Thomas Richter, linux-perf-users, linux-kernel,
	Andi Kleen, Dapeng Mi

When reading time values for metrics don't use the globals updated in
builtin-stat, just read the events as regular events. The only
exception is for time events where nanoseconds need converting to
seconds as metrics assume time metrics are in seconds.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/stat-shadow.c | 149 +++++++++++++++-------------------
 1 file changed, 66 insertions(+), 83 deletions(-)

diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index b3b482e1808f..6c1ad78604e1 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -26,7 +26,32 @@ void perf_stat__reset_shadow_stats(void)
 	memset(&ru_stats, 0, sizeof(ru_stats));
 }
 
-static int prepare_metric(const struct metric_expr *mexp,
+static bool tool_pmu__is_time_event(const struct perf_stat_config *config,
+				   const struct evsel *evsel, int *tool_aggr_idx)
+{
+	enum tool_pmu_event event = evsel__tool_event(evsel);
+	int aggr_idx;
+
+	if (event != TOOL_PMU__EVENT_DURATION_TIME &&
+	    event != TOOL_PMU__EVENT_USER_TIME &&
+	    event != TOOL_PMU__EVENT_SYSTEM_TIME)
+		return false;
+
+	if (config) {
+		cpu_aggr_map__for_each_idx(aggr_idx, config->aggr_map) {
+			if (config->aggr_map->map[aggr_idx].cpu.cpu == 0) {
+				*tool_aggr_idx = aggr_idx;
+				return true;
+			}
+		}
+		pr_debug("Unexpected CPU0 missing in aggregation for tool event.\n");
+	}
+	*tool_aggr_idx = 0; /* Assume the first aggregation index works. */
+	return true;
+}
+
+static int prepare_metric(struct perf_stat_config *config,
+			  const struct metric_expr *mexp,
 			  const struct evsel *evsel,
 			  struct expr_parse_ctx *pctx,
 			  int aggr_idx)
@@ -36,93 +61,51 @@ static int prepare_metric(const struct metric_expr *mexp,
 	int i;
 
 	for (i = 0; metric_events[i]; i++) {
+		int source_count = 0, tool_aggr_idx;
+		bool is_tool_time =
+			tool_pmu__is_time_event(config, metric_events[i], &tool_aggr_idx);
+		struct perf_stat_evsel *ps = metric_events[i]->stats;
+		struct perf_stat_aggr *aggr;
 		char *n;
 		double val;
-		int source_count = 0;
 
-		if (evsel__is_tool(metric_events[i])) {
-			struct stats *stats;
-			double scale;
-
-			switch (evsel__tool_event(metric_events[i])) {
-			case TOOL_PMU__EVENT_DURATION_TIME:
-				stats = &walltime_nsecs_stats;
-				scale = 1e-9;
-				break;
-			case TOOL_PMU__EVENT_USER_TIME:
-				stats = &ru_stats.ru_utime_usec_stat;
-				scale = 1e-6;
+		/*
+		 * If there are multiple uncore PMUs and we're not reading the
+		 * leader's stats, determine the stats for the appropriate
+		 * uncore PMU.
+		 */
+		if (evsel && evsel->metric_leader &&
+		    evsel->pmu != evsel->metric_leader->pmu &&
+		    mexp->metric_events[i]->pmu == evsel->metric_leader->pmu) {
+			struct evsel *pos;
+
+			evlist__for_each_entry(evsel->evlist, pos) {
+				if (pos->pmu != evsel->pmu)
+					continue;
+				if (pos->metric_leader != mexp->metric_events[i])
+					continue;
+				ps = pos->stats;
+				source_count = 1;
 				break;
-			case TOOL_PMU__EVENT_SYSTEM_TIME:
-				stats = &ru_stats.ru_stime_usec_stat;
-				scale = 1e-6;
-				break;
-			case TOOL_PMU__EVENT_NONE:
-				pr_err("Invalid tool event 'none'");
-				abort();
-			case TOOL_PMU__EVENT_MAX:
-				pr_err("Invalid tool event 'max'");
-				abort();
-			case TOOL_PMU__EVENT_HAS_PMEM:
-			case TOOL_PMU__EVENT_NUM_CORES:
-			case TOOL_PMU__EVENT_NUM_CPUS:
-			case TOOL_PMU__EVENT_NUM_CPUS_ONLINE:
-			case TOOL_PMU__EVENT_NUM_DIES:
-			case TOOL_PMU__EVENT_NUM_PACKAGES:
-			case TOOL_PMU__EVENT_SLOTS:
-			case TOOL_PMU__EVENT_SMT_ON:
-			case TOOL_PMU__EVENT_SYSTEM_TSC_FREQ:
-			case TOOL_PMU__EVENT_CORE_WIDE:
-			case TOOL_PMU__EVENT_TARGET_CPU:
-			default:
-				pr_err("Unexpected tool event '%s'", evsel__name(metric_events[i]));
-				abort();
 			}
-			val = avg_stats(stats) * scale;
-			source_count = 1;
-		} else {
-			struct perf_stat_evsel *ps = metric_events[i]->stats;
-			struct perf_stat_aggr *aggr;
-
+		}
+		/* Time events are always on CPU0, the first aggregation index. */
+		aggr = &ps->aggr[is_tool_time ? tool_aggr_idx : aggr_idx];
+		if (!aggr || !metric_events[i]->supported) {
 			/*
-			 * If there are multiple uncore PMUs and we're not
-			 * reading the leader's stats, determine the stats for
-			 * the appropriate uncore PMU.
+			 * Not supported events will have a count of 0, which
+			 * can be confusing in a metric. Explicitly set the
+			 * value to NAN. Not counted events (enable time of 0)
+			 * are read as 0.
 			 */
-			if (evsel && evsel->metric_leader &&
-			    evsel->pmu != evsel->metric_leader->pmu &&
-			    mexp->metric_events[i]->pmu == evsel->metric_leader->pmu) {
-				struct evsel *pos;
-
-				evlist__for_each_entry(evsel->evlist, pos) {
-					if (pos->pmu != evsel->pmu)
-						continue;
-					if (pos->metric_leader != mexp->metric_events[i])
-						continue;
-					ps = pos->stats;
-					source_count = 1;
-					break;
-				}
-			}
-			aggr = &ps->aggr[aggr_idx];
-			if (!aggr)
-				break;
-
-			if (!metric_events[i]->supported) {
-				/*
-				 * Not supported events will have a count of 0,
-				 * which can be confusing in a
-				 * metric. Explicitly set the value to NAN. Not
-				 * counted events (enable time of 0) are read as
-				 * 0.
-				 */
-				val = NAN;
-				source_count = 0;
-			} else {
-				val = aggr->counts.val;
-				if (!source_count)
-					source_count = evsel__source_count(metric_events[i]);
-			}
+			val = NAN;
+			source_count = 0;
+		} else {
+			val = aggr->counts.val;
+			if (is_tool_time)
+				val *= 1e-9; /* Convert time event nanoseconds to seconds. */
+			if (!source_count)
+				source_count = evsel__source_count(metric_events[i]);
 		}
 		n = strdup(evsel__metric_id(metric_events[i]));
 		if (!n)
@@ -168,7 +151,7 @@ static void generic_metric(struct perf_stat_config *config,
 		pctx->sctx.user_requested_cpu_list = strdup(config->user_requested_cpu_list);
 	pctx->sctx.runtime = runtime;
 	pctx->sctx.system_wide = config->system_wide;
-	i = prepare_metric(mexp, evsel, pctx, aggr_idx);
+	i = prepare_metric(config, mexp, evsel, pctx, aggr_idx);
 	if (i < 0) {
 		expr__ctx_free(pctx);
 		return;
@@ -229,7 +212,7 @@ double test_generic_metric(struct metric_expr *mexp, int aggr_idx)
 	if (!pctx)
 		return NAN;
 
-	if (prepare_metric(mexp, /*evsel=*/NULL, pctx, aggr_idx) < 0)
+	if (prepare_metric(/*config=*/NULL, mexp, /*evsel=*/NULL, pctx, aggr_idx) < 0)
 		goto out;
 
 	if (expr__parse(&ratio, pctx, mexp->metric_expr))
-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 05/10] perf stat: Reduce scope of ru_stats
  2025-11-13 18:05 [PATCH v4 00/10] perf stat fixes and improvements Ian Rogers
                   ` (3 preceding siblings ...)
  2025-11-13 18:05 ` [PATCH v4 04/10] perf stat-shadow: Read tool events directly Ian Rogers
@ 2025-11-13 18:05 ` Ian Rogers
  2025-11-18  2:31   ` Namhyung Kim
  2025-11-13 18:05 ` [PATCH v4 06/10] perf stat: Reduce scope of walltime_nsecs_stats Ian Rogers
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 22+ messages in thread
From: Ian Rogers @ 2025-11-13 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Dr. David Alan Gilbert, Yang Li, James Clark,
	Thomas Falcon, Thomas Richter, linux-perf-users, linux-kernel,
	Andi Kleen, Dapeng Mi

The ru_stats are used to capture user and system time stats when a
process exits. These are then applied to user and system time tool
events if their reads fail due to the process terminating. Reduce the
scope now the metric code no longer reads these values.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/builtin-stat.c     | 19 ++++++++++++++++++-
 tools/perf/util/config.c      |  1 -
 tools/perf/util/stat-shadow.c |  2 --
 tools/perf/util/stat.h        | 21 ---------------------
 4 files changed, 18 insertions(+), 25 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 654f840f7a2f..d6f4c84f7d7e 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -104,6 +104,11 @@
 #define DEFAULT_SEPARATOR	" "
 #define FREEZE_ON_SMI_PATH	"bus/event_source/devices/cpu/freeze_on_smi"
 
+struct rusage_stats {
+	struct stats ru_utime_usec_stat;
+	struct stats ru_stime_usec_stat;
+};
+
 static void print_counters(struct timespec *ts, int argc, const char **argv);
 
 static struct evlist	*evsel_list;
@@ -133,6 +138,7 @@ static bool			interval_count;
 static const char		*output_name;
 static int			output_fd;
 static char			*metrics;
+static struct rusage_stats	ru_stats;
 
 struct perf_stat {
 	bool			 record;
@@ -730,6 +736,17 @@ static int create_perf_stat_counter(struct evsel *evsel,
 					      evsel->core.threads);
 }
 
+static void update_rusage_stats(const struct rusage *rusage)
+{
+	const u64 us_to_ns = 1000;
+	const u64 s_to_ns = 1000000000;
+
+	update_stats(&ru_stats.ru_utime_usec_stat,
+		(rusage->ru_utime.tv_usec * us_to_ns + rusage->ru_utime.tv_sec * s_to_ns));
+	update_stats(&ru_stats.ru_stime_usec_stat,
+		(rusage->ru_stime.tv_usec * us_to_ns + rusage->ru_stime.tv_sec * s_to_ns));
+}
+
 static int __run_perf_stat(int argc, const char **argv, int run_idx)
 {
 	int interval = stat_config.interval;
@@ -979,7 +996,7 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 		evlist__reset_aggr_stats(evsel_list);
 	} else {
 		update_stats(&walltime_nsecs_stats, t1 - t0);
-		update_rusage_stats(&ru_stats, &stat_config.ru_data);
+		update_rusage_stats(&stat_config.ru_data);
 	}
 
 	/*
diff --git a/tools/perf/util/config.c b/tools/perf/util/config.c
index 6f914620c6ff..cc0746f494f4 100644
--- a/tools/perf/util/config.c
+++ b/tools/perf/util/config.c
@@ -45,7 +45,6 @@ struct perf_stat_config stat_config = {
 	.run_count		= 1,
 	.metric_only_len	= METRIC_ONLY_LEN,
 	.walltime_nsecs_stats	= &walltime_nsecs_stats,
-	.ru_stats		= &ru_stats,
 	.big_num		= true,
 	.ctl_fd			= -1,
 	.ctl_fd_ack		= -1,
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index 6c1ad78604e1..cb7c741a1ebb 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -18,12 +18,10 @@
 #include "tool_pmu.h"
 
 struct stats walltime_nsecs_stats;
-struct rusage_stats ru_stats;
 
 void perf_stat__reset_shadow_stats(void)
 {
 	memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
-	memset(&ru_stats, 0, sizeof(ru_stats));
 }
 
 static bool tool_pmu__is_time_event(const struct perf_stat_config *config,
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index b42da4a29c44..055b95d18106 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -56,11 +56,6 @@ enum aggr_mode {
 	AGGR_MAX
 };
 
-struct rusage_stats {
-	struct stats ru_utime_usec_stat;
-	struct stats ru_stime_usec_stat;
-};
-
 typedef struct aggr_cpu_id (*aggr_get_id_t)(struct perf_stat_config *config, struct perf_cpu cpu);
 
 struct perf_stat_config {
@@ -102,7 +97,6 @@ struct perf_stat_config {
 	const char		*csv_sep;
 	struct stats		*walltime_nsecs_stats;
 	struct rusage		 ru_data;
-	struct rusage_stats		 *ru_stats;
 	struct cpu_aggr_map	*aggr_map;
 	aggr_get_id_t		 aggr_get_id;
 	struct cpu_aggr_map	*cpus_aggr_map;
@@ -132,25 +126,10 @@ static inline void init_stats(struct stats *stats)
 	stats->max  = 0;
 }
 
-static inline void init_rusage_stats(struct rusage_stats *ru_stats) {
-	init_stats(&ru_stats->ru_utime_usec_stat);
-	init_stats(&ru_stats->ru_stime_usec_stat);
-}
-
-static inline void update_rusage_stats(struct rusage_stats *ru_stats, struct rusage* rusage) {
-	const u64 us_to_ns = 1000;
-	const u64 s_to_ns = 1000000000;
-	update_stats(&ru_stats->ru_utime_usec_stat,
-	             (rusage->ru_utime.tv_usec * us_to_ns + rusage->ru_utime.tv_sec * s_to_ns));
-	update_stats(&ru_stats->ru_stime_usec_stat,
-	             (rusage->ru_stime.tv_usec * us_to_ns + rusage->ru_stime.tv_sec * s_to_ns));
-}
-
 struct evsel;
 struct evlist;
 
 extern struct stats walltime_nsecs_stats;
-extern struct rusage_stats ru_stats;
 
 enum metric_threshold_classify {
 	METRIC_THRESHOLD_UNKNOWN,
-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 06/10] perf stat: Reduce scope of walltime_nsecs_stats
  2025-11-13 18:05 [PATCH v4 00/10] perf stat fixes and improvements Ian Rogers
                   ` (4 preceding siblings ...)
  2025-11-13 18:05 ` [PATCH v4 05/10] perf stat: Reduce scope of ru_stats Ian Rogers
@ 2025-11-13 18:05 ` Ian Rogers
  2025-11-13 18:05 ` [PATCH v4 07/10] perf tool_pmu: More accurately set the cpus for tool events Ian Rogers
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Ian Rogers @ 2025-11-13 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Dr. David Alan Gilbert, Yang Li, James Clark,
	Thomas Falcon, Thomas Richter, linux-perf-users, linux-kernel,
	Andi Kleen, Dapeng Mi

walltime_nsecs_stats is no longer used for counter values, move into
that stat_config where it controls certain things like noise
measurement.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/builtin-stat.c       | 16 ++++++++--------
 tools/perf/tests/parse-metric.c |  2 --
 tools/perf/tests/pmu-events.c   |  2 --
 tools/perf/util/config.c        |  2 ++
 tools/perf/util/stat-shadow.c   |  7 -------
 tools/perf/util/stat.h          |  2 --
 6 files changed, 10 insertions(+), 21 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index d6f4c84f7d7e..ca1c80c141b6 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -239,7 +239,7 @@ static inline void diff_timespec(struct timespec *r, struct timespec *a,
 static void perf_stat__reset_stats(void)
 {
 	evlist__reset_stats(evsel_list);
-	perf_stat__reset_shadow_stats();
+	memset(stat_config.walltime_nsecs_stats, 0, sizeof(*stat_config.walltime_nsecs_stats));
 }
 
 static int process_synthesized_event(const struct perf_tool *tool __maybe_unused,
@@ -455,8 +455,8 @@ static void process_interval(void)
 			pr_err("failed to write stat round event\n");
 	}
 
-	init_stats(&walltime_nsecs_stats);
-	update_stats(&walltime_nsecs_stats, stat_config.interval * 1000000ULL);
+	init_stats(stat_config.walltime_nsecs_stats);
+	update_stats(stat_config.walltime_nsecs_stats, stat_config.interval * 1000000ULL);
 	print_counters(&rs, 0, NULL);
 }
 
@@ -988,14 +988,14 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 	if (interval && stat_config.summary) {
 		stat_config.interval = 0;
 		stat_config.stop_read_counter = true;
-		init_stats(&walltime_nsecs_stats);
-		update_stats(&walltime_nsecs_stats, t1 - t0);
+		init_stats(stat_config.walltime_nsecs_stats);
+		update_stats(stat_config.walltime_nsecs_stats, t1 - t0);
 
 		evlist__copy_prev_raw_counts(evsel_list);
 		evlist__reset_prev_raw_counts(evsel_list);
 		evlist__reset_aggr_stats(evsel_list);
 	} else {
-		update_stats(&walltime_nsecs_stats, t1 - t0);
+		update_stats(stat_config.walltime_nsecs_stats, t1 - t0);
 		update_rusage_stats(&stat_config.ru_data);
 	}
 
@@ -2167,7 +2167,7 @@ static int process_stat_round_event(const struct perf_tool *tool __maybe_unused,
 	process_counters();
 
 	if (stat_round->type == PERF_STAT_ROUND_TYPE__FINAL)
-		update_stats(&walltime_nsecs_stats, stat_round->time);
+		update_stats(stat_config.walltime_nsecs_stats, stat_round->time);
 
 	if (stat_config.interval && stat_round->time) {
 		tsh.tv_sec  = stat_round->time / NSEC_PER_SEC;
@@ -2975,7 +2975,7 @@ int cmd_stat(int argc, const char **argv)
 		}
 
 		if (!interval) {
-			if (WRITE_STAT_ROUND_EVENT(walltime_nsecs_stats.max, FINAL))
+			if (WRITE_STAT_ROUND_EVENT(stat_config.walltime_nsecs_stats->max, FINAL))
 				pr_err("failed to write stat round event\n");
 		}
 
diff --git a/tools/perf/tests/parse-metric.c b/tools/perf/tests/parse-metric.c
index 66a5275917e2..7ca0b0791677 100644
--- a/tools/perf/tests/parse-metric.c
+++ b/tools/perf/tests/parse-metric.c
@@ -40,8 +40,6 @@ static void load_runtime_stat(struct evlist *evlist, struct value *vals)
 		count = find_value(evsel->name, vals);
 		evsel->supported = true;
 		evsel->stats->aggr->counts.val = count;
-		if (evsel__name_is(evsel, "duration_time"))
-			update_stats(&walltime_nsecs_stats, count);
 	}
 }
 
diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c
index f40a828c9861..a99716862168 100644
--- a/tools/perf/tests/pmu-events.c
+++ b/tools/perf/tests/pmu-events.c
@@ -872,8 +872,6 @@ static int test__parsing_callback(const struct pmu_metric *pm,
 	evlist__alloc_aggr_stats(evlist, 1);
 	evlist__for_each_entry(evlist, evsel) {
 		evsel->stats->aggr->counts.val = k;
-		if (evsel__name_is(evsel, "duration_time"))
-			update_stats(&walltime_nsecs_stats, k);
 		k++;
 	}
 	evlist__for_each_entry(evlist, evsel) {
diff --git a/tools/perf/util/config.c b/tools/perf/util/config.c
index cc0746f494f4..e0219bc6330a 100644
--- a/tools/perf/util/config.c
+++ b/tools/perf/util/config.c
@@ -37,6 +37,8 @@
 
 #define METRIC_ONLY_LEN 20
 
+static struct stats walltime_nsecs_stats;
+
 struct perf_stat_config stat_config = {
 	.aggr_mode		= AGGR_GLOBAL,
 	.aggr_level		= MAX_CACHE_LVL + 1,
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index cb7c741a1ebb..69f09db0c945 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -17,13 +17,6 @@
 #include "util/hashmap.h"
 #include "tool_pmu.h"
 
-struct stats walltime_nsecs_stats;
-
-void perf_stat__reset_shadow_stats(void)
-{
-	memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
-}
-
 static bool tool_pmu__is_time_event(const struct perf_stat_config *config,
 				   const struct evsel *evsel, int *tool_aggr_idx)
 {
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index 055b95d18106..f986911c9296 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -129,8 +129,6 @@ static inline void init_stats(struct stats *stats)
 struct evsel;
 struct evlist;
 
-extern struct stats walltime_nsecs_stats;
-
 enum metric_threshold_classify {
 	METRIC_THRESHOLD_UNKNOWN,
 	METRIC_THRESHOLD_BAD,
-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 07/10] perf tool_pmu: More accurately set the cpus for tool events
  2025-11-13 18:05 [PATCH v4 00/10] perf stat fixes and improvements Ian Rogers
                   ` (5 preceding siblings ...)
  2025-11-13 18:05 ` [PATCH v4 06/10] perf stat: Reduce scope of walltime_nsecs_stats Ian Rogers
@ 2025-11-13 18:05 ` Ian Rogers
  2025-11-13 18:05 ` [PATCH v4 08/10] perf evlist: Reduce affinity use and move into iterator, fix no affinity Ian Rogers
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Ian Rogers @ 2025-11-13 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Dr. David Alan Gilbert, Yang Li, James Clark,
	Thomas Falcon, Thomas Richter, linux-perf-users, linux-kernel,
	Andi Kleen, Dapeng Mi

The user and system time events can record on different CPUs, but for
all other events a single CPU map of just CPU 0 makes sense. In
parse-events detect a tool PMU and then pass the perf_event_attr so
that the tool_pmu can return CPUs specific for the event. This avoids
a CPU map of all online CPUs being used for events like
duration_time. Avoiding this avoids the evlist CPUs containing CPUs
for which duration_time just gives 0. Minimizing the evlist CPUs can
remove unnecessary sched_setaffinity syscalls that delay metric
calculations.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/parse-events.c |  9 +++++++--
 tools/perf/util/tool_pmu.c     | 19 +++++++++++++++++++
 tools/perf/util/tool_pmu.h     |  1 +
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 0c0dc20b1c13..7b2422ccb554 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -30,6 +30,7 @@
 #include "util/event.h"
 #include "util/bpf-filter.h"
 #include "util/stat.h"
+#include "util/tool_pmu.h"
 #include "util/util.h"
 #include "tracepoint.h"
 #include <api/fs/tracing_path.h>
@@ -227,8 +228,12 @@ __add_event(struct list_head *list, int *idx,
 	if (pmu) {
 		is_pmu_core = pmu->is_core;
 		pmu_cpus = perf_cpu_map__get(pmu->cpus);
-		if (perf_cpu_map__is_empty(pmu_cpus))
-			pmu_cpus = cpu_map__online();
+		if (perf_cpu_map__is_empty(pmu_cpus)) {
+			if (perf_pmu__is_tool(pmu))
+				pmu_cpus = tool_pmu__cpus(attr);
+			else
+				pmu_cpus = cpu_map__online();
+		}
 	} else {
 		is_pmu_core = (attr->type == PERF_TYPE_HARDWARE ||
 			       attr->type == PERF_TYPE_HW_CACHE);
diff --git a/tools/perf/util/tool_pmu.c b/tools/perf/util/tool_pmu.c
index 6a9df3dc0e07..37c4eae0bef1 100644
--- a/tools/perf/util/tool_pmu.c
+++ b/tools/perf/util/tool_pmu.c
@@ -2,6 +2,7 @@
 #include "cgroup.h"
 #include "counts.h"
 #include "cputopo.h"
+#include "debug.h"
 #include "evsel.h"
 #include "pmu.h"
 #include "print-events.h"
@@ -13,6 +14,7 @@
 #include <api/fs/fs.h>
 #include <api/io.h>
 #include <internal/threadmap.h>
+#include <perf/cpumap.h>
 #include <perf/threadmap.h>
 #include <fcntl.h>
 #include <strings.h>
@@ -109,6 +111,23 @@ const char *evsel__tool_pmu_event_name(const struct evsel *evsel)
 	return tool_pmu__event_to_str(evsel->core.attr.config);
 }
 
+struct perf_cpu_map *tool_pmu__cpus(struct perf_event_attr *attr)
+{
+	static struct perf_cpu_map *cpu0_map;
+	enum tool_pmu_event event = (enum tool_pmu_event)attr->config;
+
+	if (event <= TOOL_PMU__EVENT_NONE || event >= TOOL_PMU__EVENT_MAX) {
+		pr_err("Invalid tool PMU event config %llx\n", attr->config);
+		return NULL;
+	}
+	if (event == TOOL_PMU__EVENT_USER_TIME || event == TOOL_PMU__EVENT_SYSTEM_TIME)
+		return cpu_map__online();
+
+	if (!cpu0_map)
+		cpu0_map = perf_cpu_map__new_int(0);
+	return perf_cpu_map__get(cpu0_map);
+}
+
 static bool read_until_char(struct io *io, char e)
 {
 	int c;
diff --git a/tools/perf/util/tool_pmu.h b/tools/perf/util/tool_pmu.h
index f1714001bc1d..ea343d1983d3 100644
--- a/tools/perf/util/tool_pmu.h
+++ b/tools/perf/util/tool_pmu.h
@@ -46,6 +46,7 @@ bool tool_pmu__read_event(enum tool_pmu_event ev,
 u64 tool_pmu__cpu_slots_per_cycle(void);
 
 bool perf_pmu__is_tool(const struct perf_pmu *pmu);
+struct perf_cpu_map *tool_pmu__cpus(struct perf_event_attr *attr);
 
 bool evsel__is_tool(const struct evsel *evsel);
 enum tool_pmu_event evsel__tool_event(const struct evsel *evsel);
-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 08/10] perf evlist: Reduce affinity use and move into iterator, fix no affinity
  2025-11-13 18:05 [PATCH v4 00/10] perf stat fixes and improvements Ian Rogers
                   ` (6 preceding siblings ...)
  2025-11-13 18:05 ` [PATCH v4 07/10] perf tool_pmu: More accurately set the cpus for tool events Ian Rogers
@ 2025-11-13 18:05 ` Ian Rogers
  2025-11-13 18:05 ` [PATCH v4 09/10] perf stat: Read tool events last Ian Rogers
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Ian Rogers @ 2025-11-13 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Dr. David Alan Gilbert, Yang Li, James Clark,
	Thomas Falcon, Thomas Richter, linux-perf-users, linux-kernel,
	Andi Kleen, Dapeng Mi

The evlist__for_each_cpu iterator will call sched_setaffitinity when
moving between CPUs to avoid IPIs. If only 1 IPI is saved then this
may be unprofitable as the delay to get scheduled may be
considerable. This may be particularly true if reading an event group
in `perf stat` in interval mode.

Move the affinity handling completely into the iterator so that a
single evlist__use_affinity can determine whether CPU affinities will
be used. For `perf record` the change is minimal as the dummy event
and the real event will always make the use of affinities the thing to
do. In `perf stat`, tool events are ignored and affinities only used
if >1 event on the same CPU occur. Determining if affinities are
useful is done by per-event in a new PMU benefits from affinity
function.

Fix a bug where when there are no affinities that the CPU map iterator
may reference a CPU not present in the initial evsel. Fix by making
the iterator and non-iterator code common.

Fix a bug where closing events on an evlist wasn't closing TPEBS
events.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/builtin-stat.c | 128 ++++++++++++++----------------
 tools/perf/util/evlist.c  | 160 ++++++++++++++++++++++++--------------
 tools/perf/util/evlist.h  |  26 +++++--
 tools/perf/util/pmu.c     |  12 +++
 tools/perf/util/pmu.h     |   1 +
 5 files changed, 189 insertions(+), 138 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index ca1c80c141b6..947f11b8b106 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -366,22 +366,14 @@ static int read_counter_cpu(struct evsel *counter, int cpu_map_idx)
 	return 0;
 }
 
-static int read_affinity_counters(void)
+static int read_counters_with_affinity(void)
 {
 	struct evlist_cpu_iterator evlist_cpu_itr;
-	struct affinity saved_affinity, *affinity;
 
 	if (all_counters_use_bpf)
 		return 0;
 
-	if (!target__has_cpu(&target) || target__has_per_thread(&target))
-		affinity = NULL;
-	else if (affinity__setup(&saved_affinity) < 0)
-		return -1;
-	else
-		affinity = &saved_affinity;
-
-	evlist__for_each_cpu(evlist_cpu_itr, evsel_list, affinity) {
+	evlist__for_each_cpu(evlist_cpu_itr, evsel_list) {
 		struct evsel *counter = evlist_cpu_itr.evsel;
 
 		if (evsel__is_bpf(counter))
@@ -390,8 +382,6 @@ static int read_affinity_counters(void)
 		if (!counter->err)
 			counter->err = read_counter_cpu(counter, evlist_cpu_itr.cpu_map_idx);
 	}
-	if (affinity)
-		affinity__cleanup(&saved_affinity);
 
 	return 0;
 }
@@ -414,12 +404,18 @@ static int read_bpf_map_counters(void)
 
 static int read_counters(void)
 {
-	if (!stat_config.stop_read_counter) {
-		if (read_bpf_map_counters() ||
-		    read_affinity_counters())
-			return -1;
-	}
-	return 0;
+	int ret;
+
+	if (stat_config.stop_read_counter)
+		return 0;
+
+	// Read all BPF counters first.
+	ret = read_bpf_map_counters();
+	if (ret)
+		return ret;
+
+	// Read non-BPF and non-tool counters next.
+	return read_counters_with_affinity();
 }
 
 static void process_counters(void)
@@ -760,7 +756,6 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 	const bool forks = (argc > 0);
 	bool is_pipe = STAT_RECORD ? perf_stat.data.is_pipe : false;
 	struct evlist_cpu_iterator evlist_cpu_itr;
-	struct affinity saved_affinity, *affinity = NULL;
 	int err, open_err = 0;
 	bool second_pass = false, has_supported_counters;
 
@@ -772,14 +767,6 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 		child_pid = evsel_list->workload.pid;
 	}
 
-	if (!cpu_map__is_dummy(evsel_list->core.user_requested_cpus)) {
-		if (affinity__setup(&saved_affinity) < 0) {
-			err = -1;
-			goto err_out;
-		}
-		affinity = &saved_affinity;
-	}
-
 	evlist__for_each_entry(evsel_list, counter) {
 		counter->reset_group = false;
 		if (bpf_counter__load(counter, &target)) {
@@ -792,49 +779,48 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 
 	evlist__reset_aggr_stats(evsel_list);
 
-	evlist__for_each_cpu(evlist_cpu_itr, evsel_list, affinity) {
-		counter = evlist_cpu_itr.evsel;
+	/*
+	 * bperf calls evsel__open_per_cpu() in bperf__load(), so
+	 * no need to call it again here.
+	 */
+	if (!target.use_bpf) {
+		evlist__for_each_cpu(evlist_cpu_itr, evsel_list) {
+			counter = evlist_cpu_itr.evsel;
 
-		/*
-		 * bperf calls evsel__open_per_cpu() in bperf__load(), so
-		 * no need to call it again here.
-		 */
-		if (target.use_bpf)
-			break;
+			if (counter->reset_group || !counter->supported)
+				continue;
+			if (evsel__is_bperf(counter))
+				continue;
 
-		if (counter->reset_group || !counter->supported)
-			continue;
-		if (evsel__is_bperf(counter))
-			continue;
+			while (true) {
+				if (create_perf_stat_counter(counter, &stat_config,
+							      evlist_cpu_itr.cpu_map_idx) == 0)
+					break;
 
-		while (true) {
-			if (create_perf_stat_counter(counter, &stat_config,
-						     evlist_cpu_itr.cpu_map_idx) == 0)
-				break;
+				open_err = errno;
+				/*
+				 * Weak group failed. We cannot just undo this
+				 * here because earlier CPUs might be in group
+				 * mode, and the kernel doesn't support mixing
+				 * group and non group reads. Defer it to later.
+				 * Don't close here because we're in the wrong
+				 * affinity.
+				 */
+				if ((open_err == EINVAL || open_err == EBADF) &&
+					evsel__leader(counter) != counter &&
+					counter->weak_group) {
+					evlist__reset_weak_group(evsel_list, counter, false);
+					assert(counter->reset_group);
+					counter->supported = true;
+					second_pass = true;
+					break;
+				}
 
-			open_err = errno;
-			/*
-			 * Weak group failed. We cannot just undo this here
-			 * because earlier CPUs might be in group mode, and the kernel
-			 * doesn't support mixing group and non group reads. Defer
-			 * it to later.
-			 * Don't close here because we're in the wrong affinity.
-			 */
-			if ((open_err == EINVAL || open_err == EBADF) &&
-				evsel__leader(counter) != counter &&
-				counter->weak_group) {
-				evlist__reset_weak_group(evsel_list, counter, false);
-				assert(counter->reset_group);
-				counter->supported = true;
-				second_pass = true;
-				break;
+				if (stat_handle_error(counter, open_err) != COUNTER_RETRY)
+					break;
 			}
-
-			if (stat_handle_error(counter, open_err) != COUNTER_RETRY)
-				break;
 		}
 	}
-
 	if (second_pass) {
 		/*
 		 * Now redo all the weak group after closing them,
@@ -842,7 +828,7 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 		 */
 
 		/* First close errored or weak retry */
-		evlist__for_each_cpu(evlist_cpu_itr, evsel_list, affinity) {
+		evlist__for_each_cpu(evlist_cpu_itr, evsel_list) {
 			counter = evlist_cpu_itr.evsel;
 
 			if (!counter->reset_group && counter->supported)
@@ -851,7 +837,7 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 			perf_evsel__close_cpu(&counter->core, evlist_cpu_itr.cpu_map_idx);
 		}
 		/* Now reopen weak */
-		evlist__for_each_cpu(evlist_cpu_itr, evsel_list, affinity) {
+		evlist__for_each_cpu(evlist_cpu_itr, evsel_list) {
 			counter = evlist_cpu_itr.evsel;
 
 			if (!counter->reset_group)
@@ -860,17 +846,18 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 			while (true) {
 				pr_debug2("reopening weak %s\n", evsel__name(counter));
 				if (create_perf_stat_counter(counter, &stat_config,
-							     evlist_cpu_itr.cpu_map_idx) == 0)
+							     evlist_cpu_itr.cpu_map_idx) == 0) {
+					evlist_cpu_iterator__exit(&evlist_cpu_itr);
 					break;
-
+				}
 				open_err = errno;
-				if (stat_handle_error(counter, open_err) != COUNTER_RETRY)
+				if (stat_handle_error(counter, open_err) != COUNTER_RETRY) {
+					evlist_cpu_iterator__exit(&evlist_cpu_itr);
 					break;
+				}
 			}
 		}
 	}
-	affinity__cleanup(affinity);
-	affinity = NULL;
 
 	has_supported_counters = false;
 	evlist__for_each_entry(evsel_list, counter) {
@@ -1021,7 +1008,6 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 	if (forks)
 		evlist__cancel_workload(evsel_list);
 
-	affinity__cleanup(affinity);
 	return err;
 }
 
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index e8217efdda53..b6df81b8a236 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -358,36 +358,111 @@ int evlist__add_newtp(struct evlist *evlist, const char *sys, const char *name,
 }
 #endif
 
-struct evlist_cpu_iterator evlist__cpu_begin(struct evlist *evlist, struct affinity *affinity)
+/*
+ * Should sched_setaffinity be used with evlist__for_each_cpu? Determine if
+ * migrating the thread will avoid possibly numerous IPIs.
+ */
+static bool evlist__use_affinity(struct evlist *evlist)
+{
+	struct evsel *pos;
+	struct perf_cpu_map *used_cpus = NULL;
+	bool ret = false;
+
+	/*
+	 * With perf record core.user_requested_cpus is usually NULL.
+	 * Use the old method to handle this for now.
+	 */
+	if (!evlist->core.user_requested_cpus ||
+	    cpu_map__is_dummy(evlist->core.user_requested_cpus))
+		return false;
+
+	evlist__for_each_entry(evlist, pos) {
+		struct perf_cpu_map *intersect;
+
+		if (!perf_pmu__benefits_from_affinity(pos->pmu))
+			continue;
+
+		if (evsel__is_dummy_event(pos)) {
+			/*
+			 * The dummy event is opened on all CPUs so assume >1
+			 * event with shared CPUs.
+			 */
+			ret = true;
+			break;
+		}
+		if (evsel__is_retire_lat(pos)) {
+			/*
+			 * Retirement latency events are similar to tool ones in
+			 * their implementation, and so don't require affinity.
+			 */
+			continue;
+		}
+		if (perf_cpu_map__is_empty(used_cpus)) {
+			/* First benefitting event, we want >1 on a common CPU. */
+			used_cpus = perf_cpu_map__get(pos->core.cpus);
+			continue;
+		}
+		if ((pos->core.attr.read_format & PERF_FORMAT_GROUP) &&
+		    evsel__leader(pos) != pos) {
+			/* Skip members of the same sample group. */
+			continue;
+		}
+		intersect = perf_cpu_map__intersect(used_cpus, pos->core.cpus);
+		if (!perf_cpu_map__is_empty(intersect)) {
+			/* >1 event with shared CPUs. */
+			perf_cpu_map__put(intersect);
+			ret = true;
+			break;
+		}
+		perf_cpu_map__put(intersect);
+		perf_cpu_map__merge(&used_cpus, pos->core.cpus);
+	}
+	perf_cpu_map__put(used_cpus);
+	return ret;
+}
+
+void evlist_cpu_iterator__init(struct evlist_cpu_iterator *itr, struct evlist *evlist)
 {
-	struct evlist_cpu_iterator itr = {
+	*itr = (struct evlist_cpu_iterator){
 		.container = evlist,
 		.evsel = NULL,
 		.cpu_map_idx = 0,
 		.evlist_cpu_map_idx = 0,
 		.evlist_cpu_map_nr = perf_cpu_map__nr(evlist->core.all_cpus),
 		.cpu = (struct perf_cpu){ .cpu = -1},
-		.affinity = affinity,
+		.affinity = NULL,
 	};
 
 	if (evlist__empty(evlist)) {
 		/* Ensure the empty list doesn't iterate. */
-		itr.evlist_cpu_map_idx = itr.evlist_cpu_map_nr;
-	} else {
-		itr.evsel = evlist__first(evlist);
-		if (itr.affinity) {
-			itr.cpu = perf_cpu_map__cpu(evlist->core.all_cpus, 0);
-			affinity__set(itr.affinity, itr.cpu.cpu);
-			itr.cpu_map_idx = perf_cpu_map__idx(itr.evsel->core.cpus, itr.cpu);
-			/*
-			 * If this CPU isn't in the evsel's cpu map then advance
-			 * through the list.
-			 */
-			if (itr.cpu_map_idx == -1)
-				evlist_cpu_iterator__next(&itr);
-		}
+		itr->evlist_cpu_map_idx = itr->evlist_cpu_map_nr;
+		return;
 	}
-	return itr;
+
+	if (evlist__use_affinity(evlist)) {
+		if (affinity__setup(&itr->saved_affinity) == 0)
+			itr->affinity = &itr->saved_affinity;
+	}
+	itr->evsel = evlist__first(evlist);
+	itr->cpu = perf_cpu_map__cpu(evlist->core.all_cpus, 0);
+	if (itr->affinity)
+		affinity__set(itr->affinity, itr->cpu.cpu);
+	itr->cpu_map_idx = perf_cpu_map__idx(itr->evsel->core.cpus, itr->cpu);
+	/*
+	 * If this CPU isn't in the evsel's cpu map then advance
+	 * through the list.
+	 */
+	if (itr->cpu_map_idx == -1)
+		evlist_cpu_iterator__next(itr);
+}
+
+void evlist_cpu_iterator__exit(struct evlist_cpu_iterator *itr)
+{
+	if (!itr->affinity)
+		return;
+
+	affinity__cleanup(itr->affinity);
+	itr->affinity = NULL;
 }
 
 void evlist_cpu_iterator__next(struct evlist_cpu_iterator *evlist_cpu_itr)
@@ -417,14 +492,11 @@ void evlist_cpu_iterator__next(struct evlist_cpu_iterator *evlist_cpu_itr)
 		 */
 		if (evlist_cpu_itr->cpu_map_idx == -1)
 			evlist_cpu_iterator__next(evlist_cpu_itr);
+	} else {
+		evlist_cpu_iterator__exit(evlist_cpu_itr);
 	}
 }
 
-bool evlist_cpu_iterator__end(const struct evlist_cpu_iterator *evlist_cpu_itr)
-{
-	return evlist_cpu_itr->evlist_cpu_map_idx >= evlist_cpu_itr->evlist_cpu_map_nr;
-}
-
 static int evsel__strcmp(struct evsel *pos, char *evsel_name)
 {
 	if (!evsel_name)
@@ -452,19 +524,11 @@ static void __evlist__disable(struct evlist *evlist, char *evsel_name, bool excl
 {
 	struct evsel *pos;
 	struct evlist_cpu_iterator evlist_cpu_itr;
-	struct affinity saved_affinity, *affinity = NULL;
 	bool has_imm = false;
 
-	// See explanation in evlist__close()
-	if (!cpu_map__is_dummy(evlist->core.user_requested_cpus)) {
-		if (affinity__setup(&saved_affinity) < 0)
-			return;
-		affinity = &saved_affinity;
-	}
-
 	/* Disable 'immediate' events last */
 	for (int imm = 0; imm <= 1; imm++) {
-		evlist__for_each_cpu(evlist_cpu_itr, evlist, affinity) {
+		evlist__for_each_cpu(evlist_cpu_itr, evlist) {
 			pos = evlist_cpu_itr.evsel;
 			if (evsel__strcmp(pos, evsel_name))
 				continue;
@@ -482,7 +546,6 @@ static void __evlist__disable(struct evlist *evlist, char *evsel_name, bool excl
 			break;
 	}
 
-	affinity__cleanup(affinity);
 	evlist__for_each_entry(evlist, pos) {
 		if (evsel__strcmp(pos, evsel_name))
 			continue;
@@ -522,16 +585,8 @@ static void __evlist__enable(struct evlist *evlist, char *evsel_name, bool excl_
 {
 	struct evsel *pos;
 	struct evlist_cpu_iterator evlist_cpu_itr;
-	struct affinity saved_affinity, *affinity = NULL;
 
-	// See explanation in evlist__close()
-	if (!cpu_map__is_dummy(evlist->core.user_requested_cpus)) {
-		if (affinity__setup(&saved_affinity) < 0)
-			return;
-		affinity = &saved_affinity;
-	}
-
-	evlist__for_each_cpu(evlist_cpu_itr, evlist, affinity) {
+	evlist__for_each_cpu(evlist_cpu_itr, evlist) {
 		pos = evlist_cpu_itr.evsel;
 		if (evsel__strcmp(pos, evsel_name))
 			continue;
@@ -541,7 +596,6 @@ static void __evlist__enable(struct evlist *evlist, char *evsel_name, bool excl_
 			continue;
 		evsel__enable_cpu(pos, evlist_cpu_itr.cpu_map_idx);
 	}
-	affinity__cleanup(affinity);
 	evlist__for_each_entry(evlist, pos) {
 		if (evsel__strcmp(pos, evsel_name))
 			continue;
@@ -1338,28 +1392,14 @@ void evlist__close(struct evlist *evlist)
 {
 	struct evsel *evsel;
 	struct evlist_cpu_iterator evlist_cpu_itr;
-	struct affinity affinity;
-
-	/*
-	 * With perf record core.user_requested_cpus is usually NULL.
-	 * Use the old method to handle this for now.
-	 */
-	if (!evlist->core.user_requested_cpus ||
-	    cpu_map__is_dummy(evlist->core.user_requested_cpus)) {
-		evlist__for_each_entry_reverse(evlist, evsel)
-			evsel__close(evsel);
-		return;
-	}
-
-	if (affinity__setup(&affinity) < 0)
-		return;
 
-	evlist__for_each_cpu(evlist_cpu_itr, evlist, &affinity) {
+	evlist__for_each_cpu(evlist_cpu_itr, evlist) {
+		if (evlist_cpu_itr.cpu_map_idx == 0 && evsel__is_retire_lat(evlist_cpu_itr.evsel))
+			evsel__tpebs_close(evlist_cpu_itr.evsel);
 		perf_evsel__close_cpu(&evlist_cpu_itr.evsel->core,
 				      evlist_cpu_itr.cpu_map_idx);
 	}
 
-	affinity__cleanup(&affinity);
 	evlist__for_each_entry_reverse(evlist, evsel) {
 		perf_evsel__free_fd(&evsel->core);
 		perf_evsel__free_id(&evsel->core);
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 5e71e3dc6042..b4604c3f03d6 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -10,6 +10,7 @@
 #include <internal/evlist.h>
 #include <internal/evsel.h>
 #include <perf/evlist.h>
+#include "affinity.h"
 #include "events_stats.h"
 #include "evsel.h"
 #include "rblist.h"
@@ -361,6 +362,8 @@ struct evlist_cpu_iterator {
 	struct perf_cpu cpu;
 	/** If present, used to set the affinity when switching between CPUs. */
 	struct affinity *affinity;
+	/** Maybe be used to hold affinity state prior to iterating. */
+	struct affinity saved_affinity;
 };
 
 /**
@@ -368,22 +371,31 @@ struct evlist_cpu_iterator {
  *                        affinity, iterate over all CPUs and then the evlist
  *                        for each evsel on that CPU. When switching between
  *                        CPUs the affinity is set to the CPU to avoid IPIs
- *                        during syscalls.
+ *                        during syscalls. The affinity is set up and removed
+ *                        automatically, if the loop is broken a call to
+ *                        evlist_cpu_iterator__exit is necessary.
  * @evlist_cpu_itr: the iterator instance.
  * @evlist: evlist instance to iterate.
- * @affinity: NULL or used to set the affinity to the current CPU.
  */
-#define evlist__for_each_cpu(evlist_cpu_itr, evlist, affinity)		\
-	for ((evlist_cpu_itr) = evlist__cpu_begin(evlist, affinity);	\
+#define evlist__for_each_cpu(evlist_cpu_itr, evlist)			\
+	for (evlist_cpu_iterator__init(&(evlist_cpu_itr), evlist);	\
 	     !evlist_cpu_iterator__end(&evlist_cpu_itr);		\
 	     evlist_cpu_iterator__next(&evlist_cpu_itr))
 
-/** Returns an iterator set to the first CPU/evsel of evlist. */
-struct evlist_cpu_iterator evlist__cpu_begin(struct evlist *evlist, struct affinity *affinity);
+/** Setup an iterator set to the first CPU/evsel of evlist. */
+void evlist_cpu_iterator__init(struct evlist_cpu_iterator *itr, struct evlist *evlist);
+/**
+ * Cleans up the iterator, automatically done by evlist_cpu_iterator__next when
+ * the end of the list is reached. Multiple calls are safe.
+ */
+void evlist_cpu_iterator__exit(struct evlist_cpu_iterator *itr);
 /** Move to next element in iterator, updating CPU, evsel and the affinity. */
 void evlist_cpu_iterator__next(struct evlist_cpu_iterator *evlist_cpu_itr);
 /** Returns true when iterator is at the end of the CPUs and evlist. */
-bool evlist_cpu_iterator__end(const struct evlist_cpu_iterator *evlist_cpu_itr);
+static inline bool evlist_cpu_iterator__end(const struct evlist_cpu_iterator *evlist_cpu_itr)
+{
+	return evlist_cpu_itr->evlist_cpu_map_idx >= evlist_cpu_itr->evlist_cpu_map_nr;
+}
 
 struct evsel *evlist__get_tracking_event(struct evlist *evlist);
 void evlist__set_tracking_event(struct evlist *evlist, struct evsel *tracking_evsel);
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index f14f2a12d061..e300a3b71bd6 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -2410,6 +2410,18 @@ bool perf_pmu__is_software(const struct perf_pmu *pmu)
 	return false;
 }
 
+bool perf_pmu__benefits_from_affinity(struct perf_pmu *pmu)
+{
+	if (!pmu)
+		return true; /* Assume is core. */
+
+	/*
+	 * All perf event PMUs should benefit from accessing the perf event
+	 * contexts on the local CPU.
+	 */
+	return pmu->type <= PERF_PMU_TYPE_PE_END;
+}
+
 FILE *perf_pmu__open_file(const struct perf_pmu *pmu, const char *name)
 {
 	char path[PATH_MAX];
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 1ebcf0242af8..87e12a9a0e67 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -259,6 +259,7 @@ bool perf_pmu__name_no_suffix_match(const struct perf_pmu *pmu, const char *to_m
  *                        perf_sw_context in the kernel?
  */
 bool perf_pmu__is_software(const struct perf_pmu *pmu);
+bool perf_pmu__benefits_from_affinity(struct perf_pmu *pmu);
 
 FILE *perf_pmu__open_file(const struct perf_pmu *pmu, const char *name);
 FILE *perf_pmu__open_file_at(const struct perf_pmu *pmu, int dirfd, const char *name);
-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 09/10] perf stat: Read tool events last
  2025-11-13 18:05 [PATCH v4 00/10] perf stat fixes and improvements Ian Rogers
                   ` (7 preceding siblings ...)
  2025-11-13 18:05 ` [PATCH v4 08/10] perf evlist: Reduce affinity use and move into iterator, fix no affinity Ian Rogers
@ 2025-11-13 18:05 ` Ian Rogers
  2025-11-18  2:35   ` Namhyung Kim
  2025-11-13 18:05 ` [PATCH v4 10/10] perf stat: Add no-affinity flag Ian Rogers
  2025-11-18 18:00 ` [PATCH v4 00/10] perf stat fixes and improvements Namhyung Kim
  10 siblings, 1 reply; 22+ messages in thread
From: Ian Rogers @ 2025-11-13 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Dr. David Alan Gilbert, Yang Li, James Clark,
	Thomas Falcon, Thomas Richter, linux-perf-users, linux-kernel,
	Andi Kleen, Dapeng Mi

When reading a metric like memory bandwidth on multiple sockets, the
additional sockets will be on CPUS > 0. Because of the affinity
reading, the counters are read on CPU 0 along with the time, then the
later sockets are read. This can lead to the later sockets having a
bandwidth larger than is possible for the period of time. To avoid
this moving the reading of tool events to occur after all other events
are read.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/builtin-stat.c | 29 ++++++++++++++++++++++++++++-
 tools/perf/util/evlist.c  |  4 ----
 2 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 947f11b8b106..aec93b91fd11 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -379,6 +379,9 @@ static int read_counters_with_affinity(void)
 		if (evsel__is_bpf(counter))
 			continue;
 
+		if (evsel__is_tool(counter))
+			continue;
+
 		if (!counter->err)
 			counter->err = read_counter_cpu(counter, evlist_cpu_itr.cpu_map_idx);
 	}
@@ -402,6 +405,24 @@ static int read_bpf_map_counters(void)
 	return 0;
 }
 
+static int read_tool_counters(void)
+{
+	struct evsel *counter;
+
+	evlist__for_each_entry(evsel_list, counter) {
+		int idx;
+
+		if (!evsel__is_tool(counter))
+			continue;
+
+		perf_cpu_map__for_each_idx(idx, counter->core.cpus) {
+			if (!counter->err)
+				counter->err = read_counter_cpu(counter, idx);
+		}
+	}
+	return 0;
+}
+
 static int read_counters(void)
 {
 	int ret;
@@ -415,7 +436,13 @@ static int read_counters(void)
 		return ret;
 
 	// Read non-BPF and non-tool counters next.
-	return read_counters_with_affinity();
+	ret = read_counters_with_affinity();
+	if (ret)
+		return ret;
+
+	// Read the tool counters last. This way the duration_time counter
+	// should always be greater than any other counter's enabled time.
+	return read_tool_counters();
 }
 
 static void process_counters(void)
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index b6df81b8a236..fc3dae7cdfca 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -368,10 +368,6 @@ static bool evlist__use_affinity(struct evlist *evlist)
 	struct perf_cpu_map *used_cpus = NULL;
 	bool ret = false;
 
-	/*
-	 * With perf record core.user_requested_cpus is usually NULL.
-	 * Use the old method to handle this for now.
-	 */
 	if (!evlist->core.user_requested_cpus ||
 	    cpu_map__is_dummy(evlist->core.user_requested_cpus))
 		return false;
-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 10/10] perf stat: Add no-affinity flag
  2025-11-13 18:05 [PATCH v4 00/10] perf stat fixes and improvements Ian Rogers
                   ` (8 preceding siblings ...)
  2025-11-13 18:05 ` [PATCH v4 09/10] perf stat: Read tool events last Ian Rogers
@ 2025-11-13 18:05 ` Ian Rogers
  2025-11-18  2:40   ` Namhyung Kim
  2025-11-18 18:00 ` [PATCH v4 00/10] perf stat fixes and improvements Namhyung Kim
  10 siblings, 1 reply; 22+ messages in thread
From: Ian Rogers @ 2025-11-13 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Dr. David Alan Gilbert, Yang Li, James Clark,
	Thomas Falcon, Thomas Richter, linux-perf-users, linux-kernel,
	Andi Kleen, Dapeng Mi

Add flag that disables affinity behavior. Using sched_setaffinity to
place a perf thread on a CPU can avoid certain interprocessor
interrupts but may introduce a delay due to the scheduling,
particularly on loaded machines. Add a command line option to disable
the behavior. This behavior is less present in other tools like `perf
record`, as it uses a ring buffer and doesn't make repeated system
calls.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/Documentation/perf-stat.txt | 4 ++++
 tools/perf/builtin-stat.c              | 6 ++++++
 tools/perf/util/evlist.c               | 2 +-
 tools/perf/util/evlist.h               | 1 +
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 1a766d4a2233..1ffb510606af 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -382,6 +382,10 @@ color the metric's computed value.
 Don't print output, warnings or messages. This is useful with perf stat
 record below to only write data to the perf.data file.
 
+--no-affinity::
+Don't change scheduler affinities when iterating over CPUs. Disables
+an optimization aimed at minimizing interprocessor interrupts.
+
 STAT RECORD
 -----------
 Stores stat data into perf data file.
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index aec93b91fd11..fa42b08bd1df 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -2415,6 +2415,7 @@ static int parse_tpebs_mode(const struct option *opt, const char *str,
 int cmd_stat(int argc, const char **argv)
 {
 	struct opt_aggr_mode opt_mode = {};
+	bool no_affinity = false;
 	struct option stat_options[] = {
 		OPT_BOOLEAN('T', "transaction", &transaction_run,
 			"hardware transaction statistics"),
@@ -2543,6 +2544,8 @@ int cmd_stat(int argc, const char **argv)
 			"don't print 'summary' for CSV summary output"),
 		OPT_BOOLEAN(0, "quiet", &quiet,
 			"don't print any output, messages or warnings (useful with record)"),
+		OPT_BOOLEAN(0, "no-affinity", &no_affinity,
+			"don't allow affinity optimizations aimed at reducing IPIs"),
 		OPT_CALLBACK(0, "cputype", &evsel_list, "hybrid cpu type",
 			"Only enable events on applying cpu with this type "
 			"for hybrid platform (e.g. core or atom)",
@@ -2600,6 +2603,9 @@ int cmd_stat(int argc, const char **argv)
 	} else
 		stat_config.csv_sep = DEFAULT_SEPARATOR;
 
+	if (no_affinity)
+		evsel_list->no_affinity = true;
+
 	if (argc && strlen(argv[0]) > 2 && strstarts("record", argv[0])) {
 		argc = __cmd_record(stat_options, &opt_mode, argc, argv);
 		if (argc < 0)
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index fc3dae7cdfca..53c8e974de8b 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -368,7 +368,7 @@ static bool evlist__use_affinity(struct evlist *evlist)
 	struct perf_cpu_map *used_cpus = NULL;
 	bool ret = false;
 
-	if (!evlist->core.user_requested_cpus ||
+	if (evlist->no_affinity || !evlist->core.user_requested_cpus ||
 	    cpu_map__is_dummy(evlist->core.user_requested_cpus))
 		return false;
 
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index b4604c3f03d6..c7ba0e0b2219 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -59,6 +59,7 @@ struct event_enable_timer;
 struct evlist {
 	struct perf_evlist core;
 	bool		 enabled;
+	bool		 no_affinity;
 	int		 id_pos;
 	int		 is_pos;
 	int		 nr_br_cntr;
-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 04/10] perf stat-shadow: Read tool events directly
  2025-11-13 18:05 ` [PATCH v4 04/10] perf stat-shadow: Read tool events directly Ian Rogers
@ 2025-11-18  2:30   ` Namhyung Kim
  2025-11-18  4:36     ` Ian Rogers
  0 siblings, 1 reply; 22+ messages in thread
From: Namhyung Kim @ 2025-11-18  2:30 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Dr. David Alan Gilbert, Yang Li, James Clark, Thomas Falcon,
	Thomas Richter, linux-perf-users, linux-kernel, Andi Kleen,
	Dapeng Mi

On Thu, Nov 13, 2025 at 10:05:10AM -0800, Ian Rogers wrote:
> When reading time values for metrics don't use the globals updated in
> builtin-stat, just read the events as regular events. The only
> exception is for time events where nanoseconds need converting to
> seconds as metrics assume time metrics are in seconds.
> 
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/util/stat-shadow.c | 149 +++++++++++++++-------------------
>  1 file changed, 66 insertions(+), 83 deletions(-)
> 
> diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
> index b3b482e1808f..6c1ad78604e1 100644
> --- a/tools/perf/util/stat-shadow.c
> +++ b/tools/perf/util/stat-shadow.c
> @@ -26,7 +26,32 @@ void perf_stat__reset_shadow_stats(void)
>  	memset(&ru_stats, 0, sizeof(ru_stats));
>  }
>  
> -static int prepare_metric(const struct metric_expr *mexp,
> +static bool tool_pmu__is_time_event(const struct perf_stat_config *config,
> +				   const struct evsel *evsel, int *tool_aggr_idx)
> +{
> +	enum tool_pmu_event event = evsel__tool_event(evsel);
> +	int aggr_idx;
> +
> +	if (event != TOOL_PMU__EVENT_DURATION_TIME &&
> +	    event != TOOL_PMU__EVENT_USER_TIME &&
> +	    event != TOOL_PMU__EVENT_SYSTEM_TIME)
> +		return false;
> +
> +	if (config) {
> +		cpu_aggr_map__for_each_idx(aggr_idx, config->aggr_map) {
> +			if (config->aggr_map->map[aggr_idx].cpu.cpu == 0) {
> +				*tool_aggr_idx = aggr_idx;
> +				return true;
> +			}
> +		}
> +		pr_debug("Unexpected CPU0 missing in aggregation for tool event.\n");
> +	}
> +	*tool_aggr_idx = 0; /* Assume the first aggregation index works. */
> +	return true;
> +}
> +
> +static int prepare_metric(struct perf_stat_config *config,
> +			  const struct metric_expr *mexp,
>  			  const struct evsel *evsel,
>  			  struct expr_parse_ctx *pctx,
>  			  int aggr_idx)
> @@ -36,93 +61,51 @@ static int prepare_metric(const struct metric_expr *mexp,
>  	int i;
>  
>  	for (i = 0; metric_events[i]; i++) {
> +		int source_count = 0, tool_aggr_idx;
> +		bool is_tool_time =
> +			tool_pmu__is_time_event(config, metric_events[i], &tool_aggr_idx);
> +		struct perf_stat_evsel *ps = metric_events[i]->stats;
> +		struct perf_stat_aggr *aggr;
>  		char *n;
>  		double val;
> -		int source_count = 0;
>  
> -		if (evsel__is_tool(metric_events[i])) {
> -			struct stats *stats;
> -			double scale;
> -
> -			switch (evsel__tool_event(metric_events[i])) {
> -			case TOOL_PMU__EVENT_DURATION_TIME:
> -				stats = &walltime_nsecs_stats;
> -				scale = 1e-9;
> -				break;
> -			case TOOL_PMU__EVENT_USER_TIME:
> -				stats = &ru_stats.ru_utime_usec_stat;
> -				scale = 1e-6;
> +		/*
> +		 * If there are multiple uncore PMUs and we're not reading the
> +		 * leader's stats, determine the stats for the appropriate
> +		 * uncore PMU.
> +		 */
> +		if (evsel && evsel->metric_leader &&
> +		    evsel->pmu != evsel->metric_leader->pmu &&
> +		    mexp->metric_events[i]->pmu == evsel->metric_leader->pmu) {
> +			struct evsel *pos;
> +
> +			evlist__for_each_entry(evsel->evlist, pos) {
> +				if (pos->pmu != evsel->pmu)
> +					continue;
> +				if (pos->metric_leader != mexp->metric_events[i])
> +					continue;
> +				ps = pos->stats;
> +				source_count = 1;
>  				break;
> -			case TOOL_PMU__EVENT_SYSTEM_TIME:
> -				stats = &ru_stats.ru_stime_usec_stat;
> -				scale = 1e-6;
> -				break;

I think this was broken as it seems to be converted to nsec in the
update_rusage_stats().


> -			case TOOL_PMU__EVENT_NONE:
> -				pr_err("Invalid tool event 'none'");
> -				abort();
> -			case TOOL_PMU__EVENT_MAX:
> -				pr_err("Invalid tool event 'max'");
> -				abort();
> -			case TOOL_PMU__EVENT_HAS_PMEM:
> -			case TOOL_PMU__EVENT_NUM_CORES:
> -			case TOOL_PMU__EVENT_NUM_CPUS:
> -			case TOOL_PMU__EVENT_NUM_CPUS_ONLINE:
> -			case TOOL_PMU__EVENT_NUM_DIES:
> -			case TOOL_PMU__EVENT_NUM_PACKAGES:
> -			case TOOL_PMU__EVENT_SLOTS:
> -			case TOOL_PMU__EVENT_SMT_ON:
> -			case TOOL_PMU__EVENT_SYSTEM_TSC_FREQ:
> -			case TOOL_PMU__EVENT_CORE_WIDE:
> -			case TOOL_PMU__EVENT_TARGET_CPU:
> -			default:
> -				pr_err("Unexpected tool event '%s'", evsel__name(metric_events[i]));
> -				abort();
>  			}
> -			val = avg_stats(stats) * scale;
> -			source_count = 1;
> -		} else {
> -			struct perf_stat_evsel *ps = metric_events[i]->stats;
> -			struct perf_stat_aggr *aggr;
> -
> +		}
> +		/* Time events are always on CPU0, the first aggregation index. */
> +		aggr = &ps->aggr[is_tool_time ? tool_aggr_idx : aggr_idx];
> +		if (!aggr || !metric_events[i]->supported) {
>  			/*
> -			 * If there are multiple uncore PMUs and we're not
> -			 * reading the leader's stats, determine the stats for
> -			 * the appropriate uncore PMU.
> +			 * Not supported events will have a count of 0, which
> +			 * can be confusing in a metric. Explicitly set the
> +			 * value to NAN. Not counted events (enable time of 0)
> +			 * are read as 0.
>  			 */
> -			if (evsel && evsel->metric_leader &&
> -			    evsel->pmu != evsel->metric_leader->pmu &&
> -			    mexp->metric_events[i]->pmu == evsel->metric_leader->pmu) {
> -				struct evsel *pos;
> -
> -				evlist__for_each_entry(evsel->evlist, pos) {
> -					if (pos->pmu != evsel->pmu)
> -						continue;
> -					if (pos->metric_leader != mexp->metric_events[i])
> -						continue;
> -					ps = pos->stats;
> -					source_count = 1;
> -					break;
> -				}
> -			}
> -			aggr = &ps->aggr[aggr_idx];
> -			if (!aggr)
> -				break;
> -
> -			if (!metric_events[i]->supported) {
> -				/*
> -				 * Not supported events will have a count of 0,
> -				 * which can be confusing in a
> -				 * metric. Explicitly set the value to NAN. Not
> -				 * counted events (enable time of 0) are read as
> -				 * 0.
> -				 */
> -				val = NAN;
> -				source_count = 0;
> -			} else {
> -				val = aggr->counts.val;
> -				if (!source_count)
> -					source_count = evsel__source_count(metric_events[i]);
> -			}
> +			val = NAN;
> +			source_count = 0;
> +		} else {
> +			val = aggr->counts.val;
> +			if (is_tool_time)
> +				val *= 1e-9; /* Convert time event nanoseconds to seconds. */

And this code treats all time events are in nsec now.

Thanks,
Namhyung


> +			if (!source_count)
> +				source_count = evsel__source_count(metric_events[i]);
>  		}
>  		n = strdup(evsel__metric_id(metric_events[i]));
>  		if (!n)
> @@ -168,7 +151,7 @@ static void generic_metric(struct perf_stat_config *config,
>  		pctx->sctx.user_requested_cpu_list = strdup(config->user_requested_cpu_list);
>  	pctx->sctx.runtime = runtime;
>  	pctx->sctx.system_wide = config->system_wide;
> -	i = prepare_metric(mexp, evsel, pctx, aggr_idx);
> +	i = prepare_metric(config, mexp, evsel, pctx, aggr_idx);
>  	if (i < 0) {
>  		expr__ctx_free(pctx);
>  		return;
> @@ -229,7 +212,7 @@ double test_generic_metric(struct metric_expr *mexp, int aggr_idx)
>  	if (!pctx)
>  		return NAN;
>  
> -	if (prepare_metric(mexp, /*evsel=*/NULL, pctx, aggr_idx) < 0)
> +	if (prepare_metric(/*config=*/NULL, mexp, /*evsel=*/NULL, pctx, aggr_idx) < 0)
>  		goto out;
>  
>  	if (expr__parse(&ratio, pctx, mexp->metric_expr))
> -- 
> 2.51.2.1041.gc1ab5b90ca-goog
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 05/10] perf stat: Reduce scope of ru_stats
  2025-11-13 18:05 ` [PATCH v4 05/10] perf stat: Reduce scope of ru_stats Ian Rogers
@ 2025-11-18  2:31   ` Namhyung Kim
  0 siblings, 0 replies; 22+ messages in thread
From: Namhyung Kim @ 2025-11-18  2:31 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Dr. David Alan Gilbert, Yang Li, James Clark, Thomas Falcon,
	Thomas Richter, linux-perf-users, linux-kernel, Andi Kleen,
	Dapeng Mi

On Thu, Nov 13, 2025 at 10:05:11AM -0800, Ian Rogers wrote:
> The ru_stats are used to capture user and system time stats when a
> process exits. These are then applied to user and system time tool
> events if their reads fail due to the process terminating. Reduce the
> scope now the metric code no longer reads these values.
> 
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/builtin-stat.c     | 19 ++++++++++++++++++-
>  tools/perf/util/config.c      |  1 -
>  tools/perf/util/stat-shadow.c |  2 --
>  tools/perf/util/stat.h        | 21 ---------------------
>  4 files changed, 18 insertions(+), 25 deletions(-)
> 
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 654f840f7a2f..d6f4c84f7d7e 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -104,6 +104,11 @@
>  #define DEFAULT_SEPARATOR	" "
>  #define FREEZE_ON_SMI_PATH	"bus/event_source/devices/cpu/freeze_on_smi"
>  
> +struct rusage_stats {
> +	struct stats ru_utime_usec_stat;
> +	struct stats ru_stime_usec_stat;

Maybe we need to rename it to ru_[us]time_nsec_stat now?
But it can be a separate change.

Thanks,
Namhyung


> +};
> +
>  static void print_counters(struct timespec *ts, int argc, const char **argv);
>  
>  static struct evlist	*evsel_list;
> @@ -133,6 +138,7 @@ static bool			interval_count;
>  static const char		*output_name;
>  static int			output_fd;
>  static char			*metrics;
> +static struct rusage_stats	ru_stats;
>  
>  struct perf_stat {
>  	bool			 record;
> @@ -730,6 +736,17 @@ static int create_perf_stat_counter(struct evsel *evsel,
>  					      evsel->core.threads);
>  }
>  
> +static void update_rusage_stats(const struct rusage *rusage)
> +{
> +	const u64 us_to_ns = 1000;
> +	const u64 s_to_ns = 1000000000;
> +
> +	update_stats(&ru_stats.ru_utime_usec_stat,
> +		(rusage->ru_utime.tv_usec * us_to_ns + rusage->ru_utime.tv_sec * s_to_ns));
> +	update_stats(&ru_stats.ru_stime_usec_stat,
> +		(rusage->ru_stime.tv_usec * us_to_ns + rusage->ru_stime.tv_sec * s_to_ns));
> +}
> +
>  static int __run_perf_stat(int argc, const char **argv, int run_idx)
>  {
>  	int interval = stat_config.interval;
> @@ -979,7 +996,7 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>  		evlist__reset_aggr_stats(evsel_list);
>  	} else {
>  		update_stats(&walltime_nsecs_stats, t1 - t0);
> -		update_rusage_stats(&ru_stats, &stat_config.ru_data);
> +		update_rusage_stats(&stat_config.ru_data);
>  	}
>  
>  	/*
> diff --git a/tools/perf/util/config.c b/tools/perf/util/config.c
> index 6f914620c6ff..cc0746f494f4 100644
> --- a/tools/perf/util/config.c
> +++ b/tools/perf/util/config.c
> @@ -45,7 +45,6 @@ struct perf_stat_config stat_config = {
>  	.run_count		= 1,
>  	.metric_only_len	= METRIC_ONLY_LEN,
>  	.walltime_nsecs_stats	= &walltime_nsecs_stats,
> -	.ru_stats		= &ru_stats,
>  	.big_num		= true,
>  	.ctl_fd			= -1,
>  	.ctl_fd_ack		= -1,
> diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
> index 6c1ad78604e1..cb7c741a1ebb 100644
> --- a/tools/perf/util/stat-shadow.c
> +++ b/tools/perf/util/stat-shadow.c
> @@ -18,12 +18,10 @@
>  #include "tool_pmu.h"
>  
>  struct stats walltime_nsecs_stats;
> -struct rusage_stats ru_stats;
>  
>  void perf_stat__reset_shadow_stats(void)
>  {
>  	memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
> -	memset(&ru_stats, 0, sizeof(ru_stats));
>  }
>  
>  static bool tool_pmu__is_time_event(const struct perf_stat_config *config,
> diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
> index b42da4a29c44..055b95d18106 100644
> --- a/tools/perf/util/stat.h
> +++ b/tools/perf/util/stat.h
> @@ -56,11 +56,6 @@ enum aggr_mode {
>  	AGGR_MAX
>  };
>  
> -struct rusage_stats {
> -	struct stats ru_utime_usec_stat;
> -	struct stats ru_stime_usec_stat;
> -};
> -
>  typedef struct aggr_cpu_id (*aggr_get_id_t)(struct perf_stat_config *config, struct perf_cpu cpu);
>  
>  struct perf_stat_config {
> @@ -102,7 +97,6 @@ struct perf_stat_config {
>  	const char		*csv_sep;
>  	struct stats		*walltime_nsecs_stats;
>  	struct rusage		 ru_data;
> -	struct rusage_stats		 *ru_stats;
>  	struct cpu_aggr_map	*aggr_map;
>  	aggr_get_id_t		 aggr_get_id;
>  	struct cpu_aggr_map	*cpus_aggr_map;
> @@ -132,25 +126,10 @@ static inline void init_stats(struct stats *stats)
>  	stats->max  = 0;
>  }
>  
> -static inline void init_rusage_stats(struct rusage_stats *ru_stats) {
> -	init_stats(&ru_stats->ru_utime_usec_stat);
> -	init_stats(&ru_stats->ru_stime_usec_stat);
> -}
> -
> -static inline void update_rusage_stats(struct rusage_stats *ru_stats, struct rusage* rusage) {
> -	const u64 us_to_ns = 1000;
> -	const u64 s_to_ns = 1000000000;
> -	update_stats(&ru_stats->ru_utime_usec_stat,
> -	             (rusage->ru_utime.tv_usec * us_to_ns + rusage->ru_utime.tv_sec * s_to_ns));
> -	update_stats(&ru_stats->ru_stime_usec_stat,
> -	             (rusage->ru_stime.tv_usec * us_to_ns + rusage->ru_stime.tv_sec * s_to_ns));
> -}
> -
>  struct evsel;
>  struct evlist;
>  
>  extern struct stats walltime_nsecs_stats;
> -extern struct rusage_stats ru_stats;
>  
>  enum metric_threshold_classify {
>  	METRIC_THRESHOLD_UNKNOWN,
> -- 
> 2.51.2.1041.gc1ab5b90ca-goog
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 09/10] perf stat: Read tool events last
  2025-11-13 18:05 ` [PATCH v4 09/10] perf stat: Read tool events last Ian Rogers
@ 2025-11-18  2:35   ` Namhyung Kim
  2025-11-18  4:38     ` Ian Rogers
  0 siblings, 1 reply; 22+ messages in thread
From: Namhyung Kim @ 2025-11-18  2:35 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Dr. David Alan Gilbert, Yang Li, James Clark, Thomas Falcon,
	Thomas Richter, linux-perf-users, linux-kernel, Andi Kleen,
	Dapeng Mi

On Thu, Nov 13, 2025 at 10:05:15AM -0800, Ian Rogers wrote:
> When reading a metric like memory bandwidth on multiple sockets, the
> additional sockets will be on CPUS > 0. Because of the affinity
> reading, the counters are read on CPU 0 along with the time, then the
> later sockets are read. This can lead to the later sockets having a
> bandwidth larger than is possible for the period of time. To avoid
> this moving the reading of tool events to occur after all other events
> are read.

Can you move this change before the affinity updates?  I think it's
straight-forward and can be applied independently.

Thanks,
Namhyung

> 
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/builtin-stat.c | 29 ++++++++++++++++++++++++++++-
>  tools/perf/util/evlist.c  |  4 ----
>  2 files changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 947f11b8b106..aec93b91fd11 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -379,6 +379,9 @@ static int read_counters_with_affinity(void)
>  		if (evsel__is_bpf(counter))
>  			continue;
>  
> +		if (evsel__is_tool(counter))
> +			continue;
> +
>  		if (!counter->err)
>  			counter->err = read_counter_cpu(counter, evlist_cpu_itr.cpu_map_idx);
>  	}
> @@ -402,6 +405,24 @@ static int read_bpf_map_counters(void)
>  	return 0;
>  }
>  
> +static int read_tool_counters(void)
> +{
> +	struct evsel *counter;
> +
> +	evlist__for_each_entry(evsel_list, counter) {
> +		int idx;
> +
> +		if (!evsel__is_tool(counter))
> +			continue;
> +
> +		perf_cpu_map__for_each_idx(idx, counter->core.cpus) {
> +			if (!counter->err)
> +				counter->err = read_counter_cpu(counter, idx);
> +		}
> +	}
> +	return 0;
> +}
> +
>  static int read_counters(void)
>  {
>  	int ret;
> @@ -415,7 +436,13 @@ static int read_counters(void)
>  		return ret;
>  
>  	// Read non-BPF and non-tool counters next.
> -	return read_counters_with_affinity();
> +	ret = read_counters_with_affinity();
> +	if (ret)
> +		return ret;
> +
> +	// Read the tool counters last. This way the duration_time counter
> +	// should always be greater than any other counter's enabled time.
> +	return read_tool_counters();
>  }
>  
>  static void process_counters(void)
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index b6df81b8a236..fc3dae7cdfca 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -368,10 +368,6 @@ static bool evlist__use_affinity(struct evlist *evlist)
>  	struct perf_cpu_map *used_cpus = NULL;
>  	bool ret = false;
>  
> -	/*
> -	 * With perf record core.user_requested_cpus is usually NULL.
> -	 * Use the old method to handle this for now.
> -	 */
>  	if (!evlist->core.user_requested_cpus ||
>  	    cpu_map__is_dummy(evlist->core.user_requested_cpus))
>  		return false;
> -- 
> 2.51.2.1041.gc1ab5b90ca-goog
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 10/10] perf stat: Add no-affinity flag
  2025-11-13 18:05 ` [PATCH v4 10/10] perf stat: Add no-affinity flag Ian Rogers
@ 2025-11-18  2:40   ` Namhyung Kim
  2025-11-18  4:32     ` Ian Rogers
  0 siblings, 1 reply; 22+ messages in thread
From: Namhyung Kim @ 2025-11-18  2:40 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Dr. David Alan Gilbert, Yang Li, James Clark, Thomas Falcon,
	Thomas Richter, linux-perf-users, linux-kernel, Andi Kleen,
	Dapeng Mi

On Thu, Nov 13, 2025 at 10:05:16AM -0800, Ian Rogers wrote:
> Add flag that disables affinity behavior. Using sched_setaffinity to
> place a perf thread on a CPU can avoid certain interprocessor
> interrupts but may introduce a delay due to the scheduling,
> particularly on loaded machines. Add a command line option to disable
> the behavior. This behavior is less present in other tools like `perf
> record`, as it uses a ring buffer and doesn't make repeated system
> calls.
> 
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/Documentation/perf-stat.txt | 4 ++++
>  tools/perf/builtin-stat.c              | 6 ++++++
>  tools/perf/util/evlist.c               | 2 +-
>  tools/perf/util/evlist.h               | 1 +
>  4 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
> index 1a766d4a2233..1ffb510606af 100644
> --- a/tools/perf/Documentation/perf-stat.txt
> +++ b/tools/perf/Documentation/perf-stat.txt
> @@ -382,6 +382,10 @@ color the metric's computed value.
>  Don't print output, warnings or messages. This is useful with perf stat
>  record below to only write data to the perf.data file.
>  
> +--no-affinity::
> +Don't change scheduler affinities when iterating over CPUs. Disables
> +an optimization aimed at minimizing interprocessor interrupts.
> +
>  STAT RECORD
>  -----------
>  Stores stat data into perf data file.
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index aec93b91fd11..fa42b08bd1df 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -2415,6 +2415,7 @@ static int parse_tpebs_mode(const struct option *opt, const char *str,
>  int cmd_stat(int argc, const char **argv)
>  {
>  	struct opt_aggr_mode opt_mode = {};
> +	bool no_affinity = false;
>  	struct option stat_options[] = {
>  		OPT_BOOLEAN('T', "transaction", &transaction_run,
>  			"hardware transaction statistics"),
> @@ -2543,6 +2544,8 @@ int cmd_stat(int argc, const char **argv)
>  			"don't print 'summary' for CSV summary output"),
>  		OPT_BOOLEAN(0, "quiet", &quiet,
>  			"don't print any output, messages or warnings (useful with record)"),
> +		OPT_BOOLEAN(0, "no-affinity", &no_affinity,
> +			"don't allow affinity optimizations aimed at reducing IPIs"),

I know you want to add an option to disable the behaivor, but I think
it'd better to have a positive option like just '--affinity'.  Then we
will have '--no-affinity' for free. :)  The current form will allow
'--no-no-affinity'.

Then the variable also can be 'enable_affinity' or so.

You can mention --no-affinity in the help message and the man page
document so that users can discover the intention.

Thanks,
Namhyung


>  		OPT_CALLBACK(0, "cputype", &evsel_list, "hybrid cpu type",
>  			"Only enable events on applying cpu with this type "
>  			"for hybrid platform (e.g. core or atom)",
> @@ -2600,6 +2603,9 @@ int cmd_stat(int argc, const char **argv)
>  	} else
>  		stat_config.csv_sep = DEFAULT_SEPARATOR;
>  
> +	if (no_affinity)
> +		evsel_list->no_affinity = true;
> +
>  	if (argc && strlen(argv[0]) > 2 && strstarts("record", argv[0])) {
>  		argc = __cmd_record(stat_options, &opt_mode, argc, argv);
>  		if (argc < 0)
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index fc3dae7cdfca..53c8e974de8b 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -368,7 +368,7 @@ static bool evlist__use_affinity(struct evlist *evlist)
>  	struct perf_cpu_map *used_cpus = NULL;
>  	bool ret = false;
>  
> -	if (!evlist->core.user_requested_cpus ||
> +	if (evlist->no_affinity || !evlist->core.user_requested_cpus ||
>  	    cpu_map__is_dummy(evlist->core.user_requested_cpus))
>  		return false;
>  
> diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
> index b4604c3f03d6..c7ba0e0b2219 100644
> --- a/tools/perf/util/evlist.h
> +++ b/tools/perf/util/evlist.h
> @@ -59,6 +59,7 @@ struct event_enable_timer;
>  struct evlist {
>  	struct perf_evlist core;
>  	bool		 enabled;
> +	bool		 no_affinity;
>  	int		 id_pos;
>  	int		 is_pos;
>  	int		 nr_br_cntr;
> -- 
> 2.51.2.1041.gc1ab5b90ca-goog
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 10/10] perf stat: Add no-affinity flag
  2025-11-18  2:40   ` Namhyung Kim
@ 2025-11-18  4:32     ` Ian Rogers
  2025-11-18  6:50       ` Namhyung Kim
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Rogers @ 2025-11-18  4:32 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Dr. David Alan Gilbert, Yang Li, James Clark, Thomas Falcon,
	Thomas Richter, linux-perf-users, linux-kernel, Andi Kleen,
	Dapeng Mi

On Mon, Nov 17, 2025 at 6:41 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Thu, Nov 13, 2025 at 10:05:16AM -0800, Ian Rogers wrote:
> > Add flag that disables affinity behavior. Using sched_setaffinity to
> > place a perf thread on a CPU can avoid certain interprocessor
> > interrupts but may introduce a delay due to the scheduling,
> > particularly on loaded machines. Add a command line option to disable
> > the behavior. This behavior is less present in other tools like `perf
> > record`, as it uses a ring buffer and doesn't make repeated system
> > calls.
> >
> > Signed-off-by: Ian Rogers <irogers@google.com>
> > ---
> >  tools/perf/Documentation/perf-stat.txt | 4 ++++
> >  tools/perf/builtin-stat.c              | 6 ++++++
> >  tools/perf/util/evlist.c               | 2 +-
> >  tools/perf/util/evlist.h               | 1 +
> >  4 files changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
> > index 1a766d4a2233..1ffb510606af 100644
> > --- a/tools/perf/Documentation/perf-stat.txt
> > +++ b/tools/perf/Documentation/perf-stat.txt
> > @@ -382,6 +382,10 @@ color the metric's computed value.
> >  Don't print output, warnings or messages. This is useful with perf stat
> >  record below to only write data to the perf.data file.
> >
> > +--no-affinity::
> > +Don't change scheduler affinities when iterating over CPUs. Disables
> > +an optimization aimed at minimizing interprocessor interrupts.
> > +
> >  STAT RECORD
> >  -----------
> >  Stores stat data into perf data file.
> > diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> > index aec93b91fd11..fa42b08bd1df 100644
> > --- a/tools/perf/builtin-stat.c
> > +++ b/tools/perf/builtin-stat.c
> > @@ -2415,6 +2415,7 @@ static int parse_tpebs_mode(const struct option *opt, const char *str,
> >  int cmd_stat(int argc, const char **argv)
> >  {
> >       struct opt_aggr_mode opt_mode = {};
> > +     bool no_affinity = false;
> >       struct option stat_options[] = {
> >               OPT_BOOLEAN('T', "transaction", &transaction_run,
> >                       "hardware transaction statistics"),
> > @@ -2543,6 +2544,8 @@ int cmd_stat(int argc, const char **argv)
> >                       "don't print 'summary' for CSV summary output"),
> >               OPT_BOOLEAN(0, "quiet", &quiet,
> >                       "don't print any output, messages or warnings (useful with record)"),
> > +             OPT_BOOLEAN(0, "no-affinity", &no_affinity,
> > +                     "don't allow affinity optimizations aimed at reducing IPIs"),
>
> I know you want to add an option to disable the behaivor, but I think
> it'd better to have a positive option like just '--affinity'.  Then we
> will have '--no-affinity' for free. :)  The current form will allow
> '--no-no-affinity'.
>
> Then the variable also can be 'enable_affinity' or so.
>
> You can mention --no-affinity in the help message and the man page
> document so that users can discover the intention.

I was trying to keep the code the same as the flag. I don't want to
have an affinity flag you need to set to true in say evlist, as that
will need updating in all users of the evlist or become a behavioral
change. We can have the evlist with no_affinity and invert the flag,
it just looks awkward imo.

Thanks,
Ian

> Thanks,
> Namhyung
>
>
> >               OPT_CALLBACK(0, "cputype", &evsel_list, "hybrid cpu type",
> >                       "Only enable events on applying cpu with this type "
> >                       "for hybrid platform (e.g. core or atom)",
> > @@ -2600,6 +2603,9 @@ int cmd_stat(int argc, const char **argv)
> >       } else
> >               stat_config.csv_sep = DEFAULT_SEPARATOR;
> >
> > +     if (no_affinity)
> > +             evsel_list->no_affinity = true;
> > +
> >       if (argc && strlen(argv[0]) > 2 && strstarts("record", argv[0])) {
> >               argc = __cmd_record(stat_options, &opt_mode, argc, argv);
> >               if (argc < 0)
> > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> > index fc3dae7cdfca..53c8e974de8b 100644
> > --- a/tools/perf/util/evlist.c
> > +++ b/tools/perf/util/evlist.c
> > @@ -368,7 +368,7 @@ static bool evlist__use_affinity(struct evlist *evlist)
> >       struct perf_cpu_map *used_cpus = NULL;
> >       bool ret = false;
> >
> > -     if (!evlist->core.user_requested_cpus ||
> > +     if (evlist->no_affinity || !evlist->core.user_requested_cpus ||
> >           cpu_map__is_dummy(evlist->core.user_requested_cpus))
> >               return false;
> >
> > diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
> > index b4604c3f03d6..c7ba0e0b2219 100644
> > --- a/tools/perf/util/evlist.h
> > +++ b/tools/perf/util/evlist.h
> > @@ -59,6 +59,7 @@ struct event_enable_timer;
> >  struct evlist {
> >       struct perf_evlist core;
> >       bool             enabled;
> > +     bool             no_affinity;
> >       int              id_pos;
> >       int              is_pos;
> >       int              nr_br_cntr;
> > --
> > 2.51.2.1041.gc1ab5b90ca-goog
> >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 04/10] perf stat-shadow: Read tool events directly
  2025-11-18  2:30   ` Namhyung Kim
@ 2025-11-18  4:36     ` Ian Rogers
  2025-11-18  6:45       ` Namhyung Kim
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Rogers @ 2025-11-18  4:36 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Dr. David Alan Gilbert, Yang Li, James Clark, Thomas Falcon,
	Thomas Richter, linux-perf-users, linux-kernel, Andi Kleen,
	Dapeng Mi

On Mon, Nov 17, 2025 at 6:30 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Thu, Nov 13, 2025 at 10:05:10AM -0800, Ian Rogers wrote:
> > When reading time values for metrics don't use the globals updated in
> > builtin-stat, just read the events as regular events. The only
> > exception is for time events where nanoseconds need converting to
> > seconds as metrics assume time metrics are in seconds.
> >
> > Signed-off-by: Ian Rogers <irogers@google.com>
> > ---
> >  tools/perf/util/stat-shadow.c | 149 +++++++++++++++-------------------
> >  1 file changed, 66 insertions(+), 83 deletions(-)
> >
> > diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
> > index b3b482e1808f..6c1ad78604e1 100644
> > --- a/tools/perf/util/stat-shadow.c
> > +++ b/tools/perf/util/stat-shadow.c
> > @@ -26,7 +26,32 @@ void perf_stat__reset_shadow_stats(void)
> >       memset(&ru_stats, 0, sizeof(ru_stats));
> >  }
> >
> > -static int prepare_metric(const struct metric_expr *mexp,
> > +static bool tool_pmu__is_time_event(const struct perf_stat_config *config,
> > +                                const struct evsel *evsel, int *tool_aggr_idx)
> > +{
> > +     enum tool_pmu_event event = evsel__tool_event(evsel);
> > +     int aggr_idx;
> > +
> > +     if (event != TOOL_PMU__EVENT_DURATION_TIME &&
> > +         event != TOOL_PMU__EVENT_USER_TIME &&
> > +         event != TOOL_PMU__EVENT_SYSTEM_TIME)
> > +             return false;
> > +
> > +     if (config) {
> > +             cpu_aggr_map__for_each_idx(aggr_idx, config->aggr_map) {
> > +                     if (config->aggr_map->map[aggr_idx].cpu.cpu == 0) {
> > +                             *tool_aggr_idx = aggr_idx;
> > +                             return true;
> > +                     }
> > +             }
> > +             pr_debug("Unexpected CPU0 missing in aggregation for tool event.\n");
> > +     }
> > +     *tool_aggr_idx = 0; /* Assume the first aggregation index works. */
> > +     return true;
> > +}
> > +
> > +static int prepare_metric(struct perf_stat_config *config,
> > +                       const struct metric_expr *mexp,
> >                         const struct evsel *evsel,
> >                         struct expr_parse_ctx *pctx,
> >                         int aggr_idx)
> > @@ -36,93 +61,51 @@ static int prepare_metric(const struct metric_expr *mexp,
> >       int i;
> >
> >       for (i = 0; metric_events[i]; i++) {
> > +             int source_count = 0, tool_aggr_idx;
> > +             bool is_tool_time =
> > +                     tool_pmu__is_time_event(config, metric_events[i], &tool_aggr_idx);
> > +             struct perf_stat_evsel *ps = metric_events[i]->stats;
> > +             struct perf_stat_aggr *aggr;
> >               char *n;
> >               double val;
> > -             int source_count = 0;
> >
> > -             if (evsel__is_tool(metric_events[i])) {
> > -                     struct stats *stats;
> > -                     double scale;
> > -
> > -                     switch (evsel__tool_event(metric_events[i])) {
> > -                     case TOOL_PMU__EVENT_DURATION_TIME:
> > -                             stats = &walltime_nsecs_stats;
> > -                             scale = 1e-9;
> > -                             break;
> > -                     case TOOL_PMU__EVENT_USER_TIME:
> > -                             stats = &ru_stats.ru_utime_usec_stat;
> > -                             scale = 1e-6;
> > +             /*
> > +              * If there are multiple uncore PMUs and we're not reading the
> > +              * leader's stats, determine the stats for the appropriate
> > +              * uncore PMU.
> > +              */
> > +             if (evsel && evsel->metric_leader &&
> > +                 evsel->pmu != evsel->metric_leader->pmu &&
> > +                 mexp->metric_events[i]->pmu == evsel->metric_leader->pmu) {
> > +                     struct evsel *pos;
> > +
> > +                     evlist__for_each_entry(evsel->evlist, pos) {
> > +                             if (pos->pmu != evsel->pmu)
> > +                                     continue;
> > +                             if (pos->metric_leader != mexp->metric_events[i])
> > +                                     continue;
> > +                             ps = pos->stats;
> > +                             source_count = 1;
> >                               break;
> > -                     case TOOL_PMU__EVENT_SYSTEM_TIME:
> > -                             stats = &ru_stats.ru_stime_usec_stat;
> > -                             scale = 1e-6;
> > -                             break;
>
> I think this was broken as it seems to be converted to nsec in the
> update_rusage_stats().

The global utime/stime is in usecs as in the name. In the tool_pmu it
is converted to nanoseconds, for consistency with duration_time. In
the new code the counters are all in nanoseconds and so the same
scaling factor can be applied.

> > -                     case TOOL_PMU__EVENT_NONE:
> > -                             pr_err("Invalid tool event 'none'");
> > -                             abort();
> > -                     case TOOL_PMU__EVENT_MAX:
> > -                             pr_err("Invalid tool event 'max'");
> > -                             abort();
> > -                     case TOOL_PMU__EVENT_HAS_PMEM:
> > -                     case TOOL_PMU__EVENT_NUM_CORES:
> > -                     case TOOL_PMU__EVENT_NUM_CPUS:
> > -                     case TOOL_PMU__EVENT_NUM_CPUS_ONLINE:
> > -                     case TOOL_PMU__EVENT_NUM_DIES:
> > -                     case TOOL_PMU__EVENT_NUM_PACKAGES:
> > -                     case TOOL_PMU__EVENT_SLOTS:
> > -                     case TOOL_PMU__EVENT_SMT_ON:
> > -                     case TOOL_PMU__EVENT_SYSTEM_TSC_FREQ:
> > -                     case TOOL_PMU__EVENT_CORE_WIDE:
> > -                     case TOOL_PMU__EVENT_TARGET_CPU:
> > -                     default:
> > -                             pr_err("Unexpected tool event '%s'", evsel__name(metric_events[i]));
> > -                             abort();
> >                       }
> > -                     val = avg_stats(stats) * scale;
> > -                     source_count = 1;
> > -             } else {
> > -                     struct perf_stat_evsel *ps = metric_events[i]->stats;
> > -                     struct perf_stat_aggr *aggr;
> > -
> > +             }
> > +             /* Time events are always on CPU0, the first aggregation index. */
> > +             aggr = &ps->aggr[is_tool_time ? tool_aggr_idx : aggr_idx];
> > +             if (!aggr || !metric_events[i]->supported) {
> >                       /*
> > -                      * If there are multiple uncore PMUs and we're not
> > -                      * reading the leader's stats, determine the stats for
> > -                      * the appropriate uncore PMU.
> > +                      * Not supported events will have a count of 0, which
> > +                      * can be confusing in a metric. Explicitly set the
> > +                      * value to NAN. Not counted events (enable time of 0)
> > +                      * are read as 0.
> >                        */
> > -                     if (evsel && evsel->metric_leader &&
> > -                         evsel->pmu != evsel->metric_leader->pmu &&
> > -                         mexp->metric_events[i]->pmu == evsel->metric_leader->pmu) {
> > -                             struct evsel *pos;
> > -
> > -                             evlist__for_each_entry(evsel->evlist, pos) {
> > -                                     if (pos->pmu != evsel->pmu)
> > -                                             continue;
> > -                                     if (pos->metric_leader != mexp->metric_events[i])
> > -                                             continue;
> > -                                     ps = pos->stats;
> > -                                     source_count = 1;
> > -                                     break;
> > -                             }
> > -                     }
> > -                     aggr = &ps->aggr[aggr_idx];
> > -                     if (!aggr)
> > -                             break;
> > -
> > -                     if (!metric_events[i]->supported) {
> > -                             /*
> > -                              * Not supported events will have a count of 0,
> > -                              * which can be confusing in a
> > -                              * metric. Explicitly set the value to NAN. Not
> > -                              * counted events (enable time of 0) are read as
> > -                              * 0.
> > -                              */
> > -                             val = NAN;
> > -                             source_count = 0;
> > -                     } else {
> > -                             val = aggr->counts.val;
> > -                             if (!source_count)
> > -                                     source_count = evsel__source_count(metric_events[i]);
> > -                     }
> > +                     val = NAN;
> > +                     source_count = 0;
> > +             } else {
> > +                     val = aggr->counts.val;
> > +                     if (is_tool_time)
> > +                             val *= 1e-9; /* Convert time event nanoseconds to seconds. */
>
> And this code treats all time events are in nsec now.

That's correct: user_time, system_time and duration_time are all
counters that give a count in nanoseconds. However, in the metrics, I
guess as the counts are doubles, the time has always been in whole
seconds.

Thanks,
Ian

> Thanks,
> Namhyung
>
>
> > +                     if (!source_count)
> > +                             source_count = evsel__source_count(metric_events[i]);
> >               }
> >               n = strdup(evsel__metric_id(metric_events[i]));
> >               if (!n)
> > @@ -168,7 +151,7 @@ static void generic_metric(struct perf_stat_config *config,
> >               pctx->sctx.user_requested_cpu_list = strdup(config->user_requested_cpu_list);
> >       pctx->sctx.runtime = runtime;
> >       pctx->sctx.system_wide = config->system_wide;
> > -     i = prepare_metric(mexp, evsel, pctx, aggr_idx);
> > +     i = prepare_metric(config, mexp, evsel, pctx, aggr_idx);
> >       if (i < 0) {
> >               expr__ctx_free(pctx);
> >               return;
> > @@ -229,7 +212,7 @@ double test_generic_metric(struct metric_expr *mexp, int aggr_idx)
> >       if (!pctx)
> >               return NAN;
> >
> > -     if (prepare_metric(mexp, /*evsel=*/NULL, pctx, aggr_idx) < 0)
> > +     if (prepare_metric(/*config=*/NULL, mexp, /*evsel=*/NULL, pctx, aggr_idx) < 0)
> >               goto out;
> >
> >       if (expr__parse(&ratio, pctx, mexp->metric_expr))
> > --
> > 2.51.2.1041.gc1ab5b90ca-goog
> >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 09/10] perf stat: Read tool events last
  2025-11-18  2:35   ` Namhyung Kim
@ 2025-11-18  4:38     ` Ian Rogers
  2025-11-18  6:46       ` Namhyung Kim
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Rogers @ 2025-11-18  4:38 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Dr. David Alan Gilbert, Yang Li, James Clark, Thomas Falcon,
	Thomas Richter, linux-perf-users, linux-kernel, Andi Kleen,
	Dapeng Mi

On Mon, Nov 17, 2025 at 6:35 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Thu, Nov 13, 2025 at 10:05:15AM -0800, Ian Rogers wrote:
> > When reading a metric like memory bandwidth on multiple sockets, the
> > additional sockets will be on CPUS > 0. Because of the affinity
> > reading, the counters are read on CPU 0 along with the time, then the
> > later sockets are read. This can lead to the later sockets having a
> > bandwidth larger than is possible for the period of time. To avoid
> > this moving the reading of tool events to occur after all other events
> > are read.
>
> Can you move this change before the affinity updates?  I think it's
> straight-forward and can be applied independently.

It is straightforward but will require changes to
read_counters_with_affinity. I can do a v5 once I know what to do with
the other patches.

Thanks,
Ian

> Thanks,
> Namhyung
>
> >
> > Signed-off-by: Ian Rogers <irogers@google.com>
> > ---
> >  tools/perf/builtin-stat.c | 29 ++++++++++++++++++++++++++++-
> >  tools/perf/util/evlist.c  |  4 ----
> >  2 files changed, 28 insertions(+), 5 deletions(-)
> >
> > diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> > index 947f11b8b106..aec93b91fd11 100644
> > --- a/tools/perf/builtin-stat.c
> > +++ b/tools/perf/builtin-stat.c
> > @@ -379,6 +379,9 @@ static int read_counters_with_affinity(void)
> >               if (evsel__is_bpf(counter))
> >                       continue;
> >
> > +             if (evsel__is_tool(counter))
> > +                     continue;
> > +
> >               if (!counter->err)
> >                       counter->err = read_counter_cpu(counter, evlist_cpu_itr.cpu_map_idx);
> >       }
> > @@ -402,6 +405,24 @@ static int read_bpf_map_counters(void)
> >       return 0;
> >  }
> >
> > +static int read_tool_counters(void)
> > +{
> > +     struct evsel *counter;
> > +
> > +     evlist__for_each_entry(evsel_list, counter) {
> > +             int idx;
> > +
> > +             if (!evsel__is_tool(counter))
> > +                     continue;
> > +
> > +             perf_cpu_map__for_each_idx(idx, counter->core.cpus) {
> > +                     if (!counter->err)
> > +                             counter->err = read_counter_cpu(counter, idx);
> > +             }
> > +     }
> > +     return 0;
> > +}
> > +
> >  static int read_counters(void)
> >  {
> >       int ret;
> > @@ -415,7 +436,13 @@ static int read_counters(void)
> >               return ret;
> >
> >       // Read non-BPF and non-tool counters next.
> > -     return read_counters_with_affinity();
> > +     ret = read_counters_with_affinity();
> > +     if (ret)
> > +             return ret;
> > +
> > +     // Read the tool counters last. This way the duration_time counter
> > +     // should always be greater than any other counter's enabled time.
> > +     return read_tool_counters();
> >  }
> >
> >  static void process_counters(void)
> > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> > index b6df81b8a236..fc3dae7cdfca 100644
> > --- a/tools/perf/util/evlist.c
> > +++ b/tools/perf/util/evlist.c
> > @@ -368,10 +368,6 @@ static bool evlist__use_affinity(struct evlist *evlist)
> >       struct perf_cpu_map *used_cpus = NULL;
> >       bool ret = false;
> >
> > -     /*
> > -      * With perf record core.user_requested_cpus is usually NULL.
> > -      * Use the old method to handle this for now.
> > -      */
> >       if (!evlist->core.user_requested_cpus ||
> >           cpu_map__is_dummy(evlist->core.user_requested_cpus))
> >               return false;
> > --
> > 2.51.2.1041.gc1ab5b90ca-goog
> >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 04/10] perf stat-shadow: Read tool events directly
  2025-11-18  4:36     ` Ian Rogers
@ 2025-11-18  6:45       ` Namhyung Kim
  0 siblings, 0 replies; 22+ messages in thread
From: Namhyung Kim @ 2025-11-18  6:45 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Dr. David Alan Gilbert, Yang Li, James Clark, Thomas Falcon,
	Thomas Richter, linux-perf-users, linux-kernel, Andi Kleen,
	Dapeng Mi

On Mon, Nov 17, 2025 at 08:36:34PM -0800, Ian Rogers wrote:
> On Mon, Nov 17, 2025 at 6:30 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Thu, Nov 13, 2025 at 10:05:10AM -0800, Ian Rogers wrote:
> > > When reading time values for metrics don't use the globals updated in
> > > builtin-stat, just read the events as regular events. The only
> > > exception is for time events where nanoseconds need converting to
> > > seconds as metrics assume time metrics are in seconds.
> > >
> > > Signed-off-by: Ian Rogers <irogers@google.com>
> > > ---
[SNIP]
> > > -                     case TOOL_PMU__EVENT_SYSTEM_TIME:
> > > -                             stats = &ru_stats.ru_stime_usec_stat;
> > > -                             scale = 1e-6;
> > > -                             break;
> >
> > I think this was broken as it seems to be converted to nsec in the
> > update_rusage_stats().
> 
> The global utime/stime is in usecs as in the name. In the tool_pmu it
> is converted to nanoseconds, for consistency with duration_time. In
> the new code the counters are all in nanoseconds and so the same
> scaling factor can be applied.

What do you mean by global utime/stime?

> 
> > > -                     case TOOL_PMU__EVENT_NONE:
> > > -                             pr_err("Invalid tool event 'none'");
> > > -                             abort();
> > > -                     case TOOL_PMU__EVENT_MAX:
> > > -                             pr_err("Invalid tool event 'max'");
> > > -                             abort();
> > > -                     case TOOL_PMU__EVENT_HAS_PMEM:
> > > -                     case TOOL_PMU__EVENT_NUM_CORES:
> > > -                     case TOOL_PMU__EVENT_NUM_CPUS:
> > > -                     case TOOL_PMU__EVENT_NUM_CPUS_ONLINE:
> > > -                     case TOOL_PMU__EVENT_NUM_DIES:
> > > -                     case TOOL_PMU__EVENT_NUM_PACKAGES:
> > > -                     case TOOL_PMU__EVENT_SLOTS:
> > > -                     case TOOL_PMU__EVENT_SMT_ON:
> > > -                     case TOOL_PMU__EVENT_SYSTEM_TSC_FREQ:
> > > -                     case TOOL_PMU__EVENT_CORE_WIDE:
> > > -                     case TOOL_PMU__EVENT_TARGET_CPU:
> > > -                     default:
> > > -                             pr_err("Unexpected tool event '%s'", evsel__name(metric_events[i]));
> > > -                             abort();
> > >                       }
> > > -                     val = avg_stats(stats) * scale;
> > > -                     source_count = 1;
> > > -             } else {
> > > -                     struct perf_stat_evsel *ps = metric_events[i]->stats;
> > > -                     struct perf_stat_aggr *aggr;
> > > -
> > > +             }
> > > +             /* Time events are always on CPU0, the first aggregation index. */
> > > +             aggr = &ps->aggr[is_tool_time ? tool_aggr_idx : aggr_idx];
> > > +             if (!aggr || !metric_events[i]->supported) {
> > >                       /*
> > > -                      * If there are multiple uncore PMUs and we're not
> > > -                      * reading the leader's stats, determine the stats for
> > > -                      * the appropriate uncore PMU.
> > > +                      * Not supported events will have a count of 0, which
> > > +                      * can be confusing in a metric. Explicitly set the
> > > +                      * value to NAN. Not counted events (enable time of 0)
> > > +                      * are read as 0.
> > >                        */
> > > -                     if (evsel && evsel->metric_leader &&
> > > -                         evsel->pmu != evsel->metric_leader->pmu &&
> > > -                         mexp->metric_events[i]->pmu == evsel->metric_leader->pmu) {
> > > -                             struct evsel *pos;
> > > -
> > > -                             evlist__for_each_entry(evsel->evlist, pos) {
> > > -                                     if (pos->pmu != evsel->pmu)
> > > -                                             continue;
> > > -                                     if (pos->metric_leader != mexp->metric_events[i])
> > > -                                             continue;
> > > -                                     ps = pos->stats;
> > > -                                     source_count = 1;
> > > -                                     break;
> > > -                             }
> > > -                     }
> > > -                     aggr = &ps->aggr[aggr_idx];
> > > -                     if (!aggr)
> > > -                             break;
> > > -
> > > -                     if (!metric_events[i]->supported) {
> > > -                             /*
> > > -                              * Not supported events will have a count of 0,
> > > -                              * which can be confusing in a
> > > -                              * metric. Explicitly set the value to NAN. Not
> > > -                              * counted events (enable time of 0) are read as
> > > -                              * 0.
> > > -                              */
> > > -                             val = NAN;
> > > -                             source_count = 0;
> > > -                     } else {
> > > -                             val = aggr->counts.val;
> > > -                             if (!source_count)
> > > -                                     source_count = evsel__source_count(metric_events[i]);
> > > -                     }
> > > +                     val = NAN;
> > > +                     source_count = 0;
> > > +             } else {
> > > +                     val = aggr->counts.val;
> > > +                     if (is_tool_time)
> > > +                             val *= 1e-9; /* Convert time event nanoseconds to seconds. */
> >
> > And this code treats all time events are in nsec now.
> 
> That's correct: user_time, system_time and duration_time are all
> counters that give a count in nanoseconds. However, in the metrics, I
> guess as the counts are doubles, the time has always been in whole
> seconds.

Yep, I think it's ok.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 09/10] perf stat: Read tool events last
  2025-11-18  4:38     ` Ian Rogers
@ 2025-11-18  6:46       ` Namhyung Kim
  0 siblings, 0 replies; 22+ messages in thread
From: Namhyung Kim @ 2025-11-18  6:46 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Dr. David Alan Gilbert, Yang Li, James Clark, Thomas Falcon,
	Thomas Richter, linux-perf-users, linux-kernel, Andi Kleen,
	Dapeng Mi

On Mon, Nov 17, 2025 at 08:38:15PM -0800, Ian Rogers wrote:
> On Mon, Nov 17, 2025 at 6:35 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Thu, Nov 13, 2025 at 10:05:15AM -0800, Ian Rogers wrote:
> > > When reading a metric like memory bandwidth on multiple sockets, the
> > > additional sockets will be on CPUS > 0. Because of the affinity
> > > reading, the counters are read on CPU 0 along with the time, then the
> > > later sockets are read. This can lead to the later sockets having a
> > > bandwidth larger than is possible for the period of time. To avoid
> > > this moving the reading of tool events to occur after all other events
> > > are read.
> >
> > Can you move this change before the affinity updates?  I think it's
> > straight-forward and can be applied independently.
> 
> It is straightforward but will require changes to
> read_counters_with_affinity. I can do a v5 once I know what to do with
> the other patches.

I plan to merge patch 1 to 7 first.  And I think it'd good to have this
one as well.

Thanks,
Namhyung

> >
> > >
> > > Signed-off-by: Ian Rogers <irogers@google.com>
> > > ---
> > >  tools/perf/builtin-stat.c | 29 ++++++++++++++++++++++++++++-
> > >  tools/perf/util/evlist.c  |  4 ----
> > >  2 files changed, 28 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> > > index 947f11b8b106..aec93b91fd11 100644
> > > --- a/tools/perf/builtin-stat.c
> > > +++ b/tools/perf/builtin-stat.c
> > > @@ -379,6 +379,9 @@ static int read_counters_with_affinity(void)
> > >               if (evsel__is_bpf(counter))
> > >                       continue;
> > >
> > > +             if (evsel__is_tool(counter))
> > > +                     continue;
> > > +
> > >               if (!counter->err)
> > >                       counter->err = read_counter_cpu(counter, evlist_cpu_itr.cpu_map_idx);
> > >       }
> > > @@ -402,6 +405,24 @@ static int read_bpf_map_counters(void)
> > >       return 0;
> > >  }
> > >
> > > +static int read_tool_counters(void)
> > > +{
> > > +     struct evsel *counter;
> > > +
> > > +     evlist__for_each_entry(evsel_list, counter) {
> > > +             int idx;
> > > +
> > > +             if (!evsel__is_tool(counter))
> > > +                     continue;
> > > +
> > > +             perf_cpu_map__for_each_idx(idx, counter->core.cpus) {
> > > +                     if (!counter->err)
> > > +                             counter->err = read_counter_cpu(counter, idx);
> > > +             }
> > > +     }
> > > +     return 0;
> > > +}
> > > +
> > >  static int read_counters(void)
> > >  {
> > >       int ret;
> > > @@ -415,7 +436,13 @@ static int read_counters(void)
> > >               return ret;
> > >
> > >       // Read non-BPF and non-tool counters next.
> > > -     return read_counters_with_affinity();
> > > +     ret = read_counters_with_affinity();
> > > +     if (ret)
> > > +             return ret;
> > > +
> > > +     // Read the tool counters last. This way the duration_time counter
> > > +     // should always be greater than any other counter's enabled time.
> > > +     return read_tool_counters();
> > >  }
> > >
> > >  static void process_counters(void)
> > > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> > > index b6df81b8a236..fc3dae7cdfca 100644
> > > --- a/tools/perf/util/evlist.c
> > > +++ b/tools/perf/util/evlist.c
> > > @@ -368,10 +368,6 @@ static bool evlist__use_affinity(struct evlist *evlist)
> > >       struct perf_cpu_map *used_cpus = NULL;
> > >       bool ret = false;
> > >
> > > -     /*
> > > -      * With perf record core.user_requested_cpus is usually NULL.
> > > -      * Use the old method to handle this for now.
> > > -      */
> > >       if (!evlist->core.user_requested_cpus ||
> > >           cpu_map__is_dummy(evlist->core.user_requested_cpus))
> > >               return false;
> > > --
> > > 2.51.2.1041.gc1ab5b90ca-goog
> > >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 10/10] perf stat: Add no-affinity flag
  2025-11-18  4:32     ` Ian Rogers
@ 2025-11-18  6:50       ` Namhyung Kim
  0 siblings, 0 replies; 22+ messages in thread
From: Namhyung Kim @ 2025-11-18  6:50 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Dr. David Alan Gilbert, Yang Li, James Clark, Thomas Falcon,
	Thomas Richter, linux-perf-users, linux-kernel, Andi Kleen,
	Dapeng Mi

On Mon, Nov 17, 2025 at 08:32:20PM -0800, Ian Rogers wrote:
> On Mon, Nov 17, 2025 at 6:41 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Thu, Nov 13, 2025 at 10:05:16AM -0800, Ian Rogers wrote:
> > > Add flag that disables affinity behavior. Using sched_setaffinity to
> > > place a perf thread on a CPU can avoid certain interprocessor
> > > interrupts but may introduce a delay due to the scheduling,
> > > particularly on loaded machines. Add a command line option to disable
> > > the behavior. This behavior is less present in other tools like `perf
> > > record`, as it uses a ring buffer and doesn't make repeated system
> > > calls.
> > >
> > > Signed-off-by: Ian Rogers <irogers@google.com>
> > > ---
> > >  tools/perf/Documentation/perf-stat.txt | 4 ++++
> > >  tools/perf/builtin-stat.c              | 6 ++++++
> > >  tools/perf/util/evlist.c               | 2 +-
> > >  tools/perf/util/evlist.h               | 1 +
> > >  4 files changed, 12 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
> > > index 1a766d4a2233..1ffb510606af 100644
> > > --- a/tools/perf/Documentation/perf-stat.txt
> > > +++ b/tools/perf/Documentation/perf-stat.txt
> > > @@ -382,6 +382,10 @@ color the metric's computed value.
> > >  Don't print output, warnings or messages. This is useful with perf stat
> > >  record below to only write data to the perf.data file.
> > >
> > > +--no-affinity::
> > > +Don't change scheduler affinities when iterating over CPUs. Disables
> > > +an optimization aimed at minimizing interprocessor interrupts.
> > > +
> > >  STAT RECORD
> > >  -----------
> > >  Stores stat data into perf data file.
> > > diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> > > index aec93b91fd11..fa42b08bd1df 100644
> > > --- a/tools/perf/builtin-stat.c
> > > +++ b/tools/perf/builtin-stat.c
> > > @@ -2415,6 +2415,7 @@ static int parse_tpebs_mode(const struct option *opt, const char *str,
> > >  int cmd_stat(int argc, const char **argv)
> > >  {
> > >       struct opt_aggr_mode opt_mode = {};
> > > +     bool no_affinity = false;
> > >       struct option stat_options[] = {
> > >               OPT_BOOLEAN('T', "transaction", &transaction_run,
> > >                       "hardware transaction statistics"),
> > > @@ -2543,6 +2544,8 @@ int cmd_stat(int argc, const char **argv)
> > >                       "don't print 'summary' for CSV summary output"),
> > >               OPT_BOOLEAN(0, "quiet", &quiet,
> > >                       "don't print any output, messages or warnings (useful with record)"),
> > > +             OPT_BOOLEAN(0, "no-affinity", &no_affinity,
> > > +                     "don't allow affinity optimizations aimed at reducing IPIs"),
> >
> > I know you want to add an option to disable the behaivor, but I think
> > it'd better to have a positive option like just '--affinity'.  Then we
> > will have '--no-affinity' for free. :)  The current form will allow
> > '--no-no-affinity'.
> >
> > Then the variable also can be 'enable_affinity' or so.
> >
> > You can mention --no-affinity in the help message and the man page
> > document so that users can discover the intention.
> 
> I was trying to keep the code the same as the flag. I don't want to
> have an affinity flag you need to set to true in say evlist, as that
> will need updating in all users of the evlist or become a behavioral
> change. We can have the evlist with no_affinity and invert the flag,
> it just looks awkward imo.

You can make the default to true and just update the users that want
different behaviors.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 00/10] perf stat fixes and improvements
  2025-11-13 18:05 [PATCH v4 00/10] perf stat fixes and improvements Ian Rogers
                   ` (9 preceding siblings ...)
  2025-11-13 18:05 ` [PATCH v4 10/10] perf stat: Add no-affinity flag Ian Rogers
@ 2025-11-18 18:00 ` Namhyung Kim
  10 siblings, 0 replies; 22+ messages in thread
From: Namhyung Kim @ 2025-11-18 18:00 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Dr. David Alan Gilbert, Yang Li, James Clark, Thomas Falcon,
	Thomas Richter, linux-perf-users, linux-kernel, Andi Kleen,
	Dapeng Mi

On Thu, Nov 13, 2025 at 10:05:06AM -0800, Ian Rogers wrote:
> A collection of fixes aiming to stabilize and make more reasonable
> measurements/metrics such as memory bandwidth.
> 
> Tool events are changed from getting a PMU cpu map of all online CPUs
> to either CPU 0 or all online CPUs. This avoids iterating over useless
> CPUs for events in particular `duration_time`. Fix a bug where
> duration_time didn't correctly use the previous raw counts and would
> skip values in interval mode.
> 
> Change how json metrics handle tool events. Use the counter value
> rather than using shared state with perf stat. A later patch changes
> it so that tool events are read last, so that if reading say memory
> bandwidth counters you don't divide by an earlier read time and exceed
> the theoretical maximum memory bandwidth.

[...]

Applied the first 7 patches to perf-tools-next, thanks!

Best regards,
Namhyung


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-11-18 18:00 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-13 18:05 [PATCH v4 00/10] perf stat fixes and improvements Ian Rogers
2025-11-13 18:05 ` [PATCH v4 01/10] libperf cpumap: Reduce allocations and sorting in intersect Ian Rogers
2025-11-13 18:05 ` [PATCH v4 02/10] perf pmu: perf_cpu_map__new_int to avoid parsing a string Ian Rogers
2025-11-13 18:05 ` [PATCH v4 03/10] perf tool_pmu: Use old_count when computing count values for time events Ian Rogers
2025-11-13 18:05 ` [PATCH v4 04/10] perf stat-shadow: Read tool events directly Ian Rogers
2025-11-18  2:30   ` Namhyung Kim
2025-11-18  4:36     ` Ian Rogers
2025-11-18  6:45       ` Namhyung Kim
2025-11-13 18:05 ` [PATCH v4 05/10] perf stat: Reduce scope of ru_stats Ian Rogers
2025-11-18  2:31   ` Namhyung Kim
2025-11-13 18:05 ` [PATCH v4 06/10] perf stat: Reduce scope of walltime_nsecs_stats Ian Rogers
2025-11-13 18:05 ` [PATCH v4 07/10] perf tool_pmu: More accurately set the cpus for tool events Ian Rogers
2025-11-13 18:05 ` [PATCH v4 08/10] perf evlist: Reduce affinity use and move into iterator, fix no affinity Ian Rogers
2025-11-13 18:05 ` [PATCH v4 09/10] perf stat: Read tool events last Ian Rogers
2025-11-18  2:35   ` Namhyung Kim
2025-11-18  4:38     ` Ian Rogers
2025-11-18  6:46       ` Namhyung Kim
2025-11-13 18:05 ` [PATCH v4 10/10] perf stat: Add no-affinity flag Ian Rogers
2025-11-18  2:40   ` Namhyung Kim
2025-11-18  4:32     ` Ian Rogers
2025-11-18  6:50       ` Namhyung Kim
2025-11-18 18:00 ` [PATCH v4 00/10] perf stat fixes and improvements Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).