[PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology
@ 2023-05-17 17:27 K Prateek Nayak
  2023-05-17 17:27 ` [PATCH v4 1/5] perf: Extract building cache level for a CPU into separate function K Prateek Nayak
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: K Prateek Nayak @ 2023-05-17 17:27 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel, acme, peterz, mingo, mark.rutland,
	alexander.shishkin, jolsa, namhyung
  Cc: ravi.bangoria, sandipan.das, ananth.narayan, gautham.shenoy,
	eranian, irogers, puwen

Motivation behind this feature is to aggregate the data at the LLC level
for chiplet based processors which currently do not expose the chiplet
details in sysfs cpu topology information.

For the completeness of the feature, the series adds ability to
aggregate data at any cache level. Following is the example of the
output on a dual socket Zen3 processor with 2 x 64C/128T containing 8
chiplet per socket.

  $ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
    taskset -c 0-15,64-79,128-143,192-207\
    perf bench sched messaging -p -t -l 100000 -g 8

    # Running 'sched/messaging' benchmark:
    # 20 sender and receiver threads per group
    # 8 groups == 320 threads run
    
    Total time: 7.648 [sec]
    
    Performance counter stats for 'system wide':
    
    S0-D0-L3-ID0             16         17,145,912      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID8             16         14,977,628      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID16            16            262,539      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID24            16              3,140      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID32            16             27,403      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID40            16             17,026      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID48            16              7,292      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID56            16              2,464      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID64            16         22,489,306      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID72            16         21,455,257      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID80            16             11,619      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID88            16             30,978      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID96            16             37,628      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID104           16             13,594      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID112           16             10,164      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID120           16             11,259      ls_dmnd_fills_from_sys.ext_cache_remote
    
          7.779171484 seconds time elapsed

The series also adds support for perf stat record and perf stat report
to aggregate data at various cache levels. Following is an example of
recording with aggregation at L2 level and reporting the same data with
aggregation at L3 level.

  $ sudo perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
    taskset -c 0-15,64-79,128-143,192-207\
    perf bench sched messaging -p -t -l 100000 -g 8
  
    # Running 'sched/messaging' benchmark:
    # 20 sender and receiver threads per group
    # 8 groups == 320 threads run
    
    Total time: 7.318 [sec]
    
    Performance counter stats for 'system wide':
    
    S0-D0-L2-ID0              2          2,171,980      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L2-ID1              2          2,048,494      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L2-ID2              2          2,120,293      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L2-ID3              2          2,224,725      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L2-ID4              2          2,021,618      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L2-ID5              2          1,995,331      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L2-ID6              2          2,163,029      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L2-ID7              2          2,104,623      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L2-ID8              2          1,948,776      ls_dmnd_fills_from_sys.ext_cache_remote
    ...
    S0-D0-L2-ID63             2              2,648      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID64             2          2,963,323      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID65             2          2,856,629      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID66             2          2,901,725      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID67             2          3,046,120      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID68             2          2,637,971      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID69             2          2,680,029      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID70             2          2,672,259      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID71             2          2,638,768      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID72             2          3,308,642      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID73             2          3,064,473      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID74             2          3,023,379      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID75             2          2,975,119      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID76             2          2,952,677      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID77             2          2,981,695      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID78             2          3,455,916      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID79             2          2,959,540      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L2-ID80             2              4,977      ls_dmnd_fills_from_sys.ext_cache_remote
    ...
    S1-D1-L2-ID127            2              3,359      ls_dmnd_fills_from_sys.ext_cache_remote
    
          7.451725897 seconds time elapsed

  $ sudo perf stat report --per-cache=L3

    Performance counter stats for '...':

    S0-D0-L3-ID0             16         16,850,093      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID8             16         16,001,493      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID16            16            301,011      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID24            16             26,276      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID32            16             48,958      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID40            16             43,799      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID48            16             16,771      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID56            16             12,544      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID64            16         22,396,824      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID72            16         24,721,441      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID80            16             29,426      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID88            16             54,348      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID96            16             41,557      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID104           16             10,084      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID112           16             14,361      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID120           16             24,446      ls_dmnd_fills_from_sys.ext_cache_remote
    
           7.451725897 seconds time elapsed

The aggregate at S0-D0-L3-ID0 is the sum of S0-D0-L2-ID0 to S0-D0-L3-ID7
as L3 containing CPU0 contains the L2 instance of CPU0 to CPU7.

Cache IDs are derived from the shared_cpus_list file in the cache
topology. This allows for --per-cache aggregation of data on a kernel
which does not expose the cache instance ID in the sysfs. Running perf
stat will give the following output on the same system with cache
instance ID hidden:

  $ ls /sys/devices/system/cpu/cpu0/cache/index0/

    coherency_line_size  level  number_of_sets  physical_line_partition
    shared_cpu_list  shared_cpu_map  size  type  uevent
    ways_of_associativity

  $ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
    taskset -c 0-15,64-79,128-143,192-207\
    perf bench sched messaging -p -t -l 100000 -g 8

    # Running 'sched/messaging' benchmark:
    # 20 sender and receiver threads per group
    # 8 groups == 320 threads run

         Total time: 6.949 [sec]

     Performance counter stats for 'system wide':

    S0-D0-L3-ID0             16          5,297,615      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID8             16          4,347,868      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID16            16            416,593      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID24            16              4,346      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID32            16              5,506      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID40            16             15,845      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID48            16             24,164      ls_dmnd_fills_from_sys.ext_cache_remote
    S0-D0-L3-ID56            16              4,543      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID64            16         41,610,374      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID72            16         38,393,688      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID80            16             22,188      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID88            16             22,918      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID96            16             39,230      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID104           16              6,236      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID112           16             66,846      ls_dmnd_fills_from_sys.ext_cache_remote
    S1-D1-L3-ID120           16             72,713      ls_dmnd_fills_from_sys.ext_cache_remote

           7.098471410 seconds time elapsed

Few notes:

- This series makes breaking change when saving the aggregation details
  as the cache level needs to be saved along with the aggregation
  method.

- This series assumes that caches at same level will be shared by same
  set of threads. The implementation will run into an issue if, say L1i
  is thread local, but L1d is shared by the SMT siblings on the core.

This series cleanly applies on top perf-tool branch from Arnaldo's tree
(https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=perf-tools)
at commit 760ebc45746b ("perf lock contention: Add empty 'struct rq' to
satisfy libbpf 'runqueue' type verification")
---
Changelog:
o v3->v4:
  - Dropped the RFC tag.
  - Break down Patch 2 from v3 into smaller patches (kind of!)
  - Fixed couple of errors in docs and comments.

o v2->v3:
  - Dropped patches 1 and 2 that saved and retrieved the cache instance
    ID when saving the cache data.
  - The above is unnecessary as the IDs are being derived from the first
    online CPU in the cache domain for a given cache instance.
  - Improvements to handling cases where a cache level is not present
    but the level is allowed by MAX_CACHE_LVL.
  - Updated details in cover letter.

o v1->v2
  - Set cache instance ID to 0 if the file cannot be read.
  - Fix cache level parsing function.
  - Updated details in cover letter.
---
K Prateek Nayak (5):
  perf: Extract building cache level for a CPU into separate function
  perf stat: Setup the foundation to allow aggregation based on cache
    topology
  perf stat: Save cache level information when running perf stat record
  perf stat: Add "--per-cache" aggregation option and document the same
  pert stat: Add tests for the "--per-cache" option

 tools/lib/perf/include/perf/cpumap.h          |   5 +
 tools/lib/perf/include/perf/event.h           |   3 +-
 tools/perf/Documentation/perf-stat.txt        |  16 ++
 tools/perf/builtin-stat.c                     | 144 +++++++++++++++++-
 .../tests/shell/lib/perf_json_output_lint.py  |   4 +-
 tools/perf/tests/shell/stat+csv_output.sh     |  14 ++
 tools/perf/tests/shell/stat+json_output.sh    |  13 ++
 tools/perf/util/cpumap.c                      | 119 +++++++++++++++
 tools/perf/util/cpumap.h                      |  28 ++++
 tools/perf/util/event.c                       |   7 +-
 tools/perf/util/header.c                      |  62 +++++---
 tools/perf/util/header.h                      |   4 +
 tools/perf/util/stat-display.c                |  17 +++
 tools/perf/util/stat.h                        |   2 +
 tools/perf/util/synthetic-events.c            |   1 +
 15 files changed, 409 insertions(+), 30 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 1/5] perf: Extract building cache level for a CPU into separate function
  2023-05-17 17:27 [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology K Prateek Nayak
@ 2023-05-17 17:27 ` K Prateek Nayak
  2023-05-17 17:27 ` [PATCH v4 2/5] perf stat: Setup the foundation to allow aggregation based on cache topology K Prateek Nayak
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: K Prateek Nayak @ 2023-05-17 17:27 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel, acme, peterz, mingo, mark.rutland,
	alexander.shishkin, jolsa, namhyung
  Cc: ravi.bangoria, sandipan.das, ananth.narayan, gautham.shenoy,
	eranian, irogers, puwen

build_caches() builds the complete cache topology of the system by
iterating over all CPU, building and comparing cache levels of each CPU,
keeping only the unique ones at the end.

Extract the unit that build the cache levels for a single CPU into a
separate function. Expose this function, and the MAX_CACHE_LVL value to
be used elsewhere in perf too.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog:
o v3->v4:
  - No changes
---
 tools/perf/util/header.c | 62 +++++++++++++++++++++++++---------------
 tools/perf/util/header.h |  4 +++
 2 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 276870221ce0..560871736764 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1213,38 +1213,54 @@ static void cpu_cache_level__fprintf(FILE *out, struct cpu_cache_level *c)
 	fprintf(out, "L%d %-15s %8s [%s]\n", c->level, c->type, c->size, c->map);
 }
 
-#define MAX_CACHE_LVL 4
-
-static int build_caches(struct cpu_cache_level caches[], u32 *cntp)
+/*
+ * Build caches levels for a particular CPU from the data in
+ * /sys/devices/system/cpu/cpu<cpu>/cache/
+ * The cache level data is stored in caches[] from index at
+ * *cntp.
+ */
+int build_caches_for_cpu(u32 cpu, struct cpu_cache_level caches[], u32 *cntp)
 {
-	u32 i, cnt = 0;
-	u32 nr, cpu;
 	u16 level;
 
-	nr = cpu__max_cpu().cpu;
+	for (level = 0; level < MAX_CACHE_LVL; level++) {
+		struct cpu_cache_level c;
+		int err;
+		u32 i;
 
-	for (cpu = 0; cpu < nr; cpu++) {
-		for (level = 0; level < MAX_CACHE_LVL; level++) {
-			struct cpu_cache_level c;
-			int err;
+		err = cpu_cache_level__read(&c, cpu, level);
+		if (err < 0)
+			return err;
 
-			err = cpu_cache_level__read(&c, cpu, level);
-			if (err < 0)
-				return err;
+		if (err == 1)
+			break;
 
-			if (err == 1)
+		for (i = 0; i < *cntp; i++) {
+			if (cpu_cache_level__cmp(&c, &caches[i]))
 				break;
+		}
 
-			for (i = 0; i < cnt; i++) {
-				if (cpu_cache_level__cmp(&c, &caches[i]))
-					break;
-			}
+		if (i == *cntp) {
+			caches[*cntp] = c;
+			*cntp = *cntp + 1;
+		} else
+			cpu_cache_level__free(&c);
+	}
 
-			if (i == cnt)
-				caches[cnt++] = c;
-			else
-				cpu_cache_level__free(&c);
-		}
+	return 0;
+}
+
+static int build_caches(struct cpu_cache_level caches[], u32 *cntp)
+{
+	u32 nr, cpu, cnt = 0;
+
+	nr = cpu__max_cpu().cpu;
+
+	for (cpu = 0; cpu < nr; cpu++) {
+		int ret = build_caches_for_cpu(cpu, caches, &cnt);
+
+		if (ret)
+			return ret;
 	}
 	*cntp = cnt;
 	return 0;
diff --git a/tools/perf/util/header.h b/tools/perf/util/header.h
index 59eeb4a32ac5..7c16a250e738 100644
--- a/tools/perf/util/header.h
+++ b/tools/perf/util/header.h
@@ -179,7 +179,11 @@ int do_write(struct feat_fd *fd, const void *buf, size_t size);
 int write_padded(struct feat_fd *fd, const void *bf,
 		 size_t count, size_t count_aligned);
 
+#define MAX_CACHE_LVL 4
+
 int is_cpu_online(unsigned int cpu);
+int build_caches_for_cpu(u32 cpu, struct cpu_cache_level caches[], u32 *cntp);
+
 /*
  * arch specific callback
  */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 2/5] perf stat: Setup the foundation to allow aggregation based on cache topology
  2023-05-17 17:27 [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology K Prateek Nayak
  2023-05-17 17:27 ` [PATCH v4 1/5] perf: Extract building cache level for a CPU into separate function K Prateek Nayak
@ 2023-05-17 17:27 ` K Prateek Nayak
  2023-05-23 19:12   ` Arnaldo Carvalho de Melo
  2023-05-17 17:27 ` [PATCH v4 3/5] perf stat: Save cache level information when running perf stat record K Prateek Nayak
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: K Prateek Nayak @ 2023-05-17 17:27 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel, acme, peterz, mingo, mark.rutland,
	alexander.shishkin, jolsa, namhyung
  Cc: ravi.bangoria, sandipan.das, ananth.narayan, gautham.shenoy,
	eranian, irogers, puwen

Processors based on chiplet architecture, such as AMD EPYC and Hygon do
not expose the chiplet details in the sysfs CPU topology information.
However, this information can be derived from the per CPU cache level
information from the sysfs.

perf stat has already supported aggregation based on topology
information using core ID, socket ID, etc. It'll be useful to aggregate
based on the cache topology to detect problems like imbalance and
cache-to-cache sharing at various cache levels.

This patch lays the foundation for aggregating data in perf stat based
on the processor's cache topology. The cmdline option to aggregate data
based on the cache topology is added in Patch 4 of the series while this
patch sets up all the necessary functions and variables required to
support the new aggregation option.

The patch also adds support to display per-cache aggregation, or save it
as a JSON or CSV, as splitting it into a separate patch would break
builds when compiling with "-Werror=switch-enum" where the compiler will
complain about the lack of handling for the AGGR_CACHE case in the
output functions.

Suggested-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog:
o v3->v4:
  - Some parts of the previous Patch 2 have been put into subsequent
    smaller patches (while being careful not to introduce any build
    errors in case someone were to bisect through the series)
  - Fixed comments.
---
 tools/lib/perf/include/perf/cpumap.h |   5 ++
 tools/perf/builtin-stat.c            |  88 +++++++++++++++++++-
 tools/perf/util/cpumap.c             | 119 +++++++++++++++++++++++++++
 tools/perf/util/cpumap.h             |  28 +++++++
 tools/perf/util/stat-display.c       |  17 ++++
 tools/perf/util/stat.h               |   2 +
 6 files changed, 257 insertions(+), 2 deletions(-)

diff --git a/tools/lib/perf/include/perf/cpumap.h b/tools/lib/perf/include/perf/cpumap.h
index 3f43f770cdac..8724dde79342 100644
--- a/tools/lib/perf/include/perf/cpumap.h
+++ b/tools/lib/perf/include/perf/cpumap.h
@@ -11,6 +11,11 @@ struct perf_cpu {
 	int cpu;
 };
 
+struct perf_cache {
+	int cache_lvl;
+	int cache;
+};
+
 struct perf_cpu_map;
 
 LIBPERF_API struct perf_cpu_map *perf_cpu_map__dummy_new(void);
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index b9ad32f21e57..7923940edef7 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -138,6 +138,7 @@ struct perf_stat {
 	struct perf_cpu_map	*cpus;
 	struct perf_thread_map *threads;
 	enum aggr_mode		 aggr_mode;
+	u32			 aggr_level;
 };
 
 static struct perf_stat		perf_stat;
@@ -145,8 +146,9 @@ static struct perf_stat		perf_stat;
 
 static volatile sig_atomic_t done = 0;
 
-static struct perf_stat_config stat_config = {
+struct perf_stat_config stat_config = {
 	.aggr_mode		= AGGR_GLOBAL,
+	.aggr_level		= MAX_CACHE_LVL + 1,
 	.scale			= true,
 	.unit_width		= 4, /* strlen("unit") */
 	.run_count		= 1,
@@ -1245,6 +1247,7 @@ static struct option stat_options[] = {
 
 static const char *const aggr_mode__string[] = {
 	[AGGR_CORE] = "core",
+	[AGGR_CACHE] = "cache",
 	[AGGR_DIE] = "die",
 	[AGGR_GLOBAL] = "global",
 	[AGGR_NODE] = "node",
@@ -1266,6 +1269,12 @@ static struct aggr_cpu_id perf_stat__get_die(struct perf_stat_config *config __m
 	return aggr_cpu_id__die(cpu, /*data=*/NULL);
 }
 
+static struct aggr_cpu_id perf_stat__get_cache_id(struct perf_stat_config *config __maybe_unused,
+						  struct perf_cpu cpu)
+{
+	return aggr_cpu_id__cache(cpu, /*data=*/NULL);
+}
+
 static struct aggr_cpu_id perf_stat__get_core(struct perf_stat_config *config __maybe_unused,
 					      struct perf_cpu cpu)
 {
@@ -1318,6 +1327,12 @@ static struct aggr_cpu_id perf_stat__get_die_cached(struct perf_stat_config *con
 	return perf_stat__get_aggr(config, perf_stat__get_die, cpu);
 }
 
+static struct aggr_cpu_id perf_stat__get_cache_id_cached(struct perf_stat_config *config,
+							 struct perf_cpu cpu)
+{
+	return perf_stat__get_aggr(config, perf_stat__get_cache_id, cpu);
+}
+
 static struct aggr_cpu_id perf_stat__get_core_cached(struct perf_stat_config *config,
 						     struct perf_cpu cpu)
 {
@@ -1349,6 +1364,8 @@ static aggr_cpu_id_get_t aggr_mode__get_aggr(enum aggr_mode aggr_mode)
 		return aggr_cpu_id__socket;
 	case AGGR_DIE:
 		return aggr_cpu_id__die;
+	case AGGR_CACHE:
+		return aggr_cpu_id__cache;
 	case AGGR_CORE:
 		return aggr_cpu_id__core;
 	case AGGR_NODE:
@@ -1372,6 +1389,8 @@ static aggr_get_id_t aggr_mode__get_id(enum aggr_mode aggr_mode)
 		return perf_stat__get_socket_cached;
 	case AGGR_DIE:
 		return perf_stat__get_die_cached;
+	case AGGR_CACHE:
+		return perf_stat__get_cache_id_cached;
 	case AGGR_CORE:
 		return perf_stat__get_core_cached;
 	case AGGR_NODE:
@@ -1484,6 +1503,60 @@ static struct aggr_cpu_id perf_env__get_die_aggr_by_cpu(struct perf_cpu cpu, voi
 	return id;
 }
 
+static void perf_env__get_cache_id_for_cpu(struct perf_cpu cpu, struct perf_env *env,
+					   u32 cache_level, struct aggr_cpu_id *id)
+{
+	int i;
+	int caches_cnt = env->caches_cnt;
+	struct cpu_cache_level *caches = env->caches;
+
+	id->cache_lvl = (cache_level > MAX_CACHE_LVL) ? 0 : cache_level;
+	id->cache = -1;
+
+	if (!caches_cnt)
+		return;
+
+	for (i = caches_cnt - 1; i > -1; --i) {
+		struct perf_cpu_map *cpu_map;
+		int map_contains_cpu;
+
+		/*
+		 * If user has not specified a level, find the fist level with
+		 * the cpu in the map. Since building the map is expensive, do
+		 * this only if levels match.
+		 */
+		if (cache_level <= MAX_CACHE_LVL && caches[i].level != cache_level)
+			continue;
+
+		cpu_map = perf_cpu_map__new(caches[i].map);
+		map_contains_cpu = perf_cpu_map__idx(cpu_map, cpu);
+		perf_cpu_map__put(cpu_map);
+
+		if (map_contains_cpu != -1) {
+			id->cache_lvl = caches[i].level;
+			id->cache = cpu__get_cache_id_from_map(cpu, caches[i].map);
+			return;
+		}
+	}
+}
+
+static struct aggr_cpu_id perf_env__get_cache_aggr_by_cpu(struct perf_cpu cpu,
+							  void *data)
+{
+	struct perf_env *env = data;
+	struct aggr_cpu_id id = aggr_cpu_id__empty();
+
+	if (cpu.cpu != -1) {
+		u32 cache_level = (perf_stat.aggr_level) ?: stat_config.aggr_level;
+
+		id.socket = env->cpu[cpu.cpu].socket_id;
+		id.die = env->cpu[cpu.cpu].die_id;
+		perf_env__get_cache_id_for_cpu(cpu, env, cache_level, &id);
+	}
+
+	return id;
+}
+
 static struct aggr_cpu_id perf_env__get_core_aggr_by_cpu(struct perf_cpu cpu, void *data)
 {
 	struct perf_env *env = data;
@@ -1552,6 +1625,12 @@ static struct aggr_cpu_id perf_stat__get_die_file(struct perf_stat_config *confi
 	return perf_env__get_die_aggr_by_cpu(cpu, &perf_stat.session->header.env);
 }
 
+static struct aggr_cpu_id perf_stat__get_cache_file(struct perf_stat_config *config __maybe_unused,
+						    struct perf_cpu cpu)
+{
+	return perf_env__get_cache_aggr_by_cpu(cpu, &perf_stat.session->header.env);
+}
+
 static struct aggr_cpu_id perf_stat__get_core_file(struct perf_stat_config *config __maybe_unused,
 						   struct perf_cpu cpu)
 {
@@ -1583,6 +1662,8 @@ static aggr_cpu_id_get_t aggr_mode__get_aggr_file(enum aggr_mode aggr_mode)
 		return perf_env__get_socket_aggr_by_cpu;
 	case AGGR_DIE:
 		return perf_env__get_die_aggr_by_cpu;
+	case AGGR_CACHE:
+		return perf_env__get_cache_aggr_by_cpu;
 	case AGGR_CORE:
 		return perf_env__get_core_aggr_by_cpu;
 	case AGGR_NODE:
@@ -1606,6 +1687,8 @@ static aggr_get_id_t aggr_mode__get_id_file(enum aggr_mode aggr_mode)
 		return perf_stat__get_socket_file;
 	case AGGR_DIE:
 		return perf_stat__get_die_file;
+	case AGGR_CACHE:
+		return perf_stat__get_cache_file;
 	case AGGR_CORE:
 		return perf_stat__get_core_file;
 	case AGGR_NODE:
@@ -2124,7 +2207,8 @@ static struct perf_stat perf_stat = {
 		.stat		= perf_event__process_stat_event,
 		.stat_round	= process_stat_round_event,
 	},
-	.aggr_mode = AGGR_UNSET,
+	.aggr_mode	= AGGR_UNSET,
+	.aggr_level	= 0,
 };
 
 static int __cmd_report(int argc, const char **argv)
diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
index 75d9c73e0184..88d387200745 100644
--- a/tools/perf/util/cpumap.c
+++ b/tools/perf/util/cpumap.c
@@ -3,6 +3,8 @@
 #include "cpumap.h"
 #include "debug.h"
 #include "event.h"
+#include "header.h"
+#include "stat.h"
 #include <assert.h>
 #include <dirent.h>
 #include <stdio.h>
@@ -222,6 +224,10 @@ static int aggr_cpu_id__cmp(const void *a_pointer, const void *b_pointer)
 		return a->socket - b->socket;
 	else if (a->die != b->die)
 		return a->die - b->die;
+	else if (a->cache_lvl != b->cache_lvl)
+		return a->cache_lvl - b->cache_lvl;
+	else if (a->cache != b->cache)
+		return a->cache - b->cache;
 	else if (a->core != b->core)
 		return a->core - b->core;
 	else
@@ -305,6 +311,113 @@ struct aggr_cpu_id aggr_cpu_id__die(struct perf_cpu cpu, void *data)
 	return id;
 }
 
+extern struct perf_stat_config stat_config;
+
+int cpu__get_cache_id_from_map(struct perf_cpu cpu, char *map)
+{
+	int id;
+	struct perf_cpu_map *cpu_map = perf_cpu_map__new(map);
+
+	/*
+	 * If the map contains no CPU, consider the current CPU to
+	 * be the first online CPU in the cache domain else use the
+	 * first online CPU of the cache domain as the ID.
+	 */
+	if (perf_cpu_map__empty(cpu_map))
+		id = cpu.cpu;
+	else
+		id = perf_cpu_map__cpu(cpu_map, 0).cpu;
+
+	/* Free the perf_cpu_map used to find the cache ID */
+	perf_cpu_map__put(cpu_map);
+
+	return id;
+}
+
+int cpu__get_cache_details(struct perf_cpu cpu, struct perf_cache *cache)
+{
+	int ret = 0;
+	struct cpu_cache_level caches[MAX_CACHE_LVL];
+	u32 cache_level = stat_config.aggr_level;
+	u32 i = 0, caches_cnt = 0;
+
+	cache->cache_lvl = (cache_level > MAX_CACHE_LVL) ? 0 : cache_level;
+	cache->cache = -1;
+
+	ret = build_caches_for_cpu(cpu.cpu, caches, &caches_cnt);
+	if (ret) {
+		/*
+		 * If caches_cnt is not 0, cpu_cache_level data
+		 * was allocated when building the topology.
+		 * Free the allocated data before returning.
+		 */
+		if (caches_cnt)
+			goto free_caches;
+
+		return ret;
+	}
+
+	if (!caches_cnt)
+		return -1;
+
+	/*
+	 * Save the data for the highest level if no
+	 * level was specified by the user.
+	 */
+	if (cache_level > MAX_CACHE_LVL) {
+		int max_level_index = 0;
+
+		for (i = 1; i < caches_cnt; ++i) {
+			if (caches[i].level > caches[max_level_index].level)
+				max_level_index = i;
+		}
+
+		cache->cache_lvl = caches[max_level_index].level;
+		cache->cache = cpu__get_cache_id_from_map(cpu, caches[max_level_index].map);
+
+		/* Reset i to 0 to free entire caches[] */
+		i = 0;
+		goto free_caches;
+	}
+
+	for (i = 0; i < caches_cnt; ++i) {
+		if (caches[i].level == cache_level) {
+			cache->cache_lvl = cache_level;
+			cache->cache = cpu__get_cache_id_from_map(cpu, caches[i].map);
+		}
+
+		cpu_cache_level__free(&caches[i]);
+	}
+
+free_caches:
+	/*
+	 * Free all the allocated cpu_cache_level data.
+	 */
+	while (i < caches_cnt)
+		cpu_cache_level__free(&caches[i++]);
+
+	return ret;
+}
+
+struct aggr_cpu_id aggr_cpu_id__cache(struct perf_cpu cpu, void *data)
+{
+	int ret;
+	struct aggr_cpu_id id;
+	struct perf_cache cache;
+
+	id = aggr_cpu_id__die(cpu, data);
+	if (aggr_cpu_id__is_empty(&id))
+		return id;
+
+	ret = cpu__get_cache_details(cpu, &cache);
+	if (ret)
+		return id;
+
+	id.cache_lvl = cache.cache_lvl;
+	id.cache = cache.cache;
+	return id;
+}
+
 int cpu__get_core_id(struct perf_cpu cpu)
 {
 	int value, ret = cpu__get_topology_int(cpu.cpu, "core_id", &value);
@@ -679,6 +792,8 @@ bool aggr_cpu_id__equal(const struct aggr_cpu_id *a, const struct aggr_cpu_id *b
 		a->node == b->node &&
 		a->socket == b->socket &&
 		a->die == b->die &&
+		a->cache_lvl == b->cache_lvl &&
+		a->cache == b->cache &&
 		a->core == b->core &&
 		a->cpu.cpu == b->cpu.cpu;
 }
@@ -689,6 +804,8 @@ bool aggr_cpu_id__is_empty(const struct aggr_cpu_id *a)
 		a->node == -1 &&
 		a->socket == -1 &&
 		a->die == -1 &&
+		a->cache_lvl == -1 &&
+		a->cache == -1 &&
 		a->core == -1 &&
 		a->cpu.cpu == -1;
 }
@@ -700,6 +817,8 @@ struct aggr_cpu_id aggr_cpu_id__empty(void)
 		.node = -1,
 		.socket = -1,
 		.die = -1,
+		.cache_lvl = -1,
+		.cache = -1,
 		.core = -1,
 		.cpu = (struct perf_cpu){ .cpu = -1 },
 	};
diff --git a/tools/perf/util/cpumap.h b/tools/perf/util/cpumap.h
index e3426541e0aa..1212b4ab1938 100644
--- a/tools/perf/util/cpumap.h
+++ b/tools/perf/util/cpumap.h
@@ -20,6 +20,13 @@ struct aggr_cpu_id {
 	int socket;
 	/** The die id as read from /sys/devices/system/cpu/cpuX/topology/die_id. */
 	int die;
+	/** The cache level as read from /sys/devices/system/cpu/cpuX/cache/indexY/level */
+	int cache_lvl;
+	/**
+	 * The cache instance ID, which is the first CPU in the
+	 * /sys/devices/system/cpu/cpuX/cache/indexY/shared_cpu_list
+	 */
+	int cache;
 	/** The core id as read from /sys/devices/system/cpu/cpuX/topology/core_id. */
 	int core;
 	/** CPU aggregation, note there is one CPU for each SMT thread. */
@@ -79,6 +86,20 @@ int cpu__get_socket_id(struct perf_cpu cpu);
  * /sys/devices/system/cpu/cpuX/topology/die_id for the given CPU.
  */
 int cpu__get_die_id(struct perf_cpu cpu);
+/**
+ * Calculate the cache instance ID from the map in
+ * /sys/devices/system/cpu/cpuX/cache/indexY/shared_cpu_list
+ * Cache instance ID is the first CPU reported in the shared_cpu_list file.
+ */
+int cpu__get_cache_id_from_map(struct perf_cpu cpu, char *map);
+/**
+ * cpu__get_cache_id - Returns 0 if successful in populating the
+ * cache level and cache id. Cache level is read from
+ * /sys/devices/system/cpu/cpuX/cache/indexY/level where as cache instance ID
+ * is the first CPU reported by
+ * /sys/devices/system/cpu/cpuX/cache/indexY/shared_cpu_list
+ */
+int cpu__get_cache_details(struct perf_cpu cpu, struct perf_cache *cache);
 /**
  * cpu__get_core_id - Returns the core id as read from
  * /sys/devices/system/cpu/cpuX/topology/core_id for the given CPU.
@@ -119,6 +140,13 @@ struct aggr_cpu_id aggr_cpu_id__socket(struct perf_cpu cpu, void *data);
  * aggr_cpu_id_get_t.
  */
 struct aggr_cpu_id aggr_cpu_id__die(struct perf_cpu cpu, void *data);
+/**
+ * aggr_cpu_id__cache - Create an aggr_cpu_id with cache instache ID, cache
+ * level, die and socket populated with the cache instache ID, cache level,
+ * die and socket for cpu. The function signature is compatible with
+ * aggr_cpu_id_get_t.
+ */
+struct aggr_cpu_id aggr_cpu_id__cache(struct perf_cpu cpu, void *data);
 /**
  * aggr_cpu_id__core - Create an aggr_cpu_id with the core, die and socket
  * populated with the core, die and socket for cpu. The function signature is
diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
index bf5a6c14dfcd..319f456f0673 100644
--- a/tools/perf/util/stat-display.c
+++ b/tools/perf/util/stat-display.c
@@ -36,6 +36,7 @@
 
 static int aggr_header_lens[] = {
 	[AGGR_CORE] 	= 18,
+	[AGGR_CACHE]	= 22,
 	[AGGR_DIE] 	= 12,
 	[AGGR_SOCKET] 	= 6,
 	[AGGR_NODE] 	= 6,
@@ -46,6 +47,7 @@ static int aggr_header_lens[] = {
 
 static const char *aggr_header_csv[] = {
 	[AGGR_CORE] 	= 	"core,cpus,",
+	[AGGR_CACHE]	= 	"cache,cpus,",
 	[AGGR_DIE] 	= 	"die,cpus,",
 	[AGGR_SOCKET] 	= 	"socket,cpus,",
 	[AGGR_NONE] 	= 	"cpu,",
@@ -56,6 +58,7 @@ static const char *aggr_header_csv[] = {
 
 static const char *aggr_header_std[] = {
 	[AGGR_CORE] 	= 	"core",
+	[AGGR_CACHE] 	= 	"cache",
 	[AGGR_DIE] 	= 	"die",
 	[AGGR_SOCKET] 	= 	"socket",
 	[AGGR_NONE] 	= 	"cpu",
@@ -193,6 +196,10 @@ static void print_aggr_id_std(struct perf_stat_config *config,
 	case AGGR_CORE:
 		snprintf(buf, sizeof(buf), "S%d-D%d-C%d", id.socket, id.die, id.core);
 		break;
+	case AGGR_CACHE:
+		snprintf(buf, sizeof(buf), "S%d-D%d-L%d-ID%d",
+			 id.socket, id.die, id.cache_lvl, id.cache);
+		break;
 	case AGGR_DIE:
 		snprintf(buf, sizeof(buf), "S%d-D%d", id.socket, id.die);
 		break;
@@ -239,6 +246,10 @@ static void print_aggr_id_csv(struct perf_stat_config *config,
 		fprintf(output, "S%d-D%d-C%d%s%d%s",
 			id.socket, id.die, id.core, sep, aggr_nr, sep);
 		break;
+	case AGGR_CACHE:
+		fprintf(config->output, "S%d-D%d-L%d-ID%d%s%d%s",
+			id.socket, id.die, id.cache_lvl, id.cache, sep, aggr_nr, sep);
+		break;
 	case AGGR_DIE:
 		fprintf(output, "S%d-D%d%s%d%s",
 			id.socket, id.die, sep, aggr_nr, sep);
@@ -284,6 +295,10 @@ static void print_aggr_id_json(struct perf_stat_config *config,
 		fprintf(output, "\"core\" : \"S%d-D%d-C%d\", \"aggregate-number\" : %d, ",
 			id.socket, id.die, id.core, aggr_nr);
 		break;
+	case AGGR_CACHE:
+		fprintf(output, "\"cache\" : \"S%d-D%d-L%d-ID%d\", \"aggregate-number\" : %d, ",
+			id.socket, id.die, id.cache_lvl, id.cache, aggr_nr);
+		break;
 	case AGGR_DIE:
 		fprintf(output, "\"die\" : \"S%d-D%d\", \"aggregate-number\" : %d, ",
 			id.socket, id.die, aggr_nr);
@@ -1125,6 +1140,7 @@ static void print_header_interval_std(struct perf_stat_config *config,
 	case AGGR_NODE:
 	case AGGR_SOCKET:
 	case AGGR_DIE:
+	case AGGR_CACHE:
 	case AGGR_CORE:
 		fprintf(output, "#%*s %-*s cpus",
 			INTERVAL_LEN - 1, "time",
@@ -1425,6 +1441,7 @@ void evlist__print_counters(struct evlist *evlist, struct perf_stat_config *conf
 
 	switch (config->aggr_mode) {
 	case AGGR_CORE:
+	case AGGR_CACHE:
 	case AGGR_DIE:
 	case AGGR_SOCKET:
 	case AGGR_NODE:
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index e35e188237c8..7abff7cbb5a1 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -48,6 +48,7 @@ enum aggr_mode {
 	AGGR_GLOBAL,
 	AGGR_SOCKET,
 	AGGR_DIE,
+	AGGR_CACHE,
 	AGGR_CORE,
 	AGGR_THREAD,
 	AGGR_UNSET,
@@ -64,6 +65,7 @@ typedef struct aggr_cpu_id (*aggr_get_id_t)(struct perf_stat_config *config, str
 
 struct perf_stat_config {
 	enum aggr_mode		 aggr_mode;
+	u32			 aggr_level;
 	bool			 scale;
 	bool			 no_inherit;
 	bool			 identifier;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 3/5] perf stat: Save cache level information when running perf stat record
  2023-05-17 17:27 [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology K Prateek Nayak
  2023-05-17 17:27 ` [PATCH v4 1/5] perf: Extract building cache level for a CPU into separate function K Prateek Nayak
  2023-05-17 17:27 ` [PATCH v4 2/5] perf stat: Setup the foundation to allow aggregation based on cache topology K Prateek Nayak
@ 2023-05-17 17:27 ` K Prateek Nayak
  2023-05-17 17:27 ` [PATCH v4 4/5] perf stat: Add "--per-cache" aggregation option and document the same K Prateek Nayak
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: K Prateek Nayak @ 2023-05-17 17:27 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel, acme, peterz, mingo, mark.rutland,
	alexander.shishkin, jolsa, namhyung
  Cc: ravi.bangoria, sandipan.das, ananth.narayan, gautham.shenoy,
	eranian, irogers, puwen

When aggregating based on cache-topology, in addition to the aggregation
mode, knowing the cache level at which data is aggregated is necessary to
ensure consistency when running perf stat record and later perf stat
report. Save the cache level for aggregation as a part of the env data
that can be later retrieved when running perf stat report.

Suggested-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog:
o v3->v4:
  - Previously part of Patch 2.
---
 tools/lib/perf/include/perf/event.h | 3 ++-
 tools/perf/util/event.c             | 7 ++++---
 tools/perf/util/synthetic-events.c  | 1 +
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index 51b9338f4c11..ba2dcf64f4e6 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -380,7 +380,8 @@ enum {
 	PERF_STAT_CONFIG_TERM__AGGR_MODE	= 0,
 	PERF_STAT_CONFIG_TERM__INTERVAL		= 1,
 	PERF_STAT_CONFIG_TERM__SCALE		= 2,
-	PERF_STAT_CONFIG_TERM__MAX		= 3,
+	PERF_STAT_CONFIG_TERM__AGGR_LEVEL	= 3,
+	PERF_STAT_CONFIG_TERM__MAX		= 4,
 };
 
 struct perf_record_stat_config_entry {
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 8ae742e32e3c..e8b0666d913c 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -135,9 +135,10 @@ void perf_event__read_stat_config(struct perf_stat_config *config,
 			config->__val = event->data[i].val;	\
 			break;
 
-		CASE(AGGR_MODE, aggr_mode)
-		CASE(SCALE,     scale)
-		CASE(INTERVAL,  interval)
+		CASE(AGGR_MODE,  aggr_mode)
+		CASE(SCALE,      scale)
+		CASE(INTERVAL,   interval)
+		CASE(AGGR_LEVEL, aggr_level)
 #undef CASE
 		default:
 			pr_warning("unknown stat config term %" PRI_lu64 "\n",
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index b2e4afa5efa1..45714a2785fd 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -1375,6 +1375,7 @@ int perf_event__synthesize_stat_config(struct perf_tool *tool,
 	ADD(AGGR_MODE,	config->aggr_mode)
 	ADD(INTERVAL,	config->interval)
 	ADD(SCALE,	config->scale)
+	ADD(AGGR_LEVEL,	config->aggr_level)
 
 	WARN_ONCE(i != PERF_STAT_CONFIG_TERM__MAX,
 		  "stat config terms unbalanced\n");
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 4/5] perf stat: Add "--per-cache" aggregation option and document the same
  2023-05-17 17:27 [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology K Prateek Nayak
                   ` (2 preceding siblings ...)
  2023-05-17 17:27 ` [PATCH v4 3/5] perf stat: Save cache level information when running perf stat record K Prateek Nayak
@ 2023-05-17 17:27 ` K Prateek Nayak
  2023-05-17 17:27 ` [PATCH v4 5/5] pert stat: Add tests for the "--per-cache" option K Prateek Nayak
  2023-05-17 17:58 ` [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology Ian Rogers
  5 siblings, 0 replies; 11+ messages in thread
From: K Prateek Nayak @ 2023-05-17 17:27 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel, acme, peterz, mingo, mark.rutland,
	alexander.shishkin, jolsa, namhyung
  Cc: ravi.bangoria, sandipan.das, ananth.narayan, gautham.shenoy,
	eranian, irogers, puwen

This patch adds support for "--per-cache" option for aggregation at a
particular cache level and documents the same. Following is the output
of perf stat with aggregation at L3 for the event
"ls_dmnd_fills_from_sys.ext_cache_remote" on a dual socket
3rd Generation EPYC Processor (2 x 64C/128T - 16 LLCs) when running
hackbench pinned to 4 LLCs:

  $ sudo perf stat --per-cache=L3 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- \
    taskset -c 0-15,64-79,128-143,192-207 \
    perf bench sched messaging -p -t -l 100000 -g 8

  ...

   Performance counter stats for 'system wide':
  
  S0-D0-L3-ID0             16          9,500,803      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID8             16          6,338,099      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID16            16            355,005      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID24            16             22,067      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID32            16             16,321      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID40            16             11,619      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID48            16              4,238      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID56            16             31,158      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID64            16         28,242,452      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID72            16         22,906,973      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID80            16             72,898      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID88            16             56,907      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID96            16             20,456      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID104           16             40,913      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID112           16             78,113      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID120           16             37,897      ls_dmnd_fills_from_sys.ext_cache_remote

Also support perf stat record and perf stat report with the ability to
specify a different cache level to aggregate data at when running perf
stat report.

  $ sudo perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- \
    taskset -c 0-15,64-79,128-143,192-207 \
    perf bench sched messaging -p -t -l 100000 -g 8

  ...

   Performance counter stats for 'system wide':
  
  S0-D0-L2-ID0              2          1,442,061      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID1              2          1,548,994      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID2              2          1,553,557      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID3              2          1,420,122      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID4              2          1,465,461      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID5              2          1,455,153      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID6              2          1,595,237      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID7              2          1,499,321      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID8              2          1,919,025      ls_dmnd_fills_from_sys.ext_cache_remote
  ...
  S1-D1-L2-ID127            2             21,295      ls_dmnd_fills_from_sys.ext_cache_remote

  $ sudo perf stat report --per-cache=L3

   Performance counter stats for 'perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
                                  taskset -c 0-15,64-79,128-143,192-207 \
                                  perf bench sched messaging -p -t -l 100000 -g 8':
  
  S0-D0-L3-ID0             16         11,979,906      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID8             16         14,257,202      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID16            16            377,484      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID24            16             27,224      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID32            16             26,816      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID40            16             14,461      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID48            16             10,499      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID56            16             53,817      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID64            16         27,361,987      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID72            16         37,299,024      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID80            16             84,125      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID88            16             64,561      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID96            16             13,403      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID104           16             20,138      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID112           16             93,220      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID120           16             35,465      ls_dmnd_fills_from_sys.ext_cache_remote

On the above system, the domain covered by S0-D0-L3-ID0 contains
S0-D0-L2-ID0 to S0-D0-L2-ID7, the corresponding count for L3-ID0 is
equal to the sum of counts for L2-ID0 to L2-ID7.

Add documentation for the newly introduced "--per-cache" option.

Suggested-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog:
o v3->v4:
  - Previously part of Patch 2.
  - Fixed errors in documentation.
---
 tools/perf/Documentation/perf-stat.txt | 16 ++++++++
 tools/perf/builtin-stat.c              | 56 ++++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 29bdcfa93f04..785f0e2bcfac 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -308,6 +308,14 @@ use --per-die in addition to -a. (system-wide).  The output includes the
 die number and the number of online processors on that die. This is
 useful to gauge the amount of aggregation.
 
+--per-cache::
+Aggregate counts per cache instance for system-wide mode measurements.  By
+default, the aggregation happens for the cache level at the highest index
+in the system. To specify a particular level, mention the cache level
+alongside the option in the format [Ll][1-9][0-9]*. For example:
+Using option "--per-cache=l3" or "--per-cache=L3" will aggregate the
+information at the boundary of the level 3 cache in the system.
+
 --per-core::
 Aggregate counts per physical processor for system-wide mode measurements.  This
 is a useful mode to detect imbalance between physical cores.  To enable this mode,
@@ -379,6 +387,14 @@ Aggregate counts per processor socket for system-wide mode measurements.
 --per-die::
 Aggregate counts per processor die for system-wide mode measurements.
 
+--per-cache::
+Aggregate counts per cache instance for system-wide mode measurements.  By
+default, the aggregation happens for the cache level at the highest index
+in the system. To specify a particular level, mention the cache level
+alongside the option in the format [Ll][1-9][0-9]*. For example: Using
+option "--per-cache=l3" or "--per-cache=L3" will aggregate the
+information at the boundary of the level 3 cache in the system.
+
 --per-core::
 Aggregate counts per physical processor for system-wide mode measurements.
 
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index b072c4160fe1..7aafea5c7e6c 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1100,6 +1100,55 @@ static int parse_hybrid_type(const struct option *opt,
 	return 0;
 }
 
+static int parse_cache_level(const struct option *opt,
+			     const char *str,
+			     int unset __maybe_unused)
+{
+	int level;
+	u32 *aggr_mode = (u32 *)opt->value;
+	u32 *aggr_level = (u32 *)opt->data;
+
+	/*
+	 * If no string is specified, aggregate based on the topology of
+	 * Last Level Cache (LLC). Since the LLC level can change from
+	 * architecture to architecture, set level greater than
+	 * MAX_CACHE_LVL which will be interpreted as LLC.
+	 */
+	if (str == NULL) {
+		level = MAX_CACHE_LVL + 1;
+		goto out;
+	}
+
+	/*
+	 * The format to specify cache level is LX or lX where X is the
+	 * cache level.
+	 */
+	if (strlen(str) != 2 || (str[0] != 'l' && str[0] != 'L')) {
+		pr_err("Cache level must be of form L[1-%d], or l[1-%d]\n",
+		       MAX_CACHE_LVL,
+		       MAX_CACHE_LVL);
+		return -EINVAL;
+	}
+
+	level = atoi(&str[1]);
+	if (level < 1) {
+		pr_err("Cache level must be of form L[1-%d], or l[1-%d]\n",
+		       MAX_CACHE_LVL,
+		       MAX_CACHE_LVL);
+		return -EINVAL;
+	}
+
+	if (level > MAX_CACHE_LVL) {
+		pr_err("perf only supports max cache level of %d.\n"
+		       "Consider increasing MAX_CACHE_LVL\n", MAX_CACHE_LVL);
+		return -EINVAL;
+	}
+out:
+	*aggr_mode = AGGR_CACHE;
+	*aggr_level = level;
+	return 0;
+}
+
 static struct option stat_options[] = {
 	OPT_BOOLEAN('T', "transaction", &transaction_run,
 		    "hardware transaction statistics"),
@@ -1177,6 +1226,9 @@ static struct option stat_options[] = {
 		     "aggregate counts per processor socket", AGGR_SOCKET),
 	OPT_SET_UINT(0, "per-die", &stat_config.aggr_mode,
 		     "aggregate counts per processor die", AGGR_DIE),
+	OPT_CALLBACK_OPTARG(0, "per-cache", &stat_config.aggr_mode, &stat_config.aggr_level,
+			    "cache level", "aggregate count at this cache level (Default: LLC)",
+			    parse_cache_level),
 	OPT_SET_UINT(0, "per-core", &stat_config.aggr_mode,
 		     "aggregate counts per physical processor core", AGGR_CORE),
 	OPT_SET_UINT(0, "per-thread", &stat_config.aggr_mode,
@@ -2200,6 +2252,10 @@ static int __cmd_report(int argc, const char **argv)
 		     "aggregate counts per processor socket", AGGR_SOCKET),
 	OPT_SET_UINT(0, "per-die", &perf_stat.aggr_mode,
 		     "aggregate counts per processor die", AGGR_DIE),
+	OPT_CALLBACK_OPTARG(0, "per-cache", &perf_stat.aggr_mode, &perf_stat.aggr_level,
+			    "cache level",
+			    "aggregate count at this cache level (Default: LLC)",
+			    parse_cache_level),
 	OPT_SET_UINT(0, "per-core", &perf_stat.aggr_mode,
 		     "aggregate counts per physical processor core", AGGR_CORE),
 	OPT_SET_UINT(0, "per-node", &perf_stat.aggr_mode,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 5/5] pert stat: Add tests for the "--per-cache" option
  2023-05-17 17:27 [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology K Prateek Nayak
                   ` (3 preceding siblings ...)
  2023-05-17 17:27 ` [PATCH v4 4/5] perf stat: Add "--per-cache" aggregation option and document the same K Prateek Nayak
@ 2023-05-17 17:27 ` K Prateek Nayak
  2023-05-17 17:58 ` [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology Ian Rogers
  5 siblings, 0 replies; 11+ messages in thread
From: K Prateek Nayak @ 2023-05-17 17:27 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel, acme, peterz, mingo, mark.rutland,
	alexander.shishkin, jolsa, namhyung
  Cc: ravi.bangoria, sandipan.das, ananth.narayan, gautham.shenoy,
	eranian, irogers, puwen

Add tests for the new "--per-cache" option in perf stat for CSV and JSON
generation as well as for the JSON linting.

Suggested-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog:
o v3->v4:
  - Previously part of Patch 2.
---
 .../perf/tests/shell/lib/perf_json_output_lint.py  |  4 +++-
 tools/perf/tests/shell/stat+csv_output.sh          | 14 ++++++++++++++
 tools/perf/tests/shell/stat+json_output.sh         | 13 +++++++++++++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/tools/perf/tests/shell/lib/perf_json_output_lint.py b/tools/perf/tests/shell/lib/perf_json_output_lint.py
index 61f3059ca54b..4acaaed5560d 100644
--- a/tools/perf/tests/shell/lib/perf_json_output_lint.py
+++ b/tools/perf/tests/shell/lib/perf_json_output_lint.py
@@ -14,6 +14,7 @@ ap.add_argument('--system-wide', action='store_true')
 ap.add_argument('--event', action='store_true')
 ap.add_argument('--per-core', action='store_true')
 ap.add_argument('--per-thread', action='store_true')
+ap.add_argument('--per-cache', action='store_true')
 ap.add_argument('--per-die', action='store_true')
 ap.add_argument('--per-node', action='store_true')
 ap.add_argument('--per-socket', action='store_true')
@@ -47,6 +48,7 @@ def check_json_output(expected_items):
       'counter-value': lambda x: is_counter_value(x),
       'cgroup': lambda x: True,
       'cpu': lambda x: isint(x),
+      'cache': lambda x: True,
       'die': lambda x: True,
       'event': lambda x: True,
       'event-runtime': lambda x: isfloat(x),
@@ -83,7 +85,7 @@ try:
     expected_items = 7
   elif args.interval or args.per_thread or args.system_wide_no_aggr:
     expected_items = 8
-  elif args.per_core or args.per_socket or args.per_node or args.per_die:
+  elif args.per_core or args.per_socket or args.per_node or args.per_die or args.per_cache_instance:
     expected_items = 9
   else:
     # If no option is specified, don't check the number of items.
diff --git a/tools/perf/tests/shell/stat+csv_output.sh b/tools/perf/tests/shell/stat+csv_output.sh
index fb78b6251a4e..a1969f236a0a 100755
--- a/tools/perf/tests/shell/stat+csv_output.sh
+++ b/tools/perf/tests/shell/stat+csv_output.sh
@@ -40,6 +40,7 @@ function commachecker()
 	;; "--per-socket")	exp=8
 	;; "--per-node")	exp=8
 	;; "--per-die")		exp=8
+	;; "--per-cache")	exp=8
 	esac
 
 	while read line
@@ -145,6 +146,18 @@ check_per_thread()
 	echo "[Success]"
 }
 
+check_per_cache_instance()
+{
+	echo -n "Checking CSV output: per cache instance "
+	if ParanoidAndNotRoot 0
+	then
+		echo "[Skip] paranoid and not root"
+		return
+	fi
+	perf stat -x$csv_sep --per-cache -a true 2>&1 | commachecker --per-cache
+	echo "[Success]"
+}
+
 check_per_die()
 {
 	echo -n "Checking CSV output: per die "
@@ -222,6 +235,7 @@ if [ $skip_test -ne 1 ]
 then
 	check_system_wide_no_aggr
 	check_per_core
+	check_per_cache_instance
 	check_per_die
 	check_per_socket
 else
diff --git a/tools/perf/tests/shell/stat+json_output.sh b/tools/perf/tests/shell/stat+json_output.sh
index f3e4967cc72e..c282afa6217c 100755
--- a/tools/perf/tests/shell/stat+json_output.sh
+++ b/tools/perf/tests/shell/stat+json_output.sh
@@ -120,6 +120,18 @@ check_per_thread()
 	echo "[Success]"
 }
 
+check_per_cache_instance()
+{
+	echo -n "Checking json output: per cache_instance "
+	if ParanoidAndNotRoot 0
+	then
+		echo "[Skip] paranoia and not root"
+		return
+	fi
+	perf stat -j --per-cache -a true 2>&1 | $PYTHON $pythonchecker --per-cache
+	echo "[Success]"
+}
+
 check_per_die()
 {
 	echo -n "Checking json output: per die "
@@ -197,6 +209,7 @@ if [ $skip_test -ne 1 ]
 then
 	check_system_wide_no_aggr
 	check_per_core
+	check_per_cache_instance
 	check_per_die
 	check_per_socket
 else
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology
  2023-05-17 17:27 [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology K Prateek Nayak
                   ` (4 preceding siblings ...)
  2023-05-17 17:27 ` [PATCH v4 5/5] pert stat: Add tests for the "--per-cache" option K Prateek Nayak
@ 2023-05-17 17:58 ` Ian Rogers
  2023-05-18  2:13   ` K Prateek Nayak
  2023-05-23 15:31   ` Arnaldo Carvalho de Melo
  5 siblings, 2 replies; 11+ messages in thread
From: Ian Rogers @ 2023-05-17 17:58 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-perf-users, linux-kernel, acme, peterz, mingo, mark.rutland,
	alexander.shishkin, jolsa, namhyung, ravi.bangoria, sandipan.das,
	ananth.narayan, gautham.shenoy, eranian, puwen

On Wed, May 17, 2023 at 10:22 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Motivation behind this feature is to aggregate the data at the LLC level
> for chiplet based processors which currently do not expose the chiplet
> details in sysfs cpu topology information.
>
> For the completeness of the feature, the series adds ability to
> aggregate data at any cache level. Following is the example of the
> output on a dual socket Zen3 processor with 2 x 64C/128T containing 8
> chiplet per socket.
>
>   $ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
>     taskset -c 0-15,64-79,128-143,192-207\
>     perf bench sched messaging -p -t -l 100000 -g 8
>
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver threads per group
>     # 8 groups == 320 threads run
>
>     Total time: 7.648 [sec]
>
>     Performance counter stats for 'system wide':
>
>     S0-D0-L3-ID0             16         17,145,912      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID8             16         14,977,628      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID16            16            262,539      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID24            16              3,140      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID32            16             27,403      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID40            16             17,026      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID48            16              7,292      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID56            16              2,464      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID64            16         22,489,306      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID72            16         21,455,257      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID80            16             11,619      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID88            16             30,978      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID96            16             37,628      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID104           16             13,594      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID112           16             10,164      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID120           16             11,259      ls_dmnd_fills_from_sys.ext_cache_remote
>
>           7.779171484 seconds time elapsed
>
> The series also adds support for perf stat record and perf stat report
> to aggregate data at various cache levels. Following is an example of
> recording with aggregation at L2 level and reporting the same data with
> aggregation at L3 level.
>
>   $ sudo perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
>     taskset -c 0-15,64-79,128-143,192-207\
>     perf bench sched messaging -p -t -l 100000 -g 8
>
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver threads per group
>     # 8 groups == 320 threads run
>
>     Total time: 7.318 [sec]
>
>     Performance counter stats for 'system wide':
>
>     S0-D0-L2-ID0              2          2,171,980      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID1              2          2,048,494      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID2              2          2,120,293      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID3              2          2,224,725      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID4              2          2,021,618      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID5              2          1,995,331      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID6              2          2,163,029      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID7              2          2,104,623      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID8              2          1,948,776      ls_dmnd_fills_from_sys.ext_cache_remote
>     ...
>     S0-D0-L2-ID63             2              2,648      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID64             2          2,963,323      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID65             2          2,856,629      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID66             2          2,901,725      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID67             2          3,046,120      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID68             2          2,637,971      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID69             2          2,680,029      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID70             2          2,672,259      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID71             2          2,638,768      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID72             2          3,308,642      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID73             2          3,064,473      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID74             2          3,023,379      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID75             2          2,975,119      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID76             2          2,952,677      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID77             2          2,981,695      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID78             2          3,455,916      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID79             2          2,959,540      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID80             2              4,977      ls_dmnd_fills_from_sys.ext_cache_remote
>     ...
>     S1-D1-L2-ID127            2              3,359      ls_dmnd_fills_from_sys.ext_cache_remote
>
>           7.451725897 seconds time elapsed
>
>   $ sudo perf stat report --per-cache=L3
>
>     Performance counter stats for '...':
>
>     S0-D0-L3-ID0             16         16,850,093      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID8             16         16,001,493      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID16            16            301,011      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID24            16             26,276      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID32            16             48,958      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID40            16             43,799      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID48            16             16,771      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID56            16             12,544      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID64            16         22,396,824      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID72            16         24,721,441      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID80            16             29,426      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID88            16             54,348      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID96            16             41,557      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID104           16             10,084      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID112           16             14,361      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID120           16             24,446      ls_dmnd_fills_from_sys.ext_cache_remote
>
>            7.451725897 seconds time elapsed
>
> The aggregate at S0-D0-L3-ID0 is the sum of S0-D0-L2-ID0 to S0-D0-L3-ID7
> as L3 containing CPU0 contains the L2 instance of CPU0 to CPU7.
>
> Cache IDs are derived from the shared_cpus_list file in the cache
> topology. This allows for --per-cache aggregation of data on a kernel
> which does not expose the cache instance ID in the sysfs. Running perf
> stat will give the following output on the same system with cache
> instance ID hidden:
>
>   $ ls /sys/devices/system/cpu/cpu0/cache/index0/
>
>     coherency_line_size  level  number_of_sets  physical_line_partition
>     shared_cpu_list  shared_cpu_map  size  type  uevent
>     ways_of_associativity
>
>   $ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
>     taskset -c 0-15,64-79,128-143,192-207\
>     perf bench sched messaging -p -t -l 100000 -g 8
>
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver threads per group
>     # 8 groups == 320 threads run
>
>          Total time: 6.949 [sec]
>
>      Performance counter stats for 'system wide':
>
>     S0-D0-L3-ID0             16          5,297,615      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID8             16          4,347,868      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID16            16            416,593      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID24            16              4,346      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID32            16              5,506      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID40            16             15,845      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID48            16             24,164      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID56            16              4,543      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID64            16         41,610,374      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID72            16         38,393,688      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID80            16             22,188      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID88            16             22,918      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID96            16             39,230      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID104           16              6,236      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID112           16             66,846      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID120           16             72,713      ls_dmnd_fills_from_sys.ext_cache_remote
>
>            7.098471410 seconds time elapsed
>
> Few notes:
>
> - This series makes breaking change when saving the aggregation details
>   as the cache level needs to be saved along with the aggregation
>   method.
>
> - This series assumes that caches at same level will be shared by same
>   set of threads. The implementation will run into an issue if, say L1i
>   is thread local, but L1d is shared by the SMT siblings on the core.
>
> This series cleanly applies on top perf-tool branch from Arnaldo's tree
> (https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=perf-tools)
> at commit 760ebc45746b ("perf lock contention: Add empty 'struct rq' to
> satisfy libbpf 'runqueue' type verification")
> ---
> Changelog:
> o v3->v4:
>   - Dropped the RFC tag.
>   - Break down Patch 2 from v3 into smaller patches (kind of!)
>   - Fixed couple of errors in docs and comments.
>
> o v2->v3:
>   - Dropped patches 1 and 2 that saved and retrieved the cache instance
>     ID when saving the cache data.
>   - The above is unnecessary as the IDs are being derived from the first
>     online CPU in the cache domain for a given cache instance.
>   - Improvements to handling cases where a cache level is not present
>     but the level is allowed by MAX_CACHE_LVL.
>   - Updated details in cover letter.
>
> o v1->v2
>   - Set cache instance ID to 0 if the file cannot be read.
>   - Fix cache level parsing function.
>   - Updated details in cover letter.
> ---
> K Prateek Nayak (5):
>   perf: Extract building cache level for a CPU into separate function
>   perf stat: Setup the foundation to allow aggregation based on cache
>     topology
>   perf stat: Save cache level information when running perf stat record
>   perf stat: Add "--per-cache" aggregation option and document the same
>   pert stat: Add tests for the "--per-cache" option

Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

>  tools/lib/perf/include/perf/cpumap.h          |   5 +
>  tools/lib/perf/include/perf/event.h           |   3 +-
>  tools/perf/Documentation/perf-stat.txt        |  16 ++
>  tools/perf/builtin-stat.c                     | 144 +++++++++++++++++-
>  .../tests/shell/lib/perf_json_output_lint.py  |   4 +-
>  tools/perf/tests/shell/stat+csv_output.sh     |  14 ++
>  tools/perf/tests/shell/stat+json_output.sh    |  13 ++
>  tools/perf/util/cpumap.c                      | 119 +++++++++++++++
>  tools/perf/util/cpumap.h                      |  28 ++++
>  tools/perf/util/event.c                       |   7 +-
>  tools/perf/util/header.c                      |  62 +++++---
>  tools/perf/util/header.h                      |   4 +
>  tools/perf/util/stat-display.c                |  17 +++
>  tools/perf/util/stat.h                        |   2 +
>  tools/perf/util/synthetic-events.c            |   1 +
>  15 files changed, 409 insertions(+), 30 deletions(-)
>
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology
  2023-05-17 17:58 ` [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology Ian Rogers
@ 2023-05-18  2:13   ` K Prateek Nayak
  2023-05-23 15:31   ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 11+ messages in thread
From: K Prateek Nayak @ 2023-05-18  2:13 UTC (permalink / raw)
  To: Ian Rogers
  Cc: linux-perf-users, linux-kernel, acme, peterz, mingo, mark.rutland,
	alexander.shishkin, jolsa, namhyung, ravi.bangoria, sandipan.das,
	ananth.narayan, gautham.shenoy, eranian, puwen

Hello Ian,

On 5/17/2023 11:28 PM, Ian Rogers wrote:
> On Wed, May 17, 2023 at 10:22 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Motivation behind this feature is to aggregate the data at the LLC level
>> for chiplet based processors which currently do not expose the chiplet
>> details in sysfs cpu topology information.
>>
>> For the completeness of the feature, the series adds ability to
>> aggregate data at any cache level. Following is the example of the
>> output on a dual socket Zen3 processor with 2 x 64C/128T containing 8
>> chiplet per socket.
>>
>>   $ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
>>     taskset -c 0-15,64-79,128-143,192-207\
>>     perf bench sched messaging -p -t -l 100000 -g 8
>>
>>     # Running 'sched/messaging' benchmark:
>>     # 20 sender and receiver threads per group
>>     # 8 groups == 320 threads run
>>
>>     Total time: 7.648 [sec]
>>
>>     Performance counter stats for 'system wide':
>>
>>     S0-D0-L3-ID0             16         17,145,912      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S0-D0-L3-ID8             16         14,977,628      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S0-D0-L3-ID16            16            262,539      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S0-D0-L3-ID24            16              3,140      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S0-D0-L3-ID32            16             27,403      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S0-D0-L3-ID40            16             17,026      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S0-D0-L3-ID48            16              7,292      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S0-D0-L3-ID56            16              2,464      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S1-D1-L3-ID64            16         22,489,306      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S1-D1-L3-ID72            16         21,455,257      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S1-D1-L3-ID80            16             11,619      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S1-D1-L3-ID88            16             30,978      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S1-D1-L3-ID96            16             37,628      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S1-D1-L3-ID104           16             13,594      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S1-D1-L3-ID112           16             10,164      ls_dmnd_fills_from_sys.ext_cache_remote
>>     S1-D1-L3-ID120           16             11,259      ls_dmnd_fills_from_sys.ext_cache_remote
>>
>>           7.779171484 seconds time elapsed
>>
>> [..snip..]
> 
> Acked-by: Ian Rogers <irogers@google.com>

Thank you for taking a look at the series and for the ack :)

> 
> Thanks,
> Ian
> 
>>  [..snip..]
>>
 
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology
  2023-05-17 17:58 ` [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology Ian Rogers
  2023-05-18  2:13   ` K Prateek Nayak
@ 2023-05-23 15:31   ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 11+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-05-23 15:31 UTC (permalink / raw)
  To: Ian Rogers
  Cc: K Prateek Nayak, linux-perf-users, linux-kernel, peterz, mingo,
	mark.rutland, alexander.shishkin, jolsa, namhyung, ravi.bangoria,
	sandipan.das, ananth.narayan, gautham.shenoy, eranian, puwen

Em Wed, May 17, 2023 at 10:58:01AM -0700, Ian Rogers escreveu:
> On Wed, May 17, 2023 at 10:22 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> > K Prateek Nayak (5):
> >   perf: Extract building cache level for a CPU into separate function
> >   perf stat: Setup the foundation to allow aggregation based on cache
> >     topology
> >   perf stat: Save cache level information when running perf stat record
> >   perf stat: Add "--per-cache" aggregation option and document the same
> >   pert stat: Add tests for the "--per-cache" option
 
> Acked-by: Ian Rogers <irogers@google.com>

Thanks, great, documented, with accompanying 'perf test' entries, applied.

- Arnaldo


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 2/5] perf stat: Setup the foundation to allow aggregation based on cache topology
  2023-05-17 17:27 ` [PATCH v4 2/5] perf stat: Setup the foundation to allow aggregation based on cache topology K Prateek Nayak
@ 2023-05-23 19:12   ` Arnaldo Carvalho de Melo
  2023-05-24  3:00     ` K Prateek Nayak
  0 siblings, 1 reply; 11+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-05-23 19:12 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-perf-users, linux-kernel, peterz, mingo, mark.rutland,
	alexander.shishkin, jolsa, namhyung, ravi.bangoria, sandipan.das,
	ananth.narayan, gautham.shenoy, eranian, irogers, puwen

Em Wed, May 17, 2023 at 10:57:42PM +0530, K Prateek Nayak escreveu:
> Processors based on chiplet architecture, such as AMD EPYC and Hygon do
> not expose the chiplet details in the sysfs CPU topology information.
> However, this information can be derived from the per CPU cache level
> information from the sysfs.
> 
> perf stat has already supported aggregation based on topology
> information using core ID, socket ID, etc. It'll be useful to aggregate
> based on the cache topology to detect problems like imbalance and
> cache-to-cache sharing at various cache levels.
> 
> This patch lays the foundation for aggregating data in perf stat based
> on the processor's cache topology. The cmdline option to aggregate data
> based on the cache topology is added in Patch 4 of the series while this
> patch sets up all the necessary functions and variables required to
> support the new aggregation option.
> 
> The patch also adds support to display per-cache aggregation, or save it
> as a JSON or CSV, as splitting it into a separate patch would break
> builds when compiling with "-Werror=switch-enum" where the compiler will
> complain about the lack of handling for the AGGR_CACHE case in the
> output functions.
> 
> Suggested-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changelog:
> o v3->v4:
>   - Some parts of the previous Patch 2 have been put into subsequent
>     smaller patches (while being careful not to introduce any build
>     errors in case someone were to bisect through the series)
>   - Fixed comments.

So I had to make the following changes, added this explanation to the
resulting cset:

    Committer notes:

    Don't use perf_stat_config in tools/perf/util/cpumap.c, this would make
    code that is in util/, thus not really specific to a single builtin, use
    a specific builtin config structure.

    Move the functions introduced in this patch from
    tools/perf/util/cpumap.c since it needs access to builtin specific
    and is not strictly needed to live in the util/ directory.

    With this 'perf test python' is back building.

- Arnaldo

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 68294ea499ae51d9..0528d1bc15d27705 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -150,7 +150,7 @@ static struct perf_stat		perf_stat;
 
 static volatile sig_atomic_t done = 0;
 
-struct perf_stat_config stat_config = {
+static struct perf_stat_config stat_config = {
 	.aggr_mode		= AGGR_GLOBAL,
 	.aggr_level		= MAX_CACHE_LVL + 1,
 	.scale			= true,
@@ -1251,6 +1251,129 @@ static struct option stat_options[] = {
 	OPT_END()
 };
 
+/**
+ * Calculate the cache instance ID from the map in
+ * /sys/devices/system/cpu/cpuX/cache/indexY/shared_cpu_list
+ * Cache instance ID is the first CPU reported in the shared_cpu_list file.
+ */
+static int cpu__get_cache_id_from_map(struct perf_cpu cpu, char *map)
+{
+	int id;
+	struct perf_cpu_map *cpu_map = perf_cpu_map__new(map);
+
+	/*
+	 * If the map contains no CPU, consider the current CPU to
+	 * be the first online CPU in the cache domain else use the
+	 * first online CPU of the cache domain as the ID.
+	 */
+	if (perf_cpu_map__empty(cpu_map))
+		id = cpu.cpu;
+	else
+		id = perf_cpu_map__cpu(cpu_map, 0).cpu;
+
+	/* Free the perf_cpu_map used to find the cache ID */
+	perf_cpu_map__put(cpu_map);
+
+	return id;
+}
+
+/**
+ * cpu__get_cache_id - Returns 0 if successful in populating the
+ * cache level and cache id. Cache level is read from
+ * /sys/devices/system/cpu/cpuX/cache/indexY/level where as cache instance ID
+ * is the first CPU reported by
+ * /sys/devices/system/cpu/cpuX/cache/indexY/shared_cpu_list
+ */
+static int cpu__get_cache_details(struct perf_cpu cpu, struct perf_cache *cache)
+{
+	int ret = 0;
+	u32 cache_level = stat_config.aggr_level;
+	struct cpu_cache_level caches[MAX_CACHE_LVL];
+	u32 i = 0, caches_cnt = 0;
+
+	cache->cache_lvl = (cache_level > MAX_CACHE_LVL) ? 0 : cache_level;
+	cache->cache = -1;
+
+	ret = build_caches_for_cpu(cpu.cpu, caches, &caches_cnt);
+	if (ret) {
+		/*
+		 * If caches_cnt is not 0, cpu_cache_level data
+		 * was allocated when building the topology.
+		 * Free the allocated data before returning.
+		 */
+		if (caches_cnt)
+			goto free_caches;
+
+		return ret;
+	}
+
+	if (!caches_cnt)
+		return -1;
+
+	/*
+	 * Save the data for the highest level if no
+	 * level was specified by the user.
+	 */
+	if (cache_level > MAX_CACHE_LVL) {
+		int max_level_index = 0;
+
+		for (i = 1; i < caches_cnt; ++i) {
+			if (caches[i].level > caches[max_level_index].level)
+				max_level_index = i;
+		}
+
+		cache->cache_lvl = caches[max_level_index].level;
+		cache->cache = cpu__get_cache_id_from_map(cpu, caches[max_level_index].map);
+
+		/* Reset i to 0 to free entire caches[] */
+		i = 0;
+		goto free_caches;
+	}
+
+	for (i = 0; i < caches_cnt; ++i) {
+		if (caches[i].level == cache_level) {
+			cache->cache_lvl = cache_level;
+			cache->cache = cpu__get_cache_id_from_map(cpu, caches[i].map);
+		}
+
+		cpu_cache_level__free(&caches[i]);
+	}
+
+free_caches:
+	/*
+	 * Free all the allocated cpu_cache_level data.
+	 */
+	while (i < caches_cnt)
+		cpu_cache_level__free(&caches[i++]);
+
+	return ret;
+}
+
+/**
+ * aggr_cpu_id__cache - Create an aggr_cpu_id with cache instache ID, cache
+ * level, die and socket populated with the cache instache ID, cache level,
+ * die and socket for cpu. The function signature is compatible with
+ * aggr_cpu_id_get_t.
+ */
+static struct aggr_cpu_id aggr_cpu_id__cache(struct perf_cpu cpu, void *data)
+{
+	int ret;
+	struct aggr_cpu_id id;
+	struct perf_cache cache;
+
+	id = aggr_cpu_id__die(cpu, data);
+	if (aggr_cpu_id__is_empty(&id))
+		return id;
+
+	ret = cpu__get_cache_details(cpu, &cache);
+	if (ret)
+		return id;
+
+	id.cache_lvl = cache.cache_lvl;
+	id.cache = cache.cache;
+	return id;
+}
+
 static const char *const aggr_mode__string[] = {
 	[AGGR_CORE] = "core",
 	[AGGR_CACHE] = "cache",
diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
index 88d387200745de2f..a0719816a218d441 100644
--- a/tools/perf/util/cpumap.c
+++ b/tools/perf/util/cpumap.c
@@ -3,8 +3,6 @@
 #include "cpumap.h"
 #include "debug.h"
 #include "event.h"
-#include "header.h"
-#include "stat.h"
 #include <assert.h>
 #include <dirent.h>
 #include <stdio.h>
@@ -311,113 +309,6 @@ struct aggr_cpu_id aggr_cpu_id__die(struct perf_cpu cpu, void *data)
 	return id;
 }
 
-extern struct perf_stat_config stat_config;
-
-int cpu__get_cache_id_from_map(struct perf_cpu cpu, char *map)
-{
-	int id;
-	struct perf_cpu_map *cpu_map = perf_cpu_map__new(map);
-
-	/*
-	 * If the map contains no CPU, consider the current CPU to
-	 * be the first online CPU in the cache domain else use the
-	 * first online CPU of the cache domain as the ID.
-	 */
-	if (perf_cpu_map__empty(cpu_map))
-		id = cpu.cpu;
-	else
-		id = perf_cpu_map__cpu(cpu_map, 0).cpu;
-
-	/* Free the perf_cpu_map used to find the cache ID */
-	perf_cpu_map__put(cpu_map);
-
-	return id;
-}
-
-int cpu__get_cache_details(struct perf_cpu cpu, struct perf_cache *cache)
-{
-	int ret = 0;
-	struct cpu_cache_level caches[MAX_CACHE_LVL];
-	u32 cache_level = stat_config.aggr_level;
-	u32 i = 0, caches_cnt = 0;
-
-	cache->cache_lvl = (cache_level > MAX_CACHE_LVL) ? 0 : cache_level;
-	cache->cache = -1;
-
-	ret = build_caches_for_cpu(cpu.cpu, caches, &caches_cnt);
-	if (ret) {
-		/*
-		 * If caches_cnt is not 0, cpu_cache_level data
-		 * was allocated when building the topology.
-		 * Free the allocated data before returning.
-		 */
-		if (caches_cnt)
-			goto free_caches;
-
-		return ret;
-	}
-
-	if (!caches_cnt)
-		return -1;
-
-	/*
-	 * Save the data for the highest level if no
-	 * level was specified by the user.
-	 */
-	if (cache_level > MAX_CACHE_LVL) {
-		int max_level_index = 0;
-
-		for (i = 1; i < caches_cnt; ++i) {
-			if (caches[i].level > caches[max_level_index].level)
-				max_level_index = i;
-		}
-
-		cache->cache_lvl = caches[max_level_index].level;
-		cache->cache = cpu__get_cache_id_from_map(cpu, caches[max_level_index].map);
-
-		/* Reset i to 0 to free entire caches[] */
-		i = 0;
-		goto free_caches;
-	}
-
-	for (i = 0; i < caches_cnt; ++i) {
-		if (caches[i].level == cache_level) {
-			cache->cache_lvl = cache_level;
-			cache->cache = cpu__get_cache_id_from_map(cpu, caches[i].map);
-		}
-
-		cpu_cache_level__free(&caches[i]);
-	}
-
-free_caches:
-	/*
-	 * Free all the allocated cpu_cache_level data.
-	 */
-	while (i < caches_cnt)
-		cpu_cache_level__free(&caches[i++]);
-
-	return ret;
-}
-
-struct aggr_cpu_id aggr_cpu_id__cache(struct perf_cpu cpu, void *data)
-{
-	int ret;
-	struct aggr_cpu_id id;
-	struct perf_cache cache;
-
-	id = aggr_cpu_id__die(cpu, data);
-	if (aggr_cpu_id__is_empty(&id))
-		return id;
-
-	ret = cpu__get_cache_details(cpu, &cache);
-	if (ret)
-		return id;
-
-	id.cache_lvl = cache.cache_lvl;
-	id.cache = cache.cache;
-	return id;
-}
-
 int cpu__get_core_id(struct perf_cpu cpu)
 {
 	int value, ret = cpu__get_topology_int(cpu.cpu, "core_id", &value);
diff --git a/tools/perf/util/cpumap.h b/tools/perf/util/cpumap.h
index 1212b4ab19386293..f394ccc0ccfbca4c 100644
--- a/tools/perf/util/cpumap.h
+++ b/tools/perf/util/cpumap.h
@@ -86,20 +86,6 @@ int cpu__get_socket_id(struct perf_cpu cpu);
  * /sys/devices/system/cpu/cpuX/topology/die_id for the given CPU.
  */
 int cpu__get_die_id(struct perf_cpu cpu);
-/**
- * Calculate the cache instance ID from the map in
- * /sys/devices/system/cpu/cpuX/cache/indexY/shared_cpu_list
- * Cache instance ID is the first CPU reported in the shared_cpu_list file.
- */
-int cpu__get_cache_id_from_map(struct perf_cpu cpu, char *map);
-/**
- * cpu__get_cache_id - Returns 0 if successful in populating the
- * cache level and cache id. Cache level is read from
- * /sys/devices/system/cpu/cpuX/cache/indexY/level where as cache instance ID
- * is the first CPU reported by
- * /sys/devices/system/cpu/cpuX/cache/indexY/shared_cpu_list
- */
-int cpu__get_cache_details(struct perf_cpu cpu, struct perf_cache *cache);
 /**
  * cpu__get_core_id - Returns the core id as read from
  * /sys/devices/system/cpu/cpuX/topology/core_id for the given CPU.
@@ -140,13 +126,6 @@ struct aggr_cpu_id aggr_cpu_id__socket(struct perf_cpu cpu, void *data);
  * aggr_cpu_id_get_t.
  */
 struct aggr_cpu_id aggr_cpu_id__die(struct perf_cpu cpu, void *data);
-/**
- * aggr_cpu_id__cache - Create an aggr_cpu_id with cache instache ID, cache
- * level, die and socket populated with the cache instache ID, cache level,
- * die and socket for cpu. The function signature is compatible with
- * aggr_cpu_id_get_t.
- */
-struct aggr_cpu_id aggr_cpu_id__cache(struct perf_cpu cpu, void *data);
 /**
  * aggr_cpu_id__core - Create an aggr_cpu_id with the core, die and socket
  * populated with the core, die and socket for cpu. The function signature is

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 2/5] perf stat: Setup the foundation to allow aggregation based on cache topology
  2023-05-23 19:12   ` Arnaldo Carvalho de Melo
@ 2023-05-24  3:00     ` K Prateek Nayak
  0 siblings, 0 replies; 11+ messages in thread
From: K Prateek Nayak @ 2023-05-24  3:00 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: linux-perf-users, linux-kernel, peterz, mingo, mark.rutland,
	alexander.shishkin, jolsa, namhyung, ravi.bangoria, sandipan.das,
	ananth.narayan, gautham.shenoy, eranian, irogers, puwen

Hello Arnaldo,

On 5/24/2023 12:42 AM, Arnaldo Carvalho de Melo wrote:
> Em Wed, May 17, 2023 at 10:57:42PM +0530, K Prateek Nayak escreveu:
>> Processors based on chiplet architecture, such as AMD EPYC and Hygon do
>> not expose the chiplet details in the sysfs CPU topology information.
>> However, this information can be derived from the per CPU cache level
>> information from the sysfs.
>>
>> perf stat has already supported aggregation based on topology
>> information using core ID, socket ID, etc. It'll be useful to aggregate
>> based on the cache topology to detect problems like imbalance and
>> cache-to-cache sharing at various cache levels.
>>
>> This patch lays the foundation for aggregating data in perf stat based
>> on the processor's cache topology. The cmdline option to aggregate data
>> based on the cache topology is added in Patch 4 of the series while this
>> patch sets up all the necessary functions and variables required to
>> support the new aggregation option.
>>
>> The patch also adds support to display per-cache aggregation, or save it
>> as a JSON or CSV, as splitting it into a separate patch would break
>> builds when compiling with "-Werror=switch-enum" where the compiler will
>> complain about the lack of handling for the AGGR_CACHE case in the
>> output functions.
>>
>> Suggested-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>> Changelog:
>> o v3->v4:
>>   - Some parts of the previous Patch 2 have been put into subsequent
>>     smaller patches (while being careful not to introduce any build
>>     errors in case someone were to bisect through the series)
>>   - Fixed comments.
> 
> So I had to make the following changes, added this explanation to the
> resulting cset:
> 
>     Committer notes:
> 
>     Don't use perf_stat_config in tools/perf/util/cpumap.c, this would make
>     code that is in util/, thus not really specific to a single builtin, use
>     a specific builtin config structure.
> 
>     Move the functions introduced in this patch from
>     tools/perf/util/cpumap.c since it needs access to builtin specific
>     and is not strictly needed to live in the util/ directory.
> 
>     With this 'perf test python' is back building.
> 
> - Arnaldo

An oversight on my part. Sorry about that. Thank you for fixing this and
picking up the changes :)

> 
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 68294ea499ae51d9..0528d1bc15d27705 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -150,7 +150,7 @@ static struct perf_stat		perf_stat;
>  
>  static volatile sig_atomic_t done = 0;
>  
> -struct perf_stat_config stat_config = {
> +static struct perf_stat_config stat_config = {
>  	.aggr_mode		= AGGR_GLOBAL,
>  	.aggr_level		= MAX_CACHE_LVL + 1,
>  	.scale			= true,
> @@ -1251,6 +1251,129 @@ static struct option stat_options[] = {
>  	OPT_END()
>  };
>  
> +/**
> + * Calculate the cache instance ID from the map in
> + * /sys/devices/system/cpu/cpuX/cache/indexY/shared_cpu_list
> + * Cache instance ID is the first CPU reported in the shared_cpu_list file.
> + */
> +static int cpu__get_cache_id_from_map(struct perf_cpu cpu, char *map)
> +{
> +	int id;
> +	struct perf_cpu_map *cpu_map = perf_cpu_map__new(map);
> +
> +	/*
> +	 * If the map contains no CPU, consider the current CPU to
> +	 * be the first online CPU in the cache domain else use the
> +	 * first online CPU of the cache domain as the ID.
> +	 */
> +	if (perf_cpu_map__empty(cpu_map))
> +		id = cpu.cpu;
> +	else
> +		id = perf_cpu_map__cpu(cpu_map, 0).cpu;
> +
> +	/* Free the perf_cpu_map used to find the cache ID */
> +	perf_cpu_map__put(cpu_map);
> +
> +	return id;
> +}
> +
> +/**
> + * cpu__get_cache_id - Returns 0 if successful in populating the
> + * cache level and cache id. Cache level is read from
> + * /sys/devices/system/cpu/cpuX/cache/indexY/level where as cache instance ID
> + * is the first CPU reported by
> + * /sys/devices/system/cpu/cpuX/cache/indexY/shared_cpu_list
> + */
> +static int cpu__get_cache_details(struct perf_cpu cpu, struct perf_cache *cache)
> +{
> +	int ret = 0;
> +	u32 cache_level = stat_config.aggr_level;
> +	struct cpu_cache_level caches[MAX_CACHE_LVL];
> +	u32 i = 0, caches_cnt = 0;
> +
> +	cache->cache_lvl = (cache_level > MAX_CACHE_LVL) ? 0 : cache_level;
> +	cache->cache = -1;
> +
> +	ret = build_caches_for_cpu(cpu.cpu, caches, &caches_cnt);
> +	if (ret) {
> +		/*
> +		 * If caches_cnt is not 0, cpu_cache_level data
> +		 * was allocated when building the topology.
> +		 * Free the allocated data before returning.
> +		 */
> +		if (caches_cnt)
> +			goto free_caches;
> +
> +		return ret;
> +	}
> +
> +	if (!caches_cnt)
> +		return -1;
> +
> +	/*
> +	 * Save the data for the highest level if no
> +	 * level was specified by the user.
> +	 */
> +	if (cache_level > MAX_CACHE_LVL) {
> +		int max_level_index = 0;
> +
> +		for (i = 1; i < caches_cnt; ++i) {
> +			if (caches[i].level > caches[max_level_index].level)
> +				max_level_index = i;
> +		}
> +
> +		cache->cache_lvl = caches[max_level_index].level;
> +		cache->cache = cpu__get_cache_id_from_map(cpu, caches[max_level_index].map);
> +
> +		/* Reset i to 0 to free entire caches[] */
> +		i = 0;
> +		goto free_caches;
> +	}
> +
> +	for (i = 0; i < caches_cnt; ++i) {
> +		if (caches[i].level == cache_level) {
> +			cache->cache_lvl = cache_level;
> +			cache->cache = cpu__get_cache_id_from_map(cpu, caches[i].map);
> +		}
> +
> +		cpu_cache_level__free(&caches[i]);
> +	}
> +
> +free_caches:
> +	/*
> +	 * Free all the allocated cpu_cache_level data.
> +	 */
> +	while (i < caches_cnt)
> +		cpu_cache_level__free(&caches[i++]);
> +
> +	return ret;
> +}
> +
> +/**
> + * aggr_cpu_id__cache - Create an aggr_cpu_id with cache instache ID, cache
> + * level, die and socket populated with the cache instache ID, cache level,
> + * die and socket for cpu. The function signature is compatible with
> + * aggr_cpu_id_get_t.
> + */
> +static struct aggr_cpu_id aggr_cpu_id__cache(struct perf_cpu cpu, void *data)
> +{
> +	int ret;
> +	struct aggr_cpu_id id;
> +	struct perf_cache cache;
> +
> +	id = aggr_cpu_id__die(cpu, data);
> +	if (aggr_cpu_id__is_empty(&id))
> +		return id;
> +
> +	ret = cpu__get_cache_details(cpu, &cache);
> +	if (ret)
> +		return id;
> +
> +	id.cache_lvl = cache.cache_lvl;
> +	id.cache = cache.cache;
> +	return id;
> +}
> +
>  static const char *const aggr_mode__string[] = {
>  	[AGGR_CORE] = "core",
>  	[AGGR_CACHE] = "cache",
> diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
> index 88d387200745de2f..a0719816a218d441 100644
> --- a/tools/perf/util/cpumap.c
> +++ b/tools/perf/util/cpumap.c
> @@ -3,8 +3,6 @@
>  #include "cpumap.h"
>  #include "debug.h"
>  #include "event.h"
> -#include "header.h"
> -#include "stat.h"
>  #include <assert.h>
>  #include <dirent.h>
>  #include <stdio.h>
> @@ -311,113 +309,6 @@ struct aggr_cpu_id aggr_cpu_id__die(struct perf_cpu cpu, void *data)
>  	return id;
>  }
>  
> -extern struct perf_stat_config stat_config;
> -
> -int cpu__get_cache_id_from_map(struct perf_cpu cpu, char *map)
> -{
> -	int id;
> -	struct perf_cpu_map *cpu_map = perf_cpu_map__new(map);
> -
> -	/*
> -	 * If the map contains no CPU, consider the current CPU to
> -	 * be the first online CPU in the cache domain else use the
> -	 * first online CPU of the cache domain as the ID.
> -	 */
> -	if (perf_cpu_map__empty(cpu_map))
> -		id = cpu.cpu;
> -	else
> -		id = perf_cpu_map__cpu(cpu_map, 0).cpu;
> -
> -	/* Free the perf_cpu_map used to find the cache ID */
> -	perf_cpu_map__put(cpu_map);
> -
> -	return id;
> -}
> -
> -int cpu__get_cache_details(struct perf_cpu cpu, struct perf_cache *cache)
> -{
> -	int ret = 0;
> -	struct cpu_cache_level caches[MAX_CACHE_LVL];
> -	u32 cache_level = stat_config.aggr_level;
> -	u32 i = 0, caches_cnt = 0;
> -
> -	cache->cache_lvl = (cache_level > MAX_CACHE_LVL) ? 0 : cache_level;
> -	cache->cache = -1;
> -
> -	ret = build_caches_for_cpu(cpu.cpu, caches, &caches_cnt);
> -	if (ret) {
> -		/*
> -		 * If caches_cnt is not 0, cpu_cache_level data
> -		 * was allocated when building the topology.
> -		 * Free the allocated data before returning.
> -		 */
> -		if (caches_cnt)
> -			goto free_caches;
> -
> -		return ret;
> -	}
> -
> -	if (!caches_cnt)
> -		return -1;
> -
> -	/*
> -	 * Save the data for the highest level if no
> -	 * level was specified by the user.
> -	 */
> -	if (cache_level > MAX_CACHE_LVL) {
> -		int max_level_index = 0;
> -
> -		for (i = 1; i < caches_cnt; ++i) {
> -			if (caches[i].level > caches[max_level_index].level)
> -				max_level_index = i;
> -		}
> -
> -		cache->cache_lvl = caches[max_level_index].level;
> -		cache->cache = cpu__get_cache_id_from_map(cpu, caches[max_level_index].map);
> -
> -		/* Reset i to 0 to free entire caches[] */
> -		i = 0;
> -		goto free_caches;
> -	}
> -
> -	for (i = 0; i < caches_cnt; ++i) {
> -		if (caches[i].level == cache_level) {
> -			cache->cache_lvl = cache_level;
> -			cache->cache = cpu__get_cache_id_from_map(cpu, caches[i].map);
> -		}
> -
> -		cpu_cache_level__free(&caches[i]);
> -	}
> -
> -free_caches:
> -	/*
> -	 * Free all the allocated cpu_cache_level data.
> -	 */
> -	while (i < caches_cnt)
> -		cpu_cache_level__free(&caches[i++]);
> -
> -	return ret;
> -}
> -
> -struct aggr_cpu_id aggr_cpu_id__cache(struct perf_cpu cpu, void *data)
> -{
> -	int ret;
> -	struct aggr_cpu_id id;
> -	struct perf_cache cache;
> -
> -	id = aggr_cpu_id__die(cpu, data);
> -	if (aggr_cpu_id__is_empty(&id))
> -		return id;
> -
> -	ret = cpu__get_cache_details(cpu, &cache);
> -	if (ret)
> -		return id;
> -
> -	id.cache_lvl = cache.cache_lvl;
> -	id.cache = cache.cache;
> -	return id;
> -}
> -
>  int cpu__get_core_id(struct perf_cpu cpu)
>  {
>  	int value, ret = cpu__get_topology_int(cpu.cpu, "core_id", &value);
> diff --git a/tools/perf/util/cpumap.h b/tools/perf/util/cpumap.h
> index 1212b4ab19386293..f394ccc0ccfbca4c 100644
> --- a/tools/perf/util/cpumap.h
> +++ b/tools/perf/util/cpumap.h
> @@ -86,20 +86,6 @@ int cpu__get_socket_id(struct perf_cpu cpu);
>   * /sys/devices/system/cpu/cpuX/topology/die_id for the given CPU.
>   */
>  int cpu__get_die_id(struct perf_cpu cpu);
> -/**
> - * Calculate the cache instance ID from the map in
> - * /sys/devices/system/cpu/cpuX/cache/indexY/shared_cpu_list
> - * Cache instance ID is the first CPU reported in the shared_cpu_list file.
> - */
> -int cpu__get_cache_id_from_map(struct perf_cpu cpu, char *map);
> -/**
> - * cpu__get_cache_id - Returns 0 if successful in populating the
> - * cache level and cache id. Cache level is read from
> - * /sys/devices/system/cpu/cpuX/cache/indexY/level where as cache instance ID
> - * is the first CPU reported by
> - * /sys/devices/system/cpu/cpuX/cache/indexY/shared_cpu_list
> - */
> -int cpu__get_cache_details(struct perf_cpu cpu, struct perf_cache *cache);
>  /**
>   * cpu__get_core_id - Returns the core id as read from
>   * /sys/devices/system/cpu/cpuX/topology/core_id for the given CPU.
> @@ -140,13 +126,6 @@ struct aggr_cpu_id aggr_cpu_id__socket(struct perf_cpu cpu, void *data);
>   * aggr_cpu_id_get_t.
>   */
>  struct aggr_cpu_id aggr_cpu_id__die(struct perf_cpu cpu, void *data);
> -/**
> - * aggr_cpu_id__cache - Create an aggr_cpu_id with cache instache ID, cache
> - * level, die and socket populated with the cache instache ID, cache level,
> - * die and socket for cpu. The function signature is compatible with
> - * aggr_cpu_id_get_t.
> - */
> -struct aggr_cpu_id aggr_cpu_id__cache(struct perf_cpu cpu, void *data);
>  /**
>   * aggr_cpu_id__core - Create an aggr_cpu_id with the core, die and socket
>   * populated with the core, die and socket for cpu. The function signature is

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-05-24  3:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-17 17:27 [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology K Prateek Nayak
2023-05-17 17:27 ` [PATCH v4 1/5] perf: Extract building cache level for a CPU into separate function K Prateek Nayak
2023-05-17 17:27 ` [PATCH v4 2/5] perf stat: Setup the foundation to allow aggregation based on cache topology K Prateek Nayak
2023-05-23 19:12   ` Arnaldo Carvalho de Melo
2023-05-24  3:00     ` K Prateek Nayak
2023-05-17 17:27 ` [PATCH v4 3/5] perf stat: Save cache level information when running perf stat record K Prateek Nayak
2023-05-17 17:27 ` [PATCH v4 4/5] perf stat: Add "--per-cache" aggregation option and document the same K Prateek Nayak
2023-05-17 17:27 ` [PATCH v4 5/5] pert stat: Add tests for the "--per-cache" option K Prateek Nayak
2023-05-17 17:58 ` [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology Ian Rogers
2023-05-18  2:13   ` K Prateek Nayak
2023-05-23 15:31   ` Arnaldo Carvalho de Melo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).