linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] perf sched: Introduce stats tool
@ 2024-09-16 16:47 Ravi Bangoria
  2024-09-16 16:47 ` [PATCH 1/5] sched/stats: Print domain name in /proc/schedstat Ravi Bangoria
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Ravi Bangoria @ 2024-09-16 16:47 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot, bristot,
	adrian.hunter, james.clark, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

MOTIVATION
----------

Existing `perf sched` is quite exhaustive and provides lot of insights
into scheduler behavior but it quickly becomes impractical to use for
long running or scheduler intensive workload. For ex, `perf sched record`
has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
generates huge 56G perf.data for which perf takes ~137 mins to prepare
and write it to disk [1].

Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
and generates samples on a tracepoint hit, `perf sched stats record` takes
snapshot of the /proc/schedstat file before and after the workload, i.e.
there is almost zero interference on workload run. Also, it takes very
minimal time to parse /proc/schedstat, convert it into perf samples and
save those samples into perf.data file. Result perf.data file is much
smaller. So, overall `perf sched stats record` is much more light weight
compare to `perf sched record`.

We, internally at AMD, have been using this (a variant of this, known as
"sched-scoreboard"[2]) and found it to be very useful to analyse impact
of any scheduler code changes[3][4].

Please note that, this is not a replacement of perf sched record/report.
The intended users of the new tool are scheduler developers, not regular
users.

USAGE
-----

  # perf sched stats record
  # perf sched stats report

Note: Although `perf sched stats` tool supports workload profiling syntax
(i.e. -- <workload> ), the recorded profile is still systemwide since the
/proc/schedstat is a systemwide file.

HOW TO INTERPRET THE REPORT
---------------------------

The `perf sched stats report` starts with total time profiling was active
in terms of jiffies:

  ----------------------------------------------------------------------------------------------------
  Time elapsed (in jiffies)                                   :       24537
  ----------------------------------------------------------------------------------------------------

Next is CPU scheduling statistics. These are simple diffs of
/proc/schedstat CPU lines along with description. The report also
prints % relative to base stat.

In the example below, schedule() left the CPU0 idle 98.19% of the time.
16.54% of total try_to_wake_up() was to wakeup local CPU. And, the total
waittime by tasks on CPU0 is 0.49% of the total runtime by tasks on the
same CPU.

  ----------------------------------------------------------------------------------------------------
  CPU 0
  ----------------------------------------------------------------------------------------------------
  sched_yield() count                                         :           0
  Legacy counter can be ignored                               :           0
  schedule() called                                           :       17138
  schedule() left the processor idle                          :       16827 ( 98.19% )
  try_to_wake_up() was called                                 :         508
  try_to_wake_up() was called to wake up the local cpu        :          84 ( 16.54% )
  total runtime by tasks on this processor (in jiffies)       :  2408959243
  total waittime by tasks on this processor (in jiffies)      :    11731825 ( 0.49% )
  total timeslices run on this cpu                            :         311
  ----------------------------------------------------------------------------------------------------

Next is load balancing statistics. For each of the sched domains
(eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
the following three categories:

  1) Idle Load Balance: Load balancing performed on behalf of a long
                        idling CPU by some other CPU.
  2) Busy Load Balance: Load balancing performed when the CPU was busy.
  3) New Idle Balance : Load balancing performed when a CPU just became
                        idle.

Under each of these three categories, sched stats report provides
different load balancing statistics. Along with direct stats, the
report also contains derived metrics prefixed with *. Example:

  ----------------------------------------------------------------------------------------------------
  CPU 0 DOMAIN SMT CPUS <0, 64>
  ----------------------------------------- <Category idle> ------------------------------------------
  load_balance() count on cpu idle                                 :          50   $      490.74 $
  load_balance() found balanced on cpu idle                        :          42   $      584.21 $
  load_balance() move task failed on cpu idle                      :           8   $     3067.12 $
  imbalance sum on cpu idle                                        :           8
  pull_task() count on cpu idle                                    :           0
  pull_task() when target task was cache-hot on cpu idle           :           0
  load_balance() failed to find busier queue on cpu idle           :           0   $        0.00 $
  load_balance() failed to find busier group on cpu idle           :          42   $      584.21 $
  *load_balance() success count on cpu idle                        :           0
  *avg task pulled per successful lb attempt (cpu idle)            :        0.00
  ----------------------------------------- <Category busy> ------------------------------------------
  load_balance() count on cpu busy                                 :           2   $    12268.50 $
  load_balance() found balanced on cpu busy                        :           2   $    12268.50 $
  load_balance() move task failed on cpu busy                      :           0   $        0.00 $
  imbalance sum on cpu busy                                        :           0
  pull_task() count on cpu busy                                    :           0
  pull_task() when target task was cache-hot on cpu busy           :           0
  load_balance() failed to find busier queue on cpu busy           :           0   $        0.00 $
  load_balance() failed to find busier group on cpu busy           :           1   $    24537.00 $
  *load_balance() success count on cpu busy                        :           0
  *avg task pulled per successful lb attempt (cpu busy)            :        0.00
  ---------------------------------------- <Category newidle> ----------------------------------------
  load_balance() count on cpu newly idle                           :         427   $       57.46 $
  load_balance() found balanced on cpu newly idle                  :         382   $       64.23 $
  load_balance() move task failed on cpu newly idle                :          45   $      545.27 $
  imbalance sum on cpu newly idle                                  :          48
  pull_task() count on cpu newly idle                              :           0
  pull_task() when target task was cache-hot on cpu newly idle     :           0
  load_balance() failed to find busier queue on cpu newly idle     :           0   $        0.00 $
  load_balance() failed to find busier group on cpu newly idle     :         382   $       64.23 $
  *load_balance() success count on cpu newly idle                  :           0
  *avg task pulled per successful lb attempt (cpu newly idle)      :        0.00
  ----------------------------------------------------------------------------------------------------

Consider following line:

  load_balance() found balanced on cpu newly idle                  :         382    $      64.23 $

While profiling was active, the load-balancer found 382 times the load
needs to be balanced on a newly idle CPU 0. Following value encapsulated
inside $ is average jiffies between two events (24537 / 382 = 64.23).

Next are active_load_balance() stats. alb did not trigger while the 
profiling was active, hence it's all 0s.

  --------------------------------- <Category active_load_balance()> ---------------------------------
  active_load_balance() count                                      :           0
  active_load_balance() move task failed                           :           0
  active_load_balance() successfully moved a task                  :           0
  ----------------------------------------------------------------------------------------------------

Next are sched_balance_exec() and sched_balance_fork() stats. They are
not used but we kept it in RFC just for legacy purpose. Unless opposed,
we plan to remove them in next revision.

Next are wakeup statistics. For every domain, the report also shows
task-wakeup statistics. Example:

  ------------------------------------------- <Wakeup Info> ------------------------------------------
  try_to_wake_up() awoke a task that last ran on a diff cpu       :       12070
  try_to_wake_up() moved task because cache-cold on own cpu       :        3324
  try_to_wake_up() started passive balancing                      :           0
  ----------------------------------------------------------------------------------------------------

Same set of stats are reported for each CPU and each domain level.

RFC: https://lore.kernel.org/r/20240508060427.417-1-ravi.bangoria@amd.com
RFC->v1:
 - [Kernel] Print domain name along with domain number in /proc/schedstat
   file.
 - s/schedstat/stats/ for the subcommand.
 - Record domain name and cpumask details, also show them in report.
 - Add CPU filtering capability at record and report time.
 - Add /proc/schedstat v16 support.
 - Live mode support. Similar to perf stat command, live mode prints the
   sched stats on the stdout.
 - Add pager support in `perf sched stats report` for better scrolling.
 - Some minor cosmetic changes in report output to improve readability.
 - Rebase to latest perf-tools-next/perf-tools-next (1de5b5dcb835).

TODO:
 - Add perf unit tests to test basic sched stats functionalities
 - Describe new tool, it's usage and interpretation of report data in the
   perf-sched man page.
 - Currently sched stats tool provides statistics of only one run but we
   are planning to add `perf sched stats diff` which can compare the data
   of two different runs (possibly good and bad) and highlight where
   scheduler decisions are impacting workload performance.
 - perf sched stats records /proc/schedstat which is a CPU and domain
   level scheduler statistic. We are planning to add taskstat tool which
   reads task stats from procfs and generate scheduler statistic report
   at task granularity. this will probably a standalone tool, something
   like `perf sched taskstat record/report`.
 - Except pre-processor related checkpatch warnings, we have addressed
   most of the other possible warnings.

Patches are prepared on perf-tools-next/perf-tools-next (1de5b5dcb835).

Apologies for the long delay in respin. sched-ext was proposed while we
were working on next revision. So, we held on for a moment to settle down
dusts and get a clear idea of whether the new tool will be useful or not.

[1] https://youtu.be/lg-9aG2ajA0?t=283
[2] https://github.com/AMDESE/sched-scoreboard
[3] https://lore.kernel.org/lkml/c50bdbfe-02ce-c1bc-c761-c95f8e216ca0@amd.com/
[4] https://lore.kernel.org/lkml/3e32bec6-5e59-c66a-7676-7d15df2c961c@amd.com/


K Prateek Nayak (1):
  sched/stats: Print domain name in /proc/schedstat

Swapnil Sapkal (4):
  perf sched stats: Add record and rawdump support
  perf sched stats: Add schedstat v16 support
  perf sched stats: Add support for report subcommand
  perf sched stats: Add support for live mode

 Documentation/scheduler/sched-stats.rst       |   8 +-
 kernel/sched/stats.c                          |   6 +-
 tools/lib/perf/Documentation/libperf.txt      |   2 +
 tools/lib/perf/Makefile                       |   2 +-
 tools/lib/perf/include/perf/event.h           |  56 ++
 .../lib/perf/include/perf/schedstat-cpu-v15.h |  22 +
 .../lib/perf/include/perf/schedstat-cpu-v16.h |  22 +
 .../perf/include/perf/schedstat-domain-v15.h  | 121 +++
 .../perf/include/perf/schedstat-domain-v16.h  | 121 +++
 tools/perf/builtin-inject.c                   |   2 +
 tools/perf/builtin-sched.c                    | 778 +++++++++++++++++-
 tools/perf/util/event.c                       | 104 +++
 tools/perf/util/event.h                       |   2 +
 tools/perf/util/session.c                     |  20 +
 tools/perf/util/synthetic-events.c            | 255 ++++++
 tools/perf/util/synthetic-events.h            |   3 +
 tools/perf/util/tool.c                        |  20 +
 tools/perf/util/tool.h                        |   4 +-
 18 files changed, 1542 insertions(+), 6 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v15.h
 create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v16.h
 create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v15.h
 create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v16.h

-- 
2.46.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/5] sched/stats: Print domain name in /proc/schedstat
  2024-09-16 16:47 [PATCH 0/5] perf sched: Introduce stats tool Ravi Bangoria
@ 2024-09-16 16:47 ` Ravi Bangoria
  2024-09-16 16:47 ` [PATCH 2/5] perf sched stats: Add record and rawdump support Ravi Bangoria
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Ravi Bangoria @ 2024-09-16 16:47 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot, bristot,
	adrian.hunter, james.clark, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

From: K Prateek Nayak <kprateek.nayak@amd.com>

Currently, there does not exist a straightforward way to extract the
names of the sched domains and match them to the per-cpu domain entry in
/proc/schedstat other than looking at the debugfs files which are only
visible after enabling "verbose" debug after commit 34320745dfc9
("sched/debug: Put sched/domains files under the verbose flag")

Since tools like `perf sched schedstat` require displaying per-domain
information in user friendly manner, display the names of sched domain,
alongside their level in /proc/schedstat if CONFIG_SCHED_DEBUG is enabled.

Domain names also makes the /proc/schedstat data unambiguous when some
of the cpus are offline. For example, on a 128 cpus AMD Zen3 machine
where CPU0 and CPU64 are SMT siblings and CPU64 is offline:

Before:
    cpu0 ...
    domain0 ...
    domain1 ...
    cpu1 ...
    domain0 ...
    domain1 ...
    domain2 ...

After:
    cpu0 ...
    domain0:MC ...
    domain1:PKG ...
    cpu1 ...
    domain0:SMT ...
    domain1:MC ...
    domain2:PKG ...

schedstat version has not been updated since this change merely adds
additional information to the domain name field and does not add a new
field altogether.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
---
 Documentation/scheduler/sched-stats.rst | 8 ++++++--
 kernel/sched/stats.c                    | 6 +++++-
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/Documentation/scheduler/sched-stats.rst b/Documentation/scheduler/sched-stats.rst
index 7c2b16c4729d..b60a3e7bc108 100644
--- a/Documentation/scheduler/sched-stats.rst
+++ b/Documentation/scheduler/sched-stats.rst
@@ -6,6 +6,8 @@ Version 16 of schedstats changed the order of definitions within
 'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES]
 columns in show_schedstat(). In particular the position of CPU_IDLE
 and __CPU_NOT_IDLE changed places. The size of the array is unchanged.
+With CONFIG_SCHED_DEBUG enabled, the domain field can also print the
+name of the corresponding sched domain.
 
 Version 15 of schedstats dropped counters for some sched_yield:
 yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
@@ -71,9 +73,11 @@ Domain statistics
 -----------------
 One of these is produced per domain for each cpu described. (Note that if
 CONFIG_SMP is not defined, *no* domains are utilized and these lines
-will not appear in the output.)
+will not appear in the output. [:<name>] is an optional extension to the domain
+field that prints the name of the corresponding sched domain. It can appear in
+schedstat version 16 and above, and requires CONFIG_SCHED_DEBUG.)
 
-domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
+domain<N>[:<name>] <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
 
 The first field is a bit mask indicating what cpus this domain operates over.
 
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index eb0cdcd4d921..bd4ed737e894 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -138,7 +138,11 @@ static int show_schedstat(struct seq_file *seq, void *v)
 		for_each_domain(cpu, sd) {
 			enum cpu_idle_type itype;
 
-			seq_printf(seq, "domain%d %*pb", dcount++,
+			seq_printf(seq, "domain%d", dcount++);
+#ifdef CONFIG_SCHED_DEBUG
+			seq_printf(seq, ":%s", sd->name);
+#endif
+			seq_printf(seq, " %*pb",
 				   cpumask_pr_args(sched_domain_span(sd)));
 			for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
 				seq_printf(seq, " %u %u %u %u %u %u %u %u",
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/5] perf sched stats: Add record and rawdump support
  2024-09-16 16:47 [PATCH 0/5] perf sched: Introduce stats tool Ravi Bangoria
  2024-09-16 16:47 ` [PATCH 1/5] sched/stats: Print domain name in /proc/schedstat Ravi Bangoria
@ 2024-09-16 16:47 ` Ravi Bangoria
  2024-09-17 10:35   ` James Clark
  2024-09-26  6:12   ` Namhyung Kim
  2024-09-16 16:47 ` [PATCH 3/5] perf sched stats: Add schedstat v16 support Ravi Bangoria
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 17+ messages in thread
From: Ravi Bangoria @ 2024-09-16 16:47 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot, bristot,
	adrian.hunter, james.clark, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

From: Swapnil Sapkal <swapnil.sapkal@amd.com>

Define new, perf tool only, sample types and their layouts. Add logic
to parse /proc/schedstat, convert it to perf sample format and save
samples to perf.data file with `perf sched stats record` command. Also
add logic to read perf.data file, interpret schedstat samples and
print rawdump of samples with `perf script -D`.

Note that, /proc/schedstat file output is standardized with version
number. The patch supports v15 but older or newer version can be added
easily.

Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
---
 tools/lib/perf/Documentation/libperf.txt      |   2 +
 tools/lib/perf/Makefile                       |   2 +-
 tools/lib/perf/include/perf/event.h           |  42 +++
 .../lib/perf/include/perf/schedstat-cpu-v15.h |  13 +
 .../perf/include/perf/schedstat-domain-v15.h  |  40 +++
 tools/perf/builtin-inject.c                   |   2 +
 tools/perf/builtin-sched.c                    | 222 +++++++++++++++-
 tools/perf/util/event.c                       |  98 +++++++
 tools/perf/util/event.h                       |   2 +
 tools/perf/util/session.c                     |  20 ++
 tools/perf/util/synthetic-events.c            | 249 ++++++++++++++++++
 tools/perf/util/synthetic-events.h            |   3 +
 tools/perf/util/tool.c                        |  20 ++
 tools/perf/util/tool.h                        |   4 +-
 14 files changed, 716 insertions(+), 3 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v15.h
 create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v15.h

diff --git a/tools/lib/perf/Documentation/libperf.txt b/tools/lib/perf/Documentation/libperf.txt
index fcfb9499ef9c..39c78682ad2e 100644
--- a/tools/lib/perf/Documentation/libperf.txt
+++ b/tools/lib/perf/Documentation/libperf.txt
@@ -211,6 +211,8 @@ SYNOPSIS
   struct perf_record_time_conv;
   struct perf_record_header_feature;
   struct perf_record_compressed;
+  struct perf_record_schedstat_cpu;
+  struct perf_record_schedstat_domain;
 --
 
 DESCRIPTION
diff --git a/tools/lib/perf/Makefile b/tools/lib/perf/Makefile
index 3a9b2140aa04..ebbfea891a6a 100644
--- a/tools/lib/perf/Makefile
+++ b/tools/lib/perf/Makefile
@@ -187,7 +187,7 @@ install_lib: libs
 		$(call do_install_mkdir,$(libdir_SQ)); \
 		cp -fpR $(LIBPERF_ALL) $(DESTDIR)$(libdir_SQ)
 
-HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h
+HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-cpu-v15.h schedstat-domain-v15.h
 INTERNAL_HDRS := cpumap.h evlist.h evsel.h lib.h mmap.h rc_check.h threadmap.h xyarray.h
 
 INSTALL_HDRS_PFX := $(DESTDIR)$(prefix)/include/perf
diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index 37bb7771d914..35be296d68d5 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -457,6 +457,44 @@ struct perf_record_compressed {
 	char			 data[];
 };
 
+struct perf_record_schedstat_cpu_v15 {
+#define CPU_FIELD(_type, _name, _ver)		_type _name;
+#include "schedstat-cpu-v15.h"
+#undef CPU_FIELD
+};
+
+struct perf_record_schedstat_cpu {
+	struct perf_event_header header;
+	__u16			 version;
+	__u64			 timestamp;
+	__u32			 cpu;
+	union {
+		struct perf_record_schedstat_cpu_v15 v15;
+	};
+};
+
+struct perf_record_schedstat_domain_v15 {
+#define DOMAIN_FIELD(_type, _name, _ver)	_type _name;
+#include "schedstat-domain-v15.h"
+#undef DOMAIN_FIELD
+};
+
+#define DOMAIN_NAME_LEN		16
+
+struct perf_record_schedstat_domain {
+	struct perf_event_header header;
+	__u16			 version;
+	__u64			 timestamp;
+	__u32			 cpu;
+	__u16			 domain;
+	char			 name[DOMAIN_NAME_LEN];
+	union {
+		struct perf_record_schedstat_domain_v15 v15;
+	};
+	__u16			 nr_cpus;
+	__u8			 cpu_mask[];
+};
+
 enum perf_user_event_type { /* above any possible kernel type */
 	PERF_RECORD_USER_TYPE_START		= 64,
 	PERF_RECORD_HEADER_ATTR			= 64,
@@ -478,6 +516,8 @@ enum perf_user_event_type { /* above any possible kernel type */
 	PERF_RECORD_HEADER_FEATURE		= 80,
 	PERF_RECORD_COMPRESSED			= 81,
 	PERF_RECORD_FINISHED_INIT		= 82,
+	PERF_RECORD_SCHEDSTAT_CPU		= 83,
+	PERF_RECORD_SCHEDSTAT_DOMAIN		= 84,
 	PERF_RECORD_HEADER_MAX
 };
 
@@ -518,6 +558,8 @@ union perf_event {
 	struct perf_record_time_conv		time_conv;
 	struct perf_record_header_feature	feat;
 	struct perf_record_compressed		pack;
+	struct perf_record_schedstat_cpu	schedstat_cpu;
+	struct perf_record_schedstat_domain	schedstat_domain;
 };
 
 #endif /* __LIBPERF_EVENT_H */
diff --git a/tools/lib/perf/include/perf/schedstat-cpu-v15.h b/tools/lib/perf/include/perf/schedstat-cpu-v15.h
new file mode 100644
index 000000000000..8e4355ee3705
--- /dev/null
+++ b/tools/lib/perf/include/perf/schedstat-cpu-v15.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifdef CPU_FIELD
+CPU_FIELD(__u32, yld_count, v15)
+CPU_FIELD(__u32, array_exp, v15)
+CPU_FIELD(__u32, sched_count, v15)
+CPU_FIELD(__u32, sched_goidle, v15)
+CPU_FIELD(__u32, ttwu_count, v15)
+CPU_FIELD(__u32, ttwu_local, v15)
+CPU_FIELD(__u64, rq_cpu_time, v15)
+CPU_FIELD(__u64, run_delay, v15)
+CPU_FIELD(__u64, pcount, v15)
+#endif
diff --git a/tools/lib/perf/include/perf/schedstat-domain-v15.h b/tools/lib/perf/include/perf/schedstat-domain-v15.h
new file mode 100644
index 000000000000..422e713d617a
--- /dev/null
+++ b/tools/lib/perf/include/perf/schedstat-domain-v15.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifdef DOMAIN_FIELD
+DOMAIN_FIELD(__u32, idle_lb_count, v15)
+DOMAIN_FIELD(__u32, idle_lb_balanced, v15)
+DOMAIN_FIELD(__u32, idle_lb_failed, v15)
+DOMAIN_FIELD(__u32, idle_lb_imbalance, v15)
+DOMAIN_FIELD(__u32, idle_lb_gained, v15)
+DOMAIN_FIELD(__u32, idle_lb_hot_gained, v15)
+DOMAIN_FIELD(__u32, idle_lb_nobusyq, v15)
+DOMAIN_FIELD(__u32, idle_lb_nobusyg, v15)
+DOMAIN_FIELD(__u32, busy_lb_count, v15)
+DOMAIN_FIELD(__u32, busy_lb_balanced, v15)
+DOMAIN_FIELD(__u32, busy_lb_failed, v15)
+DOMAIN_FIELD(__u32, busy_lb_imbalance, v15)
+DOMAIN_FIELD(__u32, busy_lb_gained, v15)
+DOMAIN_FIELD(__u32, busy_lb_hot_gained, v15)
+DOMAIN_FIELD(__u32, busy_lb_nobusyq, v15)
+DOMAIN_FIELD(__u32, busy_lb_nobusyg, v15)
+DOMAIN_FIELD(__u32, newidle_lb_count, v15)
+DOMAIN_FIELD(__u32, newidle_lb_balanced, v15)
+DOMAIN_FIELD(__u32, newidle_lb_failed, v15)
+DOMAIN_FIELD(__u32, newidle_lb_imbalance, v15)
+DOMAIN_FIELD(__u32, newidle_lb_gained, v15)
+DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v15)
+DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v15)
+DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v15)
+DOMAIN_FIELD(__u32, alb_count, v15)
+DOMAIN_FIELD(__u32, alb_failed, v15)
+DOMAIN_FIELD(__u32, alb_pushed, v15)
+DOMAIN_FIELD(__u32, sbe_count, v15)
+DOMAIN_FIELD(__u32, sbe_balanced, v15)
+DOMAIN_FIELD(__u32, sbe_pushed, v15)
+DOMAIN_FIELD(__u32, sbf_count, v15)
+DOMAIN_FIELD(__u32, sbf_balanced, v15)
+DOMAIN_FIELD(__u32, sbf_pushed, v15)
+DOMAIN_FIELD(__u32, ttwu_wake_remote, v15)
+DOMAIN_FIELD(__u32, ttwu_move_affine, v15)
+DOMAIN_FIELD(__u32, ttwu_move_balance, v15)
+#endif /* DOMAIN_FIELD */
diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index d6989195a061..b6eca6e92390 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -2530,6 +2530,8 @@ int cmd_inject(int argc, const char **argv)
 	inject.tool.finished_init	= perf_event__repipe_op2_synth;
 	inject.tool.compressed		= perf_event__repipe_op4_synth;
 	inject.tool.auxtrace		= perf_event__repipe_auxtrace;
+	inject.tool.schedstat_cpu	= perf_event__repipe_op2_synth;
+	inject.tool.schedstat_domain	= perf_event__repipe_op2_synth;
 	inject.tool.dont_split_sample_group = true;
 	inject.session = __perf_session__new(&data, &inject.tool,
 					     /*trace_event_repipe=*/inject.output.is_pipe);
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 5981cc51abc8..6ea0db05aa41 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -27,6 +27,8 @@
 #include "util/debug.h"
 #include "util/event.h"
 #include "util/util.h"
+#include "util/synthetic-events.h"
+#include "util/target.h"
 
 #include <linux/kernel.h>
 #include <linux/log2.h>
@@ -54,6 +56,7 @@
 #define MAX_PRIO		140
 
 static const char *cpu_list;
+static struct perf_cpu_map *user_requested_cpus;
 static DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
 
 struct sched_atom;
@@ -237,6 +240,9 @@ struct perf_sched {
 	volatile bool   thread_funcs_exit;
 	const char	*prio_str;
 	DECLARE_BITMAP(prio_bitmap, MAX_PRIO);
+
+	struct perf_session *session;
+	struct perf_data *data;
 };
 
 /* per thread run time data */
@@ -3660,6 +3666,195 @@ static void setup_sorting(struct perf_sched *sched, const struct option *options
 	sort_dimension__add("pid", &sched->cmp_pid);
 }
 
+static int process_synthesized_schedstat_event(const struct perf_tool *tool,
+					       union perf_event *event,
+					       struct perf_sample *sample __maybe_unused,
+					       struct machine *machine __maybe_unused)
+{
+	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
+
+	if (perf_data__write(sched->data, event, event->header.size) <= 0) {
+		pr_err("failed to write perf data, error: %m\n");
+		return -1;
+	}
+
+	sched->session->header.data_size += event->header.size;
+	return 0;
+}
+
+static void sighandler(int sig __maybe_unused)
+{
+}
+
+static int enable_sched_schedstats(int *reset)
+{
+	FILE *fp;
+	char ch;
+
+	fp = fopen("/proc/sys/kernel/sched_schedstats", "w+");
+	if (!fp) {
+		pr_err("Failed to open /proc/sys/kernel/sched_schedstats\n");
+		return -1;
+	}
+
+	ch = getc(fp);
+	if (ch == '0') {
+		*reset = 1;
+		rewind(fp);
+		putc('1', fp);
+		fclose(fp);
+	}
+	return 0;
+}
+
+static int disable_sched_schedstat(void)
+{
+	FILE *fp;
+
+	fp = fopen("/proc/sys/kernel/sched_schedstats", "w");
+	if (!fp) {
+		pr_err("Failed to open /proc/sys/kernel/sched_schedstats\n");
+		return -1;
+	}
+
+	putc('0', fp);
+	fclose(fp);
+	return 0;
+}
+
+/* perf.data or any other output file name used by stats subcommand (only). */
+const char *output_name;
+
+static int perf_sched__schedstat_record(struct perf_sched *sched,
+					int argc, const char **argv)
+{
+	struct perf_session *session;
+	struct evlist *evlist;
+	struct target *target;
+	int reset = 0;
+	int err = 0;
+	int fd;
+	struct perf_data data = {
+		.path  = output_name,
+		.mode  = PERF_DATA_MODE_WRITE,
+	};
+
+	signal(SIGINT, sighandler);
+	signal(SIGCHLD, sighandler);
+	signal(SIGTERM, sighandler);
+
+	evlist = evlist__new();
+	if (!evlist)
+		return -ENOMEM;
+
+	session = perf_session__new(&data, &sched->tool);
+	if (IS_ERR(session)) {
+		pr_err("Perf session creation failed.\n");
+		return PTR_ERR(session);
+	}
+
+	session->evlist = evlist;
+
+	sched->session = session;
+	sched->data = &data;
+
+	fd = perf_data__fd(&data);
+
+	/*
+	 * Capture all important metadata about the system. Although they are
+	 * not used by `perf sched stats` tool directly, they provide useful
+	 * information about profiled environment.
+	 */
+	perf_header__set_feat(&session->header, HEADER_HOSTNAME);
+	perf_header__set_feat(&session->header, HEADER_OSRELEASE);
+	perf_header__set_feat(&session->header, HEADER_VERSION);
+	perf_header__set_feat(&session->header, HEADER_ARCH);
+	perf_header__set_feat(&session->header, HEADER_NRCPUS);
+	perf_header__set_feat(&session->header, HEADER_CPUDESC);
+	perf_header__set_feat(&session->header, HEADER_CPUID);
+	perf_header__set_feat(&session->header, HEADER_TOTAL_MEM);
+	perf_header__set_feat(&session->header, HEADER_CMDLINE);
+	perf_header__set_feat(&session->header, HEADER_CPU_TOPOLOGY);
+	perf_header__set_feat(&session->header, HEADER_NUMA_TOPOLOGY);
+	perf_header__set_feat(&session->header, HEADER_CACHE);
+	perf_header__set_feat(&session->header, HEADER_MEM_TOPOLOGY);
+	perf_header__set_feat(&session->header, HEADER_CPU_PMU_CAPS);
+	perf_header__set_feat(&session->header, HEADER_HYBRID_TOPOLOGY);
+	perf_header__set_feat(&session->header, HEADER_PMU_CAPS);
+
+	err = perf_session__write_header(session, evlist, fd, false);
+	if (err < 0)
+		goto out;
+
+	/*
+	 * `perf sched stats` does not support workload profiling (-p pid)
+	 * since /proc/schedstat file contains cpu specific data only. Hence, a
+	 * profile target is either set of cpus or systemwide, never a process.
+	 * Note that, although `-- <workload>` is supported, profile data are
+	 * still cpu/systemwide.
+	 */
+	target = zalloc(sizeof(struct target));
+	if (cpu_list)
+		target->cpu_list = cpu_list;
+	else
+		target->system_wide = true;
+
+	if (argc) {
+		err = evlist__prepare_workload(evlist, target, argv, false, NULL);
+		if (err)
+			goto out_target;
+	}
+
+	if (cpu_list) {
+		user_requested_cpus = perf_cpu_map__new(cpu_list);
+		if (!user_requested_cpus)
+			goto out_target;
+	}
+
+	err = perf_event__synthesize_schedstat(&(sched->tool),
+					       process_synthesized_schedstat_event,
+					       user_requested_cpus);
+	if (err < 0)
+		goto out_target;
+
+	err = enable_sched_schedstats(&reset);
+	if (err < 0)
+		goto out_target;
+
+	if (argc)
+		evlist__start_workload(evlist);
+
+	/* wait for signal */
+	pause();
+
+	if (reset) {
+		err = disable_sched_schedstat();
+		if (err < 0)
+			goto out_target;
+	}
+
+	err = perf_event__synthesize_schedstat(&(sched->tool),
+					       process_synthesized_schedstat_event,
+					       user_requested_cpus);
+	if (err < 0)
+		goto out_target;
+
+	err = perf_session__write_header(session, evlist, fd, true);
+
+out_target:
+	free(target);
+out:
+	if (!err)
+		fprintf(stderr, "[ perf sched stats: Wrote samples to %s ]\n", data.path);
+	else
+		fprintf(stderr, "[ perf sched stats: Failed !! ]\n");
+
+	close(fd);
+	perf_session__delete(session);
+
+	return err;
+}
+
 static bool schedstat_events_exposed(void)
 {
 	/*
@@ -3835,6 +4030,12 @@ int cmd_sched(int argc, const char **argv)
 		   "analyze events only for given task priority(ies)"),
 	OPT_PARENT(sched_options)
 	};
+	const struct option stats_options[] = {
+	OPT_STRING('o', "output", &output_name, "file",
+		   "`stats record` with output filename"),
+	OPT_STRING('C', "cpu", &cpu_list, "cpu", "list of cpus to profile"),
+	OPT_END()
+	};
 
 	const char * const latency_usage[] = {
 		"perf sched latency [<options>]",
@@ -3852,9 +4053,13 @@ int cmd_sched(int argc, const char **argv)
 		"perf sched timehist [<options>]",
 		NULL
 	};
+	const char * stats_usage[] = {
+		"perf sched stats {record} [<options>]",
+		NULL
+	};
 	const char *const sched_subcommands[] = { "record", "latency", "map",
 						  "replay", "script",
-						  "timehist", NULL };
+						  "timehist", "stats", NULL };
 	const char *sched_usage[] = {
 		NULL,
 		NULL
@@ -3950,6 +4155,21 @@ int cmd_sched(int argc, const char **argv)
 			return ret;
 
 		return perf_sched__timehist(&sched);
+	} else if (!strcmp(argv[0], "stats")) {
+		const char *const stats_subcommands[] = {"record", NULL};
+
+		argc = parse_options_subcommand(argc, argv, stats_options,
+						stats_subcommands,
+						stats_usage,
+						PARSE_OPT_STOP_AT_NON_OPTION);
+
+		if (argv[0] && !strcmp(argv[0], "record")) {
+			if (argc)
+				argc = parse_options(argc, argv, stats_options,
+						     stats_usage, 0);
+			return perf_sched__schedstat_record(&sched, argc, argv);
+		}
+		usage_with_options(stats_usage, stats_options);
 	} else {
 		usage_with_options(sched_usage, sched_options);
 	}
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index aac96d5d1917..c9bc8237e3fa 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -77,6 +77,8 @@ static const char *perf_event__names[] = {
 	[PERF_RECORD_HEADER_FEATURE]		= "FEATURE",
 	[PERF_RECORD_COMPRESSED]		= "COMPRESSED",
 	[PERF_RECORD_FINISHED_INIT]		= "FINISHED_INIT",
+	[PERF_RECORD_SCHEDSTAT_CPU]		= "SCHEDSTAT_CPU",
+	[PERF_RECORD_SCHEDSTAT_DOMAIN]		= "SCHEDSTAT_DOMAIN",
 };
 
 const char *perf_event__name(unsigned int id)
@@ -550,6 +552,102 @@ size_t perf_event__fprintf_text_poke(union perf_event *event, struct machine *ma
 	return ret;
 }
 
+size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
+{
+	struct perf_record_schedstat_cpu *cs = &event->schedstat_cpu;
+	__u16 version = cs->version;
+	size_t size = 0;
+
+	size = fprintf(fp, "\ncpu%u ", cs->cpu);
+
+#define CPU_FIELD(_type, _name, _ver)						\
+	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)cs->_ver._name);	\
+
+	if (version == 15) {
+#include <perf/schedstat-cpu-v15.h>
+		return size;
+	}
+#undef CPU_FIELD
+
+	return fprintf(fp, "Unsupported /proc/schedstat version %d.\n",
+		       event->schedstat_cpu.version);
+}
+
+size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
+{
+	struct perf_record_schedstat_domain *ds = &event->schedstat_domain;
+	__u16 version = ds->version;
+	size_t cpu_mask_len_2;
+	size_t cpu_mask_len;
+	size_t size = 0;
+	char *cpu_mask;
+	int idx;
+	int i, j;
+	bool low;
+
+	if (ds->name[0])
+		size = fprintf(fp, "\ndomain%u:%s ", ds->domain, ds->name);
+	else
+		size = fprintf(fp, "\ndomain%u ", ds->domain);
+
+	cpu_mask_len = ((ds->nr_cpus + 3) >> 2);
+	cpu_mask_len_2 = cpu_mask_len + ((cpu_mask_len - 1) / 8);
+
+	cpu_mask = zalloc(cpu_mask_len_2 + 1);
+	if (!cpu_mask)
+		return fprintf(fp, "Cannot allocate memory for cpumask\n");
+
+	idx = ((ds->nr_cpus + 7) >> 3) - 1;
+
+	i = cpu_mask_len_2 - 1;
+
+	low = true;
+	j = 1;
+	while (i >= 0) {
+		__u8 m;
+
+		if (low)
+			m = ds->cpu_mask[idx] & 0xf;
+		else
+			m = (ds->cpu_mask[idx] & 0xf0) >> 4;
+
+		if (m >= 0 && m <= 9)
+			m += '0';
+		else if (m >= 0xa && m <= 0xf)
+			m = m + 'a' - 10;
+		else if (m >= 0xA && m <= 0xF)
+			m = m + 'A' - 10;
+
+		cpu_mask[i] = m;
+
+		if (j == 8 && i != 0) {
+			cpu_mask[i - 1] = ',';
+			j = 0;
+			i--;
+		}
+
+		if (!low)
+			idx--;
+		low = !low;
+		i--;
+		j++;
+	}
+	size += fprintf(fp, "%s ", cpu_mask);
+	free(cpu_mask);
+
+#define DOMAIN_FIELD(_type, _name, _ver)					\
+	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)ds->_ver._name);	\
+
+	if (version == 15) {
+#include <perf/schedstat-domain-v15.h>
+		return size;
+	}
+#undef DOMAIN_FIELD
+
+	return fprintf(fp, "Unsupported /proc/schedstat version %d.\n",
+		       event->schedstat_domain.version);
+}
+
 size_t perf_event__fprintf(union perf_event *event, struct machine *machine, FILE *fp)
 {
 	size_t ret = fprintf(fp, "PERF_RECORD_%s",
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index f8742e6230a5..96e7cd1d282d 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -360,6 +360,8 @@ size_t perf_event__fprintf_cgroup(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_ksymbol(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_bpf(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_text_poke(union perf_event *event, struct machine *machine,FILE *fp);
+size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp);
+size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf(union perf_event *event, struct machine *machine, FILE *fp);
 
 int kallsyms__get_function_start(const char *kallsyms_filename,
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index dbaf07bf6c5f..a929a63e1827 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -691,6 +691,20 @@ static void perf_event__time_conv_swap(union perf_event *event,
 	}
 }
 
+static void
+perf_event__schedstat_cpu_swap(union perf_event *event __maybe_unused,
+			       bool sample_id_all __maybe_unused)
+{
+	/* FIXME */
+}
+
+static void
+perf_event__schedstat_domain_swap(union perf_event *event __maybe_unused,
+				  bool sample_id_all __maybe_unused)
+{
+	/* FIXME */
+}
+
 typedef void (*perf_event__swap_op)(union perf_event *event,
 				    bool sample_id_all);
 
@@ -729,6 +743,8 @@ static perf_event__swap_op perf_event__swap_ops[] = {
 	[PERF_RECORD_STAT_ROUND]	  = perf_event__stat_round_swap,
 	[PERF_RECORD_EVENT_UPDATE]	  = perf_event__event_update_swap,
 	[PERF_RECORD_TIME_CONV]		  = perf_event__time_conv_swap,
+	[PERF_RECORD_SCHEDSTAT_CPU]	  = perf_event__schedstat_cpu_swap,
+	[PERF_RECORD_SCHEDSTAT_DOMAIN]	  = perf_event__schedstat_domain_swap,
 	[PERF_RECORD_HEADER_MAX]	  = NULL,
 };
 
@@ -1444,6 +1460,10 @@ static s64 perf_session__process_user_event(struct perf_session *session,
 		return err;
 	case PERF_RECORD_FINISHED_INIT:
 		return tool->finished_init(session, event);
+	case PERF_RECORD_SCHEDSTAT_CPU:
+		return tool->schedstat_cpu(session, event);
+	case PERF_RECORD_SCHEDSTAT_DOMAIN:
+		return tool->schedstat_domain(session, event);
 	default:
 		return -EINVAL;
 	}
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index a58444c4aed1..9d8450b6eda9 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -2507,3 +2507,252 @@ int parse_synth_opt(char *synth)
 
 	return ret;
 }
+
+static union perf_event *__synthesize_schedstat_cpu(struct io *io, __u16 version,
+						    int *cpu, __u64 timestamp)
+{
+	struct perf_record_schedstat_cpu *cs;
+	union perf_event *event;
+	size_t size;
+	char ch;
+
+	size = sizeof(struct perf_record_schedstat_cpu);
+	size = PERF_ALIGN(size, sizeof(u64));
+	event = zalloc(size);
+
+	if (!event)
+		return NULL;
+
+	cs = &event->schedstat_cpu;
+	cs->header.type = PERF_RECORD_SCHEDSTAT_CPU;
+	cs->version = version;
+	cs->timestamp = timestamp;
+	cs->header.size = size;
+
+	if (io__get_char(io) != 'p' || io__get_char(io) != 'u')
+		goto out_cpu;
+
+	if (io__get_dec(io, (__u64 *)&cs->cpu) != ' ')
+		goto out_cpu;
+
+#define CPU_FIELD(_type, _name, _ver)					\
+	do {								\
+		__u64 _tmp;						\
+		ch = io__get_dec(io, &_tmp);				\
+		if (ch != ' ' && ch != '\n')				\
+			goto out_cpu;					\
+		cs->_ver._name = _tmp;					\
+	} while (0);
+
+	if (version == 15) {
+#include <perf/schedstat-cpu-v15.h>
+	}
+#undef CPU_FIELD
+
+	*cpu = cs->cpu;
+	return event;
+
+out_cpu:
+	free(event);
+	return NULL;
+}
+
+static size_t schedstat_sanitize_cpumask(char *cpu_mask, size_t cpu_mask_len)
+{
+	char *dst = cpu_mask;
+	char *src = cpu_mask;
+	int i = 0;
+
+	for ( ; src < cpu_mask + cpu_mask_len; dst++, src++) {
+		while (*src == ',')
+			src++;
+
+		*dst = *src;
+	}
+
+	for ( ; dst < src; dst++, i++)
+		*dst = '\0';
+
+	return cpu_mask_len - i;
+}
+
+static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 version,
+						       int cpu, __u64 timestamp)
+{
+	struct perf_env env = { .total_mem = 0, };
+	int nr_cpus_avail = perf_env__nr_cpus_avail(&env);
+	struct perf_record_schedstat_domain *ds;
+	union perf_event *event;
+	size_t d_name_len = 0;
+	char *d_name = NULL;
+	size_t cpu_mask_len = 0;
+	char *cpu_mask = NULL;
+	__u64 d_num;
+	size_t size;
+	int i = 0;
+	bool low;
+	char ch;
+	int idx;
+
+	if (io__get_char(io) != 'o' || io__get_char(io) != 'm' || io__get_char(io) != 'a' ||
+	    io__get_char(io) != 'i' || io__get_char(io) != 'n')
+		return NULL;
+
+	ch = io__get_dec(io, &d_num);
+	if (ch == ':') {
+		if (io__getdelim(io, &d_name, &d_name_len, ' ') < 0 || !d_name_len)
+			return NULL;
+		d_name[d_name_len - 1] = '\0';
+		d_name_len--;
+	} else if (ch != ' ') {
+		return NULL;
+	}
+
+	if (io__getdelim(io, &cpu_mask, &cpu_mask_len, ' ') < 0 || !cpu_mask_len)
+		goto out;
+
+	cpu_mask[cpu_mask_len - 1] = '\0';
+	cpu_mask_len--;
+	cpu_mask_len = schedstat_sanitize_cpumask(cpu_mask, cpu_mask_len);
+
+	size = sizeof(struct perf_record_schedstat_domain) + ((nr_cpus_avail + 7) >> 3);
+	size = PERF_ALIGN(size, sizeof(u64));
+	event = zalloc(size);
+
+	if (!event)
+		goto out_cpu_mask;
+
+	ds = &event->schedstat_domain;
+	ds->header.type = PERF_RECORD_SCHEDSTAT_DOMAIN;
+	ds->header.size = size;
+	ds->version = version;
+	ds->timestamp = timestamp;
+	if (d_name)
+		strncpy(ds->name, d_name, DOMAIN_NAME_LEN - 1);
+	ds->domain = d_num;
+	ds->nr_cpus = nr_cpus_avail;
+
+	idx = ((nr_cpus_avail + 7) >> 3) - 1;
+	low = true;
+	for (i = cpu_mask_len - 1; i >= 0 && idx >= 0; i--) {
+		char mask = cpu_mask[i];
+
+		if (mask >= '0' && mask <= '9')
+			mask -= '0';
+		else if (mask >= 'a' && mask <= 'f')
+			mask = mask - 'a' + 10;
+		else if (mask >= 'A' && mask <= 'F')
+			mask = mask - 'A' + 10;
+
+		if (low) {
+			ds->cpu_mask[idx] = mask;
+		} else {
+			ds->cpu_mask[idx] |= (mask << 4);
+			idx--;
+		}
+		low = !low;
+	}
+
+	free(d_name);
+	free(cpu_mask);
+
+#define DOMAIN_FIELD(_type, _name, _ver)				\
+	do {								\
+		__u64 _tmp;						\
+		ch = io__get_dec(io, &_tmp);				\
+		if (ch != ' ' && ch != '\n')				\
+			goto out_domain;				\
+		ds->_ver._name = _tmp;					\
+	} while (0);
+
+	if (version == 15) {
+#include <perf/schedstat-domain-v15.h>
+	}
+#undef DOMAIN_FIELD
+
+	ds->cpu = cpu;
+	return event;
+
+out_domain:
+	free(event);
+out_cpu_mask:
+	free(cpu_mask);
+out:
+	free(d_name);
+	return NULL;
+}
+
+int perf_event__synthesize_schedstat(const struct perf_tool *tool,
+				     perf_event__handler_t process,
+				     struct perf_cpu_map *user_requested_cpus)
+{
+	union perf_event *event = NULL;
+	size_t line_len = 0;
+	char *line = NULL;
+	char bf[BUFSIZ];
+	__u64 timestamp;
+	__u16 version;
+	struct io io;
+	int ret = -1;
+	int cpu = -1;
+	char ch;
+
+	io.fd = open("/proc/schedstat", O_RDONLY, 0);
+	if (io.fd < 0) {
+		pr_err("Failed to open /proc/schedstat\n");
+		return -1;
+	}
+	io__init(&io, io.fd, bf, sizeof(bf));
+
+	if (io__getline(&io, &line, &line_len) < 0 || !line_len)
+		goto out;
+
+	if (!strcmp(line, "version 15\n")) {
+		version = 15;
+	} else {
+		pr_err("Unsupported /proc/schedstat version: %s", line + 8);
+		goto out_free_line;
+	}
+
+	if (io__getline(&io, &line, &line_len) < 0 || !line_len)
+		goto out_free_line;
+	timestamp = atol(line + 10);
+
+	/*
+	 * FIXME: Can be optimized a bit by not synthesizing domain samples
+	 * for filtered out cpus.
+	 */
+	for (ch = io__get_char(&io); !io.eof; ch = io__get_char(&io)) {
+		struct perf_cpu this_cpu;
+
+		if (ch == 'c') {
+			event = __synthesize_schedstat_cpu(&io, version,
+							   &cpu, timestamp);
+		} else if (ch == 'd') {
+			event = __synthesize_schedstat_domain(&io, version,
+							      cpu, timestamp);
+		}
+		if (!event)
+			goto out_free_line;
+
+		this_cpu.cpu = cpu;
+
+		if (user_requested_cpus && !perf_cpu_map__has(user_requested_cpus, this_cpu))
+			continue;
+
+		if (process(tool, event, NULL, NULL) < 0) {
+			free(event);
+			goto out_free_line;
+		}
+
+		free(event);
+	}
+
+	ret = 0;
+
+out_free_line:
+	free(line);
+out:
+	close(io.fd);
+	return ret;
+}
diff --git a/tools/perf/util/synthetic-events.h b/tools/perf/util/synthetic-events.h
index b9c936b5cfeb..eab914c238df 100644
--- a/tools/perf/util/synthetic-events.h
+++ b/tools/perf/util/synthetic-events.h
@@ -141,4 +141,7 @@ int perf_event__synthesize_for_pipe(const struct perf_tool *tool,
 				    struct perf_data *data,
 				    perf_event__handler_t process);
 
+int perf_event__synthesize_schedstat(const struct perf_tool *tool,
+				     perf_event__handler_t process,
+				     struct perf_cpu_map *user_requested_cpu);
 #endif // __PERF_SYNTHETIC_EVENTS_H
diff --git a/tools/perf/util/tool.c b/tools/perf/util/tool.c
index 3b7f390f26eb..9f81d720735f 100644
--- a/tools/perf/util/tool.c
+++ b/tools/perf/util/tool.c
@@ -230,6 +230,24 @@ static int perf_session__process_compressed_event_stub(struct perf_session *sess
 	return 0;
 }
 
+static int process_schedstat_cpu_stub(struct perf_session *perf_session __maybe_unused,
+				      union perf_event *event)
+{
+	if (dump_trace)
+		perf_event__fprintf_schedstat_cpu(event, stdout);
+	dump_printf(": unhandled!\n");
+	return 0;
+}
+
+static int process_schedstat_domain_stub(struct perf_session *perf_session __maybe_unused,
+					 union perf_event *event)
+{
+	if (dump_trace)
+		perf_event__fprintf_schedstat_domain(event, stdout);
+	dump_printf(": unhandled!\n");
+	return 0;
+}
+
 void perf_tool__init(struct perf_tool *tool, bool ordered_events)
 {
 	tool->ordered_events = ordered_events;
@@ -286,6 +304,8 @@ void perf_tool__init(struct perf_tool *tool, bool ordered_events)
 	tool->compressed = perf_session__process_compressed_event_stub;
 #endif
 	tool->finished_init = process_event_op2_stub;
+	tool->schedstat_cpu = process_schedstat_cpu_stub;
+	tool->schedstat_domain = process_schedstat_domain_stub;
 }
 
 bool perf_tool__compressed_is_stub(const struct perf_tool *tool)
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index db1c7642b0d1..d289a5396b01 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -77,7 +77,9 @@ struct perf_tool {
 			stat,
 			stat_round,
 			feature,
-			finished_init;
+			finished_init,
+			schedstat_cpu,
+			schedstat_domain;
 	event_op4	compressed;
 	event_op3	auxtrace;
 	bool		ordered_events;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/5] perf sched stats: Add schedstat v16 support
  2024-09-16 16:47 [PATCH 0/5] perf sched: Introduce stats tool Ravi Bangoria
  2024-09-16 16:47 ` [PATCH 1/5] sched/stats: Print domain name in /proc/schedstat Ravi Bangoria
  2024-09-16 16:47 ` [PATCH 2/5] perf sched stats: Add record and rawdump support Ravi Bangoria
@ 2024-09-16 16:47 ` Ravi Bangoria
  2024-09-26  6:14   ` Namhyung Kim
  2024-09-16 16:47 ` [PATCH 4/5] perf sched stats: Add support for report subcommand Ravi Bangoria
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Ravi Bangoria @ 2024-09-16 16:47 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot, bristot,
	adrian.hunter, james.clark, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

From: Swapnil Sapkal <swapnil.sapkal@amd.com>

/proc/schedstat file output is standardized with version number.
Add support to record and raw dump v16 version layout.

Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
---
 tools/lib/perf/Makefile                       |  2 +-
 tools/lib/perf/include/perf/event.h           | 14 +++++++
 .../lib/perf/include/perf/schedstat-cpu-v16.h | 13 ++++++
 .../perf/include/perf/schedstat-domain-v16.h  | 40 +++++++++++++++++++
 tools/perf/util/event.c                       |  6 +++
 tools/perf/util/synthetic-events.c            |  6 +++
 6 files changed, 80 insertions(+), 1 deletion(-)
 create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v16.h
 create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v16.h

diff --git a/tools/lib/perf/Makefile b/tools/lib/perf/Makefile
index ebbfea891a6a..de0f4ffd9e16 100644
--- a/tools/lib/perf/Makefile
+++ b/tools/lib/perf/Makefile
@@ -187,7 +187,7 @@ install_lib: libs
 		$(call do_install_mkdir,$(libdir_SQ)); \
 		cp -fpR $(LIBPERF_ALL) $(DESTDIR)$(libdir_SQ)
 
-HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-cpu-v15.h schedstat-domain-v15.h
+HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-cpu-v15.h schedstat-domain-v15.h schedstat-cpu-v16.h schedstat-domain-v16.h
 INTERNAL_HDRS := cpumap.h evlist.h evsel.h lib.h mmap.h rc_check.h threadmap.h xyarray.h
 
 INSTALL_HDRS_PFX := $(DESTDIR)$(prefix)/include/perf
diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index 35be296d68d5..c332d467c9c9 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -463,6 +463,12 @@ struct perf_record_schedstat_cpu_v15 {
 #undef CPU_FIELD
 };
 
+struct perf_record_schedstat_cpu_v16 {
+#define CPU_FIELD(_type, _name, _ver)		_type _name;
+#include "schedstat-cpu-v16.h"
+#undef CPU_FIELD
+};
+
 struct perf_record_schedstat_cpu {
 	struct perf_event_header header;
 	__u16			 version;
@@ -470,6 +476,7 @@ struct perf_record_schedstat_cpu {
 	__u32			 cpu;
 	union {
 		struct perf_record_schedstat_cpu_v15 v15;
+		struct perf_record_schedstat_cpu_v16 v16;
 	};
 };
 
@@ -479,6 +486,12 @@ struct perf_record_schedstat_domain_v15 {
 #undef DOMAIN_FIELD
 };
 
+struct perf_record_schedstat_domain_v16 {
+#define DOMAIN_FIELD(_type, _name, _ver)	_type _name;
+#include "schedstat-domain-v16.h"
+#undef DOMAIN_FIELD
+};
+
 #define DOMAIN_NAME_LEN		16
 
 struct perf_record_schedstat_domain {
@@ -490,6 +503,7 @@ struct perf_record_schedstat_domain {
 	char			 name[DOMAIN_NAME_LEN];
 	union {
 		struct perf_record_schedstat_domain_v15 v15;
+		struct perf_record_schedstat_domain_v16 v16;
 	};
 	__u16			 nr_cpus;
 	__u8			 cpu_mask[];
diff --git a/tools/lib/perf/include/perf/schedstat-cpu-v16.h b/tools/lib/perf/include/perf/schedstat-cpu-v16.h
new file mode 100644
index 000000000000..f3a55131a05a
--- /dev/null
+++ b/tools/lib/perf/include/perf/schedstat-cpu-v16.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifdef CPU_FIELD
+CPU_FIELD(__u32, yld_count, v16)
+CPU_FIELD(__u32, array_exp, v16)
+CPU_FIELD(__u32, sched_count, v16)
+CPU_FIELD(__u32, sched_goidle, v16)
+CPU_FIELD(__u32, ttwu_count, v16)
+CPU_FIELD(__u32, ttwu_local, v16)
+CPU_FIELD(__u64, rq_cpu_time, v16)
+CPU_FIELD(__u64, run_delay, v16)
+CPU_FIELD(__u64, pcount, v16)
+#endif /* CPU_FIELD */
diff --git a/tools/lib/perf/include/perf/schedstat-domain-v16.h b/tools/lib/perf/include/perf/schedstat-domain-v16.h
new file mode 100644
index 000000000000..d6ef895c9d32
--- /dev/null
+++ b/tools/lib/perf/include/perf/schedstat-domain-v16.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifdef DOMAIN_FIELD
+DOMAIN_FIELD(__u32, busy_lb_count, v16)
+DOMAIN_FIELD(__u32, busy_lb_balanced, v16)
+DOMAIN_FIELD(__u32, busy_lb_failed, v16)
+DOMAIN_FIELD(__u32, busy_lb_imbalance, v16)
+DOMAIN_FIELD(__u32, busy_lb_gained, v16)
+DOMAIN_FIELD(__u32, busy_lb_hot_gained, v16)
+DOMAIN_FIELD(__u32, busy_lb_nobusyq, v16)
+DOMAIN_FIELD(__u32, busy_lb_nobusyg, v16)
+DOMAIN_FIELD(__u32, idle_lb_count, v16)
+DOMAIN_FIELD(__u32, idle_lb_balanced, v16)
+DOMAIN_FIELD(__u32, idle_lb_failed, v16)
+DOMAIN_FIELD(__u32, idle_lb_imbalance, v16)
+DOMAIN_FIELD(__u32, idle_lb_gained, v16)
+DOMAIN_FIELD(__u32, idle_lb_hot_gained, v16)
+DOMAIN_FIELD(__u32, idle_lb_nobusyq, v16)
+DOMAIN_FIELD(__u32, idle_lb_nobusyg, v16)
+DOMAIN_FIELD(__u32, newidle_lb_count, v16)
+DOMAIN_FIELD(__u32, newidle_lb_balanced, v16)
+DOMAIN_FIELD(__u32, newidle_lb_failed, v16)
+DOMAIN_FIELD(__u32, newidle_lb_imbalance, v16)
+DOMAIN_FIELD(__u32, newidle_lb_gained, v16)
+DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v16)
+DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v16)
+DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v16)
+DOMAIN_FIELD(__u32, alb_count, v16)
+DOMAIN_FIELD(__u32, alb_failed, v16)
+DOMAIN_FIELD(__u32, alb_pushed, v16)
+DOMAIN_FIELD(__u32, sbe_count, v16)
+DOMAIN_FIELD(__u32, sbe_balanced, v16)
+DOMAIN_FIELD(__u32, sbe_pushed, v16)
+DOMAIN_FIELD(__u32, sbf_count, v16)
+DOMAIN_FIELD(__u32, sbf_balanced, v16)
+DOMAIN_FIELD(__u32, sbf_pushed, v16)
+DOMAIN_FIELD(__u32, ttwu_wake_remote, v16)
+DOMAIN_FIELD(__u32, ttwu_move_affine, v16)
+DOMAIN_FIELD(__u32, ttwu_move_balance, v16)
+#endif /* DOMAIN_FIELD */
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index c9bc8237e3fa..d138e4a5787c 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -566,6 +566,9 @@ size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
 	if (version == 15) {
 #include <perf/schedstat-cpu-v15.h>
 		return size;
+	} else if (version == 16) {
+#include <perf/schedstat-cpu-v16.h>
+		return size;
 	}
 #undef CPU_FIELD
 
@@ -641,6 +644,9 @@ size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
 	if (version == 15) {
 #include <perf/schedstat-domain-v15.h>
 		return size;
+	} else if (version == 16) {
+#include <perf/schedstat-domain-v16.h>
+		return size;
 	}
 #undef DOMAIN_FIELD
 
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index 9d8450b6eda9..73b2492a4cde 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -2546,6 +2546,8 @@ static union perf_event *__synthesize_schedstat_cpu(struct io *io, __u16 version
 
 	if (version == 15) {
 #include <perf/schedstat-cpu-v15.h>
+	} else if (version == 16) {
+#include <perf/schedstat-cpu-v16.h>
 	}
 #undef CPU_FIELD
 
@@ -2667,6 +2669,8 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
 
 	if (version == 15) {
 #include <perf/schedstat-domain-v15.h>
+	} else if (version == 16) {
+#include <perf/schedstat-domain-v16.h>
 	}
 #undef DOMAIN_FIELD
 
@@ -2709,6 +2713,8 @@ int perf_event__synthesize_schedstat(const struct perf_tool *tool,
 
 	if (!strcmp(line, "version 15\n")) {
 		version = 15;
+	} else if (!strcmp(line, "version 16\n")) {
+		version = 16;
 	} else {
 		pr_err("Unsupported /proc/schedstat version: %s", line + 8);
 		goto out_free_line;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/5] perf sched stats: Add support for report subcommand
  2024-09-16 16:47 [PATCH 0/5] perf sched: Introduce stats tool Ravi Bangoria
                   ` (2 preceding siblings ...)
  2024-09-16 16:47 ` [PATCH 3/5] perf sched stats: Add schedstat v16 support Ravi Bangoria
@ 2024-09-16 16:47 ` Ravi Bangoria
  2024-09-16 16:47 ` [PATCH 5/5] perf sched stats: Add support for live mode Ravi Bangoria
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Ravi Bangoria @ 2024-09-16 16:47 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot, bristot,
	adrian.hunter, james.clark, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

From: Swapnil Sapkal <swapnil.sapkal@amd.com>

`perf sched stats record` captures two sets of samples. For workload
profile, first set right before workload starts and second set after
workload finishes. For the systemwide profile, first set at the
beginning of profile and second set on receiving SIGINT signal.

Add `perf sched stats report` subcommand that will read both the set
of samples, get the diff and render a final report. Final report prints
scheduler stat at cpu granularity as well as sched domain granularity.

Example usage:

  # perf sched stats record
  # perf sched stats report

Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
---
 tools/lib/perf/include/perf/event.h           |   8 +-
 .../lib/perf/include/perf/schedstat-cpu-v15.h |  27 +-
 .../lib/perf/include/perf/schedstat-cpu-v16.h |  27 +-
 .../perf/include/perf/schedstat-domain-v15.h  | 153 ++++--
 .../perf/include/perf/schedstat-domain-v16.h  | 153 ++++--
 tools/perf/builtin-sched.c                    | 475 +++++++++++++++++-
 tools/perf/util/event.c                       |   4 +-
 tools/perf/util/synthetic-events.c            |   4 +-
 8 files changed, 751 insertions(+), 100 deletions(-)

diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index c332d467c9c9..dca03b585766 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -458,13 +458,13 @@ struct perf_record_compressed {
 };
 
 struct perf_record_schedstat_cpu_v15 {
-#define CPU_FIELD(_type, _name, _ver)		_type _name;
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)	_type _name;
 #include "schedstat-cpu-v15.h"
 #undef CPU_FIELD
 };
 
 struct perf_record_schedstat_cpu_v16 {
-#define CPU_FIELD(_type, _name, _ver)		_type _name;
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)	_type _name;
 #include "schedstat-cpu-v16.h"
 #undef CPU_FIELD
 };
@@ -481,13 +481,13 @@ struct perf_record_schedstat_cpu {
 };
 
 struct perf_record_schedstat_domain_v15 {
-#define DOMAIN_FIELD(_type, _name, _ver)	_type _name;
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)	_type _name;
 #include "schedstat-domain-v15.h"
 #undef DOMAIN_FIELD
 };
 
 struct perf_record_schedstat_domain_v16 {
-#define DOMAIN_FIELD(_type, _name, _ver)	_type _name;
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)	_type _name;
 #include "schedstat-domain-v16.h"
 #undef DOMAIN_FIELD
 };
diff --git a/tools/lib/perf/include/perf/schedstat-cpu-v15.h b/tools/lib/perf/include/perf/schedstat-cpu-v15.h
index 8e4355ee3705..f1d7f8363f39 100644
--- a/tools/lib/perf/include/perf/schedstat-cpu-v15.h
+++ b/tools/lib/perf/include/perf/schedstat-cpu-v15.h
@@ -1,13 +1,22 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #ifdef CPU_FIELD
-CPU_FIELD(__u32, yld_count, v15)
-CPU_FIELD(__u32, array_exp, v15)
-CPU_FIELD(__u32, sched_count, v15)
-CPU_FIELD(__u32, sched_goidle, v15)
-CPU_FIELD(__u32, ttwu_count, v15)
-CPU_FIELD(__u32, ttwu_local, v15)
-CPU_FIELD(__u64, rq_cpu_time, v15)
-CPU_FIELD(__u64, run_delay, v15)
-CPU_FIELD(__u64, pcount, v15)
+CPU_FIELD(__u32, yld_count, "sched_yield() count",
+	  "%11u", false, yld_count, v15)
+CPU_FIELD(__u32, array_exp, "Legacy counter can be ignored",
+	  "%11u", false, array_exp, v15)
+CPU_FIELD(__u32, sched_count, "schedule() called",
+	  "%11u", false, sched_count, v15)
+CPU_FIELD(__u32, sched_goidle, "schedule() left the processor idle",
+	  "%11u", true, sched_count, v15)
+CPU_FIELD(__u32, ttwu_count, "try_to_wake_up() was called",
+	  "%11u", false, ttwu_count, v15)
+CPU_FIELD(__u32, ttwu_local, "try_to_wake_up() was called to wake up the local cpu",
+	  "%11u", true, ttwu_count, v15)
+CPU_FIELD(__u64, rq_cpu_time, "total runtime by tasks on this processor (in jiffies)",
+	  "%11llu", false, rq_cpu_time, v15)
+CPU_FIELD(__u64, run_delay, "total waittime by tasks on this processor (in jiffies)",
+	  "%11llu", true, rq_cpu_time, v15)
+CPU_FIELD(__u64, pcount, "total timeslices run on this cpu",
+	  "%11llu", false, pcount, v15)
 #endif
diff --git a/tools/lib/perf/include/perf/schedstat-cpu-v16.h b/tools/lib/perf/include/perf/schedstat-cpu-v16.h
index f3a55131a05a..ac58d68ec661 100644
--- a/tools/lib/perf/include/perf/schedstat-cpu-v16.h
+++ b/tools/lib/perf/include/perf/schedstat-cpu-v16.h
@@ -1,13 +1,22 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #ifdef CPU_FIELD
-CPU_FIELD(__u32, yld_count, v16)
-CPU_FIELD(__u32, array_exp, v16)
-CPU_FIELD(__u32, sched_count, v16)
-CPU_FIELD(__u32, sched_goidle, v16)
-CPU_FIELD(__u32, ttwu_count, v16)
-CPU_FIELD(__u32, ttwu_local, v16)
-CPU_FIELD(__u64, rq_cpu_time, v16)
-CPU_FIELD(__u64, run_delay, v16)
-CPU_FIELD(__u64, pcount, v16)
+CPU_FIELD(__u32, yld_count, "sched_yield() count",
+	  "%11u", false, yld_count, v16)
+CPU_FIELD(__u32, array_exp, "Legacy counter can be ignored",
+	  "%11u", false, array_exp, v16)
+CPU_FIELD(__u32, sched_count, "schedule() called",
+	  "%11u", false, sched_count, v16)
+CPU_FIELD(__u32, sched_goidle, "schedule() left the processor idle",
+	  "%11u", true, sched_count, v16)
+CPU_FIELD(__u32, ttwu_count, "try_to_wake_up() was called",
+	  "%11u", false, ttwu_count, v16)
+CPU_FIELD(__u32, ttwu_local, "try_to_wake_up() was called to wake up the local cpu",
+	  "%11u", true, ttwu_count, v16)
+CPU_FIELD(__u64, rq_cpu_time, "total runtime by tasks on this processor (in jiffies)",
+	  "%11llu", false, rq_cpu_time, v16)
+CPU_FIELD(__u64, run_delay, "total waittime by tasks on this processor (in jiffies)",
+	  "%11llu", true, rq_cpu_time, v16)
+CPU_FIELD(__u64, pcount, "total timeslices run on this cpu",
+	  "%11llu", false, pcount, v16)
 #endif /* CPU_FIELD */
diff --git a/tools/lib/perf/include/perf/schedstat-domain-v15.h b/tools/lib/perf/include/perf/schedstat-domain-v15.h
index 422e713d617a..c1f782e08c1f 100644
--- a/tools/lib/perf/include/perf/schedstat-domain-v15.h
+++ b/tools/lib/perf/include/perf/schedstat-domain-v15.h
@@ -1,40 +1,121 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #ifdef DOMAIN_FIELD
-DOMAIN_FIELD(__u32, idle_lb_count, v15)
-DOMAIN_FIELD(__u32, idle_lb_balanced, v15)
-DOMAIN_FIELD(__u32, idle_lb_failed, v15)
-DOMAIN_FIELD(__u32, idle_lb_imbalance, v15)
-DOMAIN_FIELD(__u32, idle_lb_gained, v15)
-DOMAIN_FIELD(__u32, idle_lb_hot_gained, v15)
-DOMAIN_FIELD(__u32, idle_lb_nobusyq, v15)
-DOMAIN_FIELD(__u32, idle_lb_nobusyg, v15)
-DOMAIN_FIELD(__u32, busy_lb_count, v15)
-DOMAIN_FIELD(__u32, busy_lb_balanced, v15)
-DOMAIN_FIELD(__u32, busy_lb_failed, v15)
-DOMAIN_FIELD(__u32, busy_lb_imbalance, v15)
-DOMAIN_FIELD(__u32, busy_lb_gained, v15)
-DOMAIN_FIELD(__u32, busy_lb_hot_gained, v15)
-DOMAIN_FIELD(__u32, busy_lb_nobusyq, v15)
-DOMAIN_FIELD(__u32, busy_lb_nobusyg, v15)
-DOMAIN_FIELD(__u32, newidle_lb_count, v15)
-DOMAIN_FIELD(__u32, newidle_lb_balanced, v15)
-DOMAIN_FIELD(__u32, newidle_lb_failed, v15)
-DOMAIN_FIELD(__u32, newidle_lb_imbalance, v15)
-DOMAIN_FIELD(__u32, newidle_lb_gained, v15)
-DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v15)
-DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v15)
-DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v15)
-DOMAIN_FIELD(__u32, alb_count, v15)
-DOMAIN_FIELD(__u32, alb_failed, v15)
-DOMAIN_FIELD(__u32, alb_pushed, v15)
-DOMAIN_FIELD(__u32, sbe_count, v15)
-DOMAIN_FIELD(__u32, sbe_balanced, v15)
-DOMAIN_FIELD(__u32, sbe_pushed, v15)
-DOMAIN_FIELD(__u32, sbf_count, v15)
-DOMAIN_FIELD(__u32, sbf_balanced, v15)
-DOMAIN_FIELD(__u32, sbf_pushed, v15)
-DOMAIN_FIELD(__u32, ttwu_wake_remote, v15)
-DOMAIN_FIELD(__u32, ttwu_move_affine, v15)
-DOMAIN_FIELD(__u32, ttwu_move_balance, v15)
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category idle> ")
+#endif
+DOMAIN_FIELD(__u32, idle_lb_count,
+	     "load_balance() count on cpu idle", "%11u", true, v15)
+DOMAIN_FIELD(__u32, idle_lb_balanced,
+	     "load_balance() found balanced on cpu idle", "%11u", true, v15)
+DOMAIN_FIELD(__u32, idle_lb_failed,
+	     "load_balance() move task failed on cpu idle", "%11u", true, v15)
+DOMAIN_FIELD(__u32, idle_lb_imbalance,
+	     "imbalance sum on cpu idle", "%11u", false, v15)
+DOMAIN_FIELD(__u32, idle_lb_gained,
+	     "pull_task() count on cpu idle", "%11u", false, v15)
+DOMAIN_FIELD(__u32, idle_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu idle", "%11u", false, v15)
+DOMAIN_FIELD(__u32, idle_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu idle", "%11u", true, v15)
+DOMAIN_FIELD(__u32, idle_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu idle", "%11u", true, v15)
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
+		  idle_lb_count, idle_lb_balanced, idle_lb_failed, v15)
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu idle)", "%11.2Lf",
+		  idle_lb_count, idle_lb_balanced, idle_lb_failed, idle_lb_gained, v15)
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category busy> ")
+#endif
+DOMAIN_FIELD(__u32, busy_lb_count,
+	     "load_balance() count on cpu busy", "%11u", true, v15)
+DOMAIN_FIELD(__u32, busy_lb_balanced,
+	     "load_balance() found balanced on cpu busy", "%11u", true, v15)
+DOMAIN_FIELD(__u32, busy_lb_failed,
+	     "load_balance() move task failed on cpu busy", "%11u", true, v15)
+DOMAIN_FIELD(__u32, busy_lb_imbalance,
+	     "imbalance sum on cpu busy", "%11u", false, v15)
+DOMAIN_FIELD(__u32, busy_lb_gained,
+	     "pull_task() count on cpu busy", "%11u", false, v15)
+DOMAIN_FIELD(__u32, busy_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu busy", "%11u", false, v15)
+DOMAIN_FIELD(__u32, busy_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu busy", "%11u", true, v15)
+DOMAIN_FIELD(__u32, busy_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu busy", "%11u", true, v15)
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
+		  busy_lb_count, busy_lb_balanced, busy_lb_failed, v15)
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu busy)", "%11.2Lf",
+		  busy_lb_count, busy_lb_balanced, busy_lb_failed, busy_lb_gained, v15)
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category newidle> ")
+#endif
+DOMAIN_FIELD(__u32, newidle_lb_count,
+	     "load_balance() count on cpu newly idle", "%11u", true, v15)
+DOMAIN_FIELD(__u32, newidle_lb_balanced,
+	     "load_balance() found balanced on cpu newly idle", "%11u", true, v15)
+DOMAIN_FIELD(__u32, newidle_lb_failed,
+	     "load_balance() move task failed on cpu newly idle", "%11u", true, v15)
+DOMAIN_FIELD(__u32, newidle_lb_imbalance,
+	     "imbalance sum on cpu newly idle", "%11u", false, v15)
+DOMAIN_FIELD(__u32, newidle_lb_gained,
+	     "pull_task() count on cpu newly idle", "%11u", false, v15)
+DOMAIN_FIELD(__u32, newidle_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu newly idle", "%11u", false, v15)
+DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v15)
+DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu newly idle", "%11u", true, v15)
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
+		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v15)
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu newly idle)", "%11.2Lf",
+		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, newidle_lb_gained, v15)
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category active_load_balance()> ")
+#endif
+DOMAIN_FIELD(__u32, alb_count,
+	     "active_load_balance() count", "%11u", false, v15)
+DOMAIN_FIELD(__u32, alb_failed,
+	     "active_load_balance() move task failed", "%11u", false, v15)
+DOMAIN_FIELD(__u32, alb_pushed,
+	     "active_load_balance() successfully moved a task", "%11u", false, v15)
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category sched_balance_exec()> ")
+#endif
+DOMAIN_FIELD(__u32, sbe_count,
+	     "sbe_count is not used", "%11u", false, v15)
+DOMAIN_FIELD(__u32, sbe_balanced,
+	     "sbe_balanced is not used", "%11u", false, v15)
+DOMAIN_FIELD(__u32, sbe_pushed,
+	     "sbe_pushed is not used", "%11u", false, v15)
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category sched_balance_fork()> ")
+#endif
+DOMAIN_FIELD(__u32, sbf_count,
+	     "sbf_count is not used", "%11u", false, v15)
+DOMAIN_FIELD(__u32, sbf_balanced,
+	     "sbf_balanced is not used", "%11u", false, v15)
+DOMAIN_FIELD(__u32, sbf_pushed,
+	     "sbf_pushed is not used", "%11u", false, v15)
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Wakeup Info> ")
+#endif
+DOMAIN_FIELD(__u32, ttwu_wake_remote,
+	     "try_to_wake_up() awoke a task that last ran on a diff cpu", "%11u", false, v15)
+DOMAIN_FIELD(__u32, ttwu_move_affine,
+	     "try_to_wake_up() moved task because cache-cold on own cpu", "%11u", false, v15)
+DOMAIN_FIELD(__u32, ttwu_move_balance,
+	     "try_to_wake_up() started passive balancing", "%11u", false, v15)
 #endif /* DOMAIN_FIELD */
diff --git a/tools/lib/perf/include/perf/schedstat-domain-v16.h b/tools/lib/perf/include/perf/schedstat-domain-v16.h
index d6ef895c9d32..b8038048e662 100644
--- a/tools/lib/perf/include/perf/schedstat-domain-v16.h
+++ b/tools/lib/perf/include/perf/schedstat-domain-v16.h
@@ -1,40 +1,121 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #ifdef DOMAIN_FIELD
-DOMAIN_FIELD(__u32, busy_lb_count, v16)
-DOMAIN_FIELD(__u32, busy_lb_balanced, v16)
-DOMAIN_FIELD(__u32, busy_lb_failed, v16)
-DOMAIN_FIELD(__u32, busy_lb_imbalance, v16)
-DOMAIN_FIELD(__u32, busy_lb_gained, v16)
-DOMAIN_FIELD(__u32, busy_lb_hot_gained, v16)
-DOMAIN_FIELD(__u32, busy_lb_nobusyq, v16)
-DOMAIN_FIELD(__u32, busy_lb_nobusyg, v16)
-DOMAIN_FIELD(__u32, idle_lb_count, v16)
-DOMAIN_FIELD(__u32, idle_lb_balanced, v16)
-DOMAIN_FIELD(__u32, idle_lb_failed, v16)
-DOMAIN_FIELD(__u32, idle_lb_imbalance, v16)
-DOMAIN_FIELD(__u32, idle_lb_gained, v16)
-DOMAIN_FIELD(__u32, idle_lb_hot_gained, v16)
-DOMAIN_FIELD(__u32, idle_lb_nobusyq, v16)
-DOMAIN_FIELD(__u32, idle_lb_nobusyg, v16)
-DOMAIN_FIELD(__u32, newidle_lb_count, v16)
-DOMAIN_FIELD(__u32, newidle_lb_balanced, v16)
-DOMAIN_FIELD(__u32, newidle_lb_failed, v16)
-DOMAIN_FIELD(__u32, newidle_lb_imbalance, v16)
-DOMAIN_FIELD(__u32, newidle_lb_gained, v16)
-DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v16)
-DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v16)
-DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v16)
-DOMAIN_FIELD(__u32, alb_count, v16)
-DOMAIN_FIELD(__u32, alb_failed, v16)
-DOMAIN_FIELD(__u32, alb_pushed, v16)
-DOMAIN_FIELD(__u32, sbe_count, v16)
-DOMAIN_FIELD(__u32, sbe_balanced, v16)
-DOMAIN_FIELD(__u32, sbe_pushed, v16)
-DOMAIN_FIELD(__u32, sbf_count, v16)
-DOMAIN_FIELD(__u32, sbf_balanced, v16)
-DOMAIN_FIELD(__u32, sbf_pushed, v16)
-DOMAIN_FIELD(__u32, ttwu_wake_remote, v16)
-DOMAIN_FIELD(__u32, ttwu_move_affine, v16)
-DOMAIN_FIELD(__u32, ttwu_move_balance, v16)
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category busy> ")
+#endif
+DOMAIN_FIELD(__u32, busy_lb_count,
+	     "load_balance() count on cpu busy", "%11u", true, v16)
+DOMAIN_FIELD(__u32, busy_lb_balanced,
+	     "load_balance() found balanced on cpu busy", "%11u", true, v16)
+DOMAIN_FIELD(__u32, busy_lb_failed,
+	     "load_balance() move task failed on cpu busy", "%11u", true, v16)
+DOMAIN_FIELD(__u32, busy_lb_imbalance,
+	     "imbalance sum on cpu busy", "%11u", false, v16)
+DOMAIN_FIELD(__u32, busy_lb_gained,
+	     "pull_task() count on cpu busy", "%11u", false, v16)
+DOMAIN_FIELD(__u32, busy_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu busy", "%11u", false, v16)
+DOMAIN_FIELD(__u32, busy_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu busy", "%11u", true, v16)
+DOMAIN_FIELD(__u32, busy_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu busy", "%11u", true, v16)
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
+		  busy_lb_count, busy_lb_balanced, busy_lb_failed, v16)
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu busy)", "%11.2Lf",
+		  busy_lb_count, busy_lb_balanced, busy_lb_failed, busy_lb_gained, v16)
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category idle> ")
+#endif
+DOMAIN_FIELD(__u32, idle_lb_count,
+	     "load_balance() count on cpu idle", "%11u", true, v16)
+DOMAIN_FIELD(__u32, idle_lb_balanced,
+	     "load_balance() found balanced on cpu idle", "%11u", true, v16)
+DOMAIN_FIELD(__u32, idle_lb_failed,
+	     "load_balance() move task failed on cpu idle", "%11u", true, v16)
+DOMAIN_FIELD(__u32, idle_lb_imbalance,
+	     "imbalance sum on cpu idle", "%11u", false, v16)
+DOMAIN_FIELD(__u32, idle_lb_gained,
+	     "pull_task() count on cpu idle", "%11u", false, v16)
+DOMAIN_FIELD(__u32, idle_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu idle", "%11u", false, v16)
+DOMAIN_FIELD(__u32, idle_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu idle", "%11u", true, v16)
+DOMAIN_FIELD(__u32, idle_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu idle", "%11u", true, v16)
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
+		  idle_lb_count, idle_lb_balanced, idle_lb_failed, v16)
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu idle)", "%11.2Lf",
+		  idle_lb_count, idle_lb_balanced, idle_lb_failed, idle_lb_gained, v16)
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category newidle> ")
+#endif
+DOMAIN_FIELD(__u32, newidle_lb_count,
+	     "load_balance() count on cpu newly idle", "%11u", true, v16)
+DOMAIN_FIELD(__u32, newidle_lb_balanced,
+	     "load_balance() found balanced on cpu newly idle", "%11u", true, v16)
+DOMAIN_FIELD(__u32, newidle_lb_failed,
+	     "load_balance() move task failed on cpu newly idle", "%11u", true, v16)
+DOMAIN_FIELD(__u32, newidle_lb_imbalance,
+	     "imbalance sum on cpu newly idle", "%11u", false, v16)
+DOMAIN_FIELD(__u32, newidle_lb_gained,
+	     "pull_task() count on cpu newly idle", "%11u", false, v16)
+DOMAIN_FIELD(__u32, newidle_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu newly idle", "%11u", false, v16)
+DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v16)
+DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu newly idle", "%11u", true, v16)
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
+		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v16)
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu newly idle)", "%11.2Lf",
+		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, newidle_lb_gained, v16)
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category active_load_balance()> ")
+#endif
+DOMAIN_FIELD(__u32, alb_count,
+	     "active_load_balance() count", "%11u", false, v16)
+DOMAIN_FIELD(__u32, alb_failed,
+	     "active_load_balance() move task failed", "%11u", false, v16)
+DOMAIN_FIELD(__u32, alb_pushed,
+	     "active_load_balance() successfully moved a task", "%11u", false, v16)
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category sched_balance_exec()> ")
+#endif
+DOMAIN_FIELD(__u32, sbe_count,
+	     "sbe_count is not used", "%11u", false, v16)
+DOMAIN_FIELD(__u32, sbe_balanced,
+	     "sbe_balanced is not used", "%11u", false, v16)
+DOMAIN_FIELD(__u32, sbe_pushed,
+	     "sbe_pushed is not used", "%11u", false, v16)
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category sched_balance_fork()> ")
+#endif
+DOMAIN_FIELD(__u32, sbf_count,
+	     "sbf_count is not used", "%11u", false, v16)
+DOMAIN_FIELD(__u32, sbf_balanced,
+	     "sbf_balanced is not used", "%11u", false, v16)
+DOMAIN_FIELD(__u32, sbf_pushed,
+	     "sbf_pushed is not used", "%11u", false, v16)
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Wakeup Info> ")
+#endif
+DOMAIN_FIELD(__u32, ttwu_wake_remote,
+	     "try_to_wake_up() awoke a task that last ran on a diff cpu", "%11u", false, v16)
+DOMAIN_FIELD(__u32, ttwu_move_affine,
+	     "try_to_wake_up() moved task because cache-cold on own cpu", "%11u", false, v16)
+DOMAIN_FIELD(__u32, ttwu_move_balance,
+	     "try_to_wake_up() started passive balancing", "%11u", false, v16)
 #endif /* DOMAIN_FIELD */
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 6ea0db05aa41..cd465644fce5 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -3855,6 +3855,470 @@ static int perf_sched__schedstat_record(struct perf_sched *sched,
 	return err;
 }
 
+struct schedstat_domain {
+	struct perf_record_schedstat_domain *domain_data;
+	struct schedstat_domain *next;
+};
+
+struct schedstat_cpu {
+	struct perf_record_schedstat_cpu *cpu_data;
+	struct schedstat_domain *domain_head;
+	struct schedstat_cpu *next;
+};
+
+struct schedstat_cpu *cpu_head = NULL, *cpu_tail = NULL, *cpu_second_pass = NULL;
+struct schedstat_domain *domain_tail = NULL, *domain_second_pass = NULL;
+
+static void store_schedtstat_cpu_diff(struct schedstat_cpu *after_workload)
+{
+	struct perf_record_schedstat_cpu *before = cpu_second_pass->cpu_data;
+	struct perf_record_schedstat_cpu *after = after_workload->cpu_data;
+	__u16 version = after_workload->cpu_data->version;
+
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)	\
+	(before->_ver._name = after->_ver._name - before->_ver._name);
+
+	if (version == 15) {
+#include <perf/schedstat-cpu-v15.h>
+	} else if (version == 16) {
+#include <perf/schedstat-cpu-v16.h>
+	}
+
+#undef CPU_FIELD
+}
+
+static void store_schedstat_domain_diff(struct schedstat_domain *after_workload)
+{
+	struct perf_record_schedstat_domain *before = domain_second_pass->domain_data;
+	struct perf_record_schedstat_domain *after = after_workload->domain_data;
+	__u16 version = after_workload->domain_data->version;
+
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)	\
+	(before->_ver._name = after->_ver._name - before->_ver._name);
+
+	if (version == 15) {
+#include <perf/schedstat-domain-v15.h>
+	} else if (version == 16) {
+#include <perf/schedstat-domain-v16.h>
+	}
+#undef DOMAIN_FIELD
+}
+
+static void print_separator(size_t pre_dash_cnt, const char *s, size_t post_dash_cnt)
+{
+	size_t i;
+
+	for (i = 0; i < pre_dash_cnt; ++i)
+		printf("-");
+
+	printf("%s", s);
+
+	for (i = 0; i < post_dash_cnt; ++i)
+		printf("-");
+
+	printf("\n");
+}
+
+static inline void print_cpu_stats(struct perf_record_schedstat_cpu *cs)
+{
+#define CALC_PCT(_x, _y)	((_y) ? ((double)(_x) / (_y)) * 100 : 0.0)
+
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)	\
+	do {									\
+		if (_is_pct) {							\
+			printf("%-60s: " _format " ( %3.2lf%% )\n", _desc,	\
+			       cs->_ver._name,					\
+			       CALC_PCT(cs->_ver._name, cs->_ver._pct_of));	\
+		} else {							\
+			printf("%-60s: " _format "\n", _desc, cs->_ver._name);	\
+		}								\
+	} while (0);
+
+	if (cs->version == 15) {
+#include <perf/schedstat-cpu-v15.h>
+	} else if (cs->version == 16) {
+#include <perf/schedstat-cpu-v16.h>
+	}
+
+#undef CPU_FIELD
+#undef CALC_PCT
+}
+
+static inline void print_domain_stats(struct perf_record_schedstat_domain *ds,
+				      __u64 jiffies)
+{
+#define DOMAIN_CATEGORY(_desc)							\
+	do {									\
+		size_t _len = strlen(_desc);					\
+		size_t _pre_dash_cnt = (100 - _len) / 2;			\
+		size_t _post_dash_cnt = 100 - _len - _pre_dash_cnt;		\
+		print_separator(_pre_dash_cnt, _desc, _post_dash_cnt);		\
+	} while (0);
+
+#define CALC_AVG(_x, _y)	((_y) ? (long double)(_x) / (_y) : 0.0)
+
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		\
+	do {									\
+		if (_is_jiffies) {						\
+			printf("%-65s: " _format "   $ %11.2Lf $\n", _desc,	\
+			       ds->_ver._name,					\
+			       CALC_AVG(jiffies, ds->_ver._name));		\
+		} else {							\
+			printf("%-65s: " _format "\n", _desc, ds->_ver._name);	\
+		}								\
+	} while (0);
+
+#define DERIVED_CNT_FIELD(_desc, _format, _x, _y, _z, _ver)			\
+	do {									\
+		printf("*%-64s: " _format "\n", _desc,				\
+		       (ds->_ver._x) - (ds->_ver._y) - (ds->_ver._z));		\
+	} while (0);
+
+#define DERIVED_AVG_FIELD(_desc, _format, _x, _y, _z, _w, _ver)		\
+	do {									\
+		printf("*%-64s: " _format "\n", _desc, CALC_AVG(ds->_ver._w,	\
+		       ((ds->_ver._x) - (ds->_ver._y) - (ds->_ver._z))));	\
+	} while (0);
+
+	if (ds->version == 15) {
+#include <perf/schedstat-domain-v15.h>
+	} else if (ds->version == 16) {
+#include <perf/schedstat-domain-v16.h>
+	}
+
+#undef DERIVED_AVG_FIELD
+#undef DERIVED_CNT_FIELD
+#undef DOMAIN_FIELD
+#undef CALC_AVG
+#undef DOMAIN_CATEGORY
+}
+
+static void print_domain_cpu_list(struct perf_record_schedstat_domain *ds)
+{
+	char bin[16][5] = {"0000", "0001", "0010", "0011",
+			   "0100", "0101", "0110", "0111",
+			   "1000", "1001", "1010", "1011",
+			   "1100", "1101", "1110", "1111"};
+	bool print_flag = false, low = true;
+	int cpu = 0, start, end, idx;
+
+	idx = ((ds->nr_cpus + 7) >> 3) - 1;
+
+	printf("<");
+	while (idx >= 0) {
+		__u8 index;
+
+		if (low)
+			index = ds->cpu_mask[idx] & 0xf;
+		else
+			index = (ds->cpu_mask[idx--] & 0xf0) >> 4;
+
+		for (int i = 3; i >= 0; i--) {
+			if (!print_flag && bin[index][i] == '1') {
+				start = cpu;
+				print_flag = true;
+			} else if (print_flag && bin[index][i] == '0') {
+				end = cpu - 1;
+				print_flag = false;
+				if (start == end)
+					printf("%d, ", start);
+				else
+					printf("%d-%d, ", start, end);
+			}
+			cpu++;
+		}
+
+		low = !low;
+	}
+
+	if (print_flag) {
+		if (start == cpu-1)
+			printf("%d, ", start);
+		else
+			printf("%d-%d, ", start, cpu-1);
+	}
+	printf("\b\b>\n");
+}
+
+static void summarize_schedstat_cpu(struct schedstat_cpu *summary_cpu,
+				    struct schedstat_cpu *cptr,
+				    int cnt, bool is_last)
+{
+	struct perf_record_schedstat_cpu *summary_cs = summary_cpu->cpu_data,
+					 *temp_cs = cptr->cpu_data;
+
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)	\
+	do {									\
+		summary_cs->_ver._name += temp_cs->_ver._name;			\
+		if (is_last)							\
+			summary_cs->_ver._name /= cnt;				\
+	} while (0);
+
+	if (cptr->cpu_data->version == 15) {
+#include <perf/schedstat-cpu-v15.h>
+	} else if (cptr->cpu_data->version == 16) {
+#include <perf/schedstat-cpu-v16.h>
+	}
+#undef CPU_FIELD
+}
+
+static void summarize_schedstat_domain(struct schedstat_domain *summary_domain,
+				       struct schedstat_domain *dptr,
+				       int cnt, bool is_last)
+{
+	struct perf_record_schedstat_domain *summary_ds = summary_domain->domain_data,
+					    *temp_ds = dptr->domain_data;
+
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		\
+	do {									\
+		summary_ds->_ver._name += temp_ds->_ver._name;			\
+		if (is_last)							\
+			summary_ds->_ver._name /= cnt;				\
+	} while (0);
+
+	if (dptr->domain_data->version == 15) {
+#include <perf/schedstat-domain-v15.h>
+	} else if (dptr->domain_data->version == 16) {
+#include <perf/schedstat-domain-v16.h>
+	}
+#undef DOMAIN_FIELD
+}
+
+/* FIXME: The code fails (segfaults) when one or ore cpus are offline. */
+static void show_schedstat_data(void)
+{
+	struct schedstat_domain *dptr = NULL, *tdptr = NULL, *dtail = NULL;
+	struct schedstat_cpu *cptr = cpu_head, *summary_head;
+	struct perf_record_schedstat_domain *ds = NULL;
+	struct perf_record_schedstat_cpu *cs = NULL;
+	__u64 jiffies = cptr->cpu_data->timestamp;
+	bool is_last = false, is_summary = true;
+	int cnt = 0;
+
+	print_separator(100, "", 0);
+	printf("Time elapsed (in jiffies)                                   : %11llu\n", jiffies);
+	print_separator(100, "", 0);
+
+	if (cptr) {
+		summary_head = zalloc(sizeof(*summary_head));
+		summary_head->cpu_data = zalloc(sizeof(*cs));
+		memcpy(summary_head->cpu_data, cptr->cpu_data, sizeof(*cs));
+		summary_head->next = NULL;
+		summary_head->domain_head = NULL;
+		dptr = cptr->domain_head;
+
+		while (dptr) {
+			size_t cpu_mask_size = (dptr->domain_data->nr_cpus + 7) >> 3;
+
+			tdptr = zalloc(sizeof(*tdptr));
+			tdptr->domain_data = zalloc(sizeof(*ds) + cpu_mask_size);
+			memcpy(tdptr->domain_data, dptr->domain_data, sizeof(*ds) + cpu_mask_size);
+
+			tdptr->next = NULL;
+			if (!dtail) {
+				summary_head->domain_head = tdptr;
+				dtail = tdptr;
+			} else {
+				dtail->next = tdptr;
+				dtail = dtail->next;
+			}
+			dptr = dptr->next;
+		}
+	}
+
+	cptr = cpu_head;
+	while (cptr) {
+		if (!cptr->next)
+			is_last = true;
+
+		cnt++;
+		summarize_schedstat_cpu(summary_head, cptr, cnt, is_last);
+		tdptr = summary_head->domain_head;
+		dptr = cptr->domain_head;
+
+		while (tdptr) {
+			summarize_schedstat_domain(tdptr, dptr, cnt, is_last);
+			tdptr = tdptr->next;
+			dptr = dptr->next;
+		}
+		cptr = cptr->next;
+	}
+
+	cptr = cpu_head;
+	summary_head->next = cptr;
+	cptr = summary_head;
+	while (cptr) {
+		cs = cptr->cpu_data;
+		printf("\n");
+		print_separator(100, "", 0);
+		if (is_summary)
+			printf("CPU <ALL CPUS SUMMARY>\n");
+		else
+			printf("CPU %d\n", cs->cpu);
+
+		print_separator(100, "", 0);
+		print_cpu_stats(cs);
+		print_separator(100, "", 0);
+
+		dptr = cptr->domain_head;
+
+		while (dptr) {
+			ds = dptr->domain_data;
+			if (is_summary)
+				if (ds->name[0])
+					printf("CPU <ALL CPUS SUMMARY>, DOMAIN %s\n", ds->name);
+				else
+					printf("CPU <ALL CPUS SUMMARY>, DOMAIN %d\n", ds->domain);
+			else {
+				if (ds->name[0])
+					printf("CPU %d, DOMAIN %s CPUS ", cs->cpu, ds->name);
+				else
+					printf("CPU %d, DOMAIN %d CPUS ", cs->cpu, ds->domain);
+
+				print_domain_cpu_list(ds);
+			}
+			print_domain_stats(ds, jiffies);
+			print_separator(100, "", 0);
+
+			dptr = dptr->next;
+		}
+		is_summary = false;
+		cptr = cptr->next;
+	}
+}
+
+static int perf_sched__process_schedstat(struct perf_session *session __maybe_unused,
+					 union perf_event *event)
+{
+	static bool after_workload_flag;
+	struct perf_cpu this_cpu;
+	static __u32 initial_cpu;
+
+	switch (event->header.type) {
+	case PERF_RECORD_SCHEDSTAT_CPU:
+		this_cpu.cpu = event->schedstat_cpu.cpu;
+		break;
+	case PERF_RECORD_SCHEDSTAT_DOMAIN:
+		this_cpu.cpu = event->schedstat_domain.cpu;
+		break;
+	default:
+		return 0;
+	}
+
+	if (user_requested_cpus && !perf_cpu_map__has(user_requested_cpus, this_cpu))
+		return 0;
+
+	if (event->header.type == PERF_RECORD_SCHEDSTAT_CPU) {
+		struct schedstat_cpu *temp = zalloc(sizeof(struct schedstat_cpu));
+
+		temp->cpu_data = zalloc(sizeof(struct perf_record_schedstat_cpu));
+		memcpy(temp->cpu_data, &event->schedstat_cpu,
+		       sizeof(struct perf_record_schedstat_cpu));
+		temp->next = NULL;
+		temp->domain_head = NULL;
+
+		if (cpu_head && temp->cpu_data->cpu == initial_cpu)
+			after_workload_flag = true;
+
+		if (!after_workload_flag) {
+			if (!cpu_head) {
+				initial_cpu = temp->cpu_data->cpu;
+				cpu_head = temp;
+			} else
+				cpu_tail->next = temp;
+
+			cpu_tail = temp;
+		} else {
+			if (temp->cpu_data->cpu == initial_cpu) {
+				cpu_second_pass = cpu_head;
+				cpu_head->cpu_data->timestamp =
+					temp->cpu_data->timestamp - cpu_second_pass->cpu_data->timestamp;
+			} else {
+				cpu_second_pass = cpu_second_pass->next;
+			}
+			domain_second_pass = cpu_second_pass->domain_head;
+			store_schedtstat_cpu_diff(temp);
+		}
+	} else if (event->header.type == PERF_RECORD_SCHEDSTAT_DOMAIN) {
+		size_t cpu_mask_size = (event->schedstat_domain.nr_cpus + 7) >> 3;
+		struct schedstat_domain *temp = zalloc(sizeof(struct schedstat_domain));
+
+		temp->domain_data = zalloc(sizeof(struct perf_record_schedstat_domain) + cpu_mask_size);
+		memcpy(temp->domain_data, &event->schedstat_domain,
+		       sizeof(struct perf_record_schedstat_domain) + cpu_mask_size);
+		temp->next = NULL;
+
+		if (!after_workload_flag) {
+			if (cpu_tail->domain_head == NULL) {
+				cpu_tail->domain_head = temp;
+				domain_tail = temp;
+			} else {
+				domain_tail->next = temp;
+				domain_tail = temp;
+			}
+		} else {
+			store_schedstat_domain_diff(temp);
+			domain_second_pass = domain_second_pass->next;
+		}
+	}
+
+	return 0;
+}
+
+static void free_schedstat(void)
+{
+	struct schedstat_cpu *cptr = cpu_head, *tmp_cptr;
+	struct schedstat_domain *dptr = NULL, *tmp_dptr;
+
+	while (cptr) {
+		tmp_cptr = cptr;
+		dptr = cptr->domain_head;
+
+		while (dptr) {
+			tmp_dptr = dptr;
+			dptr = dptr->next;
+			free(tmp_dptr);
+		}
+		cptr = cptr->next;
+		free(tmp_cptr);
+	}
+}
+
+static int perf_sched__schedstat_report(struct perf_sched *sched)
+{
+	struct perf_session *session;
+	struct perf_data data = {
+		.path  = input_name,
+		.mode  = PERF_DATA_MODE_READ,
+	};
+	int err;
+
+	if (cpu_list) {
+		user_requested_cpus = perf_cpu_map__new(cpu_list);
+		if (!user_requested_cpus)
+			return -EINVAL;
+	}
+
+	sched->tool.schedstat_cpu = perf_sched__process_schedstat;
+	sched->tool.schedstat_domain = perf_sched__process_schedstat;
+
+	session = perf_session__new(&data, &sched->tool);
+	if (IS_ERR(session)) {
+		pr_err("Perf session creation failed.\n");
+		return PTR_ERR(session);
+	}
+
+	err = perf_session__process_events(session);
+
+	perf_session__delete(session);
+	if (!err) {
+		setup_pager();
+		show_schedstat_data();
+		free_schedstat();
+	}
+	return err;
+}
+
 static bool schedstat_events_exposed(void)
 {
 	/*
@@ -4031,6 +4495,8 @@ int cmd_sched(int argc, const char **argv)
 	OPT_PARENT(sched_options)
 	};
 	const struct option stats_options[] = {
+	OPT_STRING('i', "input", &input_name, "file",
+		   "`stats report` with input filename"),
 	OPT_STRING('o', "output", &output_name, "file",
 		   "`stats record` with output filename"),
 	OPT_STRING('C', "cpu", &cpu_list, "cpu", "list of cpus to profile"),
@@ -4054,7 +4520,7 @@ int cmd_sched(int argc, const char **argv)
 		NULL
 	};
 	const char * stats_usage[] = {
-		"perf sched stats {record} [<options>]",
+		"perf sched stats {record|report} [<options>]",
 		NULL
 	};
 	const char *const sched_subcommands[] = { "record", "latency", "map",
@@ -4156,7 +4622,7 @@ int cmd_sched(int argc, const char **argv)
 
 		return perf_sched__timehist(&sched);
 	} else if (!strcmp(argv[0], "stats")) {
-		const char *const stats_subcommands[] = {"record", NULL};
+		const char *const stats_subcommands[] = {"record", "report", NULL};
 
 		argc = parse_options_subcommand(argc, argv, stats_options,
 						stats_subcommands,
@@ -4168,6 +4634,11 @@ int cmd_sched(int argc, const char **argv)
 				argc = parse_options(argc, argv, stats_options,
 						     stats_usage, 0);
 			return perf_sched__schedstat_record(&sched, argc, argv);
+		} else if (argv[0] && !strcmp(argv[0], "report")) {
+			if (argc)
+				argc = parse_options(argc, argv, stats_options,
+						     stats_usage, 0);
+			return perf_sched__schedstat_report(&sched);
 		}
 		usage_with_options(stats_usage, stats_options);
 	} else {
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index d138e4a5787c..0f14fe225be0 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -560,7 +560,7 @@ size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
 
 	size = fprintf(fp, "\ncpu%u ", cs->cpu);
 
-#define CPU_FIELD(_type, _name, _ver)						\
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		\
 	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)cs->_ver._name);	\
 
 	if (version == 15) {
@@ -638,7 +638,7 @@ size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
 	size += fprintf(fp, "%s ", cpu_mask);
 	free(cpu_mask);
 
-#define DOMAIN_FIELD(_type, _name, _ver)					\
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		\
 	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)ds->_ver._name);	\
 
 	if (version == 15) {
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index 73b2492a4cde..24f9cb95d07d 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -2535,7 +2535,7 @@ static union perf_event *__synthesize_schedstat_cpu(struct io *io, __u16 version
 	if (io__get_dec(io, (__u64 *)&cs->cpu) != ' ')
 		goto out_cpu;
 
-#define CPU_FIELD(_type, _name, _ver)					\
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)	\
 	do {								\
 		__u64 _tmp;						\
 		ch = io__get_dec(io, &_tmp);				\
@@ -2658,7 +2658,7 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
 	free(d_name);
 	free(cpu_mask);
 
-#define DOMAIN_FIELD(_type, _name, _ver)				\
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)	\
 	do {								\
 		__u64 _tmp;						\
 		ch = io__get_dec(io, &_tmp);				\
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/5] perf sched stats: Add support for live mode
  2024-09-16 16:47 [PATCH 0/5] perf sched: Introduce stats tool Ravi Bangoria
                   ` (3 preceding siblings ...)
  2024-09-16 16:47 ` [PATCH 4/5] perf sched stats: Add support for report subcommand Ravi Bangoria
@ 2024-09-16 16:47 ` Ravi Bangoria
  2024-09-17 10:35 ` [PATCH 0/5] perf sched: Introduce stats tool James Clark
  2024-09-17 10:57 ` Madadi Vineeth Reddy
  6 siblings, 0 replies; 17+ messages in thread
From: Ravi Bangoria @ 2024-09-16 16:47 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot, bristot,
	adrian.hunter, james.clark, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

From: Swapnil Sapkal <swapnil.sapkal@amd.com>

The live mode works similar to simple `perf stat` command, by profiling
the target and printing results on the terminal as soon as the target
finishes.

Example usage:

  # perf sched stats -- sleep 10

Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
---
 tools/perf/builtin-sched.c | 87 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 86 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index cd465644fce5..e888ae763eac 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -4319,6 +4319,91 @@ static int perf_sched__schedstat_report(struct perf_sched *sched)
 	return err;
 }
 
+static int process_synthesized_event_live(const struct perf_tool *tool __maybe_unused,
+					  union perf_event *event,
+					  struct perf_sample *sample __maybe_unused,
+					  struct machine *machine __maybe_unused)
+{
+	return perf_sched__process_schedstat(NULL, event);
+}
+
+static int perf_sched__schedstat_live(struct perf_sched *sched,
+				      int argc, const char **argv)
+{
+	struct evlist *evlist;
+	struct target *target;
+	int reset = 0;
+	int err = 0;
+
+	signal(SIGINT, sighandler);
+	signal(SIGCHLD, sighandler);
+	signal(SIGTERM, sighandler);
+
+	evlist = evlist__new();
+	if (!evlist)
+		return -ENOMEM;
+
+	/*
+	 * `perf sched schedstat` does not support workload profiling (-p pid)
+	 * since /proc/schedstat file contains cpu specific data only. Hence, a
+	 * profile target is either set of cpus or systemwide, never a process.
+	 * Note that, although `-- <workload>` is supported, profile data are
+	 * still cpu/systemwide.
+	 */
+	target = zalloc(sizeof(struct target));
+	if (cpu_list)
+		target->cpu_list = cpu_list;
+	else
+		target->system_wide = true;
+
+	if (argc) {
+		err = evlist__prepare_workload(evlist, target, argv, false, NULL);
+		if (err)
+			goto out_target;
+	}
+
+	if (cpu_list) {
+		user_requested_cpus = perf_cpu_map__new(cpu_list);
+		if (!user_requested_cpus)
+			goto out_target;
+	}
+
+	err = perf_event__synthesize_schedstat(&(sched->tool),
+					       process_synthesized_event_live,
+					       user_requested_cpus);
+	if (err < 0)
+		goto out_target;
+
+	err = enable_sched_schedstats(&reset);
+	if (err < 0)
+		goto out_target;
+
+	if (argc)
+		evlist__start_workload(evlist);
+
+	/* wait for signal */
+	pause();
+
+	if (reset) {
+		err = disable_sched_schedstat();
+		if (err < 0)
+			goto out_target;
+	}
+
+	err = perf_event__synthesize_schedstat(&(sched->tool),
+					       process_synthesized_event_live,
+					       user_requested_cpus);
+	if (err)
+		goto out_target;
+
+	setup_pager();
+	show_schedstat_data();
+	free_schedstat();
+out_target:
+	free(target);
+	return err;
+}
+
 static bool schedstat_events_exposed(void)
 {
 	/*
@@ -4640,7 +4725,7 @@ int cmd_sched(int argc, const char **argv)
 						     stats_usage, 0);
 			return perf_sched__schedstat_report(&sched);
 		}
-		usage_with_options(stats_usage, stats_options);
+		return perf_sched__schedstat_live(&sched, argc, argv);
 	} else {
 		usage_with_options(sched_usage, sched_options);
 	}
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] perf sched stats: Add record and rawdump support
  2024-09-16 16:47 ` [PATCH 2/5] perf sched stats: Add record and rawdump support Ravi Bangoria
@ 2024-09-17 10:35   ` James Clark
  2024-09-18  8:52     ` Sapkal, Swapnil
  2024-09-26  6:12   ` Namhyung Kim
  1 sibling, 1 reply; 17+ messages in thread
From: James Clark @ 2024-09-17 10:35 UTC (permalink / raw)
  To: Ravi Bangoria, swapnil.sapkal
  Cc: yu.c.chen, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, bristot, adrian.hunter, james.clark, kan.liang,
	gautham.shenoy, kprateek.nayak, juri.lelli, yangjihong, void, tj,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das, peterz, mingo, acme, namhyung, irogers



On 16/09/2024 17:47, Ravi Bangoria wrote:
> From: Swapnil Sapkal <swapnil.sapkal@amd.com>
> 
> Define new, perf tool only, sample types and their layouts. Add logic
> to parse /proc/schedstat, convert it to perf sample format and save
> samples to perf.data file with `perf sched stats record` command. Also
> add logic to read perf.data file, interpret schedstat samples and
> print rawdump of samples with `perf script -D`.
> 
> Note that, /proc/schedstat file output is standardized with version
> number. The patch supports v15 but older or newer version can be added
> easily.
> 
> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> ---

[...]

> +int perf_event__synthesize_schedstat(const struct perf_tool *tool,
> +				     perf_event__handler_t process,
> +				     struct perf_cpu_map *user_requested_cpus)
> +{
> +	union perf_event *event = NULL;
> +	size_t line_len = 0;
> +	char *line = NULL;
> +	char bf[BUFSIZ];
> +	__u64 timestamp;
> +	__u16 version;
> +	struct io io;
> +	int ret = -1;
> +	int cpu = -1;
> +	char ch;
> +
> +	io.fd = open("/proc/schedstat", O_RDONLY, 0);

Other parts of the tool use procfs__mountpoint() for /proc. Although it 
can only be in one place so it doesn't actually make a difference for 
this one. Probably worth it for consistency though.

> +	if (io.fd < 0) {
> +		pr_err("Failed to open /proc/schedstat\n");

A hint about CONFIG_SCHEDSTAT would be useful here.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] perf sched: Introduce stats tool
  2024-09-16 16:47 [PATCH 0/5] perf sched: Introduce stats tool Ravi Bangoria
                   ` (4 preceding siblings ...)
  2024-09-16 16:47 ` [PATCH 5/5] perf sched stats: Add support for live mode Ravi Bangoria
@ 2024-09-17 10:35 ` James Clark
  2024-09-18 13:19   ` Sapkal, Swapnil
  2024-09-17 10:57 ` Madadi Vineeth Reddy
  6 siblings, 1 reply; 17+ messages in thread
From: James Clark @ 2024-09-17 10:35 UTC (permalink / raw)
  To: Ravi Bangoria, swapnil.sapkal
  Cc: yu.c.chen, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, bristot, adrian.hunter, james.clark, kan.liang,
	gautham.shenoy, kprateek.nayak, juri.lelli, yangjihong, void, tj,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das, peterz, mingo, acme, namhyung, irogers



On 16/09/2024 17:47, Ravi Bangoria wrote:
> MOTIVATION
> ----------
> 
> Existing `perf sched` is quite exhaustive and provides lot of insights
> into scheduler behavior but it quickly becomes impractical to use for
> long running or scheduler intensive workload. For ex, `perf sched record`
> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
> generates huge 56G perf.data for which perf takes ~137 mins to prepare
> and write it to disk [1].
> 
> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
> and generates samples on a tracepoint hit, `perf sched stats record` takes
> snapshot of the /proc/schedstat file before and after the workload, i.e.
> there is almost zero interference on workload run. Also, it takes very
> minimal time to parse /proc/schedstat, convert it into perf samples and
> save those samples into perf.data file. Result perf.data file is much
> smaller. So, overall `perf sched stats record` is much more light weight
> compare to `perf sched record`.
> 
> We, internally at AMD, have been using this (a variant of this, known as
> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
> of any scheduler code changes[3][4].
> 
> Please note that, this is not a replacement of perf sched record/report.
> The intended users of the new tool are scheduler developers, not regular
> users.
> 
> USAGE
> -----
> 
>    # perf sched stats record
>    # perf sched stats report
> 
> Note: Although `perf sched stats` tool supports workload profiling syntax
> (i.e. -- <workload> ), the recorded profile is still systemwide since the
> /proc/schedstat is a systemwide file.
> 
> HOW TO INTERPRET THE REPORT
> ---------------------------
> 
> The `perf sched stats report` starts with total time profiling was active
> in terms of jiffies:
> 
>    ----------------------------------------------------------------------------------------------------
>    Time elapsed (in jiffies)                                   :       24537
>    ----------------------------------------------------------------------------------------------------
> 
> Next is CPU scheduling statistics. These are simple diffs of
> /proc/schedstat CPU lines along with description. The report also
> prints % relative to base stat.
> 
> In the example below, schedule() left the CPU0 idle 98.19% of the time.
> 16.54% of total try_to_wake_up() was to wakeup local CPU. And, the total
> waittime by tasks on CPU0 is 0.49% of the total runtime by tasks on the
> same CPU.
> 
>    ----------------------------------------------------------------------------------------------------
>    CPU 0
>    ----------------------------------------------------------------------------------------------------
>    sched_yield() count                                         :           0
>    Legacy counter can be ignored                               :           0
>    schedule() called                                           :       17138
>    schedule() left the processor idle                          :       16827 ( 98.19% )
>    try_to_wake_up() was called                                 :         508
>    try_to_wake_up() was called to wake up the local cpu        :          84 ( 16.54% )
>    total runtime by tasks on this processor (in jiffies)       :  2408959243
>    total waittime by tasks on this processor (in jiffies)      :    11731825 ( 0.49% )
>    total timeslices run on this cpu                            :         311
>    ----------------------------------------------------------------------------------------------------
> 
> Next is load balancing statistics. For each of the sched domains
> (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
> the following three categories:
> 
>    1) Idle Load Balance: Load balancing performed on behalf of a long
>                          idling CPU by some other CPU.
>    2) Busy Load Balance: Load balancing performed when the CPU was busy.
>    3) New Idle Balance : Load balancing performed when a CPU just became
>                          idle.
> 
> Under each of these three categories, sched stats report provides
> different load balancing statistics. Along with direct stats, the
> report also contains derived metrics prefixed with *. Example:
> 
>    ----------------------------------------------------------------------------------------------------
>    CPU 0 DOMAIN SMT CPUS <0, 64>
>    ----------------------------------------- <Category idle> ------------------------------------------
>    load_balance() count on cpu idle                                 :          50   $      490.74 $
>    load_balance() found balanced on cpu idle                        :          42   $      584.21 $
>    load_balance() move task failed on cpu idle                      :           8   $     3067.12 $
>    imbalance sum on cpu idle                                        :           8
>    pull_task() count on cpu idle                                    :           0
>    pull_task() when target task was cache-hot on cpu idle           :           0
>    load_balance() failed to find busier queue on cpu idle           :           0   $        0.00 $
>    load_balance() failed to find busier group on cpu idle           :          42   $      584.21 $
>    *load_balance() success count on cpu idle                        :           0
>    *avg task pulled per successful lb attempt (cpu idle)            :        0.00
>    ----------------------------------------- <Category busy> ------------------------------------------
>    load_balance() count on cpu busy                                 :           2   $    12268.50 $
>    load_balance() found balanced on cpu busy                        :           2   $    12268.50 $
>    load_balance() move task failed on cpu busy                      :           0   $        0.00 $
>    imbalance sum on cpu busy                                        :           0
>    pull_task() count on cpu busy                                    :           0
>    pull_task() when target task was cache-hot on cpu busy           :           0
>    load_balance() failed to find busier queue on cpu busy           :           0   $        0.00 $
>    load_balance() failed to find busier group on cpu busy           :           1   $    24537.00 $
>    *load_balance() success count on cpu busy                        :           0
>    *avg task pulled per successful lb attempt (cpu busy)            :        0.00
>    ---------------------------------------- <Category newidle> ----------------------------------------
>    load_balance() count on cpu newly idle                           :         427   $       57.46 $
>    load_balance() found balanced on cpu newly idle                  :         382   $       64.23 $
>    load_balance() move task failed on cpu newly idle                :          45   $      545.27 $
>    imbalance sum on cpu newly idle                                  :          48
>    pull_task() count on cpu newly idle                              :           0
>    pull_task() when target task was cache-hot on cpu newly idle     :           0
>    load_balance() failed to find busier queue on cpu newly idle     :           0   $        0.00 $
>    load_balance() failed to find busier group on cpu newly idle     :         382   $       64.23 $
>    *load_balance() success count on cpu newly idle                  :           0
>    *avg task pulled per successful lb attempt (cpu newly idle)      :        0.00
>    ----------------------------------------------------------------------------------------------------
> 
> Consider following line:
> 
>    load_balance() found balanced on cpu newly idle                  :         382    $      64.23 $
> 
> While profiling was active, the load-balancer found 382 times the load
> needs to be balanced on a newly idle CPU 0. Following value encapsulated
> inside $ is average jiffies between two events (24537 / 382 = 64.23).

This explanation of the $ fields is quite buried. Is there a way of 
making it clearer with a column header in the report? I think even if it 
was documented in the man pages it might not be that useful.

There are also other jiffies fields that don't use $. Maybe if it was 
like this it could be semi self documenting:

----------------------------------------------------------------------
     Time elapsed (in jiffies)               :        $  24537        $ 
  ----------------------------------------------------------------------

------------------<Category newidle> ---------------------------------
   load_balance() count on cpu newly idle    :   427  $     57.46 avg $
----------------------------------------------------------------------


Other than that:

Tested-by: James Clark <james.clark@linaro.org>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] perf sched: Introduce stats tool
  2024-09-16 16:47 [PATCH 0/5] perf sched: Introduce stats tool Ravi Bangoria
                   ` (5 preceding siblings ...)
  2024-09-17 10:35 ` [PATCH 0/5] perf sched: Introduce stats tool James Clark
@ 2024-09-17 10:57 ` Madadi Vineeth Reddy
  2024-09-18  8:43   ` Sapkal, Swapnil
  2024-09-18  8:45   ` Sapkal, Swapnil
  6 siblings, 2 replies; 17+ messages in thread
From: Madadi Vineeth Reddy @ 2024-09-17 10:57 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: irogers, namhyung, acme, peterz, swapnil.sapkal, yu.c.chen,
	mark.rutland, alexander.shishkin, jolsa, rostedt, vincent.guittot,
	bristot, adrian.hunter, james.clark, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das,
	mingo, Madadi Vineeth Reddy

Hi Ravi,

On 16/09/24 22:17, Ravi Bangoria wrote:
> MOTIVATION
> ----------
> 
> Existing `perf sched` is quite exhaustive and provides lot of insights
> into scheduler behavior but it quickly becomes impractical to use for
> long running or scheduler intensive workload. For ex, `perf sched record`
> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
> generates huge 56G perf.data for which perf takes ~137 mins to prepare
> and write it to disk [1].
> 
> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
> and generates samples on a tracepoint hit, `perf sched stats record` takes
> snapshot of the /proc/schedstat file before and after the workload, i.e.
> there is almost zero interference on workload run. Also, it takes very
> minimal time to parse /proc/schedstat, convert it into perf samples and
> save those samples into perf.data file. Result perf.data file is much

per.data file is empty after the record.

Error:
The perf.data data has no samples!

Thanks and Regards
Madadi Vineeth Reddy

> smaller. So, overall `perf sched stats record` is much more light weight
> compare to `perf sched record`.
> 
> We, internally at AMD, have been using this (a variant of this, known as
> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
> of any scheduler code changes[3][4].
> 
> Please note that, this is not a replacement of perf sched record/report.
> The intended users of the new tool are scheduler developers, not regular
> users.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] perf sched: Introduce stats tool
  2024-09-17 10:57 ` Madadi Vineeth Reddy
@ 2024-09-18  8:43   ` Sapkal, Swapnil
  2024-09-18  8:45   ` Sapkal, Swapnil
  1 sibling, 0 replies; 17+ messages in thread
From: Sapkal, Swapnil @ 2024-09-18  8:43 UTC (permalink / raw)
  To: 20240916164722.1838-1-ravi.bangoria, Ravi Bangoria
  Cc: irogers, namhyung, acme, peterz, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot, bristot,
	adrian.hunter, james.clark, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das,
	mingo, Madadi Vineeth Reddy

Hi Vineeth,

Thank you for testing the series.

On 9/17/2024 4:27 PM, Madadi Vineeth Reddy wrote:
> Hi Ravi,
> 
> On 16/09/24 22:17, Ravi Bangoria wrote:
>> MOTIVATION
>> ----------
>>
>> Existing `perf sched` is quite exhaustive and provides lot of insights
>> into scheduler behavior but it quickly becomes impractical to use for
>> long running or scheduler intensive workload. For ex, `perf sched record`
>> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
>> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
>> generates huge 56G perf.data for which perf takes ~137 mins to prepare
>> and write it to disk [1].
>>
>> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
>> and generates samples on a tracepoint hit, `perf sched stats record` takes
>> snapshot of the /proc/schedstat file before and after the workload, i.e.
>> there is almost zero interference on workload run. Also, it takes very
>> minimal time to parse /proc/schedstat, convert it into perf samples and
>> save those samples into perf.data file. Result perf.data file is much
> 
> per.data file is empty after the record.
> 
> Error:
> The perf.data data has no samples!

I am not able to reproduce this error on my system. Can you please share 
`/proc/schedstat` file from your system? What was the base kernel you 
applied this on?

--
Thanks And Regards,
Swapnil
> 
> Thanks and Regards
> Madadi Vineeth Reddy
> 
>> smaller. So, overall `perf sched stats record` is much more light weight
>> compare to `perf sched record`.
>>
>> We, internally at AMD, have been using this (a variant of this, known as
>> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
>> of any scheduler code changes[3][4].
>>
>> Please note that, this is not a replacement of perf sched record/report.
>> The intended users of the new tool are scheduler developers, not regular
>> users.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] perf sched: Introduce stats tool
  2024-09-17 10:57 ` Madadi Vineeth Reddy
  2024-09-18  8:43   ` Sapkal, Swapnil
@ 2024-09-18  8:45   ` Sapkal, Swapnil
  1 sibling, 0 replies; 17+ messages in thread
From: Sapkal, Swapnil @ 2024-09-18  8:45 UTC (permalink / raw)
  To: 20240916164722.1838-1-ravi.bangoria, Ravi Bangoria
  Cc: irogers, namhyung, acme, peterz, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot, bristot,
	adrian.hunter, james.clark, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das,
	mingo, Madadi Vineeth Reddy

Hi Vineeth,

Thank you for testing the series.

On 9/17/2024 4:27 PM, Madadi Vineeth Reddy wrote:
> Hi Ravi,
> 
> On 16/09/24 22:17, Ravi Bangoria wrote:
>> MOTIVATION
>> ----------
>>
>> Existing `perf sched` is quite exhaustive and provides lot of insights
>> into scheduler behavior but it quickly becomes impractical to use for
>> long running or scheduler intensive workload. For ex, `perf sched record`
>> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
>> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
>> generates huge 56G perf.data for which perf takes ~137 mins to prepare
>> and write it to disk [1].
>>
>> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
>> and generates samples on a tracepoint hit, `perf sched stats record` takes
>> snapshot of the /proc/schedstat file before and after the workload, i.e.
>> there is almost zero interference on workload run. Also, it takes very
>> minimal time to parse /proc/schedstat, convert it into perf samples and
>> save those samples into perf.data file. Result perf.data file is much
> 
> per.data file is empty after the record.
> 
> Error:
> The perf.data data has no samples!

I am not able to reproduce this error on my system. Can you please share 
`/proc/schedstat` file from your system? What was the base kernel you 
applied this on?

--
Thanks And Regards,
Swapnil
> 
> Thanks and Regards
> Madadi Vineeth Reddy
> 
>> smaller. So, overall `perf sched stats record` is much more light weight
>> compare to `perf sched record`.
>>
>> We, internally at AMD, have been using this (a variant of this, known as
>> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
>> of any scheduler code changes[3][4].
>>
>> Please note that, this is not a replacement of perf sched record/report.
>> The intended users of the new tool are scheduler developers, not regular
>> users.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] perf sched stats: Add record and rawdump support
  2024-09-17 10:35   ` James Clark
@ 2024-09-18  8:52     ` Sapkal, Swapnil
  0 siblings, 0 replies; 17+ messages in thread
From: Sapkal, Swapnil @ 2024-09-18  8:52 UTC (permalink / raw)
  To: James Clark, Ravi Bangoria
  Cc: yu.c.chen, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, bristot, adrian.hunter, james.clark, kan.liang,
	gautham.shenoy, kprateek.nayak, juri.lelli, yangjihong, void, tj,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das, peterz, mingo, acme, namhyung, irogers

Hi James,

Thanks for the review.

On 9/17/2024 4:05 PM, James Clark wrote:
> 
> 
> On 16/09/2024 17:47, Ravi Bangoria wrote:
>> From: Swapnil Sapkal <swapnil.sapkal@amd.com>
>>
>> Define new, perf tool only, sample types and their layouts. Add logic
>> to parse /proc/schedstat, convert it to perf sample format and save
>> samples to perf.data file with `perf sched stats record` command. Also
>> add logic to read perf.data file, interpret schedstat samples and
>> print rawdump of samples with `perf script -D`.
>>
>> Note that, /proc/schedstat file output is standardized with version
>> number. The patch supports v15 but older or newer version can be added
>> easily.
>>
>> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
>> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> ---
> 
> [...]
> 
>> +int perf_event__synthesize_schedstat(const struct perf_tool *tool,
>> +                     perf_event__handler_t process,
>> +                     struct perf_cpu_map *user_requested_cpus)
>> +{
>> +    union perf_event *event = NULL;
>> +    size_t line_len = 0;
>> +    char *line = NULL;
>> +    char bf[BUFSIZ];
>> +    __u64 timestamp;
>> +    __u16 version;
>> +    struct io io;
>> +    int ret = -1;
>> +    int cpu = -1;
>> +    char ch;
>> +
>> +    io.fd = open("/proc/schedstat", O_RDONLY, 0);
> 
> Other parts of the tool use procfs__mountpoint() for /proc. Although it 
> can only be in one place so it doesn't actually make a difference for 
> this one. Probably worth it for consistency though.
> 
Sure, I will update this in the next version.

>> +    if (io.fd < 0) {
>> +        pr_err("Failed to open /proc/schedstat\n");
> 
> A hint about CONFIG_SCHEDSTAT would be useful here.

Sure, I will update.

--
Thanks and Regards,
Swapnil

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] perf sched: Introduce stats tool
  2024-09-17 10:35 ` [PATCH 0/5] perf sched: Introduce stats tool James Clark
@ 2024-09-18 13:19   ` Sapkal, Swapnil
  0 siblings, 0 replies; 17+ messages in thread
From: Sapkal, Swapnil @ 2024-09-18 13:19 UTC (permalink / raw)
  To: James Clark, Ravi Bangoria
  Cc: yu.c.chen, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, bristot, adrian.hunter, james.clark, kan.liang,
	gautham.shenoy, kprateek.nayak, juri.lelli, yangjihong, void, tj,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das, peterz, mingo, acme, namhyung, irogers

Hi James,

On 9/17/2024 4:05 PM, James Clark wrote:
> 
> 
> On 16/09/2024 17:47, Ravi Bangoria wrote:
>> MOTIVATION
>> ----------
>>
>> Existing `perf sched` is quite exhaustive and provides lot of insights
>> into scheduler behavior but it quickly becomes impractical to use for
>> long running or scheduler intensive workload. For ex, `perf sched record`
>> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
>> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
>> generates huge 56G perf.data for which perf takes ~137 mins to prepare
>> and write it to disk [1].
>>
>> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
>> and generates samples on a tracepoint hit, `perf sched stats record` 
>> takes
>> snapshot of the /proc/schedstat file before and after the workload, i.e.
>> there is almost zero interference on workload run. Also, it takes very
>> minimal time to parse /proc/schedstat, convert it into perf samples and
>> save those samples into perf.data file. Result perf.data file is much
>> smaller. So, overall `perf sched stats record` is much more light weight
>> compare to `perf sched record`.
>>
>> We, internally at AMD, have been using this (a variant of this, known as
>> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
>> of any scheduler code changes[3][4].
>>
>> Please note that, this is not a replacement of perf sched record/report.
>> The intended users of the new tool are scheduler developers, not regular
>> users.
>>
>> USAGE
>> -----
>>
>>    # perf sched stats record
>>    # perf sched stats report
>>
>> Note: Although `perf sched stats` tool supports workload profiling syntax
>> (i.e. -- <workload> ), the recorded profile is still systemwide since the
>> /proc/schedstat is a systemwide file.
>>
>> HOW TO INTERPRET THE REPORT
>> ---------------------------
>>
>> The `perf sched stats report` starts with total time profiling was active
>> in terms of jiffies:
>>
>>    
>> ----------------------------------------------------------------------------------------------------
>>    Time elapsed (in jiffies)                                   :       
>> 24537
>>    
>> ----------------------------------------------------------------------------------------------------
>>
>> Next is CPU scheduling statistics. These are simple diffs of
>> /proc/schedstat CPU lines along with description. The report also
>> prints % relative to base stat.
>>
>> In the example below, schedule() left the CPU0 idle 98.19% of the time.
>> 16.54% of total try_to_wake_up() was to wakeup local CPU. And, the total
>> waittime by tasks on CPU0 is 0.49% of the total runtime by tasks on the
>> same CPU.
>>
>>    
>> ----------------------------------------------------------------------------------------------------
>>    CPU 0
>>    
>> ----------------------------------------------------------------------------------------------------
>>    sched_yield() count                                         
>> :           0
>>    Legacy counter can be ignored                               
>> :           0
>>    schedule() called                                           :       
>> 17138
>>    schedule() left the processor idle                          :       
>> 16827 ( 98.19% )
>>    try_to_wake_up() was called                                 
>> :         508
>>    try_to_wake_up() was called to wake up the local cpu        
>> :          84 ( 16.54% )
>>    total runtime by tasks on this processor (in jiffies)       :  
>> 2408959243
>>    total waittime by tasks on this processor (in jiffies)      :    
>> 11731825 ( 0.49% )
>>    total timeslices run on this cpu                            
>> :         311
>>    
>> ----------------------------------------------------------------------------------------------------
>>
>> Next is load balancing statistics. For each of the sched domains
>> (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
>> the following three categories:
>>
>>    1) Idle Load Balance: Load balancing performed on behalf of a long
>>                          idling CPU by some other CPU.
>>    2) Busy Load Balance: Load balancing performed when the CPU was busy.
>>    3) New Idle Balance : Load balancing performed when a CPU just became
>>                          idle.
>>
>> Under each of these three categories, sched stats report provides
>> different load balancing statistics. Along with direct stats, the
>> report also contains derived metrics prefixed with *. Example:
>>
>>    
>> ----------------------------------------------------------------------------------------------------
>>    CPU 0 DOMAIN SMT CPUS <0, 64>
>>    ----------------------------------------- <Category idle> 
>> ------------------------------------------
>>    load_balance() count on cpu idle                                 
>> :          50   $      490.74 $
>>    load_balance() found balanced on cpu idle                        
>> :          42   $      584.21 $
>>    load_balance() move task failed on cpu idle                      
>> :           8   $     3067.12 $
>>    imbalance sum on cpu idle                                        
>> :           8
>>    pull_task() count on cpu idle                                    
>> :           0
>>    pull_task() when target task was cache-hot on cpu idle           
>> :           0
>>    load_balance() failed to find busier queue on cpu idle           
>> :           0   $        0.00 $
>>    load_balance() failed to find busier group on cpu idle           
>> :          42   $      584.21 $
>>    *load_balance() success count on cpu idle                        
>> :           0
>>    *avg task pulled per successful lb attempt (cpu idle)            
>> :        0.00
>>    ----------------------------------------- <Category busy> 
>> ------------------------------------------
>>    load_balance() count on cpu busy                                 
>> :           2   $    12268.50 $
>>    load_balance() found balanced on cpu busy                        
>> :           2   $    12268.50 $
>>    load_balance() move task failed on cpu busy                      
>> :           0   $        0.00 $
>>    imbalance sum on cpu busy                                        
>> :           0
>>    pull_task() count on cpu busy                                    
>> :           0
>>    pull_task() when target task was cache-hot on cpu busy           
>> :           0
>>    load_balance() failed to find busier queue on cpu busy           
>> :           0   $        0.00 $
>>    load_balance() failed to find busier group on cpu busy           
>> :           1   $    24537.00 $
>>    *load_balance() success count on cpu busy                        
>> :           0
>>    *avg task pulled per successful lb attempt (cpu busy)            
>> :        0.00
>>    ---------------------------------------- <Category newidle> 
>> ----------------------------------------
>>    load_balance() count on cpu newly idle                           
>> :         427   $       57.46 $
>>    load_balance() found balanced on cpu newly idle                  
>> :         382   $       64.23 $
>>    load_balance() move task failed on cpu newly idle                
>> :          45   $      545.27 $
>>    imbalance sum on cpu newly idle                                  
>> :          48
>>    pull_task() count on cpu newly idle                              
>> :           0
>>    pull_task() when target task was cache-hot on cpu newly idle     
>> :           0
>>    load_balance() failed to find busier queue on cpu newly idle     
>> :           0   $        0.00 $
>>    load_balance() failed to find busier group on cpu newly idle     
>> :         382   $       64.23 $
>>    *load_balance() success count on cpu newly idle                  
>> :           0
>>    *avg task pulled per successful lb attempt (cpu newly idle)      
>> :        0.00
>>    
>> ----------------------------------------------------------------------------------------------------
>>
>> Consider following line:
>>
>>    load_balance() found balanced on cpu newly idle                  
>> :         382    $      64.23 $
>>
>> While profiling was active, the load-balancer found 382 times the load
>> needs to be balanced on a newly idle CPU 0. Following value encapsulated
>> inside $ is average jiffies between two events (24537 / 382 = 64.23).
> 
> This explanation of the $ fields is quite buried. Is there a way of 
> making it clearer with a column header in the report? I think even if it 
> was documented in the man pages it might not be that useful.
> 
Thank you for the suggestion. I will add a header in the report to 
explain what each column values are representing.

> There are also other jiffies fields that don't use $. Maybe if it was 
> like this it could be semi self documenting:
> 
> ----------------------------------------------------------------------
>      Time elapsed (in jiffies)               :        $  24537        $ 
>   ----------------------------------------------------------------------
> 
> ------------------<Category newidle> ---------------------------------
>    load_balance() count on cpu newly idle    :   427  $     57.46 avg $
> ----------------------------------------------------------------------
> 
> 
> Other than that:
> 
> Tested-by: James Clark <james.clark@linaro.org>

Ack.

--
Thanks and Regards,
Swapnil

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] perf sched stats: Add record and rawdump support
  2024-09-16 16:47 ` [PATCH 2/5] perf sched stats: Add record and rawdump support Ravi Bangoria
  2024-09-17 10:35   ` James Clark
@ 2024-09-26  6:12   ` Namhyung Kim
  2024-09-27 11:04     ` Sapkal, Swapnil
  1 sibling, 1 reply; 17+ messages in thread
From: Namhyung Kim @ 2024-09-26  6:12 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: peterz, mingo, acme, irogers, swapnil.sapkal, yu.c.chen,
	mark.rutland, alexander.shishkin, jolsa, rostedt, vincent.guittot,
	bristot, adrian.hunter, james.clark, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

On Mon, Sep 16, 2024 at 04:47:19PM +0000, Ravi Bangoria wrote:
> From: Swapnil Sapkal <swapnil.sapkal@amd.com>
> 
> Define new, perf tool only, sample types and their layouts. Add logic
> to parse /proc/schedstat, convert it to perf sample format and save
> samples to perf.data file with `perf sched stats record` command. Also
> add logic to read perf.data file, interpret schedstat samples and
> print rawdump of samples with `perf script -D`.
> 
> Note that, /proc/schedstat file output is standardized with version
> number. The patch supports v15 but older or newer version can be added
> easily.
> 
> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> ---
>  tools/lib/perf/Documentation/libperf.txt      |   2 +
>  tools/lib/perf/Makefile                       |   2 +-
>  tools/lib/perf/include/perf/event.h           |  42 +++
>  .../lib/perf/include/perf/schedstat-cpu-v15.h |  13 +
>  .../perf/include/perf/schedstat-domain-v15.h  |  40 +++
>  tools/perf/builtin-inject.c                   |   2 +
>  tools/perf/builtin-sched.c                    | 222 +++++++++++++++-
>  tools/perf/util/event.c                       |  98 +++++++
>  tools/perf/util/event.h                       |   2 +
>  tools/perf/util/session.c                     |  20 ++
>  tools/perf/util/synthetic-events.c            | 249 ++++++++++++++++++
>  tools/perf/util/synthetic-events.h            |   3 +
>  tools/perf/util/tool.c                        |  20 ++
>  tools/perf/util/tool.h                        |   4 +-
>  14 files changed, 716 insertions(+), 3 deletions(-)
>  create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v15.h
>  create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v15.h
> 
> diff --git a/tools/lib/perf/Documentation/libperf.txt b/tools/lib/perf/Documentation/libperf.txt
> index fcfb9499ef9c..39c78682ad2e 100644
> --- a/tools/lib/perf/Documentation/libperf.txt
> +++ b/tools/lib/perf/Documentation/libperf.txt
> @@ -211,6 +211,8 @@ SYNOPSIS
>    struct perf_record_time_conv;
>    struct perf_record_header_feature;
>    struct perf_record_compressed;
> +  struct perf_record_schedstat_cpu;
> +  struct perf_record_schedstat_domain;
>  --
>  
>  DESCRIPTION
> diff --git a/tools/lib/perf/Makefile b/tools/lib/perf/Makefile
> index 3a9b2140aa04..ebbfea891a6a 100644
> --- a/tools/lib/perf/Makefile
> +++ b/tools/lib/perf/Makefile
> @@ -187,7 +187,7 @@ install_lib: libs
>  		$(call do_install_mkdir,$(libdir_SQ)); \
>  		cp -fpR $(LIBPERF_ALL) $(DESTDIR)$(libdir_SQ)
>  
> -HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h
> +HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-cpu-v15.h schedstat-domain-v15.h
>  INTERNAL_HDRS := cpumap.h evlist.h evsel.h lib.h mmap.h rc_check.h threadmap.h xyarray.h
>  
>  INSTALL_HDRS_PFX := $(DESTDIR)$(prefix)/include/perf
> diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
> index 37bb7771d914..35be296d68d5 100644
> --- a/tools/lib/perf/include/perf/event.h
> +++ b/tools/lib/perf/include/perf/event.h
> @@ -457,6 +457,44 @@ struct perf_record_compressed {
>  	char			 data[];
>  };
>  
> +struct perf_record_schedstat_cpu_v15 {
> +#define CPU_FIELD(_type, _name, _ver)		_type _name;
> +#include "schedstat-cpu-v15.h"
> +#undef CPU_FIELD
> +};
> +
> +struct perf_record_schedstat_cpu {
> +	struct perf_event_header header;
> +	__u16			 version;
> +	__u64			 timestamp;
> +	__u32			 cpu;

Can you change the layout to minimize the paddings?  Probably better to
add an explicit field for unused bits.


> +	union {
> +		struct perf_record_schedstat_cpu_v15 v15;
> +	};
> +};
> +
> +struct perf_record_schedstat_domain_v15 {
> +#define DOMAIN_FIELD(_type, _name, _ver)	_type _name;
> +#include "schedstat-domain-v15.h"
> +#undef DOMAIN_FIELD
> +};
> +
> +#define DOMAIN_NAME_LEN		16
> +
> +struct perf_record_schedstat_domain {
> +	struct perf_event_header header;
> +	__u16			 version;
> +	__u64			 timestamp;
> +	__u32			 cpu;
> +	__u16			 domain;

Ditto.

> +	char			 name[DOMAIN_NAME_LEN];
> +	union {
> +		struct perf_record_schedstat_domain_v15 v15;
> +	};
> +	__u16			 nr_cpus;
> +	__u8			 cpu_mask[];
> +};
> +
>  enum perf_user_event_type { /* above any possible kernel type */
>  	PERF_RECORD_USER_TYPE_START		= 64,
>  	PERF_RECORD_HEADER_ATTR			= 64,
> @@ -478,6 +516,8 @@ enum perf_user_event_type { /* above any possible kernel type */
>  	PERF_RECORD_HEADER_FEATURE		= 80,
>  	PERF_RECORD_COMPRESSED			= 81,
>  	PERF_RECORD_FINISHED_INIT		= 82,
> +	PERF_RECORD_SCHEDSTAT_CPU		= 83,
> +	PERF_RECORD_SCHEDSTAT_DOMAIN		= 84,
>  	PERF_RECORD_HEADER_MAX
>  };
>  
> @@ -518,6 +558,8 @@ union perf_event {
>  	struct perf_record_time_conv		time_conv;
>  	struct perf_record_header_feature	feat;
>  	struct perf_record_compressed		pack;
> +	struct perf_record_schedstat_cpu	schedstat_cpu;
> +	struct perf_record_schedstat_domain	schedstat_domain;
>  };
>  
>  #endif /* __LIBPERF_EVENT_H */
> diff --git a/tools/lib/perf/include/perf/schedstat-cpu-v15.h b/tools/lib/perf/include/perf/schedstat-cpu-v15.h
> new file mode 100644
> index 000000000000..8e4355ee3705
> --- /dev/null
> +++ b/tools/lib/perf/include/perf/schedstat-cpu-v15.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifdef CPU_FIELD
> +CPU_FIELD(__u32, yld_count, v15)
> +CPU_FIELD(__u32, array_exp, v15)
> +CPU_FIELD(__u32, sched_count, v15)
> +CPU_FIELD(__u32, sched_goidle, v15)
> +CPU_FIELD(__u32, ttwu_count, v15)
> +CPU_FIELD(__u32, ttwu_local, v15)
> +CPU_FIELD(__u64, rq_cpu_time, v15)
> +CPU_FIELD(__u64, run_delay, v15)
> +CPU_FIELD(__u64, pcount, v15)
> +#endif

Can we have a single schedstat.h containing both CPU fields and domain
fields?  You might require users to define the macro always and get rid
of the ifdef condition here.

Also is there any macro magic to handle the version number?  I think you
can have the number only (15; without 'v') and compare with input if
needed..

Thanks,
Namhyung


> diff --git a/tools/lib/perf/include/perf/schedstat-domain-v15.h b/tools/lib/perf/include/perf/schedstat-domain-v15.h
> new file mode 100644
> index 000000000000..422e713d617a
> --- /dev/null
> +++ b/tools/lib/perf/include/perf/schedstat-domain-v15.h
> @@ -0,0 +1,40 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifdef DOMAIN_FIELD
> +DOMAIN_FIELD(__u32, idle_lb_count, v15)
> +DOMAIN_FIELD(__u32, idle_lb_balanced, v15)
> +DOMAIN_FIELD(__u32, idle_lb_failed, v15)
> +DOMAIN_FIELD(__u32, idle_lb_imbalance, v15)
> +DOMAIN_FIELD(__u32, idle_lb_gained, v15)
> +DOMAIN_FIELD(__u32, idle_lb_hot_gained, v15)
> +DOMAIN_FIELD(__u32, idle_lb_nobusyq, v15)
> +DOMAIN_FIELD(__u32, idle_lb_nobusyg, v15)
> +DOMAIN_FIELD(__u32, busy_lb_count, v15)
> +DOMAIN_FIELD(__u32, busy_lb_balanced, v15)
> +DOMAIN_FIELD(__u32, busy_lb_failed, v15)
> +DOMAIN_FIELD(__u32, busy_lb_imbalance, v15)
> +DOMAIN_FIELD(__u32, busy_lb_gained, v15)
> +DOMAIN_FIELD(__u32, busy_lb_hot_gained, v15)
> +DOMAIN_FIELD(__u32, busy_lb_nobusyq, v15)
> +DOMAIN_FIELD(__u32, busy_lb_nobusyg, v15)
> +DOMAIN_FIELD(__u32, newidle_lb_count, v15)
> +DOMAIN_FIELD(__u32, newidle_lb_balanced, v15)
> +DOMAIN_FIELD(__u32, newidle_lb_failed, v15)
> +DOMAIN_FIELD(__u32, newidle_lb_imbalance, v15)
> +DOMAIN_FIELD(__u32, newidle_lb_gained, v15)
> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v15)
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v15)
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v15)
> +DOMAIN_FIELD(__u32, alb_count, v15)
> +DOMAIN_FIELD(__u32, alb_failed, v15)
> +DOMAIN_FIELD(__u32, alb_pushed, v15)
> +DOMAIN_FIELD(__u32, sbe_count, v15)
> +DOMAIN_FIELD(__u32, sbe_balanced, v15)
> +DOMAIN_FIELD(__u32, sbe_pushed, v15)
> +DOMAIN_FIELD(__u32, sbf_count, v15)
> +DOMAIN_FIELD(__u32, sbf_balanced, v15)
> +DOMAIN_FIELD(__u32, sbf_pushed, v15)
> +DOMAIN_FIELD(__u32, ttwu_wake_remote, v15)
> +DOMAIN_FIELD(__u32, ttwu_move_affine, v15)
> +DOMAIN_FIELD(__u32, ttwu_move_balance, v15)
> +#endif /* DOMAIN_FIELD */

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/5] perf sched stats: Add schedstat v16 support
  2024-09-16 16:47 ` [PATCH 3/5] perf sched stats: Add schedstat v16 support Ravi Bangoria
@ 2024-09-26  6:14   ` Namhyung Kim
  2024-09-27 11:08     ` Sapkal, Swapnil
  0 siblings, 1 reply; 17+ messages in thread
From: Namhyung Kim @ 2024-09-26  6:14 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: peterz, mingo, acme, irogers, swapnil.sapkal, yu.c.chen,
	mark.rutland, alexander.shishkin, jolsa, rostedt, vincent.guittot,
	bristot, adrian.hunter, james.clark, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

On Mon, Sep 16, 2024 at 04:47:20PM +0000, Ravi Bangoria wrote:
> From: Swapnil Sapkal <swapnil.sapkal@amd.com>
> 
> /proc/schedstat file output is standardized with version number.
> Add support to record and raw dump v16 version layout.

How many difference between v15 and v16?  Can we have it in the same
file with a different version number?

Thanks,
Namhyung

> 
> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> ---
>  tools/lib/perf/Makefile                       |  2 +-
>  tools/lib/perf/include/perf/event.h           | 14 +++++++
>  .../lib/perf/include/perf/schedstat-cpu-v16.h | 13 ++++++
>  .../perf/include/perf/schedstat-domain-v16.h  | 40 +++++++++++++++++++
>  tools/perf/util/event.c                       |  6 +++
>  tools/perf/util/synthetic-events.c            |  6 +++
>  6 files changed, 80 insertions(+), 1 deletion(-)
>  create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v16.h
>  create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v16.h
> 
> diff --git a/tools/lib/perf/Makefile b/tools/lib/perf/Makefile
> index ebbfea891a6a..de0f4ffd9e16 100644
> --- a/tools/lib/perf/Makefile
> +++ b/tools/lib/perf/Makefile
> @@ -187,7 +187,7 @@ install_lib: libs
>  		$(call do_install_mkdir,$(libdir_SQ)); \
>  		cp -fpR $(LIBPERF_ALL) $(DESTDIR)$(libdir_SQ)
>  
> -HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-cpu-v15.h schedstat-domain-v15.h
> +HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-cpu-v15.h schedstat-domain-v15.h schedstat-cpu-v16.h schedstat-domain-v16.h
>  INTERNAL_HDRS := cpumap.h evlist.h evsel.h lib.h mmap.h rc_check.h threadmap.h xyarray.h
>  
>  INSTALL_HDRS_PFX := $(DESTDIR)$(prefix)/include/perf
> diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
> index 35be296d68d5..c332d467c9c9 100644
> --- a/tools/lib/perf/include/perf/event.h
> +++ b/tools/lib/perf/include/perf/event.h
> @@ -463,6 +463,12 @@ struct perf_record_schedstat_cpu_v15 {
>  #undef CPU_FIELD
>  };
>  
> +struct perf_record_schedstat_cpu_v16 {
> +#define CPU_FIELD(_type, _name, _ver)		_type _name;
> +#include "schedstat-cpu-v16.h"
> +#undef CPU_FIELD
> +};
> +
>  struct perf_record_schedstat_cpu {
>  	struct perf_event_header header;
>  	__u16			 version;
> @@ -470,6 +476,7 @@ struct perf_record_schedstat_cpu {
>  	__u32			 cpu;
>  	union {
>  		struct perf_record_schedstat_cpu_v15 v15;
> +		struct perf_record_schedstat_cpu_v16 v16;
>  	};
>  };
>  
> @@ -479,6 +486,12 @@ struct perf_record_schedstat_domain_v15 {
>  #undef DOMAIN_FIELD
>  };
>  
> +struct perf_record_schedstat_domain_v16 {
> +#define DOMAIN_FIELD(_type, _name, _ver)	_type _name;
> +#include "schedstat-domain-v16.h"
> +#undef DOMAIN_FIELD
> +};
> +
>  #define DOMAIN_NAME_LEN		16
>  
>  struct perf_record_schedstat_domain {
> @@ -490,6 +503,7 @@ struct perf_record_schedstat_domain {
>  	char			 name[DOMAIN_NAME_LEN];
>  	union {
>  		struct perf_record_schedstat_domain_v15 v15;
> +		struct perf_record_schedstat_domain_v16 v16;
>  	};
>  	__u16			 nr_cpus;
>  	__u8			 cpu_mask[];
> diff --git a/tools/lib/perf/include/perf/schedstat-cpu-v16.h b/tools/lib/perf/include/perf/schedstat-cpu-v16.h
> new file mode 100644
> index 000000000000..f3a55131a05a
> --- /dev/null
> +++ b/tools/lib/perf/include/perf/schedstat-cpu-v16.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifdef CPU_FIELD
> +CPU_FIELD(__u32, yld_count, v16)
> +CPU_FIELD(__u32, array_exp, v16)
> +CPU_FIELD(__u32, sched_count, v16)
> +CPU_FIELD(__u32, sched_goidle, v16)
> +CPU_FIELD(__u32, ttwu_count, v16)
> +CPU_FIELD(__u32, ttwu_local, v16)
> +CPU_FIELD(__u64, rq_cpu_time, v16)
> +CPU_FIELD(__u64, run_delay, v16)
> +CPU_FIELD(__u64, pcount, v16)
> +#endif /* CPU_FIELD */
> diff --git a/tools/lib/perf/include/perf/schedstat-domain-v16.h b/tools/lib/perf/include/perf/schedstat-domain-v16.h
> new file mode 100644
> index 000000000000..d6ef895c9d32
> --- /dev/null
> +++ b/tools/lib/perf/include/perf/schedstat-domain-v16.h
> @@ -0,0 +1,40 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifdef DOMAIN_FIELD
> +DOMAIN_FIELD(__u32, busy_lb_count, v16)
> +DOMAIN_FIELD(__u32, busy_lb_balanced, v16)
> +DOMAIN_FIELD(__u32, busy_lb_failed, v16)
> +DOMAIN_FIELD(__u32, busy_lb_imbalance, v16)
> +DOMAIN_FIELD(__u32, busy_lb_gained, v16)
> +DOMAIN_FIELD(__u32, busy_lb_hot_gained, v16)
> +DOMAIN_FIELD(__u32, busy_lb_nobusyq, v16)
> +DOMAIN_FIELD(__u32, busy_lb_nobusyg, v16)
> +DOMAIN_FIELD(__u32, idle_lb_count, v16)
> +DOMAIN_FIELD(__u32, idle_lb_balanced, v16)
> +DOMAIN_FIELD(__u32, idle_lb_failed, v16)
> +DOMAIN_FIELD(__u32, idle_lb_imbalance, v16)
> +DOMAIN_FIELD(__u32, idle_lb_gained, v16)
> +DOMAIN_FIELD(__u32, idle_lb_hot_gained, v16)
> +DOMAIN_FIELD(__u32, idle_lb_nobusyq, v16)
> +DOMAIN_FIELD(__u32, idle_lb_nobusyg, v16)
> +DOMAIN_FIELD(__u32, newidle_lb_count, v16)
> +DOMAIN_FIELD(__u32, newidle_lb_balanced, v16)
> +DOMAIN_FIELD(__u32, newidle_lb_failed, v16)
> +DOMAIN_FIELD(__u32, newidle_lb_imbalance, v16)
> +DOMAIN_FIELD(__u32, newidle_lb_gained, v16)
> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v16)
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v16)
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v16)
> +DOMAIN_FIELD(__u32, alb_count, v16)
> +DOMAIN_FIELD(__u32, alb_failed, v16)
> +DOMAIN_FIELD(__u32, alb_pushed, v16)
> +DOMAIN_FIELD(__u32, sbe_count, v16)
> +DOMAIN_FIELD(__u32, sbe_balanced, v16)
> +DOMAIN_FIELD(__u32, sbe_pushed, v16)
> +DOMAIN_FIELD(__u32, sbf_count, v16)
> +DOMAIN_FIELD(__u32, sbf_balanced, v16)
> +DOMAIN_FIELD(__u32, sbf_pushed, v16)
> +DOMAIN_FIELD(__u32, ttwu_wake_remote, v16)
> +DOMAIN_FIELD(__u32, ttwu_move_affine, v16)
> +DOMAIN_FIELD(__u32, ttwu_move_balance, v16)
> +#endif /* DOMAIN_FIELD */
> diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
> index c9bc8237e3fa..d138e4a5787c 100644
> --- a/tools/perf/util/event.c
> +++ b/tools/perf/util/event.c
> @@ -566,6 +566,9 @@ size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
>  	if (version == 15) {
>  #include <perf/schedstat-cpu-v15.h>
>  		return size;
> +	} else if (version == 16) {
> +#include <perf/schedstat-cpu-v16.h>
> +		return size;
>  	}
>  #undef CPU_FIELD
>  
> @@ -641,6 +644,9 @@ size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
>  	if (version == 15) {
>  #include <perf/schedstat-domain-v15.h>
>  		return size;
> +	} else if (version == 16) {
> +#include <perf/schedstat-domain-v16.h>
> +		return size;
>  	}
>  #undef DOMAIN_FIELD
>  
> diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
> index 9d8450b6eda9..73b2492a4cde 100644
> --- a/tools/perf/util/synthetic-events.c
> +++ b/tools/perf/util/synthetic-events.c
> @@ -2546,6 +2546,8 @@ static union perf_event *__synthesize_schedstat_cpu(struct io *io, __u16 version
>  
>  	if (version == 15) {
>  #include <perf/schedstat-cpu-v15.h>
> +	} else if (version == 16) {
> +#include <perf/schedstat-cpu-v16.h>
>  	}
>  #undef CPU_FIELD
>  
> @@ -2667,6 +2669,8 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
>  
>  	if (version == 15) {
>  #include <perf/schedstat-domain-v15.h>
> +	} else if (version == 16) {
> +#include <perf/schedstat-domain-v16.h>
>  	}
>  #undef DOMAIN_FIELD
>  
> @@ -2709,6 +2713,8 @@ int perf_event__synthesize_schedstat(const struct perf_tool *tool,
>  
>  	if (!strcmp(line, "version 15\n")) {
>  		version = 15;
> +	} else if (!strcmp(line, "version 16\n")) {
> +		version = 16;
>  	} else {
>  		pr_err("Unsupported /proc/schedstat version: %s", line + 8);
>  		goto out_free_line;
> -- 
> 2.46.0
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/5] perf sched stats: Add record and rawdump support
  2024-09-26  6:12   ` Namhyung Kim
@ 2024-09-27 11:04     ` Sapkal, Swapnil
  0 siblings, 0 replies; 17+ messages in thread
From: Sapkal, Swapnil @ 2024-09-27 11:04 UTC (permalink / raw)
  To: Namhyung Kim, Ravi Bangoria
  Cc: peterz, mingo, acme, irogers, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot, bristot,
	adrian.hunter, james.clark, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

Hello Namhyung,

Thank you for reviewing.

On 9/26/2024 11:42 AM, Namhyung Kim wrote:
> On Mon, Sep 16, 2024 at 04:47:19PM +0000, Ravi Bangoria wrote:
>> From: Swapnil Sapkal <swapnil.sapkal@amd.com>
>>
>> Define new, perf tool only, sample types and their layouts. Add logic
>> to parse /proc/schedstat, convert it to perf sample format and save
>> samples to perf.data file with `perf sched stats record` command. Also
>> add logic to read perf.data file, interpret schedstat samples and
>> print rawdump of samples with `perf script -D`.
>>
>> Note that, /proc/schedstat file output is standardized with version
>> number. The patch supports v15 but older or newer version can be added
>> easily.
>>
>> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
>> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> ---
>>   tools/lib/perf/Documentation/libperf.txt      |   2 +
>>   tools/lib/perf/Makefile                       |   2 +-
>>   tools/lib/perf/include/perf/event.h           |  42 +++
>>   .../lib/perf/include/perf/schedstat-cpu-v15.h |  13 +
>>   .../perf/include/perf/schedstat-domain-v15.h  |  40 +++
>>   tools/perf/builtin-inject.c                   |   2 +
>>   tools/perf/builtin-sched.c                    | 222 +++++++++++++++-
>>   tools/perf/util/event.c                       |  98 +++++++
>>   tools/perf/util/event.h                       |   2 +
>>   tools/perf/util/session.c                     |  20 ++
>>   tools/perf/util/synthetic-events.c            | 249 ++++++++++++++++++
>>   tools/perf/util/synthetic-events.h            |   3 +
>>   tools/perf/util/tool.c                        |  20 ++
>>   tools/perf/util/tool.h                        |   4 +-
>>   14 files changed, 716 insertions(+), 3 deletions(-)
>>   create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v15.h
>>   create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v15.h
>>
>> diff --git a/tools/lib/perf/Documentation/libperf.txt b/tools/lib/perf/Documentation/libperf.txt
>> index fcfb9499ef9c..39c78682ad2e 100644
>> --- a/tools/lib/perf/Documentation/libperf.txt
>> +++ b/tools/lib/perf/Documentation/libperf.txt
>> @@ -211,6 +211,8 @@ SYNOPSIS
>>     struct perf_record_time_conv;
>>     struct perf_record_header_feature;
>>     struct perf_record_compressed;
>> +  struct perf_record_schedstat_cpu;
>> +  struct perf_record_schedstat_domain;
>>   --
>>   
>>   DESCRIPTION
>> diff --git a/tools/lib/perf/Makefile b/tools/lib/perf/Makefile
>> index 3a9b2140aa04..ebbfea891a6a 100644
>> --- a/tools/lib/perf/Makefile
>> +++ b/tools/lib/perf/Makefile
>> @@ -187,7 +187,7 @@ install_lib: libs
>>   		$(call do_install_mkdir,$(libdir_SQ)); \
>>   		cp -fpR $(LIBPERF_ALL) $(DESTDIR)$(libdir_SQ)
>>   
>> -HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h
>> +HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-cpu-v15.h schedstat-domain-v15.h
>>   INTERNAL_HDRS := cpumap.h evlist.h evsel.h lib.h mmap.h rc_check.h threadmap.h xyarray.h
>>   
>>   INSTALL_HDRS_PFX := $(DESTDIR)$(prefix)/include/perf
>> diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
>> index 37bb7771d914..35be296d68d5 100644
>> --- a/tools/lib/perf/include/perf/event.h
>> +++ b/tools/lib/perf/include/perf/event.h
>> @@ -457,6 +457,44 @@ struct perf_record_compressed {
>>   	char			 data[];
>>   };
>>   
>> +struct perf_record_schedstat_cpu_v15 {
>> +#define CPU_FIELD(_type, _name, _ver)		_type _name;
>> +#include "schedstat-cpu-v15.h"
>> +#undef CPU_FIELD
>> +};
>> +
>> +struct perf_record_schedstat_cpu {
>> +	struct perf_event_header header;
>> +	__u16			 version;
>> +	__u64			 timestamp;
>> +	__u32			 cpu;
> 
> Can you change the layout to minimize the paddings?  Probably better to
> add an explicit field for unused bits.
> 
Sure, I will change.
> 
>> +	union {
>> +		struct perf_record_schedstat_cpu_v15 v15;
>> +	};
>> +};
>> +
>> +struct perf_record_schedstat_domain_v15 {
>> +#define DOMAIN_FIELD(_type, _name, _ver)	_type _name;
>> +#include "schedstat-domain-v15.h"
>> +#undef DOMAIN_FIELD
>> +};
>> +
>> +#define DOMAIN_NAME_LEN		16
>> +
>> +struct perf_record_schedstat_domain {
>> +	struct perf_event_header header;
>> +	__u16			 version;
>> +	__u64			 timestamp;
>> +	__u32			 cpu;
>> +	__u16			 domain;
> 
> Ditto.

Ack.
> 
>> +	char			 name[DOMAIN_NAME_LEN];
>> +	union {
>> +		struct perf_record_schedstat_domain_v15 v15;
>> +	};
>> +	__u16			 nr_cpus;
>> +	__u8			 cpu_mask[];
>> +};
>> +
>>   enum perf_user_event_type { /* above any possible kernel type */
>>   	PERF_RECORD_USER_TYPE_START		= 64,
>>   	PERF_RECORD_HEADER_ATTR			= 64,
>> @@ -478,6 +516,8 @@ enum perf_user_event_type { /* above any possible kernel type */
>>   	PERF_RECORD_HEADER_FEATURE		= 80,
>>   	PERF_RECORD_COMPRESSED			= 81,
>>   	PERF_RECORD_FINISHED_INIT		= 82,
>> +	PERF_RECORD_SCHEDSTAT_CPU		= 83,
>> +	PERF_RECORD_SCHEDSTAT_DOMAIN		= 84,
>>   	PERF_RECORD_HEADER_MAX
>>   };
>>   
>> @@ -518,6 +558,8 @@ union perf_event {
>>   	struct perf_record_time_conv		time_conv;
>>   	struct perf_record_header_feature	feat;
>>   	struct perf_record_compressed		pack;
>> +	struct perf_record_schedstat_cpu	schedstat_cpu;
>> +	struct perf_record_schedstat_domain	schedstat_domain;
>>   };
>>   
>>   #endif /* __LIBPERF_EVENT_H */
>> diff --git a/tools/lib/perf/include/perf/schedstat-cpu-v15.h b/tools/lib/perf/include/perf/schedstat-cpu-v15.h
>> new file mode 100644
>> index 000000000000..8e4355ee3705
>> --- /dev/null
>> +++ b/tools/lib/perf/include/perf/schedstat-cpu-v15.h
>> @@ -0,0 +1,13 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifdef CPU_FIELD
>> +CPU_FIELD(__u32, yld_count, v15)
>> +CPU_FIELD(__u32, array_exp, v15)
>> +CPU_FIELD(__u32, sched_count, v15)
>> +CPU_FIELD(__u32, sched_goidle, v15)
>> +CPU_FIELD(__u32, ttwu_count, v15)
>> +CPU_FIELD(__u32, ttwu_local, v15)
>> +CPU_FIELD(__u64, rq_cpu_time, v15)
>> +CPU_FIELD(__u64, run_delay, v15)
>> +CPU_FIELD(__u64, pcount, v15)
>> +#endif
> 
> Can we have a single schedstat.h containing both CPU fields and domain
> fields? 

Yes, I think it is possible to have a single schedstat.h for both CPU 
and domain fields. I will think more on this.


> You might require users to define the macro always and get rid
> of the ifdef condition here.
>The later patches needed this ifdef's so I kept it. If we combine both 
cpu and domain fields, we will need this.

> Also is there any macro magic to handle the version number?  I think you
> can have the number only (15; without 'v') and compare with input if
> needed..
> 
I will think more on this, if it works out cleaner I will update in the 
next  version.

> Thanks,
> Namhyung
> 
> 
>> diff --git a/tools/lib/perf/include/perf/schedstat-domain-v15.h b/tools/lib/perf/include/perf/schedstat-domain-v15.h
>> new file mode 100644
>> index 000000000000..422e713d617a
>> --- /dev/null
>> +++ b/tools/lib/perf/include/perf/schedstat-domain-v15.h
>> @@ -0,0 +1,40 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifdef DOMAIN_FIELD
>> +DOMAIN_FIELD(__u32, idle_lb_count, v15)
>> +DOMAIN_FIELD(__u32, idle_lb_balanced, v15)
>> +DOMAIN_FIELD(__u32, idle_lb_failed, v15)
>> +DOMAIN_FIELD(__u32, idle_lb_imbalance, v15)
>> +DOMAIN_FIELD(__u32, idle_lb_gained, v15)
>> +DOMAIN_FIELD(__u32, idle_lb_hot_gained, v15)
>> +DOMAIN_FIELD(__u32, idle_lb_nobusyq, v15)
>> +DOMAIN_FIELD(__u32, idle_lb_nobusyg, v15)
>> +DOMAIN_FIELD(__u32, busy_lb_count, v15)
>> +DOMAIN_FIELD(__u32, busy_lb_balanced, v15)
>> +DOMAIN_FIELD(__u32, busy_lb_failed, v15)
>> +DOMAIN_FIELD(__u32, busy_lb_imbalance, v15)
>> +DOMAIN_FIELD(__u32, busy_lb_gained, v15)
>> +DOMAIN_FIELD(__u32, busy_lb_hot_gained, v15)
>> +DOMAIN_FIELD(__u32, busy_lb_nobusyq, v15)
>> +DOMAIN_FIELD(__u32, busy_lb_nobusyg, v15)
>> +DOMAIN_FIELD(__u32, newidle_lb_count, v15)
>> +DOMAIN_FIELD(__u32, newidle_lb_balanced, v15)
>> +DOMAIN_FIELD(__u32, newidle_lb_failed, v15)
>> +DOMAIN_FIELD(__u32, newidle_lb_imbalance, v15)
>> +DOMAIN_FIELD(__u32, newidle_lb_gained, v15)
>> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v15)
>> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v15)
>> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v15)
>> +DOMAIN_FIELD(__u32, alb_count, v15)
>> +DOMAIN_FIELD(__u32, alb_failed, v15)
>> +DOMAIN_FIELD(__u32, alb_pushed, v15)
>> +DOMAIN_FIELD(__u32, sbe_count, v15)
>> +DOMAIN_FIELD(__u32, sbe_balanced, v15)
>> +DOMAIN_FIELD(__u32, sbe_pushed, v15)
>> +DOMAIN_FIELD(__u32, sbf_count, v15)
>> +DOMAIN_FIELD(__u32, sbf_balanced, v15)
>> +DOMAIN_FIELD(__u32, sbf_pushed, v15)
>> +DOMAIN_FIELD(__u32, ttwu_wake_remote, v15)
>> +DOMAIN_FIELD(__u32, ttwu_move_affine, v15)
>> +DOMAIN_FIELD(__u32, ttwu_move_balance, v15)
>> +#endif /* DOMAIN_FIELD */
--
Thanks and Regards,
Swapnil

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/5] perf sched stats: Add schedstat v16 support
  2024-09-26  6:14   ` Namhyung Kim
@ 2024-09-27 11:08     ` Sapkal, Swapnil
  0 siblings, 0 replies; 17+ messages in thread
From: Sapkal, Swapnil @ 2024-09-27 11:08 UTC (permalink / raw)
  To: Namhyung Kim, Ravi Bangoria
  Cc: peterz, mingo, acme, irogers, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot, bristot,
	adrian.hunter, james.clark, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

Hello Namhyung,

On 9/26/2024 11:44 AM, Namhyung Kim wrote:
> On Mon, Sep 16, 2024 at 04:47:20PM +0000, Ravi Bangoria wrote:
>> From: Swapnil Sapkal <swapnil.sapkal@amd.com>
>>
>> /proc/schedstat file output is standardized with version number.
>> Add support to record and raw dump v16 version layout.
> 
> How many difference between v15 and v16?  Can we have it in the same
> file with a different version number?
There is difference in ordering in domain fields between v15 and v16, 
busy and idle load balancing fields are interchanged. Furthermore if new 
fields are added, the parser breaks and maintaning a separate header 
file seemed cleaner.
It will be difficult to define the perf structs, if the fields are 
present in same header file.
If you any suggestion to handle this, I am all ears.

--
Thanks and Regards,
Swapnil
> 
> Thanks,
> Namhyung
> 
>>
>> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
>> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> ---
>>   tools/lib/perf/Makefile                       |  2 +-
>>   tools/lib/perf/include/perf/event.h           | 14 +++++++
>>   .../lib/perf/include/perf/schedstat-cpu-v16.h | 13 ++++++
>>   .../perf/include/perf/schedstat-domain-v16.h  | 40 +++++++++++++++++++
>>   tools/perf/util/event.c                       |  6 +++
>>   tools/perf/util/synthetic-events.c            |  6 +++
>>   6 files changed, 80 insertions(+), 1 deletion(-)
>>   create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v16.h
>>   create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v16.h
>>
>> diff --git a/tools/lib/perf/Makefile b/tools/lib/perf/Makefile
>> index ebbfea891a6a..de0f4ffd9e16 100644
>> --- a/tools/lib/perf/Makefile
>> +++ b/tools/lib/perf/Makefile
>> @@ -187,7 +187,7 @@ install_lib: libs
>>   		$(call do_install_mkdir,$(libdir_SQ)); \
>>   		cp -fpR $(LIBPERF_ALL) $(DESTDIR)$(libdir_SQ)
>>   
>> -HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-cpu-v15.h schedstat-domain-v15.h
>> +HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-cpu-v15.h schedstat-domain-v15.h schedstat-cpu-v16.h schedstat-domain-v16.h
>>   INTERNAL_HDRS := cpumap.h evlist.h evsel.h lib.h mmap.h rc_check.h threadmap.h xyarray.h
>>   
>>   INSTALL_HDRS_PFX := $(DESTDIR)$(prefix)/include/perf
>> diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
>> index 35be296d68d5..c332d467c9c9 100644
>> --- a/tools/lib/perf/include/perf/event.h
>> +++ b/tools/lib/perf/include/perf/event.h
>> @@ -463,6 +463,12 @@ struct perf_record_schedstat_cpu_v15 {
>>   #undef CPU_FIELD
>>   };
>>   
>> +struct perf_record_schedstat_cpu_v16 {
>> +#define CPU_FIELD(_type, _name, _ver)		_type _name;
>> +#include "schedstat-cpu-v16.h"
>> +#undef CPU_FIELD
>> +};
>> +
>>   struct perf_record_schedstat_cpu {
>>   	struct perf_event_header header;
>>   	__u16			 version;
>> @@ -470,6 +476,7 @@ struct perf_record_schedstat_cpu {
>>   	__u32			 cpu;
>>   	union {
>>   		struct perf_record_schedstat_cpu_v15 v15;
>> +		struct perf_record_schedstat_cpu_v16 v16;
>>   	};
>>   };
>>   
>> @@ -479,6 +486,12 @@ struct perf_record_schedstat_domain_v15 {
>>   #undef DOMAIN_FIELD
>>   };
>>   
>> +struct perf_record_schedstat_domain_v16 {
>> +#define DOMAIN_FIELD(_type, _name, _ver)	_type _name;
>> +#include "schedstat-domain-v16.h"
>> +#undef DOMAIN_FIELD
>> +};
>> +
>>   #define DOMAIN_NAME_LEN		16
>>   
>>   struct perf_record_schedstat_domain {
>> @@ -490,6 +503,7 @@ struct perf_record_schedstat_domain {
>>   	char			 name[DOMAIN_NAME_LEN];
>>   	union {
>>   		struct perf_record_schedstat_domain_v15 v15;
>> +		struct perf_record_schedstat_domain_v16 v16;
>>   	};
>>   	__u16			 nr_cpus;
>>   	__u8			 cpu_mask[];
>> diff --git a/tools/lib/perf/include/perf/schedstat-cpu-v16.h b/tools/lib/perf/include/perf/schedstat-cpu-v16.h
>> new file mode 100644
>> index 000000000000..f3a55131a05a
>> --- /dev/null
>> +++ b/tools/lib/perf/include/perf/schedstat-cpu-v16.h
>> @@ -0,0 +1,13 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifdef CPU_FIELD
>> +CPU_FIELD(__u32, yld_count, v16)
>> +CPU_FIELD(__u32, array_exp, v16)
>> +CPU_FIELD(__u32, sched_count, v16)
>> +CPU_FIELD(__u32, sched_goidle, v16)
>> +CPU_FIELD(__u32, ttwu_count, v16)
>> +CPU_FIELD(__u32, ttwu_local, v16)
>> +CPU_FIELD(__u64, rq_cpu_time, v16)
>> +CPU_FIELD(__u64, run_delay, v16)
>> +CPU_FIELD(__u64, pcount, v16)
>> +#endif /* CPU_FIELD */
>> diff --git a/tools/lib/perf/include/perf/schedstat-domain-v16.h b/tools/lib/perf/include/perf/schedstat-domain-v16.h
>> new file mode 100644
>> index 000000000000..d6ef895c9d32
>> --- /dev/null
>> +++ b/tools/lib/perf/include/perf/schedstat-domain-v16.h
>> @@ -0,0 +1,40 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifdef DOMAIN_FIELD
>> +DOMAIN_FIELD(__u32, busy_lb_count, v16)
>> +DOMAIN_FIELD(__u32, busy_lb_balanced, v16)
>> +DOMAIN_FIELD(__u32, busy_lb_failed, v16)
>> +DOMAIN_FIELD(__u32, busy_lb_imbalance, v16)
>> +DOMAIN_FIELD(__u32, busy_lb_gained, v16)
>> +DOMAIN_FIELD(__u32, busy_lb_hot_gained, v16)
>> +DOMAIN_FIELD(__u32, busy_lb_nobusyq, v16)
>> +DOMAIN_FIELD(__u32, busy_lb_nobusyg, v16)
>> +DOMAIN_FIELD(__u32, idle_lb_count, v16)
>> +DOMAIN_FIELD(__u32, idle_lb_balanced, v16)
>> +DOMAIN_FIELD(__u32, idle_lb_failed, v16)
>> +DOMAIN_FIELD(__u32, idle_lb_imbalance, v16)
>> +DOMAIN_FIELD(__u32, idle_lb_gained, v16)
>> +DOMAIN_FIELD(__u32, idle_lb_hot_gained, v16)
>> +DOMAIN_FIELD(__u32, idle_lb_nobusyq, v16)
>> +DOMAIN_FIELD(__u32, idle_lb_nobusyg, v16)
>> +DOMAIN_FIELD(__u32, newidle_lb_count, v16)
>> +DOMAIN_FIELD(__u32, newidle_lb_balanced, v16)
>> +DOMAIN_FIELD(__u32, newidle_lb_failed, v16)
>> +DOMAIN_FIELD(__u32, newidle_lb_imbalance, v16)
>> +DOMAIN_FIELD(__u32, newidle_lb_gained, v16)
>> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v16)
>> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v16)
>> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v16)
>> +DOMAIN_FIELD(__u32, alb_count, v16)
>> +DOMAIN_FIELD(__u32, alb_failed, v16)
>> +DOMAIN_FIELD(__u32, alb_pushed, v16)
>> +DOMAIN_FIELD(__u32, sbe_count, v16)
>> +DOMAIN_FIELD(__u32, sbe_balanced, v16)
>> +DOMAIN_FIELD(__u32, sbe_pushed, v16)
>> +DOMAIN_FIELD(__u32, sbf_count, v16)
>> +DOMAIN_FIELD(__u32, sbf_balanced, v16)
>> +DOMAIN_FIELD(__u32, sbf_pushed, v16)
>> +DOMAIN_FIELD(__u32, ttwu_wake_remote, v16)
>> +DOMAIN_FIELD(__u32, ttwu_move_affine, v16)
>> +DOMAIN_FIELD(__u32, ttwu_move_balance, v16)
>> +#endif /* DOMAIN_FIELD */
>> diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
>> index c9bc8237e3fa..d138e4a5787c 100644
>> --- a/tools/perf/util/event.c
>> +++ b/tools/perf/util/event.c
>> @@ -566,6 +566,9 @@ size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
>>   	if (version == 15) {
>>   #include <perf/schedstat-cpu-v15.h>
>>   		return size;
>> +	} else if (version == 16) {
>> +#include <perf/schedstat-cpu-v16.h>
>> +		return size;
>>   	}
>>   #undef CPU_FIELD
>>   
>> @@ -641,6 +644,9 @@ size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
>>   	if (version == 15) {
>>   #include <perf/schedstat-domain-v15.h>
>>   		return size;
>> +	} else if (version == 16) {
>> +#include <perf/schedstat-domain-v16.h>
>> +		return size;
>>   	}
>>   #undef DOMAIN_FIELD
>>   
>> diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
>> index 9d8450b6eda9..73b2492a4cde 100644
>> --- a/tools/perf/util/synthetic-events.c
>> +++ b/tools/perf/util/synthetic-events.c
>> @@ -2546,6 +2546,8 @@ static union perf_event *__synthesize_schedstat_cpu(struct io *io, __u16 version
>>   
>>   	if (version == 15) {
>>   #include <perf/schedstat-cpu-v15.h>
>> +	} else if (version == 16) {
>> +#include <perf/schedstat-cpu-v16.h>
>>   	}
>>   #undef CPU_FIELD
>>   
>> @@ -2667,6 +2669,8 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
>>   
>>   	if (version == 15) {
>>   #include <perf/schedstat-domain-v15.h>
>> +	} else if (version == 16) {
>> +#include <perf/schedstat-domain-v16.h>
>>   	}
>>   #undef DOMAIN_FIELD
>>   
>> @@ -2709,6 +2713,8 @@ int perf_event__synthesize_schedstat(const struct perf_tool *tool,
>>   
>>   	if (!strcmp(line, "version 15\n")) {
>>   		version = 15;
>> +	} else if (!strcmp(line, "version 16\n")) {
>> +		version = 16;
>>   	} else {
>>   		pr_err("Unsupported /proc/schedstat version: %s", line + 8);
>>   		goto out_free_line;
>> -- 
>> 2.46.0
>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-09-27 11:08 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-16 16:47 [PATCH 0/5] perf sched: Introduce stats tool Ravi Bangoria
2024-09-16 16:47 ` [PATCH 1/5] sched/stats: Print domain name in /proc/schedstat Ravi Bangoria
2024-09-16 16:47 ` [PATCH 2/5] perf sched stats: Add record and rawdump support Ravi Bangoria
2024-09-17 10:35   ` James Clark
2024-09-18  8:52     ` Sapkal, Swapnil
2024-09-26  6:12   ` Namhyung Kim
2024-09-27 11:04     ` Sapkal, Swapnil
2024-09-16 16:47 ` [PATCH 3/5] perf sched stats: Add schedstat v16 support Ravi Bangoria
2024-09-26  6:14   ` Namhyung Kim
2024-09-27 11:08     ` Sapkal, Swapnil
2024-09-16 16:47 ` [PATCH 4/5] perf sched stats: Add support for report subcommand Ravi Bangoria
2024-09-16 16:47 ` [PATCH 5/5] perf sched stats: Add support for live mode Ravi Bangoria
2024-09-17 10:35 ` [PATCH 0/5] perf sched: Introduce stats tool James Clark
2024-09-18 13:19   ` Sapkal, Swapnil
2024-09-17 10:57 ` Madadi Vineeth Reddy
2024-09-18  8:43   ` Sapkal, Swapnil
2024-09-18  8:45   ` Sapkal, Swapnil

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).