linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/8] perf sched: Introduce stats tool
@ 2025-03-11 12:02 Swapnil Sapkal
  2025-03-11 12:02 ` [PATCH v3 1/8] perf sched stats: Add record and rawdump support Swapnil Sapkal
                   ` (8 more replies)
  0 siblings, 9 replies; 23+ messages in thread
From: Swapnil Sapkal @ 2025-03-11 12:02 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, james.clark
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot,
	adrian.hunter, kan.liang, gautham.shenoy, kprateek.nayak,
	juri.lelli, yangjihong, void, tj, sshegde, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

MOTIVATION
----------

Existing `perf sched` is quite exhaustive and provides lot of insights
into scheduler behavior but it quickly becomes impractical to use for
long running or scheduler intensive workload. For ex, `perf sched record`
has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
generates huge 56G perf.data for which perf takes ~137 mins to prepare
and write it to disk [1].

Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
and generates samples on a tracepoint hit, `perf sched stats record` takes
snapshot of the /proc/schedstat file before and after the workload, i.e.
there is almost zero interference on workload run. Also, it takes very
minimal time to parse /proc/schedstat, convert it into perf samples and
save those samples into perf.data file. Result perf.data file is much
smaller. So, overall `perf sched stats record` is much more light weight
compare to `perf sched record`.

We, internally at AMD, have been using this (a variant of this, known as
"sched-scoreboard"[2]) and found it to be very useful to analyse impact
of any scheduler code changes[3][4]. Prateek used v2[5] of this patch
series to report the analysis[6][7].

Please note that, this is not a replacement of perf sched record/report.
The intended users of the new tool are scheduler developers, not regular
users.

USAGE
-----

  # perf sched stats record
  # perf sched stats report
  # perf sched stats diff

Note: Although `perf sched stats` tool supports workload profiling syntax
(i.e. -- <workload> ), the recorded profile is still systemwide since the
/proc/schedstat is a systemwide file.

HOW TO INTERPRET THE REPORT
---------------------------

The `perf sched stats report` starts with description of the columns
present in the report. These column names are gievn before cpu and
domain stats to improve the readability of the report.

  ----------------------------------------------------------------------------------------------------
  DESC                    -> Description of the field
  COUNT                   -> Value of the field
  PCT_CHANGE              -> Percent change with corresponding base value
  AVG_JIFFIES             -> Avg time in jiffies between two consecutive occurrence of event
  ----------------------------------------------------------------------------------------------------

Next is the total profiling time in terms of jiffies:

  ----------------------------------------------------------------------------------------------------
  Time elapsed (in jiffies)                                   :       24537
  ----------------------------------------------------------------------------------------------------

Next is CPU scheduling statistics. These are simple diffs of
/proc/schedstat CPU lines along with description. The report also
prints % relative to base stat.

In the example below, schedule() left the CPU0 idle 98.19% of the time.
16.54% of total try_to_wake_up() was to wakeup local CPU. And, the total
waittime by tasks on CPU0 is 0.49% of the total runtime by tasks on the
same CPU.

  ----------------------------------------------------------------------------------------------------
  CPU 0
  ----------------------------------------------------------------------------------------------------
  DESC                                                                COUNT  PCT_CHANGE
  ----------------------------------------------------------------------------------------------------
  sched_yield() count                                         :           0
  Legacy counter can be ignored                               :           0
  schedule() called                                           :       17138
  schedule() left the processor idle                          :       16827  (  98.19% )
  try_to_wake_up() was called                                 :         508
  try_to_wake_up() was called to wake up the local cpu        :          84  (  16.54% )
  total runtime by tasks on this processor (in jiffies)       :  2408959243
  total waittime by tasks on this processor (in jiffies)      :    11731825  (  0.49% )
  total timeslices run on this cpu                            :         311
  ----------------------------------------------------------------------------------------------------

Next is load balancing statistics. For each of the sched domains
(eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
the following three categories:

  1) Idle Load Balance: Load balancing performed on behalf of a long
                        idling CPU by some other CPU.
  2) Busy Load Balance: Load balancing performed when the CPU was busy.
  3) New Idle Balance : Load balancing performed when a CPU just became
                        idle.

Under each of these three categories, sched stats report provides
different load balancing statistics. Along with direct stats, the
report also contains derived metrics prefixed with *. Example:

  ----------------------------------------------------------------------------------------------------
  CPU 0 DOMAIN SMT CPUS <0, 64>
  ----------------------------------------------------------------------------------------------------
  DESC                                                                     COUNT     AVG_JIFFIES
  ----------------------------------------- <Category idle> ------------------------------------------
  load_balance() count on cpu idle                                 :          50   $      490.74 $
  load_balance() found balanced on cpu idle                        :          42   $      584.21 $
  load_balance() move task failed on cpu idle                      :           8   $     3067.12 $
  imbalance sum on cpu idle                                        :           8
  pull_task() count on cpu idle                                    :           0
  pull_task() when target task was cache-hot on cpu idle           :           0
  load_balance() failed to find busier queue on cpu idle           :           0   $        0.00 $
  load_balance() failed to find busier group on cpu idle           :          42   $      584.21 $
  *load_balance() success count on cpu idle                        :           0
  *avg task pulled per successful lb attempt (cpu idle)            :        0.00
  ----------------------------------------- <Category busy> ------------------------------------------
  load_balance() count on cpu busy                                 :           2   $    12268.50 $
  load_balance() found balanced on cpu busy                        :           2   $    12268.50 $
  load_balance() move task failed on cpu busy                      :           0   $        0.00 $
  imbalance sum on cpu busy                                        :           0
  pull_task() count on cpu busy                                    :           0
  pull_task() when target task was cache-hot on cpu busy           :           0
  load_balance() failed to find busier queue on cpu busy           :           0   $        0.00 $
  load_balance() failed to find busier group on cpu busy           :           1   $    24537.00 $
  *load_balance() success count on cpu busy                        :           0
  *avg task pulled per successful lb attempt (cpu busy)            :        0.00
  ---------------------------------------- <Category newidle> ----------------------------------------
  load_balance() count on cpu newly idle                           :         427   $       57.46 $
  load_balance() found balanced on cpu newly idle                  :         382   $       64.23 $
  load_balance() move task failed on cpu newly idle                :          45   $      545.27 $
  imbalance sum on cpu newly idle                                  :          48
  pull_task() count on cpu newly idle                              :           0
  pull_task() when target task was cache-hot on cpu newly idle     :           0
  load_balance() failed to find busier queue on cpu newly idle     :           0   $        0.00 $
  load_balance() failed to find busier group on cpu newly idle     :         382   $       64.23 $
  *load_balance() success count on cpu newly idle                  :           0
  *avg task pulled per successful lb attempt (cpu newly idle)      :        0.00
  ----------------------------------------------------------------------------------------------------

Consider following line:

  load_balance() found balanced on cpu newly idle                  :         382    $      64.23 $

While profiling was active, the load-balancer found 382 times the load
needs to be balanced on a newly idle CPU 0. Following value encapsulated
inside $ is average jiffies between two events (24537 / 382 = 64.23).

Next are active_load_balance() stats. alb did not trigger while the
profiling was active, hence it's all 0s.

  --------------------------------- <Category active_load_balance()> ---------------------------------
  active_load_balance() count                                      :           0
  active_load_balance() move task failed                           :           0
  active_load_balance() successfully moved a task                  :           0
  ----------------------------------------------------------------------------------------------------

Next are sched_balance_exec() and sched_balance_fork() stats. They are
not used but we kept it in RFC just for legacy purpose. Unless opposed,
we plan to remove them in next revision.

Next are wakeup statistics. For every domain, the report also shows
task-wakeup statistics. Example:

  ------------------------------------------- <Wakeup Info> ------------------------------------------
  try_to_wake_up() awoke a task that last ran on a diff cpu       :       12070
  try_to_wake_up() moved task because cache-cold on own cpu       :        3324
  try_to_wake_up() started passive balancing                      :           0
  ----------------------------------------------------------------------------------------------------

Same set of stats are reported for each CPU and each domain level.

HOW TO INTERPRET THE DIFF
-------------------------

The `perf sched stats diff` will also start with explaining the columns
present in the diff. Then it will show the diff in time in terms of
jiffies. The order of the values depends on the order of input data
files. Example:

  ----------------------------------------------------------------------------------------------------
  Time elapsed (in jiffies)                                        :        2009,       2001
  ----------------------------------------------------------------------------------------------------

Below is the sample representing the difference in cpu and domain stats of
two runs. Here third column or the values enclosed in `|...|` shows the
percent change between the two. Second and fourth columns shows the
side-by-side representions of the corresponding fields from `perf sched
stats report`.

  ----------------------------------------------------------------------------------------------------
  CPU <ALL CPUS SUMMARY>
  ----------------------------------------------------------------------------------------------------
  DESC                                                                    COUNT1      COUNT2  PCT_CHANGE  PCT_CHANGE1 PCT_CHANGE2
  ----------------------------------------------------------------------------------------------------
  sched_yield() count                                              :           0,          0  |    0.00% |
  Legacy counter can be ignored                                    :           0,          0  |    0.00% |
  schedule() called                                                :      442939,     447305  |    0.99% |
  schedule() left the processor idle                               :      154012,     174657  |   13.40% |  (   34.77,      39.05 )
  try_to_wake_up() was called                                      :      306810,     258076  |  -15.88% |
  try_to_wake_up() was called to wake up the local cpu             :       21313,      14130  |  -33.70% |  (    6.95,       5.48 )
  total runtime by tasks on this processor (in jiffies)            :  6235330010, 5463133934  |  -12.38% |
  total waittime by tasks on this processor (in jiffies)           :  8349785693, 5755097654  |  -31.07% |  (  133.91,     105.34 )
  total timeslices run on this cpu                                 :      288869,     272599  |   -5.63% |
  ----------------------------------------------------------------------------------------------------

Below is the sample of domain stats diff:

  ----------------------------------------------------------------------------------------------------
  CPU <ALL CPUS SUMMARY>, DOMAIN SMT CPUS <0, 64>
  ----------------------------------------------------------------------------------------------------
  DESC                                                                    COUNT1      COUNT2  PCT_CHANGE     AVG_JIFFIES1  AVG_JIFFIES2
  ----------------------------------------- <Category busy> ------------------------------------------
  load_balance() count on cpu busy                                 :         154,         80  |  -48.05% |  $       13.05,       25.01 $
  load_balance() found balanced on cpu busy                        :         120,         66  |  -45.00% |  $       16.74,       30.32 $
  load_balance() move task failed on cpu busy                      :           0,          4  |    0.00% |  $        0.00,      500.25 $
  imbalance sum on cpu busy                                        :        1640,        299  |  -81.77% |
  pull_task() count on cpu busy                                    :          55,         18  |  -67.27% |
  pull_task() when target task was cache-hot on cpu busy           :           0,          0  |    0.00% |
  load_balance() failed to find busier queue on cpu busy           :           0,          0  |    0.00% |  $        0.00,        0.00 $
  load_balance() failed to find busier group on cpu busy           :         120,         66  |  -45.00% |  $       16.74,       30.32 $
  *load_balance() success count on cpu busy                        :          34,         10  |  -70.59% |
  *avg task pulled per successful lb attempt (cpu busy)            :        1.62,       1.80  |   11.27% |
  ----------------------------------------- <Category idle> ------------------------------------------
  load_balance() count on cpu idle                                 :         299,        481  |   60.87% |  $        6.72,        4.16 $
  load_balance() found balanced on cpu idle                        :         197,        331  |   68.02% |  $       10.20,        6.05 $
  load_balance() move task failed on cpu idle                      :           1,          2  |  100.00% |  $     2009.00,     1000.50 $
  imbalance sum on cpu idle                                        :         145,        222  |   53.10% |
  pull_task() count on cpu idle                                    :         133,        199  |   49.62% |
  pull_task() when target task was cache-hot on cpu idle           :           0,          0  |    0.00% |
  load_balance() failed to find busier queue on cpu idle           :           0,          0  |    0.00% |  $        0.00,        0.00 $
  load_balance() failed to find busier group on cpu idle           :         197,        331  |   68.02% |  $       10.20,        6.05 $
  *load_balance() success count on cpu idle                        :         101,        148  |   46.53% |
  *avg task pulled per successful lb attempt (cpu idle)            :        1.32,       1.34  |    2.11% |
  ---------------------------------------- <Category newidle> ----------------------------------------
  load_balance() count on cpu newly idle                           :       21791,      15976  |  -26.69% |  $        0.09,        0.13 $
  load_balance() found balanced on cpu newly idle                  :       16226,      12125  |  -25.27% |  $        0.12,        0.17 $
  load_balance() move task failed on cpu newly idle                :         236,         88  |  -62.71% |  $        8.51,       22.74 $
  imbalance sum on cpu newly idle                                  :        6655,       4628  |  -30.46% |
  pull_task() count on cpu newly idle                              :        5329,       3763  |  -29.39% |
  pull_task() when target task was cache-hot on cpu newly idle     :           0,          0  |    0.00% |
  load_balance() failed to find busier queue on cpu newly idle     :           0,          0  |    0.00% |  $        0.00,        0.00 $
  load_balance() failed to find busier group on cpu newly idle     :       12649,       9914  |  -21.62% |  $        0.16,        0.20 $
  *load_balance() success count on cpu newly idle                  :        5329,       3763  |  -29.39% |
  *avg task pulled per successful lb attempt (cpu newly idle)      :        1.00,       1.00  |    0.00% |
  --------------------------------- <Category active_load_balance()> ---------------------------------
  active_load_balance() count                                      :           0,          0  |    0.00% |
  active_load_balance() move task failed                           :           0,          0  |    0.00% |
  active_load_balance() successfully moved a task                  :           0,          0  |    0.00% |
  --------------------------------- <Category sched_balance_exec()> ----------------------------------
  sbe_count is not used                                            :           0,          0  |    0.00% |
  sbe_balanced is not used                                         :           0,          0  |    0.00% |
  sbe_pushed is not used                                           :           0,          0  |    0.00% |
  --------------------------------- <Category sched_balance_fork()> ----------------------------------
  sbf_count is not used                                            :           0,          0  |    0.00% |
  sbf_balanced is not used                                         :           0,          0  |    0.00% |
  sbf_pushed is not used                                           :           0,          0  |    0.00% |
  ------------------------------------------ <Wakeup Info> -------------------------------------------
  try_to_wake_up() awoke a task that last ran on a diff cpu        :       16606,      10214  |  -38.49% |
  try_to_wake_up() moved task because cache-cold on own cpu        :        3184,       2534  |  -20.41% |
  try_to_wake_up() started passive balancing                       :           0,          0  |    0.00% |
  ----------------------------------------------------------------------------------------------------

v2: https://lore.kernel.org/all/20241122084452.1064968-1-swapnil.sapkal@amd.com/
v2->v3:
 - Add perf unit test for basic sched stats functionalities
 - Describe new tool, it's usage and interpretation of report data in the
   perf-sched man page.
 - Add /proc/schedstat version 17 support.

v1: https://lore.kernel.org/lkml/20240916164722.1838-1-ravi.bangoria@amd.com
v1->v2
 - Add the support for `perf sched stats diff`
 - Add column header in report for better readability. Use
   procfs__mountpoint for consistency. Add hint for enabling
   CONFIG_SCHEDSTAT if disabled. [James Clark]
 - Use a single header file for both cpu and domain fileds. Change
   the layout of structs to minimise the padding. I tried changing
   `v15` to `15` in the header files but it was not giving any
   benefits so drop the idea. [Namhyung Kim]
 - Add tested-by.

RFC: https://lore.kernel.org/r/20240508060427.417-1-ravi.bangoria@amd.com
RFC->v1:
 - [Kernel] Print domain name along with domain number in /proc/schedstat
   file.
 - s/schedstat/stats/ for the subcommand.
 - Record domain name and cpumask details, also show them in report.
 - Add CPU filtering capability at record and report time.
 - Add /proc/schedstat v16 support.
 - Live mode support. Similar to perf stat command, live mode prints the
   sched stats on the stdout.
 - Add pager support in `perf sched stats report` for better scrolling.
 - Some minor cosmetic changes in report output to improve readability.
 - Rebase to latest perf-tools-next/perf-tools-next (1de5b5dcb835).

TODO:
 - perf sched stats records /proc/schedstat which is a CPU and domain
   level scheduler statistic. We are planning to add taskstat tool which
   reads task stats from procfs and generate scheduler statistic report
   at task granularity. this will probably a standalone tool, something
   like `perf sched taskstat record/report`.
 - Except pre-processor related checkpatch warnings, we have addressed
   most of the other possible warnings.

Patches are prepared on v6.14-rc6 (80e54e84911a).

[1] https://youtu.be/lg-9aG2ajA0?t=283
[2] https://github.com/AMDESE/sched-scoreboard
[3] https://lore.kernel.org/lkml/c50bdbfe-02ce-c1bc-c761-c95f8e216ca0@amd.com/
[4] https://lore.kernel.org/lkml/3e32bec6-5e59-c66a-7676-7d15df2c961c@amd.com/
[5] https://lore.kernel.org/all/20241122084452.1064968-1-swapnil.sapkal@amd.com/
[6] https://lore.kernel.org/lkml/3170d16e-eb67-4db8-a327-eb8188397fdb@amd.com/
[7] https://lore.kernel.org/lkml/feb31b6e-6457-454c-a4f3-ce8ad96bf8de@amd.com/

Swapnil Sapkal (8):
  perf sched stats: Add record and rawdump support
  perf sched stats: Add schedstat v16 support
  perf sched stats: Add schedstat v17 support
  perf sched stats: Add support for report subcommand
  perf sched stats: Add support for live mode
  perf sched stats: Add support for diff subcommand
  perf sched stats: Add basic perf sched stats test
  perf sched stats: Add details in man page

 tools/lib/perf/Documentation/libperf.txt    |   2 +
 tools/lib/perf/Makefile                     |   2 +-
 tools/lib/perf/include/perf/event.h         |  70 ++
 tools/lib/perf/include/perf/schedstat-v15.h | 142 +++
 tools/lib/perf/include/perf/schedstat-v16.h | 142 +++
 tools/lib/perf/include/perf/schedstat-v17.h | 160 ++++
 tools/perf/Documentation/perf-sched.txt     | 243 ++++-
 tools/perf/builtin-inject.c                 |   2 +
 tools/perf/builtin-sched.c                  | 978 +++++++++++++++++++-
 tools/perf/tests/shell/perf_sched_stats.sh  |  64 ++
 tools/perf/util/event.c                     | 110 +++
 tools/perf/util/event.h                     |   2 +
 tools/perf/util/session.c                   |  20 +
 tools/perf/util/synthetic-events.c          | 260 ++++++
 tools/perf/util/synthetic-events.h          |   3 +
 tools/perf/util/tool.c                      |  20 +
 tools/perf/util/tool.h                      |   4 +-
 17 files changed, 2220 insertions(+), 4 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/schedstat-v15.h
 create mode 100644 tools/lib/perf/include/perf/schedstat-v16.h
 create mode 100644 tools/lib/perf/include/perf/schedstat-v17.h
 create mode 100755 tools/perf/tests/shell/perf_sched_stats.sh

-- 
2.43.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3 1/8] perf sched stats: Add record and rawdump support
  2025-03-11 12:02 [PATCH v3 0/8] perf sched: Introduce stats tool Swapnil Sapkal
@ 2025-03-11 12:02 ` Swapnil Sapkal
  2025-03-11 13:10   ` Markus Elfring
                     ` (2 more replies)
  2025-03-11 12:02 ` [PATCH v3 2/8] perf sched stats: Add schedstat v16 support Swapnil Sapkal
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 23+ messages in thread
From: Swapnil Sapkal @ 2025-03-11 12:02 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, james.clark
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot,
	adrian.hunter, kan.liang, gautham.shenoy, kprateek.nayak,
	juri.lelli, yangjihong, void, tj, sshegde, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das,
	James Clark

Define new, perf tool only, sample types and their layouts. Add logic
to parse /proc/schedstat, convert it to perf sample format and save
samples to perf.data file with `perf sched stats record` command. Also
add logic to read perf.data file, interpret schedstat samples and
print rawdump of samples with `perf script -D`.

Note that, /proc/schedstat file output is standardized with version
number. The patch supports v15 but older or newer version can be added
easily.

Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
---
 tools/lib/perf/Documentation/libperf.txt    |   2 +
 tools/lib/perf/Makefile                     |   2 +-
 tools/lib/perf/include/perf/event.h         |  42 ++++
 tools/lib/perf/include/perf/schedstat-v15.h |  52 +++++
 tools/perf/builtin-inject.c                 |   2 +
 tools/perf/builtin-sched.c                  | 226 +++++++++++++++++-
 tools/perf/util/event.c                     |  98 ++++++++
 tools/perf/util/event.h                     |   2 +
 tools/perf/util/session.c                   |  20 ++
 tools/perf/util/synthetic-events.c          | 239 ++++++++++++++++++++
 tools/perf/util/synthetic-events.h          |   3 +
 tools/perf/util/tool.c                      |  20 ++
 tools/perf/util/tool.h                      |   4 +-
 13 files changed, 709 insertions(+), 3 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/schedstat-v15.h

diff --git a/tools/lib/perf/Documentation/libperf.txt b/tools/lib/perf/Documentation/libperf.txt
index 59aabdd3cabf..3f295639903d 100644
--- a/tools/lib/perf/Documentation/libperf.txt
+++ b/tools/lib/perf/Documentation/libperf.txt
@@ -210,6 +210,8 @@ SYNOPSIS
   struct perf_record_time_conv;
   struct perf_record_header_feature;
   struct perf_record_compressed;
+  struct perf_record_schedstat_cpu;
+  struct perf_record_schedstat_domain;
 --
 
 DESCRIPTION
diff --git a/tools/lib/perf/Makefile b/tools/lib/perf/Makefile
index e9a7ac2c062e..4b60804aa0b6 100644
--- a/tools/lib/perf/Makefile
+++ b/tools/lib/perf/Makefile
@@ -174,7 +174,7 @@ install_lib: libs
 		$(call do_install_mkdir,$(libdir_SQ)); \
 		cp -fpR $(LIBPERF_ALL) $(DESTDIR)$(libdir_SQ)
 
-HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h
+HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-v15.h
 INTERNAL_HDRS := cpumap.h evlist.h evsel.h lib.h mmap.h rc_check.h threadmap.h xyarray.h
 
 INSTALL_HDRS_PFX := $(DESTDIR)$(prefix)/include/perf
diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index 37bb7771d914..189106874063 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -457,6 +457,44 @@ struct perf_record_compressed {
 	char			 data[];
 };
 
+struct perf_record_schedstat_cpu_v15 {
+#define CPU_FIELD(_type, _name, _ver)		_type _name
+#include "schedstat-v15.h"
+#undef CPU_FIELD
+};
+
+struct perf_record_schedstat_cpu {
+	struct perf_event_header header;
+	__u64			 timestamp;
+	union {
+		struct perf_record_schedstat_cpu_v15 v15;
+	};
+	__u32			 cpu;
+	__u16			 version;
+};
+
+struct perf_record_schedstat_domain_v15 {
+#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
+#include "schedstat-v15.h"
+#undef DOMAIN_FIELD
+};
+
+#define DOMAIN_NAME_LEN		16
+
+struct perf_record_schedstat_domain {
+	struct perf_event_header header;
+	__u16			 version;
+	__u64			 timestamp;
+	__u32			 cpu;
+	__u16			 domain;
+	char			 name[DOMAIN_NAME_LEN];
+	union {
+		struct perf_record_schedstat_domain_v15 v15;
+	};
+	__u16			 nr_cpus;
+	__u8			 cpu_mask[];
+};
+
 enum perf_user_event_type { /* above any possible kernel type */
 	PERF_RECORD_USER_TYPE_START		= 64,
 	PERF_RECORD_HEADER_ATTR			= 64,
@@ -478,6 +516,8 @@ enum perf_user_event_type { /* above any possible kernel type */
 	PERF_RECORD_HEADER_FEATURE		= 80,
 	PERF_RECORD_COMPRESSED			= 81,
 	PERF_RECORD_FINISHED_INIT		= 82,
+	PERF_RECORD_SCHEDSTAT_CPU		= 83,
+	PERF_RECORD_SCHEDSTAT_DOMAIN		= 84,
 	PERF_RECORD_HEADER_MAX
 };
 
@@ -518,6 +558,8 @@ union perf_event {
 	struct perf_record_time_conv		time_conv;
 	struct perf_record_header_feature	feat;
 	struct perf_record_compressed		pack;
+	struct perf_record_schedstat_cpu	schedstat_cpu;
+	struct perf_record_schedstat_domain	schedstat_domain;
 };
 
 #endif /* __LIBPERF_EVENT_H */
diff --git a/tools/lib/perf/include/perf/schedstat-v15.h b/tools/lib/perf/include/perf/schedstat-v15.h
new file mode 100644
index 000000000000..43f8060c5337
--- /dev/null
+++ b/tools/lib/perf/include/perf/schedstat-v15.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifdef CPU_FIELD
+CPU_FIELD(__u32, yld_count, v15);
+CPU_FIELD(__u32, array_exp, v15);
+CPU_FIELD(__u32, sched_count, v15);
+CPU_FIELD(__u32, sched_goidle, v15);
+CPU_FIELD(__u32, ttwu_count, v15);
+CPU_FIELD(__u32, ttwu_local, v15);
+CPU_FIELD(__u64, rq_cpu_time, v15);
+CPU_FIELD(__u64, run_delay, v15);
+CPU_FIELD(__u64, pcount, v15);
+#endif
+
+#ifdef DOMAIN_FIELD
+DOMAIN_FIELD(__u32, idle_lb_count, v15);
+DOMAIN_FIELD(__u32, idle_lb_balanced, v15);
+DOMAIN_FIELD(__u32, idle_lb_failed, v15);
+DOMAIN_FIELD(__u32, idle_lb_imbalance, v15);
+DOMAIN_FIELD(__u32, idle_lb_gained, v15);
+DOMAIN_FIELD(__u32, idle_lb_hot_gained, v15);
+DOMAIN_FIELD(__u32, idle_lb_nobusyq, v15);
+DOMAIN_FIELD(__u32, idle_lb_nobusyg, v15);
+DOMAIN_FIELD(__u32, busy_lb_count, v15);
+DOMAIN_FIELD(__u32, busy_lb_balanced, v15);
+DOMAIN_FIELD(__u32, busy_lb_failed, v15);
+DOMAIN_FIELD(__u32, busy_lb_imbalance, v15);
+DOMAIN_FIELD(__u32, busy_lb_gained, v15);
+DOMAIN_FIELD(__u32, busy_lb_hot_gained, v15);
+DOMAIN_FIELD(__u32, busy_lb_nobusyq, v15);
+DOMAIN_FIELD(__u32, busy_lb_nobusyg, v15);
+DOMAIN_FIELD(__u32, newidle_lb_count, v15);
+DOMAIN_FIELD(__u32, newidle_lb_balanced, v15);
+DOMAIN_FIELD(__u32, newidle_lb_failed, v15);
+DOMAIN_FIELD(__u32, newidle_lb_imbalance, v15);
+DOMAIN_FIELD(__u32, newidle_lb_gained, v15);
+DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v15);
+DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v15);
+DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v15);
+DOMAIN_FIELD(__u32, alb_count, v15);
+DOMAIN_FIELD(__u32, alb_failed, v15);
+DOMAIN_FIELD(__u32, alb_pushed, v15);
+DOMAIN_FIELD(__u32, sbe_count, v15);
+DOMAIN_FIELD(__u32, sbe_balanced, v15);
+DOMAIN_FIELD(__u32, sbe_pushed, v15);
+DOMAIN_FIELD(__u32, sbf_count, v15);
+DOMAIN_FIELD(__u32, sbf_balanced, v15);
+DOMAIN_FIELD(__u32, sbf_pushed, v15);
+DOMAIN_FIELD(__u32, ttwu_wake_remote, v15);
+DOMAIN_FIELD(__u32, ttwu_move_affine, v15);
+DOMAIN_FIELD(__u32, ttwu_move_balance, v15);
+#endif
diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 11e49cafa3af..af1add2abf72 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -2530,6 +2530,8 @@ int cmd_inject(int argc, const char **argv)
 	inject.tool.finished_init	= perf_event__repipe_op2_synth;
 	inject.tool.compressed		= perf_event__repipe_op4_synth;
 	inject.tool.auxtrace		= perf_event__repipe_auxtrace;
+	inject.tool.schedstat_cpu	= perf_event__repipe_op2_synth;
+	inject.tool.schedstat_domain	= perf_event__repipe_op2_synth;
 	inject.tool.dont_split_sample_group = true;
 	inject.session = __perf_session__new(&data, &inject.tool,
 					     /*trace_event_repipe=*/inject.output.is_pipe);
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 26ece6e9bfd1..1c3b56013164 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -28,6 +28,8 @@
 #include "util/debug.h"
 #include "util/event.h"
 #include "util/util.h"
+#include "util/synthetic-events.h"
+#include "util/target.h"
 
 #include <linux/kernel.h>
 #include <linux/log2.h>
@@ -55,6 +57,7 @@
 #define MAX_PRIO		140
 
 static const char *cpu_list;
+static struct perf_cpu_map *user_requested_cpus;
 static DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
 
 struct sched_atom;
@@ -236,6 +239,9 @@ struct perf_sched {
 	volatile bool   thread_funcs_exit;
 	const char	*prio_str;
 	DECLARE_BITMAP(prio_bitmap, MAX_PRIO);
+
+	struct perf_session *session;
+	struct perf_data *data;
 };
 
 /* per thread run time data */
@@ -3670,6 +3676,199 @@ static void setup_sorting(struct perf_sched *sched, const struct option *options
 	sort_dimension__add("pid", &sched->cmp_pid);
 }
 
+static int process_synthesized_schedstat_event(const struct perf_tool *tool,
+					       union perf_event *event,
+					       struct perf_sample *sample __maybe_unused,
+					       struct machine *machine __maybe_unused)
+{
+	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
+
+	if (perf_data__write(sched->data, event, event->header.size) <= 0) {
+		pr_err("failed to write perf data, error: %m\n");
+		return -1;
+	}
+
+	sched->session->header.data_size += event->header.size;
+	return 0;
+}
+
+static void sighandler(int sig __maybe_unused)
+{
+}
+
+static int enable_sched_schedstats(int *reset)
+{
+	char path[PATH_MAX];
+	FILE *fp;
+	char ch;
+
+	snprintf(path, PATH_MAX, "%s/sys/kernel/sched_schedstats", procfs__mountpoint());
+	fp = fopen(path, "w+");
+	if (!fp) {
+		pr_err("Failed to open %s\n", path);
+		return -1;
+	}
+
+	ch = getc(fp);
+	if (ch == '0') {
+		*reset = 1;
+		rewind(fp);
+		putc('1', fp);
+		fclose(fp);
+	}
+	return 0;
+}
+
+static int disable_sched_schedstat(void)
+{
+	char path[PATH_MAX];
+	FILE *fp;
+
+	snprintf(path, PATH_MAX, "%s/sys/kernel/sched_schedstats", procfs__mountpoint());
+	fp = fopen(path, "w");
+	if (!fp) {
+		pr_err("Failed to open %s\n", path);
+		return -1;
+	}
+
+	putc('0', fp);
+	fclose(fp);
+	return 0;
+}
+
+/* perf.data or any other output file name used by stats subcommand (only). */
+const char *output_name;
+
+static int perf_sched__schedstat_record(struct perf_sched *sched,
+					int argc, const char **argv)
+{
+	struct perf_session *session;
+	struct evlist *evlist;
+	struct target *target;
+	int reset = 0;
+	int err = 0;
+	int fd;
+	struct perf_data data = {
+		.path  = output_name,
+		.mode  = PERF_DATA_MODE_WRITE,
+	};
+
+	signal(SIGINT, sighandler);
+	signal(SIGCHLD, sighandler);
+	signal(SIGTERM, sighandler);
+
+	evlist = evlist__new();
+	if (!evlist)
+		return -ENOMEM;
+
+	session = perf_session__new(&data, &sched->tool);
+	if (IS_ERR(session)) {
+		pr_err("Perf session creation failed.\n");
+		return PTR_ERR(session);
+	}
+
+	session->evlist = evlist;
+
+	sched->session = session;
+	sched->data = &data;
+
+	fd = perf_data__fd(&data);
+
+	/*
+	 * Capture all important metadata about the system. Although they are
+	 * not used by `perf sched stats` tool directly, they provide useful
+	 * information about profiled environment.
+	 */
+	perf_header__set_feat(&session->header, HEADER_HOSTNAME);
+	perf_header__set_feat(&session->header, HEADER_OSRELEASE);
+	perf_header__set_feat(&session->header, HEADER_VERSION);
+	perf_header__set_feat(&session->header, HEADER_ARCH);
+	perf_header__set_feat(&session->header, HEADER_NRCPUS);
+	perf_header__set_feat(&session->header, HEADER_CPUDESC);
+	perf_header__set_feat(&session->header, HEADER_CPUID);
+	perf_header__set_feat(&session->header, HEADER_TOTAL_MEM);
+	perf_header__set_feat(&session->header, HEADER_CMDLINE);
+	perf_header__set_feat(&session->header, HEADER_CPU_TOPOLOGY);
+	perf_header__set_feat(&session->header, HEADER_NUMA_TOPOLOGY);
+	perf_header__set_feat(&session->header, HEADER_CACHE);
+	perf_header__set_feat(&session->header, HEADER_MEM_TOPOLOGY);
+	perf_header__set_feat(&session->header, HEADER_CPU_PMU_CAPS);
+	perf_header__set_feat(&session->header, HEADER_HYBRID_TOPOLOGY);
+	perf_header__set_feat(&session->header, HEADER_PMU_CAPS);
+
+	err = perf_session__write_header(session, evlist, fd, false);
+	if (err < 0)
+		goto out;
+
+	/*
+	 * `perf sched stats` does not support workload profiling (-p pid)
+	 * since /proc/schedstat file contains cpu specific data only. Hence, a
+	 * profile target is either set of cpus or systemwide, never a process.
+	 * Note that, although `-- <workload>` is supported, profile data are
+	 * still cpu/systemwide.
+	 */
+	target = zalloc(sizeof(struct target));
+	if (cpu_list)
+		target->cpu_list = cpu_list;
+	else
+		target->system_wide = true;
+
+	if (argc) {
+		err = evlist__prepare_workload(evlist, target, argv, false, NULL);
+		if (err)
+			goto out_target;
+	}
+
+	if (cpu_list) {
+		user_requested_cpus = perf_cpu_map__new(cpu_list);
+		if (!user_requested_cpus)
+			goto out_target;
+	}
+
+	err = perf_event__synthesize_schedstat(&(sched->tool),
+					       process_synthesized_schedstat_event,
+					       user_requested_cpus);
+	if (err < 0)
+		goto out_target;
+
+	err = enable_sched_schedstats(&reset);
+	if (err < 0)
+		goto out_target;
+
+	if (argc)
+		evlist__start_workload(evlist);
+
+	/* wait for signal */
+	pause();
+
+	if (reset) {
+		err = disable_sched_schedstat();
+		if (err < 0)
+			goto out_target;
+	}
+
+	err = perf_event__synthesize_schedstat(&(sched->tool),
+					       process_synthesized_schedstat_event,
+					       user_requested_cpus);
+	if (err < 0)
+		goto out_target;
+
+	err = perf_session__write_header(session, evlist, fd, true);
+
+out_target:
+	free(target);
+out:
+	if (!err)
+		fprintf(stderr, "[ perf sched stats: Wrote samples to %s ]\n", data.path);
+	else
+		fprintf(stderr, "[ perf sched stats: Failed !! ]\n");
+
+	close(fd);
+	perf_session__delete(session);
+
+	return err;
+}
+
 static bool schedstat_events_exposed(void)
 {
 	/*
@@ -3846,6 +4045,12 @@ int cmd_sched(int argc, const char **argv)
 	OPT_BOOLEAN('P', "pre-migrations", &sched.pre_migrations, "Show pre-migration wait time"),
 	OPT_PARENT(sched_options)
 	};
+	const struct option stats_options[] = {
+	OPT_STRING('o', "output", &output_name, "file",
+		   "`stats record` with output filename"),
+	OPT_STRING('C', "cpu", &cpu_list, "cpu", "list of cpus to profile"),
+	OPT_END()
+	};
 
 	const char * const latency_usage[] = {
 		"perf sched latency [<options>]",
@@ -3863,9 +4068,13 @@ int cmd_sched(int argc, const char **argv)
 		"perf sched timehist [<options>]",
 		NULL
 	};
+	const char *stats_usage[] = {
+		"perf sched stats {record} [<options>]",
+		NULL
+	};
 	const char *const sched_subcommands[] = { "record", "latency", "map",
 						  "replay", "script",
-						  "timehist", NULL };
+						  "timehist", "stats", NULL };
 	const char *sched_usage[] = {
 		NULL,
 		NULL
@@ -3961,6 +4170,21 @@ int cmd_sched(int argc, const char **argv)
 			return ret;
 
 		return perf_sched__timehist(&sched);
+	} else if (!strcmp(argv[0], "stats")) {
+		const char *const stats_subcommands[] = {"record", NULL};
+
+		argc = parse_options_subcommand(argc, argv, stats_options,
+						stats_subcommands,
+						stats_usage,
+						PARSE_OPT_STOP_AT_NON_OPTION);
+
+		if (argv[0] && !strcmp(argv[0], "record")) {
+			if (argc)
+				argc = parse_options(argc, argv, stats_options,
+						     stats_usage, 0);
+			return perf_sched__schedstat_record(&sched, argc, argv);
+		}
+		usage_with_options(stats_usage, stats_options);
 	} else {
 		usage_with_options(sched_usage, sched_options);
 	}
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index aac96d5d1917..0f863d38abe2 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -77,6 +77,8 @@ static const char *perf_event__names[] = {
 	[PERF_RECORD_HEADER_FEATURE]		= "FEATURE",
 	[PERF_RECORD_COMPRESSED]		= "COMPRESSED",
 	[PERF_RECORD_FINISHED_INIT]		= "FINISHED_INIT",
+	[PERF_RECORD_SCHEDSTAT_CPU]		= "SCHEDSTAT_CPU",
+	[PERF_RECORD_SCHEDSTAT_DOMAIN]		= "SCHEDSTAT_DOMAIN",
 };
 
 const char *perf_event__name(unsigned int id)
@@ -550,6 +552,102 @@ size_t perf_event__fprintf_text_poke(union perf_event *event, struct machine *ma
 	return ret;
 }
 
+size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
+{
+	struct perf_record_schedstat_cpu *cs = &event->schedstat_cpu;
+	__u16 version = cs->version;
+	size_t size = 0;
+
+	size = fprintf(fp, "\ncpu%u ", cs->cpu);
+
+#define CPU_FIELD(_type, _name, _ver)						\
+	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)cs->_ver._name)
+
+	if (version == 15) {
+#include <perf/schedstat-v15.h>
+		return size;
+	}
+#undef CPU_FIELD
+
+	return fprintf(fp, "Unsupported /proc/schedstat version %d.\n",
+		       event->schedstat_cpu.version);
+}
+
+size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
+{
+	struct perf_record_schedstat_domain *ds = &event->schedstat_domain;
+	__u16 version = ds->version;
+	size_t cpu_mask_len_2;
+	size_t cpu_mask_len;
+	size_t size = 0;
+	char *cpu_mask;
+	int idx;
+	int i, j;
+	bool low;
+
+	if (ds->name[0])
+		size = fprintf(fp, "\ndomain%u:%s ", ds->domain, ds->name);
+	else
+		size = fprintf(fp, "\ndomain%u ", ds->domain);
+
+	cpu_mask_len = ((ds->nr_cpus + 3) >> 2);
+	cpu_mask_len_2 = cpu_mask_len + ((cpu_mask_len - 1) / 8);
+
+	cpu_mask = zalloc(cpu_mask_len_2 + 1);
+	if (!cpu_mask)
+		return fprintf(fp, "Cannot allocate memory for cpumask\n");
+
+	idx = ((ds->nr_cpus + 7) >> 3) - 1;
+
+	i = cpu_mask_len_2 - 1;
+
+	low = true;
+	j = 1;
+	while (i >= 0) {
+		__u8 m;
+
+		if (low)
+			m = ds->cpu_mask[idx] & 0xf;
+		else
+			m = (ds->cpu_mask[idx] & 0xf0) >> 4;
+
+		if (m >= 0 && m <= 9)
+			m += '0';
+		else if (m >= 0xa && m <= 0xf)
+			m = m + 'a' - 10;
+		else if (m >= 0xA && m <= 0xF)
+			m = m + 'A' - 10;
+
+		cpu_mask[i] = m;
+
+		if (j == 8 && i != 0) {
+			cpu_mask[i - 1] = ',';
+			j = 0;
+			i--;
+		}
+
+		if (!low)
+			idx--;
+		low = !low;
+		i--;
+		j++;
+	}
+	size += fprintf(fp, "%s ", cpu_mask);
+	free(cpu_mask);
+
+#define DOMAIN_FIELD(_type, _name, _ver)					\
+	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)ds->_ver._name)
+
+	if (version == 15) {
+#include <perf/schedstat-v15.h>
+		return size;
+	}
+#undef DOMAIN_FIELD
+
+	return fprintf(fp, "Unsupported /proc/schedstat version %d.\n",
+		       event->schedstat_domain.version);
+}
+
 size_t perf_event__fprintf(union perf_event *event, struct machine *machine, FILE *fp)
 {
 	size_t ret = fprintf(fp, "PERF_RECORD_%s",
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 2744c54f404e..333f2405cd5a 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -361,6 +361,8 @@ size_t perf_event__fprintf_cgroup(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_ksymbol(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_bpf(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_text_poke(union perf_event *event, struct machine *machine,FILE *fp);
+size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp);
+size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf(union perf_event *event, struct machine *machine, FILE *fp);
 
 int kallsyms__get_function_start(const char *kallsyms_filename,
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index c06e3020a976..bcffee2b7239 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -692,6 +692,20 @@ static void perf_event__time_conv_swap(union perf_event *event,
 	}
 }
 
+static void
+perf_event__schedstat_cpu_swap(union perf_event *event __maybe_unused,
+			       bool sample_id_all __maybe_unused)
+{
+	/* FIXME */
+}
+
+static void
+perf_event__schedstat_domain_swap(union perf_event *event __maybe_unused,
+				  bool sample_id_all __maybe_unused)
+{
+	/* FIXME */
+}
+
 typedef void (*perf_event__swap_op)(union perf_event *event,
 				    bool sample_id_all);
 
@@ -730,6 +744,8 @@ static perf_event__swap_op perf_event__swap_ops[] = {
 	[PERF_RECORD_STAT_ROUND]	  = perf_event__stat_round_swap,
 	[PERF_RECORD_EVENT_UPDATE]	  = perf_event__event_update_swap,
 	[PERF_RECORD_TIME_CONV]		  = perf_event__time_conv_swap,
+	[PERF_RECORD_SCHEDSTAT_CPU]	  = perf_event__schedstat_cpu_swap,
+	[PERF_RECORD_SCHEDSTAT_DOMAIN]	  = perf_event__schedstat_domain_swap,
 	[PERF_RECORD_HEADER_MAX]	  = NULL,
 };
 
@@ -1455,6 +1471,10 @@ static s64 perf_session__process_user_event(struct perf_session *session,
 		return err;
 	case PERF_RECORD_FINISHED_INIT:
 		return tool->finished_init(session, event);
+	case PERF_RECORD_SCHEDSTAT_CPU:
+		return tool->schedstat_cpu(session, event);
+	case PERF_RECORD_SCHEDSTAT_DOMAIN:
+		return tool->schedstat_domain(session, event);
 	default:
 		return -EINVAL;
 	}
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index 6923b0d5efed..f928f07bea15 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -2511,3 +2511,242 @@ int parse_synth_opt(char *synth)
 
 	return ret;
 }
+
+static union perf_event *__synthesize_schedstat_cpu(struct io *io, __u16 version,
+						    __u64 *cpu, __u64 timestamp)
+{
+	struct perf_record_schedstat_cpu *cs;
+	union perf_event *event;
+	size_t size;
+	char ch;
+
+	size = sizeof(struct perf_record_schedstat_cpu);
+	size = PERF_ALIGN(size, sizeof(u64));
+	event = zalloc(size);
+
+	if (!event)
+		return NULL;
+
+	cs = &event->schedstat_cpu;
+	cs->header.type = PERF_RECORD_SCHEDSTAT_CPU;
+	cs->header.size = size;
+	cs->timestamp = timestamp;
+
+	if (io__get_char(io) != 'p' || io__get_char(io) != 'u')
+		goto out_cpu;
+
+	if (io__get_dec(io, (__u64 *)cpu) != ' ')
+		goto out_cpu;
+
+#define CPU_FIELD(_type, _name, _ver)					\
+	do {								\
+		__u64 _tmp;						\
+		ch = io__get_dec(io, &_tmp);				\
+		if (ch != ' ' && ch != '\n')				\
+			goto out_cpu;					\
+		cs->_ver._name = _tmp;					\
+	} while (0)
+
+	if (version == 15) {
+#include <perf/schedstat-v15.h>
+	}
+#undef CPU_FIELD
+
+	cs->cpu = *cpu;
+	cs->version = version;
+
+	return event;
+out_cpu:
+	free(event);
+	return NULL;
+}
+
+static size_t schedstat_sanitize_cpumask(char *cpu_mask, size_t cpu_mask_len)
+{
+	char *dst = cpu_mask;
+	char *src = cpu_mask;
+	int i = 0;
+
+	for ( ; src < cpu_mask + cpu_mask_len; dst++, src++) {
+		while (*src == ',')
+			src++;
+
+		*dst = *src;
+	}
+
+	for ( ; dst < src; dst++, i++)
+		*dst = '\0';
+
+	return cpu_mask_len - i;
+}
+
+static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 version,
+						       __u64 cpu, __u64 timestamp)
+{
+	struct perf_env env = { .total_mem = 0, };
+	int nr_cpus_avail = perf_env__nr_cpus_avail(&env);
+	struct perf_record_schedstat_domain *ds;
+	union perf_event *event;
+	char *d_name = NULL;
+	size_t cpu_mask_len = 0;
+	char *cpu_mask = NULL;
+	__u64 d_num;
+	size_t size;
+	int i = 0;
+	bool low;
+	char ch;
+	int idx;
+
+	if (io__get_char(io) != 'o' || io__get_char(io) != 'm' || io__get_char(io) != 'a' ||
+	    io__get_char(io) != 'i' || io__get_char(io) != 'n')
+		return NULL;
+
+	ch = io__get_dec(io, &d_num);
+
+	if (io__getdelim(io, &cpu_mask, &cpu_mask_len, ' ') < 0 || !cpu_mask_len)
+		goto out;
+
+	cpu_mask[cpu_mask_len - 1] = '\0';
+	cpu_mask_len--;
+	cpu_mask_len = schedstat_sanitize_cpumask(cpu_mask, cpu_mask_len);
+
+	size = sizeof(struct perf_record_schedstat_domain) + ((nr_cpus_avail + 7) >> 3);
+	size = PERF_ALIGN(size, sizeof(u64));
+	event = zalloc(size);
+
+	if (!event)
+		goto out_cpu_mask;
+
+	ds = &event->schedstat_domain;
+	ds->header.type = PERF_RECORD_SCHEDSTAT_DOMAIN;
+	ds->header.size = size;
+	ds->version = version;
+	ds->timestamp = timestamp;
+	if (d_name)
+		strncpy(ds->name, d_name, DOMAIN_NAME_LEN - 1);
+	ds->domain = d_num;
+	ds->nr_cpus = nr_cpus_avail;
+
+	idx = ((nr_cpus_avail + 7) >> 3) - 1;
+	low = true;
+	for (i = cpu_mask_len - 1; i >= 0 && idx >= 0; i--) {
+		char mask = cpu_mask[i];
+
+		if (mask >= '0' && mask <= '9')
+			mask -= '0';
+		else if (mask >= 'a' && mask <= 'f')
+			mask = mask - 'a' + 10;
+		else if (mask >= 'A' && mask <= 'F')
+			mask = mask - 'A' + 10;
+
+		if (low) {
+			ds->cpu_mask[idx] = mask;
+		} else {
+			ds->cpu_mask[idx] |= (mask << 4);
+			idx--;
+		}
+		low = !low;
+	}
+
+	free(cpu_mask);
+
+#define DOMAIN_FIELD(_type, _name, _ver)				\
+	do {								\
+		__u64 _tmp;						\
+		ch = io__get_dec(io, &_tmp);				\
+		if (ch != ' ' && ch != '\n')				\
+			goto out_domain;				\
+		ds->_ver._name = _tmp;					\
+	} while (0)
+
+	if (version == 15) {
+#include <perf/schedstat-v15.h>
+	}
+#undef DOMAIN_FIELD
+
+	ds->cpu = cpu;
+	return event;
+
+out_domain:
+	free(event);
+out_cpu_mask:
+	free(cpu_mask);
+out:
+	return NULL;
+}
+
+int perf_event__synthesize_schedstat(const struct perf_tool *tool,
+				     perf_event__handler_t process,
+				     struct perf_cpu_map *user_requested_cpus)
+{
+	char *line = NULL, path[PATH_MAX];
+	union perf_event *event = NULL;
+	size_t line_len = 0;
+	char bf[BUFSIZ];
+	__u64 timestamp;
+	__u64 cpu = -1;
+	__u16 version;
+	struct io io;
+	int ret = -1;
+	char ch;
+
+	snprintf(path, PATH_MAX, "%s/schedstat", procfs__mountpoint());
+	io.fd = open(path, O_RDONLY, 0);
+	if (io.fd < 0) {
+		pr_err("Failed to open %s. Possibly CONFIG_SCHEDSTAT is disabled.\n", path);
+		return -1;
+	}
+	io__init(&io, io.fd, bf, sizeof(bf));
+
+	if (io__getline(&io, &line, &line_len) < 0 || !line_len)
+		goto out;
+
+	if (!strcmp(line, "version 15\n")) {
+		version = 15;
+	} else {
+		pr_err("Unsupported %s version: %s", path, line + 8);
+		goto out_free_line;
+	}
+
+	if (io__getline(&io, &line, &line_len) < 0 || !line_len)
+		goto out_free_line;
+	timestamp = atol(line + 10);
+
+	/*
+	 * FIXME: Can be optimized a bit by not synthesizing domain samples
+	 * for filtered out cpus.
+	 */
+	for (ch = io__get_char(&io); !io.eof; ch = io__get_char(&io)) {
+		struct perf_cpu this_cpu;
+
+		if (ch == 'c') {
+			event = __synthesize_schedstat_cpu(&io, version,
+							   &cpu, timestamp);
+		} else if (ch == 'd') {
+			event = __synthesize_schedstat_domain(&io, version,
+							      cpu, timestamp);
+		}
+		if (!event)
+			goto out_free_line;
+
+		this_cpu.cpu = cpu;
+
+		if (user_requested_cpus && !perf_cpu_map__has(user_requested_cpus, this_cpu))
+			continue;
+
+		if (process(tool, event, NULL, NULL) < 0) {
+			free(event);
+			goto out_free_line;
+		}
+
+		free(event);
+	}
+
+	ret = 0;
+
+out_free_line:
+	free(line);
+out:
+	close(io.fd);
+	return ret;
+}
diff --git a/tools/perf/util/synthetic-events.h b/tools/perf/util/synthetic-events.h
index b9c936b5cfeb..eab914c238df 100644
--- a/tools/perf/util/synthetic-events.h
+++ b/tools/perf/util/synthetic-events.h
@@ -141,4 +141,7 @@ int perf_event__synthesize_for_pipe(const struct perf_tool *tool,
 				    struct perf_data *data,
 				    perf_event__handler_t process);
 
+int perf_event__synthesize_schedstat(const struct perf_tool *tool,
+				     perf_event__handler_t process,
+				     struct perf_cpu_map *user_requested_cpu);
 #endif // __PERF_SYNTHETIC_EVENTS_H
diff --git a/tools/perf/util/tool.c b/tools/perf/util/tool.c
index 3b7f390f26eb..9f81d720735f 100644
--- a/tools/perf/util/tool.c
+++ b/tools/perf/util/tool.c
@@ -230,6 +230,24 @@ static int perf_session__process_compressed_event_stub(struct perf_session *sess
 	return 0;
 }
 
+static int process_schedstat_cpu_stub(struct perf_session *perf_session __maybe_unused,
+				      union perf_event *event)
+{
+	if (dump_trace)
+		perf_event__fprintf_schedstat_cpu(event, stdout);
+	dump_printf(": unhandled!\n");
+	return 0;
+}
+
+static int process_schedstat_domain_stub(struct perf_session *perf_session __maybe_unused,
+					 union perf_event *event)
+{
+	if (dump_trace)
+		perf_event__fprintf_schedstat_domain(event, stdout);
+	dump_printf(": unhandled!\n");
+	return 0;
+}
+
 void perf_tool__init(struct perf_tool *tool, bool ordered_events)
 {
 	tool->ordered_events = ordered_events;
@@ -286,6 +304,8 @@ void perf_tool__init(struct perf_tool *tool, bool ordered_events)
 	tool->compressed = perf_session__process_compressed_event_stub;
 #endif
 	tool->finished_init = process_event_op2_stub;
+	tool->schedstat_cpu = process_schedstat_cpu_stub;
+	tool->schedstat_domain = process_schedstat_domain_stub;
 }
 
 bool perf_tool__compressed_is_stub(const struct perf_tool *tool)
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index db1c7642b0d1..d289a5396b01 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -77,7 +77,9 @@ struct perf_tool {
 			stat,
 			stat_round,
 			feature,
-			finished_init;
+			finished_init,
+			schedstat_cpu,
+			schedstat_domain;
 	event_op4	compressed;
 	event_op3	auxtrace;
 	bool		ordered_events;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 2/8] perf sched stats: Add schedstat v16 support
  2025-03-11 12:02 [PATCH v3 0/8] perf sched: Introduce stats tool Swapnil Sapkal
  2025-03-11 12:02 ` [PATCH v3 1/8] perf sched stats: Add record and rawdump support Swapnil Sapkal
@ 2025-03-11 12:02 ` Swapnil Sapkal
  2025-03-11 12:02 ` [PATCH v3 3/8] perf sched stats: Add schedstat v17 support Swapnil Sapkal
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Swapnil Sapkal @ 2025-03-11 12:02 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, james.clark
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot,
	adrian.hunter, kan.liang, gautham.shenoy, kprateek.nayak,
	juri.lelli, yangjihong, void, tj, sshegde, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das,
	James Clark

/proc/schedstat file output is standardized with version number.
Add support to record and raw dump v16 version layout.

Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
---
 tools/lib/perf/Makefile                     |  2 +-
 tools/lib/perf/include/perf/event.h         | 14 ++++++
 tools/lib/perf/include/perf/schedstat-v16.h | 52 +++++++++++++++++++++
 tools/perf/util/event.c                     |  6 +++
 tools/perf/util/synthetic-events.c          |  6 +++
 5 files changed, 79 insertions(+), 1 deletion(-)
 create mode 100644 tools/lib/perf/include/perf/schedstat-v16.h

diff --git a/tools/lib/perf/Makefile b/tools/lib/perf/Makefile
index 4b60804aa0b6..d0506a13a97f 100644
--- a/tools/lib/perf/Makefile
+++ b/tools/lib/perf/Makefile
@@ -174,7 +174,7 @@ install_lib: libs
 		$(call do_install_mkdir,$(libdir_SQ)); \
 		cp -fpR $(LIBPERF_ALL) $(DESTDIR)$(libdir_SQ)
 
-HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-v15.h
+HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-v15.h schedstat-v16.h
 INTERNAL_HDRS := cpumap.h evlist.h evsel.h lib.h mmap.h rc_check.h threadmap.h xyarray.h
 
 INSTALL_HDRS_PFX := $(DESTDIR)$(prefix)/include/perf
diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index 189106874063..8ef70799e070 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -463,11 +463,18 @@ struct perf_record_schedstat_cpu_v15 {
 #undef CPU_FIELD
 };
 
+struct perf_record_schedstat_cpu_v16 {
+#define CPU_FIELD(_type, _name, _ver)		_type _name
+#include "schedstat-v16.h"
+#undef CPU_FIELD
+};
+
 struct perf_record_schedstat_cpu {
 	struct perf_event_header header;
 	__u64			 timestamp;
 	union {
 		struct perf_record_schedstat_cpu_v15 v15;
+		struct perf_record_schedstat_cpu_v16 v16;
 	};
 	__u32			 cpu;
 	__u16			 version;
@@ -479,6 +486,12 @@ struct perf_record_schedstat_domain_v15 {
 #undef DOMAIN_FIELD
 };
 
+struct perf_record_schedstat_domain_v16 {
+#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
+#include "schedstat-v16.h"
+#undef DOMAIN_FIELD
+};
+
 #define DOMAIN_NAME_LEN		16
 
 struct perf_record_schedstat_domain {
@@ -490,6 +503,7 @@ struct perf_record_schedstat_domain {
 	char			 name[DOMAIN_NAME_LEN];
 	union {
 		struct perf_record_schedstat_domain_v15 v15;
+		struct perf_record_schedstat_domain_v16 v16;
 	};
 	__u16			 nr_cpus;
 	__u8			 cpu_mask[];
diff --git a/tools/lib/perf/include/perf/schedstat-v16.h b/tools/lib/perf/include/perf/schedstat-v16.h
new file mode 100644
index 000000000000..d6a4691b2fd5
--- /dev/null
+++ b/tools/lib/perf/include/perf/schedstat-v16.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifdef CPU_FIELD
+CPU_FIELD(__u32, yld_count, v16);
+CPU_FIELD(__u32, array_exp, v16);
+CPU_FIELD(__u32, sched_count, v16);
+CPU_FIELD(__u32, sched_goidle, v16);
+CPU_FIELD(__u32, ttwu_count, v16);
+CPU_FIELD(__u32, ttwu_local, v16);
+CPU_FIELD(__u64, rq_cpu_time, v16);
+CPU_FIELD(__u64, run_delay, v16);
+CPU_FIELD(__u64, pcount, v16);
+#endif
+
+#ifdef DOMAIN_FIELD
+DOMAIN_FIELD(__u32, busy_lb_count, v16);
+DOMAIN_FIELD(__u32, busy_lb_balanced, v16);
+DOMAIN_FIELD(__u32, busy_lb_failed, v16);
+DOMAIN_FIELD(__u32, busy_lb_imbalance, v16);
+DOMAIN_FIELD(__u32, busy_lb_gained, v16);
+DOMAIN_FIELD(__u32, busy_lb_hot_gained, v16);
+DOMAIN_FIELD(__u32, busy_lb_nobusyq, v16);
+DOMAIN_FIELD(__u32, busy_lb_nobusyg, v16);
+DOMAIN_FIELD(__u32, idle_lb_count, v16);
+DOMAIN_FIELD(__u32, idle_lb_balanced, v16);
+DOMAIN_FIELD(__u32, idle_lb_failed, v16);
+DOMAIN_FIELD(__u32, idle_lb_imbalance, v16);
+DOMAIN_FIELD(__u32, idle_lb_gained, v16);
+DOMAIN_FIELD(__u32, idle_lb_hot_gained, v16);
+DOMAIN_FIELD(__u32, idle_lb_nobusyq, v16);
+DOMAIN_FIELD(__u32, idle_lb_nobusyg, v16);
+DOMAIN_FIELD(__u32, newidle_lb_count, v16);
+DOMAIN_FIELD(__u32, newidle_lb_balanced, v16);
+DOMAIN_FIELD(__u32, newidle_lb_failed, v16);
+DOMAIN_FIELD(__u32, newidle_lb_imbalance, v16);
+DOMAIN_FIELD(__u32, newidle_lb_gained, v16);
+DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v16);
+DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v16);
+DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v16);
+DOMAIN_FIELD(__u32, alb_count, v16);
+DOMAIN_FIELD(__u32, alb_failed, v16);
+DOMAIN_FIELD(__u32, alb_pushed, v16);
+DOMAIN_FIELD(__u32, sbe_count, v16);
+DOMAIN_FIELD(__u32, sbe_balanced, v16);
+DOMAIN_FIELD(__u32, sbe_pushed, v16);
+DOMAIN_FIELD(__u32, sbf_count, v16);
+DOMAIN_FIELD(__u32, sbf_balanced, v16);
+DOMAIN_FIELD(__u32, sbf_pushed, v16);
+DOMAIN_FIELD(__u32, ttwu_wake_remote, v16);
+DOMAIN_FIELD(__u32, ttwu_move_affine, v16);
+DOMAIN_FIELD(__u32, ttwu_move_balance, v16);
+#endif
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 0f863d38abe2..64f81e7b7f70 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -566,6 +566,9 @@ size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
 	if (version == 15) {
 #include <perf/schedstat-v15.h>
 		return size;
+	} else if (version == 16) {
+#include <perf/schedstat-v16.h>
+		return size;
 	}
 #undef CPU_FIELD
 
@@ -641,6 +644,9 @@ size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
 	if (version == 15) {
 #include <perf/schedstat-v15.h>
 		return size;
+	} else if (version == 16) {
+#include <perf/schedstat-v16.h>
+		return size;
 	}
 #undef DOMAIN_FIELD
 
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index f928f07bea15..e9dc1e14cfea 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -2549,6 +2549,8 @@ static union perf_event *__synthesize_schedstat_cpu(struct io *io, __u16 version
 
 	if (version == 15) {
 #include <perf/schedstat-v15.h>
+	} else if (version == 16) {
+#include <perf/schedstat-v16.h>
 	}
 #undef CPU_FIELD
 
@@ -2661,6 +2663,8 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
 
 	if (version == 15) {
 #include <perf/schedstat-v15.h>
+	} else if (version == 16) {
+#include <perf/schedstat-v16.h>
 	}
 #undef DOMAIN_FIELD
 
@@ -2703,6 +2707,8 @@ int perf_event__synthesize_schedstat(const struct perf_tool *tool,
 
 	if (!strcmp(line, "version 15\n")) {
 		version = 15;
+	} else if (!strcmp(line, "version 16\n")) {
+		version = 16;
 	} else {
 		pr_err("Unsupported %s version: %s", path, line + 8);
 		goto out_free_line;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 3/8] perf sched stats: Add schedstat v17 support
  2025-03-11 12:02 [PATCH v3 0/8] perf sched: Introduce stats tool Swapnil Sapkal
  2025-03-11 12:02 ` [PATCH v3 1/8] perf sched stats: Add record and rawdump support Swapnil Sapkal
  2025-03-11 12:02 ` [PATCH v3 2/8] perf sched stats: Add schedstat v16 support Swapnil Sapkal
@ 2025-03-11 12:02 ` Swapnil Sapkal
  2025-03-15  2:27   ` Namhyung Kim
  2025-03-11 12:02 ` [PATCH v3 4/8] perf sched stats: Add support for report subcommand Swapnil Sapkal
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Swapnil Sapkal @ 2025-03-11 12:02 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, james.clark
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot,
	adrian.hunter, kan.liang, gautham.shenoy, kprateek.nayak,
	juri.lelli, yangjihong, void, tj, sshegde, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

/proc/schedstat file output is standardized with version number.
Add support to record and raw dump v17 version layout.

Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
---
 tools/lib/perf/Makefile                     |  2 +-
 tools/lib/perf/include/perf/event.h         | 14 +++++
 tools/lib/perf/include/perf/schedstat-v17.h | 61 +++++++++++++++++++++
 tools/perf/util/event.c                     |  6 ++
 tools/perf/util/synthetic-events.c          | 15 +++++
 5 files changed, 97 insertions(+), 1 deletion(-)
 create mode 100644 tools/lib/perf/include/perf/schedstat-v17.h

diff --git a/tools/lib/perf/Makefile b/tools/lib/perf/Makefile
index d0506a13a97f..30712ce8b6b1 100644
--- a/tools/lib/perf/Makefile
+++ b/tools/lib/perf/Makefile
@@ -174,7 +174,7 @@ install_lib: libs
 		$(call do_install_mkdir,$(libdir_SQ)); \
 		cp -fpR $(LIBPERF_ALL) $(DESTDIR)$(libdir_SQ)
 
-HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-v15.h schedstat-v16.h
+HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-v15.h schedstat-v16.h schedstat-v17.h
 INTERNAL_HDRS := cpumap.h evlist.h evsel.h lib.h mmap.h rc_check.h threadmap.h xyarray.h
 
 INSTALL_HDRS_PFX := $(DESTDIR)$(prefix)/include/perf
diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index 8ef70799e070..0d1983ad9a41 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -469,12 +469,19 @@ struct perf_record_schedstat_cpu_v16 {
 #undef CPU_FIELD
 };
 
+struct perf_record_schedstat_cpu_v17 {
+#define CPU_FIELD(_type, _name, _ver)		_type _name
+#include "schedstat-v17.h"
+#undef CPU_FIELD
+};
+
 struct perf_record_schedstat_cpu {
 	struct perf_event_header header;
 	__u64			 timestamp;
 	union {
 		struct perf_record_schedstat_cpu_v15 v15;
 		struct perf_record_schedstat_cpu_v16 v16;
+		struct perf_record_schedstat_cpu_v17 v17;
 	};
 	__u32			 cpu;
 	__u16			 version;
@@ -492,6 +499,12 @@ struct perf_record_schedstat_domain_v16 {
 #undef DOMAIN_FIELD
 };
 
+struct perf_record_schedstat_domain_v17 {
+#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
+#include "schedstat-v17.h"
+#undef DOMAIN_FIELD
+};
+
 #define DOMAIN_NAME_LEN		16
 
 struct perf_record_schedstat_domain {
@@ -504,6 +517,7 @@ struct perf_record_schedstat_domain {
 	union {
 		struct perf_record_schedstat_domain_v15 v15;
 		struct perf_record_schedstat_domain_v16 v16;
+		struct perf_record_schedstat_domain_v17 v17;
 	};
 	__u16			 nr_cpus;
 	__u8			 cpu_mask[];
diff --git a/tools/lib/perf/include/perf/schedstat-v17.h b/tools/lib/perf/include/perf/schedstat-v17.h
new file mode 100644
index 000000000000..851d4f1f4ecb
--- /dev/null
+++ b/tools/lib/perf/include/perf/schedstat-v17.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifdef CPU_FIELD
+CPU_FIELD(__u32, yld_count, v17);
+CPU_FIELD(__u32, array_exp, v17);
+CPU_FIELD(__u32, sched_count, v17);
+CPU_FIELD(__u32, sched_goidle, v17);
+CPU_FIELD(__u32, ttwu_count, v17);
+CPU_FIELD(__u32, ttwu_local, v17);
+CPU_FIELD(__u64, rq_cpu_time, v17);
+CPU_FIELD(__u64, run_delay, v17);
+CPU_FIELD(__u64, pcount, v17);
+#endif
+
+#ifdef DOMAIN_FIELD
+DOMAIN_FIELD(__u32, busy_lb_count, v17);
+DOMAIN_FIELD(__u32, busy_lb_balanced, v17);
+DOMAIN_FIELD(__u32, busy_lb_failed, v17);
+DOMAIN_FIELD(__u32, busy_lb_imbalance_load, v17);
+DOMAIN_FIELD(__u32, busy_lb_imbalance_util, v17);
+DOMAIN_FIELD(__u32, busy_lb_imbalance_task, v17);
+DOMAIN_FIELD(__u32, busy_lb_imbalance_misfit, v17);
+DOMAIN_FIELD(__u32, busy_lb_gained, v17);
+DOMAIN_FIELD(__u32, busy_lb_hot_gained, v17);
+DOMAIN_FIELD(__u32, busy_lb_nobusyq, v17);
+DOMAIN_FIELD(__u32, busy_lb_nobusyg, v17);
+DOMAIN_FIELD(__u32, idle_lb_count, v17);
+DOMAIN_FIELD(__u32, idle_lb_balanced, v17);
+DOMAIN_FIELD(__u32, idle_lb_failed, v17);
+DOMAIN_FIELD(__u32, idle_lb_imbalance_load, v17);
+DOMAIN_FIELD(__u32, idle_lb_imbalance_util, v17);
+DOMAIN_FIELD(__u32, idle_lb_imbalance_task, v17);
+DOMAIN_FIELD(__u32, idle_lb_imbalance_misfit, v17);
+DOMAIN_FIELD(__u32, idle_lb_gained, v17);
+DOMAIN_FIELD(__u32, idle_lb_hot_gained, v17);
+DOMAIN_FIELD(__u32, idle_lb_nobusyq, v17);
+DOMAIN_FIELD(__u32, idle_lb_nobusyg, v17);
+DOMAIN_FIELD(__u32, newidle_lb_count, v17);
+DOMAIN_FIELD(__u32, newidle_lb_balanced, v17);
+DOMAIN_FIELD(__u32, newidle_lb_failed, v17);
+DOMAIN_FIELD(__u32, newidle_lb_imbalance_load, v17);
+DOMAIN_FIELD(__u32, newidle_lb_imbalance_util, v17);
+DOMAIN_FIELD(__u32, newidle_lb_imbalance_task, v17);
+DOMAIN_FIELD(__u32, newidle_lb_imbalance_misfit, v17);
+DOMAIN_FIELD(__u32, newidle_lb_gained, v17);
+DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v17);
+DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v17);
+DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v17);
+DOMAIN_FIELD(__u32, alb_count, v17);
+DOMAIN_FIELD(__u32, alb_failed, v17);
+DOMAIN_FIELD(__u32, alb_pushed, v17);
+DOMAIN_FIELD(__u32, sbe_count, v17);
+DOMAIN_FIELD(__u32, sbe_balanced, v17);
+DOMAIN_FIELD(__u32, sbe_pushed, v17);
+DOMAIN_FIELD(__u32, sbf_count, v17);
+DOMAIN_FIELD(__u32, sbf_balanced, v17);
+DOMAIN_FIELD(__u32, sbf_pushed, v17);
+DOMAIN_FIELD(__u32, ttwu_wake_remote, v17);
+DOMAIN_FIELD(__u32, ttwu_move_affine, v17);
+DOMAIN_FIELD(__u32, ttwu_move_balance, v17);
+#endif
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 64f81e7b7f70..d09c3c99ab48 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -569,6 +569,9 @@ size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
 	} else if (version == 16) {
 #include <perf/schedstat-v16.h>
 		return size;
+	} else if (version == 17) {
+#include <perf/schedstat-v17.h>
+		return size;
 	}
 #undef CPU_FIELD
 
@@ -647,6 +650,9 @@ size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
 	} else if (version == 16) {
 #include <perf/schedstat-v16.h>
 		return size;
+	} else if (version == 17) {
+#include <perf/schedstat-v17.h>
+		return size;
 	}
 #undef DOMAIN_FIELD
 
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index e9dc1e14cfea..fad0c472f297 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -2551,6 +2551,8 @@ static union perf_event *__synthesize_schedstat_cpu(struct io *io, __u16 version
 #include <perf/schedstat-v15.h>
 	} else if (version == 16) {
 #include <perf/schedstat-v16.h>
+	} else if (version == 17) {
+#include <perf/schedstat-v17.h>
 	}
 #undef CPU_FIELD
 
@@ -2589,6 +2591,7 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
 	int nr_cpus_avail = perf_env__nr_cpus_avail(&env);
 	struct perf_record_schedstat_domain *ds;
 	union perf_event *event;
+	size_t d_name_len = 0;
 	char *d_name = NULL;
 	size_t cpu_mask_len = 0;
 	char *cpu_mask = NULL;
@@ -2604,6 +2607,12 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
 		return NULL;
 
 	ch = io__get_dec(io, &d_num);
+	if (version >= 17) {
+		if (io__getdelim(io, &d_name, &d_name_len, ' ') < 0 || !d_name_len)
+			return NULL;
+		d_name[d_name_len - 1] = '\0';
+		d_name_len--;
+	}
 
 	if (io__getdelim(io, &cpu_mask, &cpu_mask_len, ' ') < 0 || !cpu_mask_len)
 		goto out;
@@ -2650,6 +2659,7 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
 		low = !low;
 	}
 
+	free(d_name);
 	free(cpu_mask);
 
 #define DOMAIN_FIELD(_type, _name, _ver)				\
@@ -2665,6 +2675,8 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
 #include <perf/schedstat-v15.h>
 	} else if (version == 16) {
 #include <perf/schedstat-v16.h>
+	} else if (version == 17) {
+#include <perf/schedstat-v17.h>
 	}
 #undef DOMAIN_FIELD
 
@@ -2676,6 +2688,7 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
 out_cpu_mask:
 	free(cpu_mask);
 out:
+	free(d_name);
 	return NULL;
 }
 
@@ -2709,6 +2722,8 @@ int perf_event__synthesize_schedstat(const struct perf_tool *tool,
 		version = 15;
 	} else if (!strcmp(line, "version 16\n")) {
 		version = 16;
+	} else if (!strcmp(line, "version 17\n")) {
+		version = 17;
 	} else {
 		pr_err("Unsupported %s version: %s", path, line + 8);
 		goto out_free_line;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 4/8] perf sched stats: Add support for report subcommand
  2025-03-11 12:02 [PATCH v3 0/8] perf sched: Introduce stats tool Swapnil Sapkal
                   ` (2 preceding siblings ...)
  2025-03-11 12:02 ` [PATCH v3 3/8] perf sched stats: Add schedstat v17 support Swapnil Sapkal
@ 2025-03-11 12:02 ` Swapnil Sapkal
  2025-03-15  4:39   ` Namhyung Kim
  2025-05-20 10:36   ` Peter Zijlstra
  2025-03-11 12:02 ` [PATCH v3 5/8] perf sched stats: Add support for live mode Swapnil Sapkal
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 23+ messages in thread
From: Swapnil Sapkal @ 2025-03-11 12:02 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, james.clark
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot,
	adrian.hunter, kan.liang, gautham.shenoy, kprateek.nayak,
	juri.lelli, yangjihong, void, tj, sshegde, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das,
	James Clark

`perf sched stats record` captures two sets of samples. For workload
profile, first set right before workload starts and second set after
workload finishes. For the systemwide profile, first set at the
beginning of profile and second set on receiving SIGINT signal.

Add `perf sched stats report` subcommand that will read both the set
of samples, get the diff and render a final report. Final report prints
scheduler stat at cpu granularity as well as sched domain granularity.

Example usage:

  # perf sched stats record
  # perf sched stats report

Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
---
 tools/lib/perf/include/perf/event.h         |  12 +-
 tools/lib/perf/include/perf/schedstat-v15.h | 180 +++++--
 tools/lib/perf/include/perf/schedstat-v16.h | 182 +++++--
 tools/lib/perf/include/perf/schedstat-v17.h | 209 +++++---
 tools/perf/builtin-sched.c                  | 504 +++++++++++++++++++-
 tools/perf/util/event.c                     |   4 +-
 tools/perf/util/synthetic-events.c          |   4 +-
 7 files changed, 938 insertions(+), 157 deletions(-)

diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index 0d1983ad9a41..5e2c56c9b038 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -458,19 +458,19 @@ struct perf_record_compressed {
 };
 
 struct perf_record_schedstat_cpu_v15 {
-#define CPU_FIELD(_type, _name, _ver)		_type _name
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		_type _name
 #include "schedstat-v15.h"
 #undef CPU_FIELD
 };
 
 struct perf_record_schedstat_cpu_v16 {
-#define CPU_FIELD(_type, _name, _ver)		_type _name
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		_type _name
 #include "schedstat-v16.h"
 #undef CPU_FIELD
 };
 
 struct perf_record_schedstat_cpu_v17 {
-#define CPU_FIELD(_type, _name, _ver)		_type _name
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		_type _name
 #include "schedstat-v17.h"
 #undef CPU_FIELD
 };
@@ -488,19 +488,19 @@ struct perf_record_schedstat_cpu {
 };
 
 struct perf_record_schedstat_domain_v15 {
-#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		_type _name
 #include "schedstat-v15.h"
 #undef DOMAIN_FIELD
 };
 
 struct perf_record_schedstat_domain_v16 {
-#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		_type _name
 #include "schedstat-v16.h"
 #undef DOMAIN_FIELD
 };
 
 struct perf_record_schedstat_domain_v17 {
-#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		_type _name
 #include "schedstat-v17.h"
 #undef DOMAIN_FIELD
 };
diff --git a/tools/lib/perf/include/perf/schedstat-v15.h b/tools/lib/perf/include/perf/schedstat-v15.h
index 43f8060c5337..011411ac0f7e 100644
--- a/tools/lib/perf/include/perf/schedstat-v15.h
+++ b/tools/lib/perf/include/perf/schedstat-v15.h
@@ -1,52 +1,142 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #ifdef CPU_FIELD
-CPU_FIELD(__u32, yld_count, v15);
-CPU_FIELD(__u32, array_exp, v15);
-CPU_FIELD(__u32, sched_count, v15);
-CPU_FIELD(__u32, sched_goidle, v15);
-CPU_FIELD(__u32, ttwu_count, v15);
-CPU_FIELD(__u32, ttwu_local, v15);
-CPU_FIELD(__u64, rq_cpu_time, v15);
-CPU_FIELD(__u64, run_delay, v15);
-CPU_FIELD(__u64, pcount, v15);
+CPU_FIELD(__u32, yld_count, "sched_yield() count",
+	  "%11u", false, yld_count, v15);
+CPU_FIELD(__u32, array_exp, "Legacy counter can be ignored",
+	  "%11u", false, array_exp, v15);
+CPU_FIELD(__u32, sched_count, "schedule() called",
+	  "%11u", false, sched_count, v15);
+CPU_FIELD(__u32, sched_goidle, "schedule() left the processor idle",
+	  "%11u", true, sched_count, v15);
+CPU_FIELD(__u32, ttwu_count, "try_to_wake_up() was called",
+	  "%11u", false, ttwu_count, v15);
+CPU_FIELD(__u32, ttwu_local, "try_to_wake_up() was called to wake up the local cpu",
+	  "%11u", true, ttwu_count, v15);
+CPU_FIELD(__u64, rq_cpu_time, "total runtime by tasks on this processor (in jiffies)",
+	  "%11llu", false, rq_cpu_time, v15);
+CPU_FIELD(__u64, run_delay, "total waittime by tasks on this processor (in jiffies)",
+	  "%11llu", true, rq_cpu_time, v15);
+CPU_FIELD(__u64, pcount, "total timeslices run on this cpu",
+	  "%11llu", false, pcount, v15);
 #endif
 
 #ifdef DOMAIN_FIELD
-DOMAIN_FIELD(__u32, idle_lb_count, v15);
-DOMAIN_FIELD(__u32, idle_lb_balanced, v15);
-DOMAIN_FIELD(__u32, idle_lb_failed, v15);
-DOMAIN_FIELD(__u32, idle_lb_imbalance, v15);
-DOMAIN_FIELD(__u32, idle_lb_gained, v15);
-DOMAIN_FIELD(__u32, idle_lb_hot_gained, v15);
-DOMAIN_FIELD(__u32, idle_lb_nobusyq, v15);
-DOMAIN_FIELD(__u32, idle_lb_nobusyg, v15);
-DOMAIN_FIELD(__u32, busy_lb_count, v15);
-DOMAIN_FIELD(__u32, busy_lb_balanced, v15);
-DOMAIN_FIELD(__u32, busy_lb_failed, v15);
-DOMAIN_FIELD(__u32, busy_lb_imbalance, v15);
-DOMAIN_FIELD(__u32, busy_lb_gained, v15);
-DOMAIN_FIELD(__u32, busy_lb_hot_gained, v15);
-DOMAIN_FIELD(__u32, busy_lb_nobusyq, v15);
-DOMAIN_FIELD(__u32, busy_lb_nobusyg, v15);
-DOMAIN_FIELD(__u32, newidle_lb_count, v15);
-DOMAIN_FIELD(__u32, newidle_lb_balanced, v15);
-DOMAIN_FIELD(__u32, newidle_lb_failed, v15);
-DOMAIN_FIELD(__u32, newidle_lb_imbalance, v15);
-DOMAIN_FIELD(__u32, newidle_lb_gained, v15);
-DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v15);
-DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v15);
-DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v15);
-DOMAIN_FIELD(__u32, alb_count, v15);
-DOMAIN_FIELD(__u32, alb_failed, v15);
-DOMAIN_FIELD(__u32, alb_pushed, v15);
-DOMAIN_FIELD(__u32, sbe_count, v15);
-DOMAIN_FIELD(__u32, sbe_balanced, v15);
-DOMAIN_FIELD(__u32, sbe_pushed, v15);
-DOMAIN_FIELD(__u32, sbf_count, v15);
-DOMAIN_FIELD(__u32, sbf_balanced, v15);
-DOMAIN_FIELD(__u32, sbf_pushed, v15);
-DOMAIN_FIELD(__u32, ttwu_wake_remote, v15);
-DOMAIN_FIELD(__u32, ttwu_move_affine, v15);
-DOMAIN_FIELD(__u32, ttwu_move_balance, v15);
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category idle> ");
 #endif
+DOMAIN_FIELD(__u32, idle_lb_count,
+	     "load_balance() count on cpu idle", "%11u", true, v15);
+DOMAIN_FIELD(__u32, idle_lb_balanced,
+	     "load_balance() found balanced on cpu idle", "%11u", true, v15);
+DOMAIN_FIELD(__u32, idle_lb_failed,
+	     "load_balance() move task failed on cpu idle", "%11u", true, v15);
+DOMAIN_FIELD(__u32, idle_lb_imbalance,
+	     "imbalance sum on cpu idle", "%11u", false, v15);
+DOMAIN_FIELD(__u32, idle_lb_gained,
+	     "pull_task() count on cpu idle", "%11u", false, v15);
+DOMAIN_FIELD(__u32, idle_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu idle", "%11u", false, v15);
+DOMAIN_FIELD(__u32, idle_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu idle", "%11u", true, v15);
+DOMAIN_FIELD(__u32, idle_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu idle", "%11u", true, v15);
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
+		  idle_lb_count, idle_lb_balanced, idle_lb_failed, v15);
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu idle)", "%11.2Lf",
+		  idle_lb_count, idle_lb_balanced, idle_lb_failed, idle_lb_gained, v15);
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category busy> ");
+#endif
+DOMAIN_FIELD(__u32, busy_lb_count,
+	     "load_balance() count on cpu busy", "%11u", true, v15);
+DOMAIN_FIELD(__u32, busy_lb_balanced,
+	     "load_balance() found balanced on cpu busy", "%11u", true, v15);
+DOMAIN_FIELD(__u32, busy_lb_failed,
+	     "load_balance() move task failed on cpu busy", "%11u", true, v15);
+DOMAIN_FIELD(__u32, busy_lb_imbalance,
+	     "imbalance sum on cpu busy", "%11u", false, v15);
+DOMAIN_FIELD(__u32, busy_lb_gained,
+	     "pull_task() count on cpu busy", "%11u", false, v15);
+DOMAIN_FIELD(__u32, busy_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu busy", "%11u", false, v15);
+DOMAIN_FIELD(__u32, busy_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu busy", "%11u", true, v15);
+DOMAIN_FIELD(__u32, busy_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu busy", "%11u", true, v15);
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
+		  busy_lb_count, busy_lb_balanced, busy_lb_failed, v15);
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu busy)", "%11.2Lf",
+		  busy_lb_count, busy_lb_balanced, busy_lb_failed, busy_lb_gained, v15);
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category newidle> ");
+#endif
+DOMAIN_FIELD(__u32, newidle_lb_count,
+	     "load_balance() count on cpu newly idle", "%11u", true, v15);
+DOMAIN_FIELD(__u32, newidle_lb_balanced,
+	     "load_balance() found balanced on cpu newly idle", "%11u", true, v15);
+DOMAIN_FIELD(__u32, newidle_lb_failed,
+	     "load_balance() move task failed on cpu newly idle", "%11u", true, v15);
+DOMAIN_FIELD(__u32, newidle_lb_imbalance,
+	     "imbalance sum on cpu newly idle", "%11u", false, v15);
+DOMAIN_FIELD(__u32, newidle_lb_gained,
+	     "pull_task() count on cpu newly idle", "%11u", false, v15);
+DOMAIN_FIELD(__u32, newidle_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu newly idle", "%11u", false, v15);
+DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v15);
+DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu newly idle", "%11u", true, v15);
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
+		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v15);
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu newly idle)", "%11.2Lf",
+		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, newidle_lb_gained, v15);
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category active_load_balance()> ");
+#endif
+DOMAIN_FIELD(__u32, alb_count,
+	     "active_load_balance() count", "%11u", false, v15);
+DOMAIN_FIELD(__u32, alb_failed,
+	     "active_load_balance() move task failed", "%11u", false, v15);
+DOMAIN_FIELD(__u32, alb_pushed,
+	     "active_load_balance() successfully moved a task", "%11u", false, v15);
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category sched_balance_exec()> ");
+#endif
+DOMAIN_FIELD(__u32, sbe_count,
+	     "sbe_count is not used", "%11u", false, v15);
+DOMAIN_FIELD(__u32, sbe_balanced,
+	     "sbe_balanced is not used", "%11u", false, v15);
+DOMAIN_FIELD(__u32, sbe_pushed,
+	     "sbe_pushed is not used", "%11u", false, v15);
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category sched_balance_fork()> ");
+#endif
+DOMAIN_FIELD(__u32, sbf_count,
+	     "sbf_count is not used", "%11u", false, v15);
+DOMAIN_FIELD(__u32, sbf_balanced,
+	     "sbf_balanced is not used", "%11u", false, v15);
+DOMAIN_FIELD(__u32, sbf_pushed,
+	     "sbf_pushed is not used", "%11u", false, v15);
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Wakeup Info> ");
+#endif
+DOMAIN_FIELD(__u32, ttwu_wake_remote,
+	     "try_to_wake_up() awoke a task that last ran on a diff cpu", "%11u", false, v15);
+DOMAIN_FIELD(__u32, ttwu_move_affine,
+	     "try_to_wake_up() moved task because cache-cold on own cpu", "%11u", false, v15);
+DOMAIN_FIELD(__u32, ttwu_move_balance,
+	     "try_to_wake_up() started passive balancing", "%11u", false, v15);
+#endif /* DOMAIN_FIELD */
diff --git a/tools/lib/perf/include/perf/schedstat-v16.h b/tools/lib/perf/include/perf/schedstat-v16.h
index d6a4691b2fd5..5ba53bd7d61a 100644
--- a/tools/lib/perf/include/perf/schedstat-v16.h
+++ b/tools/lib/perf/include/perf/schedstat-v16.h
@@ -1,52 +1,142 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #ifdef CPU_FIELD
-CPU_FIELD(__u32, yld_count, v16);
-CPU_FIELD(__u32, array_exp, v16);
-CPU_FIELD(__u32, sched_count, v16);
-CPU_FIELD(__u32, sched_goidle, v16);
-CPU_FIELD(__u32, ttwu_count, v16);
-CPU_FIELD(__u32, ttwu_local, v16);
-CPU_FIELD(__u64, rq_cpu_time, v16);
-CPU_FIELD(__u64, run_delay, v16);
-CPU_FIELD(__u64, pcount, v16);
-#endif
+CPU_FIELD(__u32, yld_count, "sched_yield() count",
+	  "%11u", false, yld_count, v16);
+CPU_FIELD(__u32, array_exp, "Legacy counter can be ignored",
+	  "%11u", false, array_exp, v16);
+CPU_FIELD(__u32, sched_count, "schedule() called",
+	  "%11u", false, sched_count, v16);
+CPU_FIELD(__u32, sched_goidle, "schedule() left the processor idle",
+	  "%11u", true, sched_count, v16);
+CPU_FIELD(__u32, ttwu_count, "try_to_wake_up() was called",
+	  "%11u", false, ttwu_count, v16);
+CPU_FIELD(__u32, ttwu_local, "try_to_wake_up() was called to wake up the local cpu",
+	  "%11u", true, ttwu_count, v16);
+CPU_FIELD(__u64, rq_cpu_time, "total runtime by tasks on this processor (in jiffies)",
+	  "%11llu", false, rq_cpu_time, v16);
+CPU_FIELD(__u64, run_delay, "total waittime by tasks on this processor (in jiffies)",
+	  "%11llu", true, rq_cpu_time, v16);
+CPU_FIELD(__u64, pcount, "total timeslices run on this cpu",
+	  "%11llu", false, pcount, v16);
+#endif /* CPU_FIELD */
 
 #ifdef DOMAIN_FIELD
-DOMAIN_FIELD(__u32, busy_lb_count, v16);
-DOMAIN_FIELD(__u32, busy_lb_balanced, v16);
-DOMAIN_FIELD(__u32, busy_lb_failed, v16);
-DOMAIN_FIELD(__u32, busy_lb_imbalance, v16);
-DOMAIN_FIELD(__u32, busy_lb_gained, v16);
-DOMAIN_FIELD(__u32, busy_lb_hot_gained, v16);
-DOMAIN_FIELD(__u32, busy_lb_nobusyq, v16);
-DOMAIN_FIELD(__u32, busy_lb_nobusyg, v16);
-DOMAIN_FIELD(__u32, idle_lb_count, v16);
-DOMAIN_FIELD(__u32, idle_lb_balanced, v16);
-DOMAIN_FIELD(__u32, idle_lb_failed, v16);
-DOMAIN_FIELD(__u32, idle_lb_imbalance, v16);
-DOMAIN_FIELD(__u32, idle_lb_gained, v16);
-DOMAIN_FIELD(__u32, idle_lb_hot_gained, v16);
-DOMAIN_FIELD(__u32, idle_lb_nobusyq, v16);
-DOMAIN_FIELD(__u32, idle_lb_nobusyg, v16);
-DOMAIN_FIELD(__u32, newidle_lb_count, v16);
-DOMAIN_FIELD(__u32, newidle_lb_balanced, v16);
-DOMAIN_FIELD(__u32, newidle_lb_failed, v16);
-DOMAIN_FIELD(__u32, newidle_lb_imbalance, v16);
-DOMAIN_FIELD(__u32, newidle_lb_gained, v16);
-DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v16);
-DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v16);
-DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v16);
-DOMAIN_FIELD(__u32, alb_count, v16);
-DOMAIN_FIELD(__u32, alb_failed, v16);
-DOMAIN_FIELD(__u32, alb_pushed, v16);
-DOMAIN_FIELD(__u32, sbe_count, v16);
-DOMAIN_FIELD(__u32, sbe_balanced, v16);
-DOMAIN_FIELD(__u32, sbe_pushed, v16);
-DOMAIN_FIELD(__u32, sbf_count, v16);
-DOMAIN_FIELD(__u32, sbf_balanced, v16);
-DOMAIN_FIELD(__u32, sbf_pushed, v16);
-DOMAIN_FIELD(__u32, ttwu_wake_remote, v16);
-DOMAIN_FIELD(__u32, ttwu_move_affine, v16);
-DOMAIN_FIELD(__u32, ttwu_move_balance, v16);
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category busy> ");
+#endif
+DOMAIN_FIELD(__u32, busy_lb_count,
+	     "load_balance() count on cpu busy", "%11u", true, v16);
+DOMAIN_FIELD(__u32, busy_lb_balanced,
+	     "load_balance() found balanced on cpu busy", "%11u", true, v16);
+DOMAIN_FIELD(__u32, busy_lb_failed,
+	     "load_balance() move task failed on cpu busy", "%11u", true, v16);
+DOMAIN_FIELD(__u32, busy_lb_imbalance,
+	     "imbalance sum on cpu busy", "%11u", false, v16);
+DOMAIN_FIELD(__u32, busy_lb_gained,
+	     "pull_task() count on cpu busy", "%11u", false, v16);
+DOMAIN_FIELD(__u32, busy_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu busy", "%11u", false, v16);
+DOMAIN_FIELD(__u32, busy_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu busy", "%11u", true, v16);
+DOMAIN_FIELD(__u32, busy_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu busy", "%11u", true, v16);
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
+		  busy_lb_count, busy_lb_balanced, busy_lb_failed, v16);
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu busy)", "%11.2Lf",
+		  busy_lb_count, busy_lb_balanced, busy_lb_failed, busy_lb_gained, v16);
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category idle> ");
+#endif
+DOMAIN_FIELD(__u32, idle_lb_count,
+	     "load_balance() count on cpu idle", "%11u", true, v16);
+DOMAIN_FIELD(__u32, idle_lb_balanced,
+	     "load_balance() found balanced on cpu idle", "%11u", true, v16);
+DOMAIN_FIELD(__u32, idle_lb_failed,
+	     "load_balance() move task failed on cpu idle", "%11u", true, v16);
+DOMAIN_FIELD(__u32, idle_lb_imbalance,
+	     "imbalance sum on cpu idle", "%11u", false, v16);
+DOMAIN_FIELD(__u32, idle_lb_gained,
+	     "pull_task() count on cpu idle", "%11u", false, v16);
+DOMAIN_FIELD(__u32, idle_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu idle", "%11u", false, v16);
+DOMAIN_FIELD(__u32, idle_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu idle", "%11u", true, v16);
+DOMAIN_FIELD(__u32, idle_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu idle", "%11u", true, v16);
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
+		  idle_lb_count, idle_lb_balanced, idle_lb_failed, v16);
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu idle)", "%11.2Lf",
+		  idle_lb_count, idle_lb_balanced, idle_lb_failed, idle_lb_gained, v16);
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category newidle> ");
+#endif
+DOMAIN_FIELD(__u32, newidle_lb_count,
+	     "load_balance() count on cpu newly idle", "%11u", true, v16);
+DOMAIN_FIELD(__u32, newidle_lb_balanced,
+	     "load_balance() found balanced on cpu newly idle", "%11u", true, v16);
+DOMAIN_FIELD(__u32, newidle_lb_failed,
+	     "load_balance() move task failed on cpu newly idle", "%11u", true, v16);
+DOMAIN_FIELD(__u32, newidle_lb_imbalance,
+	     "imbalance sum on cpu newly idle", "%11u", false, v16);
+DOMAIN_FIELD(__u32, newidle_lb_gained,
+	     "pull_task() count on cpu newly idle", "%11u", false, v16);
+DOMAIN_FIELD(__u32, newidle_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu newly idle", "%11u", false, v16);
+DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v16);
+DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu newly idle", "%11u", true, v16);
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
+		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v16);
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu newly idle)", "%11.2Lf",
+		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, newidle_lb_gained, v16);
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category active_load_balance()> ");
+#endif
+DOMAIN_FIELD(__u32, alb_count,
+	     "active_load_balance() count", "%11u", false, v16);
+DOMAIN_FIELD(__u32, alb_failed,
+	     "active_load_balance() move task failed", "%11u", false, v16);
+DOMAIN_FIELD(__u32, alb_pushed,
+	     "active_load_balance() successfully moved a task", "%11u", false, v16);
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category sched_balance_exec()> ");
+#endif
+DOMAIN_FIELD(__u32, sbe_count,
+	     "sbe_count is not used", "%11u", false, v16);
+DOMAIN_FIELD(__u32, sbe_balanced,
+	     "sbe_balanced is not used", "%11u", false, v16);
+DOMAIN_FIELD(__u32, sbe_pushed,
+	     "sbe_pushed is not used", "%11u", false, v16);
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category sched_balance_fork()> ");
+#endif
+DOMAIN_FIELD(__u32, sbf_count,
+	     "sbf_count is not used", "%11u", false, v16);
+DOMAIN_FIELD(__u32, sbf_balanced,
+	     "sbf_balanced is not used", "%11u", false, v16);
+DOMAIN_FIELD(__u32, sbf_pushed,
+	     "sbf_pushed is not used", "%11u", false, v16);
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Wakeup Info> ");
 #endif
+DOMAIN_FIELD(__u32, ttwu_wake_remote,
+	     "try_to_wake_up() awoke a task that last ran on a diff cpu", "%11u", false, v16);
+DOMAIN_FIELD(__u32, ttwu_move_affine,
+	     "try_to_wake_up() moved task because cache-cold on own cpu", "%11u", false, v16);
+DOMAIN_FIELD(__u32, ttwu_move_balance,
+	     "try_to_wake_up() started passive balancing", "%11u", false, v16);
+#endif /* DOMAIN_FIELD */
diff --git a/tools/lib/perf/include/perf/schedstat-v17.h b/tools/lib/perf/include/perf/schedstat-v17.h
index 851d4f1f4ecb..00009bd5f006 100644
--- a/tools/lib/perf/include/perf/schedstat-v17.h
+++ b/tools/lib/perf/include/perf/schedstat-v17.h
@@ -1,61 +1,160 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #ifdef CPU_FIELD
-CPU_FIELD(__u32, yld_count, v17);
-CPU_FIELD(__u32, array_exp, v17);
-CPU_FIELD(__u32, sched_count, v17);
-CPU_FIELD(__u32, sched_goidle, v17);
-CPU_FIELD(__u32, ttwu_count, v17);
-CPU_FIELD(__u32, ttwu_local, v17);
-CPU_FIELD(__u64, rq_cpu_time, v17);
-CPU_FIELD(__u64, run_delay, v17);
-CPU_FIELD(__u64, pcount, v17);
-#endif
+CPU_FIELD(__u32, yld_count, "sched_yield() count",
+	  "%11u", false, yld_count, v17);
+CPU_FIELD(__u32, array_exp, "Legacy counter can be ignored",
+	  "%11u", false, array_exp, v17);
+CPU_FIELD(__u32, sched_count, "schedule() called",
+	  "%11u", false, sched_count, v17);
+CPU_FIELD(__u32, sched_goidle, "schedule() left the processor idle",
+	  "%11u", true, sched_count, v17);
+CPU_FIELD(__u32, ttwu_count, "try_to_wake_up() was called",
+	  "%11u", false, ttwu_count, v17);
+CPU_FIELD(__u32, ttwu_local, "try_to_wake_up() was called to wake up the local cpu",
+	  "%11u", true, ttwu_count, v17);
+CPU_FIELD(__u64, rq_cpu_time, "total runtime by tasks on this processor (in jiffies)",
+	  "%11llu", false, rq_cpu_time, v17);
+CPU_FIELD(__u64, run_delay, "total waittime by tasks on this processor (in jiffies)",
+	  "%11llu", true, rq_cpu_time, v17);
+CPU_FIELD(__u64, pcount, "total timeslices run on this cpu",
+	  "%11llu", false, pcount, v17);
+#endif /* CPU_FIELD */
 
 #ifdef DOMAIN_FIELD
-DOMAIN_FIELD(__u32, busy_lb_count, v17);
-DOMAIN_FIELD(__u32, busy_lb_balanced, v17);
-DOMAIN_FIELD(__u32, busy_lb_failed, v17);
-DOMAIN_FIELD(__u32, busy_lb_imbalance_load, v17);
-DOMAIN_FIELD(__u32, busy_lb_imbalance_util, v17);
-DOMAIN_FIELD(__u32, busy_lb_imbalance_task, v17);
-DOMAIN_FIELD(__u32, busy_lb_imbalance_misfit, v17);
-DOMAIN_FIELD(__u32, busy_lb_gained, v17);
-DOMAIN_FIELD(__u32, busy_lb_hot_gained, v17);
-DOMAIN_FIELD(__u32, busy_lb_nobusyq, v17);
-DOMAIN_FIELD(__u32, busy_lb_nobusyg, v17);
-DOMAIN_FIELD(__u32, idle_lb_count, v17);
-DOMAIN_FIELD(__u32, idle_lb_balanced, v17);
-DOMAIN_FIELD(__u32, idle_lb_failed, v17);
-DOMAIN_FIELD(__u32, idle_lb_imbalance_load, v17);
-DOMAIN_FIELD(__u32, idle_lb_imbalance_util, v17);
-DOMAIN_FIELD(__u32, idle_lb_imbalance_task, v17);
-DOMAIN_FIELD(__u32, idle_lb_imbalance_misfit, v17);
-DOMAIN_FIELD(__u32, idle_lb_gained, v17);
-DOMAIN_FIELD(__u32, idle_lb_hot_gained, v17);
-DOMAIN_FIELD(__u32, idle_lb_nobusyq, v17);
-DOMAIN_FIELD(__u32, idle_lb_nobusyg, v17);
-DOMAIN_FIELD(__u32, newidle_lb_count, v17);
-DOMAIN_FIELD(__u32, newidle_lb_balanced, v17);
-DOMAIN_FIELD(__u32, newidle_lb_failed, v17);
-DOMAIN_FIELD(__u32, newidle_lb_imbalance_load, v17);
-DOMAIN_FIELD(__u32, newidle_lb_imbalance_util, v17);
-DOMAIN_FIELD(__u32, newidle_lb_imbalance_task, v17);
-DOMAIN_FIELD(__u32, newidle_lb_imbalance_misfit, v17);
-DOMAIN_FIELD(__u32, newidle_lb_gained, v17);
-DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v17);
-DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v17);
-DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v17);
-DOMAIN_FIELD(__u32, alb_count, v17);
-DOMAIN_FIELD(__u32, alb_failed, v17);
-DOMAIN_FIELD(__u32, alb_pushed, v17);
-DOMAIN_FIELD(__u32, sbe_count, v17);
-DOMAIN_FIELD(__u32, sbe_balanced, v17);
-DOMAIN_FIELD(__u32, sbe_pushed, v17);
-DOMAIN_FIELD(__u32, sbf_count, v17);
-DOMAIN_FIELD(__u32, sbf_balanced, v17);
-DOMAIN_FIELD(__u32, sbf_pushed, v17);
-DOMAIN_FIELD(__u32, ttwu_wake_remote, v17);
-DOMAIN_FIELD(__u32, ttwu_move_affine, v17);
-DOMAIN_FIELD(__u32, ttwu_move_balance, v17);
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category busy> ");
+#endif
+DOMAIN_FIELD(__u32, busy_lb_count,
+	     "load_balance() count on cpu busy", "%11u", true, v17);
+DOMAIN_FIELD(__u32, busy_lb_balanced,
+	     "load_balance() found balanced on cpu busy", "%11u", true, v17);
+DOMAIN_FIELD(__u32, busy_lb_failed,
+	     "load_balance() move task failed on cpu busy", "%11u", true, v17);
+DOMAIN_FIELD(__u32, busy_lb_imbalance_load,
+	     "imbalance in load on cpu busy", "%11u", false, v17);
+DOMAIN_FIELD(__u32, busy_lb_imbalance_util,
+	     "imbalance in utilization on cpu busy", "%11u", false, v17);
+DOMAIN_FIELD(__u32, busy_lb_imbalance_task,
+	     "imbalance in number of tasks on cpu busy", "%11u", false, v17);
+DOMAIN_FIELD(__u32, busy_lb_imbalance_misfit,
+	     "imbalance in misfit tasks on cpu busy", "%11u", false, v17);
+DOMAIN_FIELD(__u32, busy_lb_gained,
+	     "pull_task() count on cpu busy", "%11u", false, v17);
+DOMAIN_FIELD(__u32, busy_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu busy", "%11u", false, v17);
+DOMAIN_FIELD(__u32, busy_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu busy", "%11u", true, v17);
+DOMAIN_FIELD(__u32, busy_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu busy", "%11u", true, v17);
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
+		  busy_lb_count, busy_lb_balanced, busy_lb_failed, v17);
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu busy)", "%11.2Lf",
+		  busy_lb_count, busy_lb_balanced, busy_lb_failed, busy_lb_gained, v17);
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category idle> ");
+#endif
+DOMAIN_FIELD(__u32, idle_lb_count,
+	     "load_balance() count on cpu idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, idle_lb_balanced,
+	     "load_balance() found balanced on cpu idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, idle_lb_failed,
+	     "load_balance() move task failed on cpu idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, idle_lb_imbalance_load,
+	     "imbalance in load on cpu idle", "%11u", false, v17);
+DOMAIN_FIELD(__u32, idle_lb_imbalance_util,
+	     "imbalance in utilization on cpu idle", "%11u", false, v17);
+DOMAIN_FIELD(__u32, idle_lb_imbalance_task,
+	     "imbalance in number of tasks on cpu idle", "%11u", false, v17);
+DOMAIN_FIELD(__u32, idle_lb_imbalance_misfit,
+	     "imbalance in misfit tasks on cpu idle", "%11u", false, v17);
+DOMAIN_FIELD(__u32, idle_lb_gained,
+	     "pull_task() count on cpu idle", "%11u", false, v17);
+DOMAIN_FIELD(__u32, idle_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu idle", "%11u", false, v17);
+DOMAIN_FIELD(__u32, idle_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, idle_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu idle", "%11u", true, v17);
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
+		  idle_lb_count, idle_lb_balanced, idle_lb_failed, v17);
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu idle)", "%11.2Lf",
+		  idle_lb_count, idle_lb_balanced, idle_lb_failed, idle_lb_gained, v17);
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category newidle> ");
+#endif
+DOMAIN_FIELD(__u32, newidle_lb_count,
+	     "load_balance() count on cpu newly idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, newidle_lb_balanced,
+	     "load_balance() found balanced on cpu newly idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, newidle_lb_failed,
+	     "load_balance() move task failed on cpu newly idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, newidle_lb_imbalance_load,
+	     "imbalance in load on cpu newly idle", "%11u", false, v17);
+DOMAIN_FIELD(__u32, newidle_lb_imbalance_util,
+	     "imbalance in utilization on cpu newly idle", "%11u", false, v17);
+DOMAIN_FIELD(__u32, newidle_lb_imbalance_task,
+	     "imbalance in number of tasks on cpu newly idle", "%11u", false, v17);
+DOMAIN_FIELD(__u32, newidle_lb_imbalance_misfit,
+	     "imbalance in misfit tasks on cpu newly idle", "%11u", false, v17);
+DOMAIN_FIELD(__u32, newidle_lb_gained,
+	     "pull_task() count on cpu newly idle", "%11u", false, v17);
+DOMAIN_FIELD(__u32, newidle_lb_hot_gained,
+	     "pull_task() when target task was cache-hot on cpu newly idle", "%11u", false, v17);
+DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
+	     "load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
+	     "load_balance() failed to find busier group on cpu newly idle", "%11u", true, v17);
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
+		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v17);
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu newly idle)", "%11.2Lf",
+		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, newidle_lb_gained, v17);
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category active_load_balance()> ");
+#endif
+DOMAIN_FIELD(__u32, alb_count,
+	     "active_load_balance() count", "%11u", false, v17);
+DOMAIN_FIELD(__u32, alb_failed,
+	     "active_load_balance() move task failed", "%11u", false, v17);
+DOMAIN_FIELD(__u32, alb_pushed,
+	     "active_load_balance() successfully moved a task", "%11u", false, v17);
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category sched_balance_exec()> ");
+#endif
+DOMAIN_FIELD(__u32, sbe_count,
+	     "sbe_count is not used", "%11u", false, v17);
+DOMAIN_FIELD(__u32, sbe_balanced,
+	     "sbe_balanced is not used", "%11u", false, v17);
+DOMAIN_FIELD(__u32, sbe_pushed,
+	     "sbe_pushed is not used", "%11u", false, v17);
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Category sched_balance_fork()> ");
+#endif
+DOMAIN_FIELD(__u32, sbf_count,
+	     "sbf_count is not used", "%11u", false, v17);
+DOMAIN_FIELD(__u32, sbf_balanced,
+	     "sbf_balanced is not used", "%11u", false, v17);
+DOMAIN_FIELD(__u32, sbf_pushed,
+	     "sbf_pushed is not used", "%11u", false, v17);
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY(" <Wakeup Info> ");
 #endif
+DOMAIN_FIELD(__u32, ttwu_wake_remote,
+	     "try_to_wake_up() awoke a task that last ran on a diff cpu", "%11u", false, v17);
+DOMAIN_FIELD(__u32, ttwu_move_affine,
+	     "try_to_wake_up() moved task because cache-cold on own cpu", "%11u", false, v17);
+DOMAIN_FIELD(__u32, ttwu_move_balance,
+	     "try_to_wake_up() started passive balancing", "%11u", false, v17);
+#endif /* DOMAIN_FIELD */
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 1c3b56013164..e2e7dbc4f0aa 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -3869,6 +3869,501 @@ static int perf_sched__schedstat_record(struct perf_sched *sched,
 	return err;
 }
 
+struct schedstat_domain {
+	struct perf_record_schedstat_domain *domain_data;
+	struct schedstat_domain *next;
+};
+
+struct schedstat_cpu {
+	struct perf_record_schedstat_cpu *cpu_data;
+	struct schedstat_domain *domain_head;
+	struct schedstat_cpu *next;
+};
+
+struct schedstat_cpu *cpu_head = NULL, *cpu_tail = NULL, *cpu_second_pass = NULL;
+struct schedstat_domain *domain_tail = NULL, *domain_second_pass = NULL;
+bool after_workload_flag;
+
+static void store_schedtstat_cpu_diff(struct schedstat_cpu *after_workload)
+{
+	struct perf_record_schedstat_cpu *before = cpu_second_pass->cpu_data;
+	struct perf_record_schedstat_cpu *after = after_workload->cpu_data;
+	__u16 version = after_workload->cpu_data->version;
+
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)	\
+	(before->_ver._name = after->_ver._name - before->_ver._name)
+
+	if (version == 15) {
+#include <perf/schedstat-v15.h>
+	} else if (version == 16) {
+#include <perf/schedstat-v16.h>
+	} else if (version == 17) {
+#include <perf/schedstat-v17.h>
+	}
+
+#undef CPU_FIELD
+}
+
+static void store_schedstat_domain_diff(struct schedstat_domain *after_workload)
+{
+	struct perf_record_schedstat_domain *before = domain_second_pass->domain_data;
+	struct perf_record_schedstat_domain *after = after_workload->domain_data;
+	__u16 version = after_workload->domain_data->version;
+
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)	\
+	(before->_ver._name = after->_ver._name - before->_ver._name)
+
+	if (version == 15) {
+#include <perf/schedstat-v15.h>
+	} else if (version == 16) {
+#include <perf/schedstat-v16.h>
+	} else if (version == 17) {
+#include <perf/schedstat-v17.h>
+	}
+#undef DOMAIN_FIELD
+}
+
+static void print_separator(size_t pre_dash_cnt, const char *s, size_t post_dash_cnt)
+{
+	size_t i;
+
+	for (i = 0; i < pre_dash_cnt; ++i)
+		printf("-");
+
+	printf("%s", s);
+
+	for (i = 0; i < post_dash_cnt; ++i)
+		printf("-");
+
+	printf("\n");
+}
+
+static inline void print_cpu_stats(struct perf_record_schedstat_cpu *cs)
+{
+	printf("%-65s %12s %12s\n", "DESC", "COUNT", "PCT_CHANGE");
+	print_separator(100, "", 0);
+
+#define CALC_PCT(_x, _y)	((_y) ? ((double)(_x) / (_y)) * 100 : 0.0)
+
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		\
+	do {									\
+		printf("%-65s: " _format, _desc, cs->_ver._name);		\
+		if (_is_pct) {							\
+			printf("  ( %8.2lf%% )",				\
+			       CALC_PCT(cs->_ver._name, cs->_ver._pct_of));	\
+		}								\
+		printf("\n");							\
+	} while (0)
+
+	if (cs->version == 15) {
+#include <perf/schedstat-v15.h>
+	} else if (cs->version == 16) {
+#include <perf/schedstat-v16.h>
+	} else if (cs->version == 17) {
+#include <perf/schedstat-v17.h>
+	}
+
+#undef CPU_FIELD
+#undef CALC_PCT
+}
+
+static inline void print_domain_stats(struct perf_record_schedstat_domain *ds,
+				      __u64 jiffies)
+{
+	printf("%-65s %12s %14s\n", "DESC", "COUNT", "AVG_JIFFIES");
+
+#define DOMAIN_CATEGORY(_desc)							\
+	do {									\
+		size_t _len = strlen(_desc);					\
+		size_t _pre_dash_cnt = (100 - _len) / 2;			\
+		size_t _post_dash_cnt = 100 - _len - _pre_dash_cnt;		\
+		print_separator(_pre_dash_cnt, _desc, _post_dash_cnt);		\
+	} while (0)
+
+#define CALC_AVG(_x, _y)	((_y) ? (long double)(_x) / (_y) : 0.0)
+
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		\
+	do {									\
+		printf("%-65s: " _format, _desc, ds->_ver._name);		\
+		if (_is_jiffies) {						\
+			printf("  $ %11.2Lf $",					\
+			       CALC_AVG(jiffies, ds->_ver._name));		\
+		}								\
+		printf("\n");							\
+	} while (0)
+
+#define DERIVED_CNT_FIELD(_desc, _format, _x, _y, _z, _ver)			\
+	printf("*%-64s: " _format "\n", _desc,					\
+	       (ds->_ver._x) - (ds->_ver._y) - (ds->_ver._z))
+
+#define DERIVED_AVG_FIELD(_desc, _format, _x, _y, _z, _w, _ver)			\
+	printf("*%-64s: " _format "\n", _desc, CALC_AVG(ds->_ver._w,		\
+	       ((ds->_ver._x) - (ds->_ver._y) - (ds->_ver._z))))
+
+	if (ds->version == 15) {
+#include <perf/schedstat-v15.h>
+	} else if (ds->version == 16) {
+#include <perf/schedstat-v16.h>
+	} else if (ds->version == 17) {
+#include <perf/schedstat-v17.h>
+	}
+
+#undef DERIVED_AVG_FIELD
+#undef DERIVED_CNT_FIELD
+#undef DOMAIN_FIELD
+#undef CALC_AVG
+#undef DOMAIN_CATEGORY
+}
+
+static void print_domain_cpu_list(struct perf_record_schedstat_domain *ds)
+{
+	char bin[16][5] = {"0000", "0001", "0010", "0011",
+			   "0100", "0101", "0110", "0111",
+			   "1000", "1001", "1010", "1011",
+			   "1100", "1101", "1110", "1111"};
+	bool print_flag = false, low = true;
+	int cpu = 0, start, end, idx;
+
+	idx = ((ds->nr_cpus + 7) >> 3) - 1;
+
+	printf("<");
+	while (idx >= 0) {
+		__u8 index;
+
+		if (low)
+			index = ds->cpu_mask[idx] & 0xf;
+		else
+			index = (ds->cpu_mask[idx--] & 0xf0) >> 4;
+
+		for (int i = 3; i >= 0; i--) {
+			if (!print_flag && bin[index][i] == '1') {
+				start = cpu;
+				print_flag = true;
+			} else if (print_flag && bin[index][i] == '0') {
+				end = cpu - 1;
+				print_flag = false;
+				if (start == end)
+					printf("%d, ", start);
+				else
+					printf("%d-%d, ", start, end);
+			}
+			cpu++;
+		}
+
+		low = !low;
+	}
+
+	if (print_flag) {
+		if (start == cpu-1)
+			printf("%d, ", start);
+		else
+			printf("%d-%d, ", start, cpu-1);
+	}
+	printf("\b\b>\n");
+}
+
+static void summarize_schedstat_cpu(struct schedstat_cpu *summary_cpu,
+				    struct schedstat_cpu *cptr,
+				    int cnt, bool is_last)
+{
+	struct perf_record_schedstat_cpu *summary_cs = summary_cpu->cpu_data,
+					 *temp_cs = cptr->cpu_data;
+
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		\
+	do {									\
+		summary_cs->_ver._name += temp_cs->_ver._name;			\
+		if (is_last)							\
+			summary_cs->_ver._name /= cnt;				\
+	} while (0)
+
+	if (cptr->cpu_data->version == 15) {
+#include <perf/schedstat-v15.h>
+	} else if (cptr->cpu_data->version == 16) {
+#include <perf/schedstat-v16.h>
+	} else if (cptr->cpu_data->version == 17) {
+#include <perf/schedstat-v17.h>
+	}
+#undef CPU_FIELD
+}
+
+static void summarize_schedstat_domain(struct schedstat_domain *summary_domain,
+				       struct schedstat_domain *dptr,
+				       int cnt, bool is_last)
+{
+	struct perf_record_schedstat_domain *summary_ds = summary_domain->domain_data,
+					    *temp_ds = dptr->domain_data;
+
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		\
+	do {									\
+		summary_ds->_ver._name += temp_ds->_ver._name;			\
+		if (is_last)							\
+			summary_ds->_ver._name /= cnt;				\
+	} while (0)
+
+	if (dptr->domain_data->version == 15) {
+#include <perf/schedstat-v15.h>
+	} else if (dptr->domain_data->version == 16) {
+#include <perf/schedstat-v16.h>
+	} else if (dptr->domain_data->version == 17) {
+#include <perf/schedstat-v17.h>
+	}
+#undef DOMAIN_FIELD
+}
+
+static void get_all_cpu_stats(struct schedstat_cpu **cptr)
+{
+	struct schedstat_domain *dptr = NULL, *tdptr = NULL, *dtail = NULL;
+	struct schedstat_cpu *tcptr = *cptr, *summary_head = NULL;
+	struct perf_record_schedstat_domain *ds = NULL;
+	struct perf_record_schedstat_cpu *cs = NULL;
+	bool is_last = false;
+	int cnt = 0;
+
+	if (tcptr) {
+		summary_head = zalloc(sizeof(*summary_head));
+		summary_head->cpu_data = zalloc(sizeof(*cs));
+		memcpy(summary_head->cpu_data, tcptr->cpu_data, sizeof(*cs));
+		summary_head->next = NULL;
+		summary_head->domain_head = NULL;
+		dptr = tcptr->domain_head;
+
+		while (dptr) {
+			size_t cpu_mask_size = (dptr->domain_data->nr_cpus + 7) >> 3;
+
+			tdptr = zalloc(sizeof(*tdptr));
+			tdptr->domain_data = zalloc(sizeof(*ds) + cpu_mask_size);
+			memcpy(tdptr->domain_data, dptr->domain_data, sizeof(*ds) + cpu_mask_size);
+
+			tdptr->next = NULL;
+			if (!dtail) {
+				summary_head->domain_head = tdptr;
+				dtail = tdptr;
+			} else {
+				dtail->next = tdptr;
+				dtail = dtail->next;
+			}
+			dptr = dptr->next;
+		}
+	}
+
+	tcptr = (*cptr)->next;
+	while (tcptr) {
+		if (!tcptr->next)
+			is_last = true;
+
+		cnt++;
+		summarize_schedstat_cpu(summary_head, tcptr, cnt, is_last);
+		tdptr = summary_head->domain_head;
+		dptr = tcptr->domain_head;
+
+		while (tdptr) {
+			summarize_schedstat_domain(tdptr, dptr, cnt, is_last);
+			tdptr = tdptr->next;
+			dptr = dptr->next;
+		}
+		tcptr = tcptr->next;
+	}
+
+	tcptr = *cptr;
+	summary_head->next = tcptr;
+	*cptr = summary_head;
+}
+
+/* FIXME: The code fails (segfaults) when one or ore cpus are offline. */
+static void show_schedstat_data(struct schedstat_cpu *cptr)
+{
+	struct perf_record_schedstat_domain *ds = NULL;
+	struct perf_record_schedstat_cpu *cs = NULL;
+	__u64 jiffies = cptr->cpu_data->timestamp;
+	struct schedstat_domain *dptr = NULL;
+	bool is_summary = true;
+
+	printf("Columns description\n");
+	print_separator(100, "", 0);
+	printf("DESC\t\t\t-> Description of the field\n");
+	printf("COUNT\t\t\t-> Value of the field\n");
+	printf("PCT_CHANGE\t\t-> Percent change with corresponding base value\n");
+	printf("AVG_JIFFIES\t\t-> Avg time in jiffies between two consecutive occurrence of event\n");
+
+	print_separator(100, "", 0);
+	printf("Time elapsed (in jiffies)                                        : %11llu\n",
+	       jiffies);
+	print_separator(100, "", 0);
+
+	get_all_cpu_stats(&cptr);
+
+	while (cptr) {
+		cs = cptr->cpu_data;
+		printf("\n");
+		print_separator(100, "", 0);
+		if (is_summary)
+			printf("CPU <ALL CPUS SUMMARY>\n");
+		else
+			printf("CPU %d\n", cs->cpu);
+
+		print_separator(100, "", 0);
+		print_cpu_stats(cs);
+		print_separator(100, "", 0);
+
+		dptr = cptr->domain_head;
+
+		while (dptr) {
+			ds = dptr->domain_data;
+			if (is_summary)
+				if (ds->name[0])
+					printf("CPU <ALL CPUS SUMMARY>, DOMAIN %s\n", ds->name);
+				else
+					printf("CPU <ALL CPUS SUMMARY>, DOMAIN %d\n", ds->domain);
+			else {
+				if (ds->name[0])
+					printf("CPU %d, DOMAIN %s CPUS ", cs->cpu, ds->name);
+				else
+					printf("CPU %d, DOMAIN %d CPUS ", cs->cpu, ds->domain);
+
+				print_domain_cpu_list(ds);
+			}
+			print_separator(100, "", 0);
+			print_domain_stats(ds, jiffies);
+			print_separator(100, "", 0);
+
+			dptr = dptr->next;
+		}
+		is_summary = false;
+		cptr = cptr->next;
+	}
+}
+
+static int perf_sched__process_schedstat(struct perf_session *session __maybe_unused,
+					 union perf_event *event)
+{
+	struct perf_cpu this_cpu;
+	static __u32 initial_cpu;
+
+	switch (event->header.type) {
+	case PERF_RECORD_SCHEDSTAT_CPU:
+		this_cpu.cpu = event->schedstat_cpu.cpu;
+		break;
+	case PERF_RECORD_SCHEDSTAT_DOMAIN:
+		this_cpu.cpu = event->schedstat_domain.cpu;
+		break;
+	default:
+		return 0;
+	}
+
+	if (user_requested_cpus && !perf_cpu_map__has(user_requested_cpus, this_cpu))
+		return 0;
+
+	if (event->header.type == PERF_RECORD_SCHEDSTAT_CPU) {
+		struct schedstat_cpu *temp = zalloc(sizeof(struct schedstat_cpu));
+
+		temp->cpu_data = zalloc(sizeof(struct perf_record_schedstat_cpu));
+		memcpy(temp->cpu_data, &event->schedstat_cpu,
+		       sizeof(struct perf_record_schedstat_cpu));
+		temp->next = NULL;
+		temp->domain_head = NULL;
+
+		if (cpu_head && temp->cpu_data->cpu == initial_cpu)
+			after_workload_flag = true;
+
+		if (!after_workload_flag) {
+			if (!cpu_head) {
+				initial_cpu = temp->cpu_data->cpu;
+				cpu_head = temp;
+			} else
+				cpu_tail->next = temp;
+
+			cpu_tail = temp;
+		} else {
+			if (temp->cpu_data->cpu == initial_cpu) {
+				cpu_second_pass = cpu_head;
+				cpu_head->cpu_data->timestamp =
+					temp->cpu_data->timestamp - cpu_second_pass->cpu_data->timestamp;
+			} else {
+				cpu_second_pass = cpu_second_pass->next;
+			}
+			domain_second_pass = cpu_second_pass->domain_head;
+			store_schedtstat_cpu_diff(temp);
+		}
+	} else if (event->header.type == PERF_RECORD_SCHEDSTAT_DOMAIN) {
+		size_t cpu_mask_size = (event->schedstat_domain.nr_cpus + 7) >> 3;
+		struct schedstat_domain *temp = zalloc(sizeof(struct schedstat_domain));
+
+		temp->domain_data = zalloc(sizeof(struct perf_record_schedstat_domain) + cpu_mask_size);
+		memcpy(temp->domain_data, &event->schedstat_domain,
+		       sizeof(struct perf_record_schedstat_domain) + cpu_mask_size);
+		temp->next = NULL;
+
+		if (!after_workload_flag) {
+			if (cpu_tail->domain_head == NULL) {
+				cpu_tail->domain_head = temp;
+				domain_tail = temp;
+			} else {
+				domain_tail->next = temp;
+				domain_tail = temp;
+			}
+		} else {
+			store_schedstat_domain_diff(temp);
+			domain_second_pass = domain_second_pass->next;
+		}
+	}
+
+	return 0;
+}
+
+static void free_schedstat(struct schedstat_cpu *cptr)
+{
+	struct schedstat_domain *dptr = NULL, *tmp_dptr;
+	struct schedstat_cpu *tmp_cptr;
+
+	while (cptr) {
+		tmp_cptr = cptr;
+		dptr = cptr->domain_head;
+
+		while (dptr) {
+			tmp_dptr = dptr;
+			dptr = dptr->next;
+			free(tmp_dptr);
+		}
+		cptr = cptr->next;
+		free(tmp_cptr);
+	}
+}
+
+static int perf_sched__schedstat_report(struct perf_sched *sched)
+{
+	struct perf_session *session;
+	struct perf_data data = {
+		.path  = input_name,
+		.mode  = PERF_DATA_MODE_READ,
+	};
+	int err;
+
+	if (cpu_list) {
+		user_requested_cpus = perf_cpu_map__new(cpu_list);
+		if (!user_requested_cpus)
+			return -EINVAL;
+	}
+
+	sched->tool.schedstat_cpu = perf_sched__process_schedstat;
+	sched->tool.schedstat_domain = perf_sched__process_schedstat;
+
+	session = perf_session__new(&data, &sched->tool);
+	if (IS_ERR(session)) {
+		pr_err("Perf session creation failed.\n");
+		return PTR_ERR(session);
+	}
+
+	err = perf_session__process_events(session);
+
+	perf_session__delete(session);
+	if (!err) {
+		setup_pager();
+		show_schedstat_data(cpu_head);
+		free_schedstat(cpu_head);
+	}
+	return err;
+}
+
 static bool schedstat_events_exposed(void)
 {
 	/*
@@ -4046,6 +4541,8 @@ int cmd_sched(int argc, const char **argv)
 	OPT_PARENT(sched_options)
 	};
 	const struct option stats_options[] = {
+	OPT_STRING('i', "input", &input_name, "file",
+		   "`stats report` with input filename"),
 	OPT_STRING('o', "output", &output_name, "file",
 		   "`stats record` with output filename"),
 	OPT_STRING('C', "cpu", &cpu_list, "cpu", "list of cpus to profile"),
@@ -4171,7 +4668,7 @@ int cmd_sched(int argc, const char **argv)
 
 		return perf_sched__timehist(&sched);
 	} else if (!strcmp(argv[0], "stats")) {
-		const char *const stats_subcommands[] = {"record", NULL};
+		const char *const stats_subcommands[] = {"record", "report", NULL};
 
 		argc = parse_options_subcommand(argc, argv, stats_options,
 						stats_subcommands,
@@ -4183,6 +4680,11 @@ int cmd_sched(int argc, const char **argv)
 				argc = parse_options(argc, argv, stats_options,
 						     stats_usage, 0);
 			return perf_sched__schedstat_record(&sched, argc, argv);
+		} else if (argv[0] && !strcmp(argv[0], "report")) {
+			if (argc)
+				argc = parse_options(argc, argv, stats_options,
+						     stats_usage, 0);
+			return perf_sched__schedstat_report(&sched);
 		}
 		usage_with_options(stats_usage, stats_options);
 	} else {
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index d09c3c99ab48..4071bd95192d 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -560,7 +560,7 @@ size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
 
 	size = fprintf(fp, "\ncpu%u ", cs->cpu);
 
-#define CPU_FIELD(_type, _name, _ver)						\
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		\
 	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)cs->_ver._name)
 
 	if (version == 15) {
@@ -641,7 +641,7 @@ size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
 	size += fprintf(fp, "%s ", cpu_mask);
 	free(cpu_mask);
 
-#define DOMAIN_FIELD(_type, _name, _ver)					\
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		\
 	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)ds->_ver._name)
 
 	if (version == 15) {
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index fad0c472f297..495ed8433c0c 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -2538,7 +2538,7 @@ static union perf_event *__synthesize_schedstat_cpu(struct io *io, __u16 version
 	if (io__get_dec(io, (__u64 *)cpu) != ' ')
 		goto out_cpu;
 
-#define CPU_FIELD(_type, _name, _ver)					\
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)	\
 	do {								\
 		__u64 _tmp;						\
 		ch = io__get_dec(io, &_tmp);				\
@@ -2662,7 +2662,7 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
 	free(d_name);
 	free(cpu_mask);
 
-#define DOMAIN_FIELD(_type, _name, _ver)				\
+#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)	\
 	do {								\
 		__u64 _tmp;						\
 		ch = io__get_dec(io, &_tmp);				\
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 5/8] perf sched stats: Add support for live mode
  2025-03-11 12:02 [PATCH v3 0/8] perf sched: Introduce stats tool Swapnil Sapkal
                   ` (3 preceding siblings ...)
  2025-03-11 12:02 ` [PATCH v3 4/8] perf sched stats: Add support for report subcommand Swapnil Sapkal
@ 2025-03-11 12:02 ` Swapnil Sapkal
  2025-03-15  4:46   ` Namhyung Kim
  2025-03-11 12:02 ` [PATCH v3 6/8] perf sched stats: Add support for diff subcommand Swapnil Sapkal
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Swapnil Sapkal @ 2025-03-11 12:02 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, james.clark
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot,
	adrian.hunter, kan.liang, gautham.shenoy, kprateek.nayak,
	juri.lelli, yangjihong, void, tj, sshegde, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das,
	James Clark

The live mode works similar to simple `perf stat` command, by profiling
the target and printing results on the terminal as soon as the target
finishes.

Example usage:

  # perf sched stats -- sleep 10

Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
---
 tools/perf/builtin-sched.c | 87 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 86 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index e2e7dbc4f0aa..9813e25b54b8 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -4364,6 +4364,91 @@ static int perf_sched__schedstat_report(struct perf_sched *sched)
 	return err;
 }
 
+static int process_synthesized_event_live(const struct perf_tool *tool __maybe_unused,
+					  union perf_event *event,
+					  struct perf_sample *sample __maybe_unused,
+					  struct machine *machine __maybe_unused)
+{
+	return perf_sched__process_schedstat(NULL, event);
+}
+
+static int perf_sched__schedstat_live(struct perf_sched *sched,
+				      int argc, const char **argv)
+{
+	struct evlist *evlist;
+	struct target *target;
+	int reset = 0;
+	int err = 0;
+
+	signal(SIGINT, sighandler);
+	signal(SIGCHLD, sighandler);
+	signal(SIGTERM, sighandler);
+
+	evlist = evlist__new();
+	if (!evlist)
+		return -ENOMEM;
+
+	/*
+	 * `perf sched schedstat` does not support workload profiling (-p pid)
+	 * since /proc/schedstat file contains cpu specific data only. Hence, a
+	 * profile target is either set of cpus or systemwide, never a process.
+	 * Note that, although `-- <workload>` is supported, profile data are
+	 * still cpu/systemwide.
+	 */
+	target = zalloc(sizeof(struct target));
+	if (cpu_list)
+		target->cpu_list = cpu_list;
+	else
+		target->system_wide = true;
+
+	if (argc) {
+		err = evlist__prepare_workload(evlist, target, argv, false, NULL);
+		if (err)
+			goto out_target;
+	}
+
+	if (cpu_list) {
+		user_requested_cpus = perf_cpu_map__new(cpu_list);
+		if (!user_requested_cpus)
+			goto out_target;
+	}
+
+	err = perf_event__synthesize_schedstat(&(sched->tool),
+					       process_synthesized_event_live,
+					       user_requested_cpus);
+	if (err < 0)
+		goto out_target;
+
+	err = enable_sched_schedstats(&reset);
+	if (err < 0)
+		goto out_target;
+
+	if (argc)
+		evlist__start_workload(evlist);
+
+	/* wait for signal */
+	pause();
+
+	if (reset) {
+		err = disable_sched_schedstat();
+		if (err < 0)
+			goto out_target;
+	}
+
+	err = perf_event__synthesize_schedstat(&(sched->tool),
+					       process_synthesized_event_live,
+					       user_requested_cpus);
+	if (err)
+		goto out_target;
+
+	setup_pager();
+	show_schedstat_data(cpu_head);
+	free_schedstat(cpu_head);
+out_target:
+	free(target);
+	return err;
+}
+
 static bool schedstat_events_exposed(void)
 {
 	/*
@@ -4686,7 +4771,7 @@ int cmd_sched(int argc, const char **argv)
 						     stats_usage, 0);
 			return perf_sched__schedstat_report(&sched);
 		}
-		usage_with_options(stats_usage, stats_options);
+		return perf_sched__schedstat_live(&sched, argc, argv);
 	} else {
 		usage_with_options(sched_usage, sched_options);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 6/8] perf sched stats: Add support for diff subcommand
  2025-03-11 12:02 [PATCH v3 0/8] perf sched: Introduce stats tool Swapnil Sapkal
                   ` (4 preceding siblings ...)
  2025-03-11 12:02 ` [PATCH v3 5/8] perf sched stats: Add support for live mode Swapnil Sapkal
@ 2025-03-11 12:02 ` Swapnil Sapkal
  2025-03-11 12:02 ` [PATCH v3 7/8] perf sched stats: Add basic perf sched stats test Swapnil Sapkal
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Swapnil Sapkal @ 2025-03-11 12:02 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, james.clark
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot,
	adrian.hunter, kan.liang, gautham.shenoy, kprateek.nayak,
	juri.lelli, yangjihong, void, tj, sshegde, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

`perf sched stats diff` subcommand will take two perf.data files as an
input and it will print the diff between the two perf.data files. The
default input to this subcommnd is perf.data.old and perf.data.

Example usage:

 # perf sched stats diff sample1.data sample2.data

Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
---
 tools/perf/builtin-sched.c | 277 +++++++++++++++++++++++++++++--------
 1 file changed, 221 insertions(+), 56 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 9813e25b54b8..bd86cc73e156 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -3938,28 +3938,44 @@ static void print_separator(size_t pre_dash_cnt, const char *s, size_t post_dash
 	printf("\n");
 }
 
-static inline void print_cpu_stats(struct perf_record_schedstat_cpu *cs)
+#define PCT_CHNG(_x, _y)        ((_x) ? ((double)((double)(_y) - (_x)) / (_x)) * 100 : 0.0)
+static inline void print_cpu_stats(struct perf_record_schedstat_cpu *cs1,
+				   struct perf_record_schedstat_cpu *cs2)
 {
-	printf("%-65s %12s %12s\n", "DESC", "COUNT", "PCT_CHANGE");
+	printf("%-65s ", "DESC");
+	if (!cs2)
+		printf("%12s %12s", "COUNT", "PCT_CHANGE");
+	else
+		printf("%12s %11s %12s %14s %10s", "COUNT1", "COUNT2", "PCT_CHANGE",
+		       "PCT_CHANGE1", "PCT_CHANGE2");
+
+	printf("\n");
 	print_separator(100, "", 0);
 
 #define CALC_PCT(_x, _y)	((_y) ? ((double)(_x) / (_y)) * 100 : 0.0)
-
-#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		\
-	do {									\
-		printf("%-65s: " _format, _desc, cs->_ver._name);		\
-		if (_is_pct) {							\
-			printf("  ( %8.2lf%% )",				\
-			       CALC_PCT(cs->_ver._name, cs->_ver._pct_of));	\
-		}								\
-		printf("\n");							\
+#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)			\
+	do {										\
+		printf("%-65s: " _format, _desc, cs1->_ver._name);			\
+		if (!cs2) {								\
+			if (_is_pct)							\
+				printf("  ( %8.2lf%% )",				\
+				       CALC_PCT(cs1->_ver._name, cs1->_ver._pct_of));	\
+		} else {								\
+			printf("," _format "  | %8.2lf%% |", cs2->_ver._name,		\
+			       PCT_CHNG(cs1->_ver._name, cs2->_ver._name));		\
+			if (_is_pct)							\
+				printf("  ( %8.2lf%%,  %8.2lf%% )",			\
+				       CALC_PCT(cs1->_ver._name, cs1->_ver._pct_of),	\
+				       CALC_PCT(cs2->_ver._name, cs2->_ver._pct_of));	\
+		}									\
+		printf("\n");								\
 	} while (0)
 
-	if (cs->version == 15) {
+	if (cs1->version == 15) {
 #include <perf/schedstat-v15.h>
-	} else if (cs->version == 16) {
+	} else if (cs1->version == 16) {
 #include <perf/schedstat-v16.h>
-	} else if (cs->version == 17) {
+	} else if (cs1->version == 17) {
 #include <perf/schedstat-v17.h>
 	}
 
@@ -3967,10 +3983,17 @@ static inline void print_cpu_stats(struct perf_record_schedstat_cpu *cs)
 #undef CALC_PCT
 }
 
-static inline void print_domain_stats(struct perf_record_schedstat_domain *ds,
-				      __u64 jiffies)
+static inline void print_domain_stats(struct perf_record_schedstat_domain *ds1,
+				      struct perf_record_schedstat_domain *ds2,
+				      __u64 jiffies1, __u64 jiffies2)
 {
-	printf("%-65s %12s %14s\n", "DESC", "COUNT", "AVG_JIFFIES");
+	printf("%-65s ", "DESC");
+	if (!ds2)
+		printf("%12s %14s", "COUNT", "AVG_JIFFIES");
+	else
+		printf("%12s %11s %12s %16s %12s", "COUNT1", "COUNT2", "PCT_CHANGE",
+		       "AVG_JIFFIES1", "AVG_JIFFIES2");
+	printf("\n");
 
 #define DOMAIN_CATEGORY(_desc)							\
 	do {									\
@@ -3984,27 +4007,54 @@ static inline void print_domain_stats(struct perf_record_schedstat_domain *ds,
 
 #define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		\
 	do {									\
-		printf("%-65s: " _format, _desc, ds->_ver._name);		\
-		if (_is_jiffies) {						\
-			printf("  $ %11.2Lf $",					\
-			       CALC_AVG(jiffies, ds->_ver._name));		\
+		printf("%-65s: " _format, _desc, ds1->_ver._name);		\
+		if (!ds2) {							\
+			if (_is_jiffies)					\
+				printf("  $ %11.2Lf $",				\
+				       CALC_AVG(jiffies1, ds1->_ver._name));	\
+		} else {							\
+			printf("," _format "  | %8.2lf%% |", ds2->_ver._name,	\
+			       PCT_CHNG(ds1->_ver._name, ds2->_ver._name));	\
+			if (_is_jiffies)					\
+				printf("  $ %11.2Lf, %11.2Lf $",		\
+				       CALC_AVG(jiffies1, ds1->_ver._name),	\
+				       CALC_AVG(jiffies2, ds2->_ver._name));	\
 		}								\
 		printf("\n");							\
 	} while (0)
 
 #define DERIVED_CNT_FIELD(_desc, _format, _x, _y, _z, _ver)			\
-	printf("*%-64s: " _format "\n", _desc,					\
-	       (ds->_ver._x) - (ds->_ver._y) - (ds->_ver._z))
+	do {									\
+		__u32 t1 = ds1->_ver._x - ds1->_ver._y - ds1->_ver._z;		\
+		printf("*%-64s: " _format, _desc, t1);				\
+		if (ds2) {							\
+			__u32 t2 = ds2->_ver._x - ds2->_ver._y - ds2->_ver._z;	\
+			printf("," _format "  | %8.2lf%% |", t2,		\
+			       PCT_CHNG(t1, t2));				\
+		}								\
+		printf("\n");							\
+	} while (0)
 
 #define DERIVED_AVG_FIELD(_desc, _format, _x, _y, _z, _w, _ver)			\
-	printf("*%-64s: " _format "\n", _desc, CALC_AVG(ds->_ver._w,		\
-	       ((ds->_ver._x) - (ds->_ver._y) - (ds->_ver._z))))
+	do {									\
+		__u32 t1 = ds1->_ver._x - ds1->_ver._y - ds1->_ver._z;		\
+		printf("*%-64s: " _format, _desc,				\
+		       CALC_AVG(ds1->_ver._w, t1));				\
+		if (ds2) {							\
+			__u32 t2 = ds2->_ver._x - ds2->_ver._y - ds2->_ver._z;	\
+			printf("," _format "  | %8.2Lf%% |",			\
+			       CALC_AVG(ds2->_ver._w, t2),			\
+			       PCT_CHNG(CALC_AVG(ds1->_ver._w, t1),		\
+					CALC_AVG(ds2->_ver._w, t2)));		\
+		}								\
+		printf("\n");							\
+	} while (0)
 
-	if (ds->version == 15) {
+	if (ds1->version == 15) {
 #include <perf/schedstat-v15.h>
-	} else if (ds->version == 16) {
+	} else if (ds1->version == 16) {
 #include <perf/schedstat-v16.h>
-	} else if (ds->version == 17) {
+	} else if (ds1->version == 17) {
 #include <perf/schedstat-v17.h>
 	}
 
@@ -4014,6 +4064,7 @@ static inline void print_domain_stats(struct perf_record_schedstat_domain *ds,
 #undef CALC_AVG
 #undef DOMAIN_CATEGORY
 }
+#undef PCT_CHNG
 
 static void print_domain_cpu_list(struct perf_record_schedstat_domain *ds)
 {
@@ -4169,13 +4220,13 @@ static void get_all_cpu_stats(struct schedstat_cpu **cptr)
 	*cptr = summary_head;
 }
 
-/* FIXME: The code fails (segfaults) when one or ore cpus are offline. */
-static void show_schedstat_data(struct schedstat_cpu *cptr)
+static void show_schedstat_data(struct schedstat_cpu *cptr1, struct schedstat_cpu *cptr2,
+				bool summary_only)
 {
-	struct perf_record_schedstat_domain *ds = NULL;
-	struct perf_record_schedstat_cpu *cs = NULL;
-	__u64 jiffies = cptr->cpu_data->timestamp;
-	struct schedstat_domain *dptr = NULL;
+	struct perf_record_schedstat_domain *ds1 = NULL, *ds2 = NULL;
+	struct perf_record_schedstat_cpu *cs1 = NULL, *cs2 = NULL;
+	struct schedstat_domain *dptr1 = NULL, *dptr2 = NULL;
+	__u64 jiffies1 = 0, jiffies2 = 0;
 	bool is_summary = true;
 
 	printf("Columns description\n");
@@ -4186,50 +4237,83 @@ static void show_schedstat_data(struct schedstat_cpu *cptr)
 	printf("AVG_JIFFIES\t\t-> Avg time in jiffies between two consecutive occurrence of event\n");
 
 	print_separator(100, "", 0);
-	printf("Time elapsed (in jiffies)                                        : %11llu\n",
-	       jiffies);
+	printf("Time elapsed (in jiffies)                                        : ");
+	jiffies1 = cptr1->cpu_data->timestamp;
+	printf("%11llu", jiffies1);
+	if (cptr2) {
+		jiffies2 = cptr2->cpu_data->timestamp;
+		printf(",%11llu", jiffies2);
+	}
+	printf("\n");
+
 	print_separator(100, "", 0);
 
-	get_all_cpu_stats(&cptr);
+	get_all_cpu_stats(&cptr1);
+	if (cptr2)
+		get_all_cpu_stats(&cptr2);
+
+	while (cptr1) {
+		cs1 = cptr1->cpu_data;
+		if (cptr2) {
+			cs2 = cptr2->cpu_data;
+			dptr2 = cptr2->domain_head;
+		}
+
+		if (cs2 && cs1->cpu != cs2->cpu) {
+			pr_err("Failed because matching cpus not found for diff\n");
+			return;
+		}
 
-	while (cptr) {
-		cs = cptr->cpu_data;
 		printf("\n");
 		print_separator(100, "", 0);
 		if (is_summary)
 			printf("CPU <ALL CPUS SUMMARY>\n");
 		else
-			printf("CPU %d\n", cs->cpu);
+			printf("CPU %d\n", cs1->cpu);
 
 		print_separator(100, "", 0);
-		print_cpu_stats(cs);
+		print_cpu_stats(cs1, cs2);
 		print_separator(100, "", 0);
 
-		dptr = cptr->domain_head;
+		dptr1 = cptr1->domain_head;
+
+		while (dptr1) {
+			ds1 = dptr1->domain_data;
+
+			if (dptr2)
+				ds2 = dptr2->domain_data;
+
+			if (dptr2 && ds1->domain != ds2->domain) {
+				pr_err("Failed because matching domain not found for diff\n");
+				return;
+			}
 
-		while (dptr) {
-			ds = dptr->domain_data;
 			if (is_summary)
-				if (ds->name[0])
-					printf("CPU <ALL CPUS SUMMARY>, DOMAIN %s\n", ds->name);
+				if (ds1->name[0])
+					printf("CPU <ALL CPUS SUMMARY>, DOMAIN %s\n", ds1->name);
 				else
-					printf("CPU <ALL CPUS SUMMARY>, DOMAIN %d\n", ds->domain);
+					printf("CPU <ALL CPUS SUMMARY>, DOMAIN %d\n", ds1->domain);
 			else {
-				if (ds->name[0])
-					printf("CPU %d, DOMAIN %s CPUS ", cs->cpu, ds->name);
+				if (ds1->name[0])
+					printf("CPU %d, DOMAIN %s CPUS ", cs1->cpu, ds1->name);
 				else
-					printf("CPU %d, DOMAIN %d CPUS ", cs->cpu, ds->domain);
+					printf("CPU %d, DOMAIN %d CPUS ", cs1->cpu, ds1->domain);
 
-				print_domain_cpu_list(ds);
+				print_domain_cpu_list(ds1);
 			}
 			print_separator(100, "", 0);
-			print_domain_stats(ds, jiffies);
+			print_domain_stats(ds1, ds2, jiffies1, jiffies2);
 			print_separator(100, "", 0);
 
-			dptr = dptr->next;
+			dptr1 = dptr1->next;
+			if (dptr2)
+				dptr2 = dptr2->next;
 		}
+		if (summary_only)
+			break;
+
 		is_summary = false;
-		cptr = cptr->next;
+		cptr1 = cptr1->next;
 	}
 }
 
@@ -4358,12 +4442,88 @@ static int perf_sched__schedstat_report(struct perf_sched *sched)
 	perf_session__delete(session);
 	if (!err) {
 		setup_pager();
-		show_schedstat_data(cpu_head);
+		show_schedstat_data(cpu_head, NULL, false);
 		free_schedstat(cpu_head);
 	}
 	return err;
 }
 
+static int perf_sched__schedstat_diff(struct perf_sched *sched,
+				      int argc, const char **argv)
+{
+	struct schedstat_cpu *cpu_head_ses0 = NULL, *cpu_head_ses1 = NULL;
+	struct perf_session *session[2];
+	struct perf_data data[2];
+	int ret, err;
+	static const char *defaults[] = {
+		"perf.data.old",
+		"perf.data",
+	};
+
+	if (argc) {
+		if (argc == 1)
+			defaults[1] = argv[0];
+		else if (argc == 2) {
+			defaults[0] = argv[0];
+			defaults[1] = argv[1];
+		} else {
+			pr_err("perf sched stats diff is not supported with more than 2 files.\n");
+			goto out_ret;
+		}
+	}
+
+	sched->tool.schedstat_cpu = perf_sched__process_schedstat;
+	sched->tool.schedstat_domain = perf_sched__process_schedstat;
+
+	data[0].path = defaults[0];
+	data[0].mode  = PERF_DATA_MODE_READ;
+	session[0] = perf_session__new(&data[0], &sched->tool);
+	if (IS_ERR(session[0])) {
+		ret = PTR_ERR(session[0]);
+		pr_err("Failed to open %s\n", data[0].path);
+		goto out_delete_ses0;
+	}
+
+	err = perf_session__process_events(session[0]);
+	if (err)
+		goto out_delete_ses0;
+
+	cpu_head_ses0 = cpu_head;
+	after_workload_flag = false;
+	cpu_head = NULL;
+
+	data[1].path = defaults[1];
+	data[1].mode  = PERF_DATA_MODE_READ;
+	session[1] = perf_session__new(&data[1], &sched->tool);
+	if (IS_ERR(session[1])) {
+		ret = PTR_ERR(session[1]);
+		pr_err("Failed to open %s\n", data[1].path);
+		goto out_delete_ses1;
+	}
+
+	err = perf_session__process_events(session[1]);
+	if (err)
+		goto out_delete_ses1;
+
+	cpu_head_ses1 = cpu_head;
+	after_workload_flag = false;
+	setup_pager();
+	show_schedstat_data(cpu_head_ses0, cpu_head_ses1, true);
+	free_schedstat(cpu_head_ses0);
+	free_schedstat(cpu_head_ses1);
+
+out_delete_ses1:
+	if (!IS_ERR(session[1]))
+		perf_session__delete(session[1]);
+
+out_delete_ses0:
+	if (!IS_ERR(session[0]))
+		perf_session__delete(session[0]);
+
+out_ret:
+	return ret;
+}
+
 static int process_synthesized_event_live(const struct perf_tool *tool __maybe_unused,
 					  union perf_event *event,
 					  struct perf_sample *sample __maybe_unused,
@@ -4442,7 +4602,7 @@ static int perf_sched__schedstat_live(struct perf_sched *sched,
 		goto out_target;
 
 	setup_pager();
-	show_schedstat_data(cpu_head);
+	show_schedstat_data(cpu_head, NULL, false);
 	free_schedstat(cpu_head);
 out_target:
 	free(target);
@@ -4770,6 +4930,11 @@ int cmd_sched(int argc, const char **argv)
 				argc = parse_options(argc, argv, stats_options,
 						     stats_usage, 0);
 			return perf_sched__schedstat_report(&sched);
+		} else if (argv[0] && !strcmp(argv[0], "diff")) {
+			if (argc)
+				argc = parse_options(argc, argv, stats_options,
+						     stats_usage, 0);
+			return perf_sched__schedstat_diff(&sched, argc, argv);
 		}
 		return perf_sched__schedstat_live(&sched, argc, argv);
 	} else {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 7/8] perf sched stats: Add basic perf sched stats test
  2025-03-11 12:02 [PATCH v3 0/8] perf sched: Introduce stats tool Swapnil Sapkal
                   ` (5 preceding siblings ...)
  2025-03-11 12:02 ` [PATCH v3 6/8] perf sched stats: Add support for diff subcommand Swapnil Sapkal
@ 2025-03-11 12:02 ` Swapnil Sapkal
  2025-03-11 12:02 ` [PATCH v3 8/8] perf sched stats: Add details in man page Swapnil Sapkal
  2025-04-10  9:41 ` [PATCH v3 0/8] perf sched: Introduce stats tool Chen, Yu C
  8 siblings, 0 replies; 23+ messages in thread
From: Swapnil Sapkal @ 2025-03-11 12:02 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, james.clark
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot,
	adrian.hunter, kan.liang, gautham.shenoy, kprateek.nayak,
	juri.lelli, yangjihong, void, tj, sshegde, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

Add basic test for perf sched stats {record|report|diff} subcommand.

Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
---
 tools/perf/tests/shell/perf_sched_stats.sh | 64 ++++++++++++++++++++++
 1 file changed, 64 insertions(+)
 create mode 100755 tools/perf/tests/shell/perf_sched_stats.sh

diff --git a/tools/perf/tests/shell/perf_sched_stats.sh b/tools/perf/tests/shell/perf_sched_stats.sh
new file mode 100755
index 000000000000..ddc926f50129
--- /dev/null
+++ b/tools/perf/tests/shell/perf_sched_stats.sh
@@ -0,0 +1,64 @@
+#!/bin/sh
+# perf sched stats tests
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+err=0
+test_perf_sched_stats_record() {
+  echo "Basic perf sched stats record test"
+  if ! perf sched stats record true 2>&1 | \
+    grep -E -q "[ perf sched stats: Wrote samples to perf.data ]"
+  then
+    echo "Basic perf sched stats record test [Failed]"
+    err=1
+    return
+  fi
+  echo "Basic perf sched stats record test [Success]"
+}
+
+test_perf_sched_stats_report() {
+  echo "Basic perf sched stats report test"
+  perf sched stats record true > /dev/null
+  if ! perf sched stats report 2>&1 | grep -E -q "Columns description"
+  then
+    echo "Basic perf sched stats report test [Failed]"
+    err=1
+    rm perf.data
+    return
+  fi
+  rm perf.data
+  echo "Basic perf sched stats report test [Success]"
+}
+
+test_perf_sched_stats_live() {
+  echo "Basic perf sched stats live mode test"
+  if ! perf sched stats true 2>&1 | grep -E -q "Columns description"
+  then
+    echo "Basic perf sched stats live mode test [Failed]"
+    err=1
+    return
+  fi
+  echo "Basic perf sched stats live mode test [Success]"
+}
+
+test_perf_sched_stats_diff() {
+  echo "Basic perf sched stats diff test"
+  perf sched stats record true > /dev/null
+  perf sched stats record true > /dev/null
+  if ! perf sched stats diff > /dev/null
+  then
+    echo "Basic perf sched stats diff test [Failed]"
+    err=1
+    rm perf.data.old perf.data
+    return
+  fi
+  rm perf.data.old perf.data
+  echo "Basic perf sched stats diff test [Success]"
+}
+
+test_perf_sched_stats_record
+test_perf_sched_stats_report
+test_perf_sched_stats_live
+test_perf_sched_stats_diff
+exit $err
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 8/8] perf sched stats: Add details in man page
  2025-03-11 12:02 [PATCH v3 0/8] perf sched: Introduce stats tool Swapnil Sapkal
                   ` (6 preceding siblings ...)
  2025-03-11 12:02 ` [PATCH v3 7/8] perf sched stats: Add basic perf sched stats test Swapnil Sapkal
@ 2025-03-11 12:02 ` Swapnil Sapkal
  2025-04-10  9:41 ` [PATCH v3 0/8] perf sched: Introduce stats tool Chen, Yu C
  8 siblings, 0 replies; 23+ messages in thread
From: Swapnil Sapkal @ 2025-03-11 12:02 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, james.clark
  Cc: ravi.bangoria, swapnil.sapkal, yu.c.chen, mark.rutland,
	alexander.shishkin, jolsa, rostedt, vincent.guittot,
	adrian.hunter, kan.liang, gautham.shenoy, kprateek.nayak,
	juri.lelli, yangjihong, void, tj, sshegde, linux-kernel,
	linux-perf-users, santosh.shukla, ananth.narayan, sandipan.das

Document perf sched stats purpose, usage examples and guide on
how to interpret the report data in the perf-sched man page.

Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
---
 tools/perf/Documentation/perf-sched.txt | 243 +++++++++++++++++++++++-
 1 file changed, 242 insertions(+), 1 deletion(-)

diff --git a/tools/perf/Documentation/perf-sched.txt b/tools/perf/Documentation/perf-sched.txt
index 6dbbddb6464d..c674d95e2811 100644
--- a/tools/perf/Documentation/perf-sched.txt
+++ b/tools/perf/Documentation/perf-sched.txt
@@ -8,7 +8,7 @@ perf-sched - Tool to trace/measure scheduler properties (latencies)
 SYNOPSIS
 --------
 [verse]
-'perf sched' {record|latency|map|replay|script|timehist}
+'perf sched' {record|latency|map|replay|script|timehist|stats}
 
 DESCRIPTION
 -----------
@@ -80,8 +80,249 @@ There are several variants of 'perf sched':
     
    Times are in msec.usec.
 
+   'perf sched stats {record | report | diff} <command>' to capture, report the diff
+   in schedstat counters and show the difference between perf sched stats report.
+   schedstat counters which are present in the linux kernel which are exposed through
+   the file ``/proc/schedstat``. These counters are enabled or disabled via the
+   sysctl governed by the file ``/proc/sys/kernel/sched_schedstats``. These counters
+   accounts for many scheduler events such as ``schedule()`` calls, load-balancing
+   events, ``try_to_wakeup()`` call among others. This is useful in understading the
+   scheduler behavior for the workload.
+
+   Note: The tool will not give correct results if there is topological reordering or
+         online/offline of cpus in between capturing snapshots of `/proc/schedstat`.
+
+    Example usage:
+        perf sched stats record -- sleep 1
+        perf sched stats report
+        perf sched stats diff
+
+   A detailed description of the schedstats can be found in the Kernel Documentation:
+   https://www.kernel.org/doc/html/latest/scheduler/sched-stats.html
+
+   The result can be interprested as follows:
+
+   The `perf sched stats report` starts with description of the columns present in
+   the report. These column names are gievn before cpu and domain stats to improve
+   the readability of the report.
+
+   ----------------------------------------------------------------------------------------------------
+   DESC                    -> Description of the field
+   COUNT                   -> Value of the field
+   PCT_CHANGE              -> Percent change with corresponding base value
+   AVG_JIFFIES             -> Avg time in jiffies between two consecutive occurrence of event
+   ----------------------------------------------------------------------------------------------------
+
+   Next is the total profiling time in terms of jiffies:
+
+   ----------------------------------------------------------------------------------------------------
+   Time elapsed (in jiffies)                                   :       24537
+   ----------------------------------------------------------------------------------------------------
+
+   Next is CPU scheduling statistics. These are simple diffs of /proc/schedstat
+   CPU lines along with description. The report also prints % relative to base stat.
+
+   In the example below, schedule() left the CPU0 idle 98.19% of the time.
+   16.54% of total try_to_wake_up() was to wakeup local CPU. And, the total
+   waittime by tasks on CPU0 is 0.49% of the total runtime by tasks on the
+   same CPU.
+
+   ----------------------------------------------------------------------------------------------------
+   CPU 0
+   ----------------------------------------------------------------------------------------------------
+   DESC                                                                COUNT  PCT_CHANGE
+   ----------------------------------------------------------------------------------------------------
+   sched_yield() count                                         :           0
+   Legacy counter can be ignored                               :           0
+   schedule() called                                           :       17138
+   schedule() left the processor idle                          :       16827  (  98.19% )
+   try_to_wake_up() was called                                 :         508
+   try_to_wake_up() was called to wake up the local cpu        :          84  (  16.54% )
+   total runtime by tasks on this processor (in jiffies)       :  2408959243
+   total waittime by tasks on this processor (in jiffies)      :    11731825  (  0.49% )
+   total timeslices run on this cpu                            :         311
+   ----------------------------------------------------------------------------------------------------
+
+   Next is load balancing statistics. For each of the sched domains
+   (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
+   the following three categories:
+
+   1) Idle Load Balance: Load balancing performed on behalf of a long
+                         idling CPU by some other CPU.
+   2) Busy Load Balance: Load balancing performed when the CPU was busy.
+   3) New Idle Balance : Load balancing performed when a CPU just became
+                        idle.
+
+   Under each of these three categories, sched stats report provides
+   different load balancing statistics. Along with direct stats, the
+   report also contains derived metrics prefixed with *. Example:
+
+   ----------------------------------------------------------------------------------------------------
+   CPU 0 DOMAIN SMT CPUS <0, 64>
+   ----------------------------------------------------------------------------------------------------
+   DESC                                                                     COUNT     AVG_JIFFIES
+   ----------------------------------------- <Category idle> ------------------------------------------
+   load_balance() count on cpu idle                                 :          50   $      490.74 $
+   load_balance() found balanced on cpu idle                        :          42   $      584.21 $
+   load_balance() move task failed on cpu idle                      :           8   $     3067.12 $
+   imbalance sum on cpu idle                                        :           8
+   pull_task() count on cpu idle                                    :           0
+   pull_task() when target task was cache-hot on cpu idle           :           0
+   load_balance() failed to find busier queue on cpu idle           :           0   $        0.00 $
+   load_balance() failed to find busier group on cpu idle           :          42   $      584.21 $
+   *load_balance() success count on cpu idle                        :           0
+   *avg task pulled per successful lb attempt (cpu idle)            :        0.00
+   ----------------------------------------- <Category busy> ------------------------------------------
+   load_balance() count on cpu busy                                 :           2   $    12268.50 $
+   load_balance() found balanced on cpu busy                        :           2   $    12268.50 $
+   load_balance() move task failed on cpu busy                      :           0   $        0.00 $
+   imbalance sum on cpu busy                                        :           0
+   pull_task() count on cpu busy                                    :           0
+   pull_task() when target task was cache-hot on cpu busy           :           0
+   load_balance() failed to find busier queue on cpu busy           :           0   $        0.00 $
+   load_balance() failed to find busier group on cpu busy           :           1   $    24537.00 $
+   *load_balance() success count on cpu busy                        :           0
+   *avg task pulled per successful lb attempt (cpu busy)            :        0.00
+   ---------------------------------------- <Category newidle> ----------------------------------------
+   load_balance() count on cpu newly idle                           :         427   $       57.46 $
+   load_balance() found balanced on cpu newly idle                  :         382   $       64.23 $
+   load_balance() move task failed on cpu newly idle                :          45   $      545.27 $
+   imbalance sum on cpu newly idle                                  :          48
+   pull_task() count on cpu newly idle                              :           0
+   pull_task() when target task was cache-hot on cpu newly idle     :           0
+   load_balance() failed to find busier queue on cpu newly idle     :           0   $        0.00 $
+   load_balance() failed to find busier group on cpu newly idle     :         382   $       64.23 $
+   *load_balance() success count on cpu newly idle                  :           0
+   *avg task pulled per successful lb attempt (cpu newly idle)      :        0.00
+   ----------------------------------------------------------------------------------------------------
+
+   Consider following line:
+
+   load_balance() found balanced on cpu newly idle                  :         382    $      64.23 $
+
+   While profiling was active, the load-balancer found 382 times the load
+   needs to be balanced on a newly idle CPU 0. Following value encapsulated
+   inside $ is average jiffies between two events (24537 / 382 = 64.23).
+
+   Next are active_load_balance() stats. alb did not trigger while the
+   profiling was active, hence it's all 0s.
+
+   --------------------------------- <Category active_load_balance()> ---------------------------------
+   active_load_balance() count                                      :           0
+   active_load_balance() move task failed                           :           0
+   active_load_balance() successfully moved a task                  :           0
+   ----------------------------------------------------------------------------------------------------
+
+   Next are sched_balance_exec() and sched_balance_fork() stats. They are
+   not used but we kept it in RFC just for legacy purpose. Unless opposed,
+   we plan to remove them in next revision.
+
+   Next are wakeup statistics. For every domain, the report also shows
+   task-wakeup statistics. Example:
+
+   ------------------------------------------- <Wakeup Info> ------------------------------------------
+   try_to_wake_up() awoke a task that last ran on a diff cpu       :       12070
+   try_to_wake_up() moved task because cache-cold on own cpu       :        3324
+   try_to_wake_up() started passive balancing                      :           0
+   ----------------------------------------------------------------------------------------------------
+
+   Same set of stats are reported for each CPU and each domain level.
+
+   How to interpret the diff
+   ~~~~~~~~~~~~~~~~~~~~~~~~~
+
+   The `perf sched stats diff` will also start with explaining the columns
+   present in the diff. Then it will show the diff in time in terms of
+   jiffies. The order of the values depends on the order of input data
+   files. Example:
+
+   ----------------------------------------------------------------------------------------------------
+   Time elapsed (in jiffies)                                        :        2009,       2001
+   ----------------------------------------------------------------------------------------------------
+
+   Below is the sample representing the difference in cpu and domain stats of
+   two runs. Here third column or the values enclosed in `|...|` shows the
+   percent change between the two. Second and fourth columns shows the
+   side-by-side representions of the corresponding fields from `perf sched
+   stats report`.
+
+   ----------------------------------------------------------------------------------------------------
+   CPU <ALL CPUS SUMMARY>
+   ----------------------------------------------------------------------------------------------------
+   DESC                                                                    COUNT1      COUNT2  PCT_CHANGE  PCT_CHANGE1 PCT_CHANGE2
+   ----------------------------------------------------------------------------------------------------
+   sched_yield() count                                              :           0,          0  |    0.00% |
+   Legacy counter can be ignored                                    :           0,          0  |    0.00% |
+   schedule() called                                                :      442939,     447305  |    0.99% |
+   schedule() left the processor idle                               :      154012,     174657  |   13.40% |  (   34.77,      39.05 )
+   try_to_wake_up() was called                                      :      306810,     258076  |  -15.88% |
+   try_to_wake_up() was called to wake up the local cpu             :       21313,      14130  |  -33.70% |  (    6.95,       5.48 )
+   total runtime by tasks on this processor (in jiffies)            :  6235330010, 5463133934  |  -12.38% |
+   total waittime by tasks on this processor (in jiffies)           :  8349785693, 5755097654  |  -31.07% |  (  133.91,     105.34 )
+   total timeslices run on this cpu                                 :      288869,     272599  |   -5.63% |
+   ----------------------------------------------------------------------------------------------------
+
+   Below is the sample of domain stats diff:
+
+   ----------------------------------------------------------------------------------------------------
+   CPU <ALL CPUS SUMMARY>, DOMAIN SMT CPUS <0, 64>
+   ----------------------------------------------------------------------------------------------------
+   DESC                                                                    COUNT1      COUNT2  PCT_CHANGE     AVG_JIFFIES1  AVG_JIFFIES2
+   ----------------------------------------- <Category busy> ------------------------------------------
+   load_balance() count on cpu busy                                 :         154,         80  |  -48.05% |  $       13.05,       25.01 $
+   load_balance() found balanced on cpu busy                        :         120,         66  |  -45.00% |  $       16.74,       30.32 $
+   load_balance() move task failed on cpu busy                      :           0,          4  |    0.00% |  $        0.00,      500.25 $
+   imbalance sum on cpu busy                                        :        1640,        299  |  -81.77% |
+   pull_task() count on cpu busy                                    :          55,         18  |  -67.27% |
+   pull_task() when target task was cache-hot on cpu busy           :           0,          0  |    0.00% |
+   load_balance() failed to find busier queue on cpu busy           :           0,          0  |    0.00% |  $        0.00,        0.00 $
+   load_balance() failed to find busier group on cpu busy           :         120,         66  |  -45.00% |  $       16.74,       30.32 $
+   *load_balance() success count on cpu busy                        :          34,         10  |  -70.59% |
+   *avg task pulled per successful lb attempt (cpu busy)            :        1.62,       1.80  |   11.27% |
+   ----------------------------------------- <Category idle> ------------------------------------------
+   load_balance() count on cpu idle                                 :         299,        481  |   60.87% |  $        6.72,        4.16 $
+   load_balance() found balanced on cpu idle                        :         197,        331  |   68.02% |  $       10.20,        6.05 $
+   load_balance() move task failed on cpu idle                      :           1,          2  |  100.00% |  $     2009.00,     1000.50 $
+   imbalance sum on cpu idle                                        :         145,        222  |   53.10% |
+   pull_task() count on cpu idle                                    :         133,        199  |   49.62% |
+   pull_task() when target task was cache-hot on cpu idle           :           0,          0  |    0.00% |
+   load_balance() failed to find busier queue on cpu idle           :           0,          0  |    0.00% |  $        0.00,        0.00 $
+   load_balance() failed to find busier group on cpu idle           :         197,        331  |   68.02% |  $       10.20,        6.05 $
+   *load_balance() success count on cpu idle                        :         101,        148  |   46.53% |
+   *avg task pulled per successful lb attempt (cpu idle)            :        1.32,       1.34  |    2.11% |
+   ---------------------------------------- <Category newidle> ----------------------------------------
+   load_balance() count on cpu newly idle                           :       21791,      15976  |  -26.69% |  $        0.09,        0.13 $
+   load_balance() found balanced on cpu newly idle                  :       16226,      12125  |  -25.27% |  $        0.12,        0.17 $
+   load_balance() move task failed on cpu newly idle                :         236,         88  |  -62.71% |  $        8.51,       22.74 $
+   imbalance sum on cpu newly idle                                  :        6655,       4628  |  -30.46% |
+   pull_task() count on cpu newly idle                              :        5329,       3763  |  -29.39% |
+   pull_task() when target task was cache-hot on cpu newly idle     :           0,          0  |    0.00% |
+   load_balance() failed to find busier queue on cpu newly idle     :           0,          0  |    0.00% |  $        0.00,        0.00 $
+   load_balance() failed to find busier group on cpu newly idle     :       12649,       9914  |  -21.62% |  $        0.16,        0.20 $
+   *load_balance() success count on cpu newly idle                  :        5329,       3763  |  -29.39% |
+   *avg task pulled per successful lb attempt (cpu newly idle)      :        1.00,       1.00  |    0.00% |
+   --------------------------------- <Category active_load_balance()> ---------------------------------
+   active_load_balance() count                                      :           0,          0  |    0.00% |
+   active_load_balance() move task failed                           :           0,          0  |    0.00% |
+   active_load_balance() successfully moved a task                  :           0,          0  |    0.00% |
+   --------------------------------- <Category sched_balance_exec()> ----------------------------------
+   sbe_count is not used                                            :           0,          0  |    0.00% |
+   sbe_balanced is not used                                         :           0,          0  |    0.00% |
+   sbe_pushed is not used                                           :           0,          0  |    0.00% |
+   --------------------------------- <Category sched_balance_fork()> ----------------------------------
+   sbf_count is not used                                            :           0,          0  |    0.00% |
+   sbf_balanced is not used                                         :           0,          0  |    0.00% |
+   sbf_pushed is not used                                           :           0,          0  |    0.00% |
+   ------------------------------------------ <Wakeup Info> -------------------------------------------
+   try_to_wake_up() awoke a task that last ran on a diff cpu        :       16606,      10214  |  -38.49% |
+   try_to_wake_up() moved task because cache-cold on own cpu        :        3184,       2534  |  -20.41% |
+   try_to_wake_up() started passive balancing                       :           0,          0  |    0.00% |
+   ----------------------------------------------------------------------------------------------------
+
 OPTIONS
 -------
+Applicable to {record|latency|map|replay|script}
+
 -i::
 --input=<file>::
         Input file name. (default: perf.data unless stdin is a fifo)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/8] perf sched stats: Add record and rawdump support
  2025-03-11 12:02 ` [PATCH v3 1/8] perf sched stats: Add record and rawdump support Swapnil Sapkal
@ 2025-03-11 13:10   ` Markus Elfring
  2025-03-11 16:19   ` Markus Elfring
  2025-03-15  2:24   ` Namhyung Kim
  2 siblings, 0 replies; 23+ messages in thread
From: Markus Elfring @ 2025-03-11 13:10 UTC (permalink / raw)
  To: Swapnil Sapkal, Ravi Bangoria, linux-perf-users
  Cc: LKML, Adrian Hunter, Alexander Shishkin, Ananth Narayan,
	Arnaldo Carvalho de Melo, Chen Yu, David Vernet,
	Gautham R. Shenoy, Ian Rogers, Ingo Molnar, James Clark,
	James Clark, Jiri Olsa, Juri Lelli, Kan Liang, K Prateek Nayak,
	Mark Rutland, Namhyung Kim, Peter Zijlstra, Sandipan Das,
	Santosh Shukla, Shrikanth Hegde, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Yang Jihong

…
> +++ b/tools/perf/builtin-sched.c
> +static int enable_sched_schedstats(int *reset)
> +{
> +	ch = getc(fp);
> +	if (ch == '0') {
> +		*reset = 1;
> +		rewind(fp);
> +		putc('1', fp);
> +		fclose(fp);
> +	}
> +	return 0;
> +}
…

Is the error detection incomplete so far?
https://cwe.mitre.org/data/definitions/252.html

Regards,
Markus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/8] perf sched stats: Add record and rawdump support
  2025-03-11 12:02 ` [PATCH v3 1/8] perf sched stats: Add record and rawdump support Swapnil Sapkal
  2025-03-11 13:10   ` Markus Elfring
@ 2025-03-11 16:19   ` Markus Elfring
  2025-03-15  2:24   ` Namhyung Kim
  2 siblings, 0 replies; 23+ messages in thread
From: Markus Elfring @ 2025-03-11 16:19 UTC (permalink / raw)
  To: Swapnil Sapkal, Ravi Bangoria, linux-perf-users
  Cc: LKML, Adrian Hunter, Alexander Shishkin, Ananth Narayan,
	Arnaldo Carvalho de Melo, Chen Yu, David Vernet,
	Gautham R. Shenoy, Ian Rogers, Ingo Molnar, James Clark,
	James Clark, Jiri Olsa, Juri Lelli, Kan Liang, K Prateek Nayak,
	Mark Rutland, Namhyung Kim, Peter Zijlstra, Sandipan Das,
	Santosh Shukla, Shrikanth Hegde, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Yang Jihong

…
> +++ b/tools/perf/util/event.c
> +size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
> +{
> +	size_t size = 0;
> +
> +	size = fprintf(fp, "\ncpu%u ", cs->cpu);

I find an other code variant more succinct.

+	size_t size = fprintf(fp, "\ncpu%u ", cs->cpu);


Would you like to care for more complete error detection and corresponding
exception handling?
https://cwe.mitre.org/data/definitions/252.html

Regards,
Markus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/8] perf sched stats: Add record and rawdump support
  2025-03-11 12:02 ` [PATCH v3 1/8] perf sched stats: Add record and rawdump support Swapnil Sapkal
  2025-03-11 13:10   ` Markus Elfring
  2025-03-11 16:19   ` Markus Elfring
@ 2025-03-15  2:24   ` Namhyung Kim
  2025-03-17 13:29     ` Sapkal, Swapnil
  2 siblings, 1 reply; 23+ messages in thread
From: Namhyung Kim @ 2025-03-15  2:24 UTC (permalink / raw)
  To: Swapnil Sapkal
  Cc: peterz, mingo, acme, irogers, james.clark, ravi.bangoria,
	yu.c.chen, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, adrian.hunter, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, sshegde,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das, James Clark

Hello,

On Tue, Mar 11, 2025 at 12:02:23PM +0000, Swapnil Sapkal wrote:
> Define new, perf tool only, sample types and their layouts. Add logic
> to parse /proc/schedstat, convert it to perf sample format and save
> samples to perf.data file with `perf sched stats record` command. Also
> add logic to read perf.data file, interpret schedstat samples and
> print rawdump of samples with `perf script -D`.
> 
> Note that, /proc/schedstat file output is standardized with version
> number. The patch supports v15 but older or newer version can be added
> easily.
> 
> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> Tested-by: James Clark <james.clark@linaro.org>
> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> ---
>  tools/lib/perf/Documentation/libperf.txt    |   2 +
>  tools/lib/perf/Makefile                     |   2 +-
>  tools/lib/perf/include/perf/event.h         |  42 ++++
>  tools/lib/perf/include/perf/schedstat-v15.h |  52 +++++
>  tools/perf/builtin-inject.c                 |   2 +
>  tools/perf/builtin-sched.c                  | 226 +++++++++++++++++-
>  tools/perf/util/event.c                     |  98 ++++++++
>  tools/perf/util/event.h                     |   2 +
>  tools/perf/util/session.c                   |  20 ++
>  tools/perf/util/synthetic-events.c          | 239 ++++++++++++++++++++
>  tools/perf/util/synthetic-events.h          |   3 +
>  tools/perf/util/tool.c                      |  20 ++
>  tools/perf/util/tool.h                      |   4 +-
>  13 files changed, 709 insertions(+), 3 deletions(-)
>  create mode 100644 tools/lib/perf/include/perf/schedstat-v15.h
> 
> diff --git a/tools/lib/perf/Documentation/libperf.txt b/tools/lib/perf/Documentation/libperf.txt
> index 59aabdd3cabf..3f295639903d 100644
> --- a/tools/lib/perf/Documentation/libperf.txt
> +++ b/tools/lib/perf/Documentation/libperf.txt
> @@ -210,6 +210,8 @@ SYNOPSIS
>    struct perf_record_time_conv;
>    struct perf_record_header_feature;
>    struct perf_record_compressed;
> +  struct perf_record_schedstat_cpu;
> +  struct perf_record_schedstat_domain;
>  --
>  
>  DESCRIPTION
> diff --git a/tools/lib/perf/Makefile b/tools/lib/perf/Makefile
> index e9a7ac2c062e..4b60804aa0b6 100644
> --- a/tools/lib/perf/Makefile
> +++ b/tools/lib/perf/Makefile
> @@ -174,7 +174,7 @@ install_lib: libs
>  		$(call do_install_mkdir,$(libdir_SQ)); \
>  		cp -fpR $(LIBPERF_ALL) $(DESTDIR)$(libdir_SQ)
>  
> -HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h
> +HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-v15.h
>  INTERNAL_HDRS := cpumap.h evlist.h evsel.h lib.h mmap.h rc_check.h threadmap.h xyarray.h
>  
>  INSTALL_HDRS_PFX := $(DESTDIR)$(prefix)/include/perf
> diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
> index 37bb7771d914..189106874063 100644
> --- a/tools/lib/perf/include/perf/event.h
> +++ b/tools/lib/perf/include/perf/event.h
> @@ -457,6 +457,44 @@ struct perf_record_compressed {
>  	char			 data[];
>  };
>  
> +struct perf_record_schedstat_cpu_v15 {
> +#define CPU_FIELD(_type, _name, _ver)		_type _name
> +#include "schedstat-v15.h"
> +#undef CPU_FIELD
> +};
> +
> +struct perf_record_schedstat_cpu {
> +	struct perf_event_header header;
> +	__u64			 timestamp;
> +	union {
> +		struct perf_record_schedstat_cpu_v15 v15;
> +	};
> +	__u32			 cpu;
> +	__u16			 version;

Why not putting these before the union?  I think it'll have variable
size once you add different versions then it'd be hard to access the
fields after union.  You may want to add a padding explicitly.

> +};
> +
> +struct perf_record_schedstat_domain_v15 {
> +#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
> +#include "schedstat-v15.h"
> +#undef DOMAIN_FIELD
> +};
> +
> +#define DOMAIN_NAME_LEN		16
> +
> +struct perf_record_schedstat_domain {
> +	struct perf_event_header header;
> +	__u16			 version;
> +	__u64			 timestamp;
> +	__u32			 cpu;
> +	__u16			 domain;

If this has similar information for schedstat_cpu, I think it's better
to start with the same layout.  And having version before timestamp
would add unnecessary paddings.


> +	char			 name[DOMAIN_NAME_LEN];
> +	union {
> +		struct perf_record_schedstat_domain_v15 v15;
> +	};
> +	__u16			 nr_cpus;
> +	__u8			 cpu_mask[];

Does cpu_mask represent the domain membership?  Maybe you can split
those info into a separate record or put it in a header feature like
we have topology information there.


> +};
> +
>  enum perf_user_event_type { /* above any possible kernel type */
>  	PERF_RECORD_USER_TYPE_START		= 64,
>  	PERF_RECORD_HEADER_ATTR			= 64,
> @@ -478,6 +516,8 @@ enum perf_user_event_type { /* above any possible kernel type */
>  	PERF_RECORD_HEADER_FEATURE		= 80,
>  	PERF_RECORD_COMPRESSED			= 81,
>  	PERF_RECORD_FINISHED_INIT		= 82,
> +	PERF_RECORD_SCHEDSTAT_CPU		= 83,
> +	PERF_RECORD_SCHEDSTAT_DOMAIN		= 84,
>  	PERF_RECORD_HEADER_MAX
>  };
>  
> @@ -518,6 +558,8 @@ union perf_event {
>  	struct perf_record_time_conv		time_conv;
>  	struct perf_record_header_feature	feat;
>  	struct perf_record_compressed		pack;
> +	struct perf_record_schedstat_cpu	schedstat_cpu;
> +	struct perf_record_schedstat_domain	schedstat_domain;
>  };
>  
>  #endif /* __LIBPERF_EVENT_H */
> diff --git a/tools/lib/perf/include/perf/schedstat-v15.h b/tools/lib/perf/include/perf/schedstat-v15.h
> new file mode 100644
> index 000000000000..43f8060c5337
> --- /dev/null
> +++ b/tools/lib/perf/include/perf/schedstat-v15.h
> @@ -0,0 +1,52 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifdef CPU_FIELD
> +CPU_FIELD(__u32, yld_count, v15);
> +CPU_FIELD(__u32, array_exp, v15);
> +CPU_FIELD(__u32, sched_count, v15);
> +CPU_FIELD(__u32, sched_goidle, v15);
> +CPU_FIELD(__u32, ttwu_count, v15);
> +CPU_FIELD(__u32, ttwu_local, v15);
> +CPU_FIELD(__u64, rq_cpu_time, v15);
> +CPU_FIELD(__u64, run_delay, v15);
> +CPU_FIELD(__u64, pcount, v15);
> +#endif
> +
> +#ifdef DOMAIN_FIELD
> +DOMAIN_FIELD(__u32, idle_lb_count, v15);
> +DOMAIN_FIELD(__u32, idle_lb_balanced, v15);
> +DOMAIN_FIELD(__u32, idle_lb_failed, v15);
> +DOMAIN_FIELD(__u32, idle_lb_imbalance, v15);
> +DOMAIN_FIELD(__u32, idle_lb_gained, v15);
> +DOMAIN_FIELD(__u32, idle_lb_hot_gained, v15);
> +DOMAIN_FIELD(__u32, idle_lb_nobusyq, v15);
> +DOMAIN_FIELD(__u32, idle_lb_nobusyg, v15);
> +DOMAIN_FIELD(__u32, busy_lb_count, v15);
> +DOMAIN_FIELD(__u32, busy_lb_balanced, v15);
> +DOMAIN_FIELD(__u32, busy_lb_failed, v15);
> +DOMAIN_FIELD(__u32, busy_lb_imbalance, v15);
> +DOMAIN_FIELD(__u32, busy_lb_gained, v15);
> +DOMAIN_FIELD(__u32, busy_lb_hot_gained, v15);
> +DOMAIN_FIELD(__u32, busy_lb_nobusyq, v15);
> +DOMAIN_FIELD(__u32, busy_lb_nobusyg, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_count, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_balanced, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_failed, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_imbalance, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_gained, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v15);
> +DOMAIN_FIELD(__u32, alb_count, v15);
> +DOMAIN_FIELD(__u32, alb_failed, v15);
> +DOMAIN_FIELD(__u32, alb_pushed, v15);
> +DOMAIN_FIELD(__u32, sbe_count, v15);
> +DOMAIN_FIELD(__u32, sbe_balanced, v15);
> +DOMAIN_FIELD(__u32, sbe_pushed, v15);
> +DOMAIN_FIELD(__u32, sbf_count, v15);
> +DOMAIN_FIELD(__u32, sbf_balanced, v15);
> +DOMAIN_FIELD(__u32, sbf_pushed, v15);
> +DOMAIN_FIELD(__u32, ttwu_wake_remote, v15);
> +DOMAIN_FIELD(__u32, ttwu_move_affine, v15);
> +DOMAIN_FIELD(__u32, ttwu_move_balance, v15);
> +#endif
> diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
> index 11e49cafa3af..af1add2abf72 100644
> --- a/tools/perf/builtin-inject.c
> +++ b/tools/perf/builtin-inject.c
> @@ -2530,6 +2530,8 @@ int cmd_inject(int argc, const char **argv)
>  	inject.tool.finished_init	= perf_event__repipe_op2_synth;
>  	inject.tool.compressed		= perf_event__repipe_op4_synth;
>  	inject.tool.auxtrace		= perf_event__repipe_auxtrace;
> +	inject.tool.schedstat_cpu	= perf_event__repipe_op2_synth;
> +	inject.tool.schedstat_domain	= perf_event__repipe_op2_synth;
>  	inject.tool.dont_split_sample_group = true;
>  	inject.session = __perf_session__new(&data, &inject.tool,
>  					     /*trace_event_repipe=*/inject.output.is_pipe);
> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> index 26ece6e9bfd1..1c3b56013164 100644
> --- a/tools/perf/builtin-sched.c
> +++ b/tools/perf/builtin-sched.c
> @@ -28,6 +28,8 @@
>  #include "util/debug.h"
>  #include "util/event.h"
>  #include "util/util.h"
> +#include "util/synthetic-events.h"
> +#include "util/target.h"
>  
>  #include <linux/kernel.h>
>  #include <linux/log2.h>
> @@ -55,6 +57,7 @@
>  #define MAX_PRIO		140
>  
>  static const char *cpu_list;
> +static struct perf_cpu_map *user_requested_cpus;

I guess this can be in evlist.


>  static DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
>  
>  struct sched_atom;
> @@ -236,6 +239,9 @@ struct perf_sched {
>  	volatile bool   thread_funcs_exit;
>  	const char	*prio_str;
>  	DECLARE_BITMAP(prio_bitmap, MAX_PRIO);
> +
> +	struct perf_session *session;
> +	struct perf_data *data;
>  };
>  
>  /* per thread run time data */
> @@ -3670,6 +3676,199 @@ static void setup_sorting(struct perf_sched *sched, const struct option *options
>  	sort_dimension__add("pid", &sched->cmp_pid);
>  }
>  
> +static int process_synthesized_schedstat_event(const struct perf_tool *tool,
> +					       union perf_event *event,
> +					       struct perf_sample *sample __maybe_unused,
> +					       struct machine *machine __maybe_unused)
> +{
> +	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
> +
> +	if (perf_data__write(sched->data, event, event->header.size) <= 0) {
> +		pr_err("failed to write perf data, error: %m\n");
> +		return -1;
> +	}
> +
> +	sched->session->header.data_size += event->header.size;
> +	return 0;
> +}
> +
> +static void sighandler(int sig __maybe_unused)
> +{
> +}
> +
> +static int enable_sched_schedstats(int *reset)
> +{
> +	char path[PATH_MAX];
> +	FILE *fp;
> +	char ch;
> +
> +	snprintf(path, PATH_MAX, "%s/sys/kernel/sched_schedstats", procfs__mountpoint());
> +	fp = fopen(path, "w+");
> +	if (!fp) {
> +		pr_err("Failed to open %s\n", path);
> +		return -1;
> +	}
> +
> +	ch = getc(fp);
> +	if (ch == '0') {
> +		*reset = 1;
> +		rewind(fp);
> +		putc('1', fp);
> +		fclose(fp);
> +	}
> +	return 0;
> +}
> +
> +static int disable_sched_schedstat(void)
> +{
> +	char path[PATH_MAX];
> +	FILE *fp;
> +
> +	snprintf(path, PATH_MAX, "%s/sys/kernel/sched_schedstats", procfs__mountpoint());
> +	fp = fopen(path, "w");
> +	if (!fp) {
> +		pr_err("Failed to open %s\n", path);
> +		return -1;
> +	}
> +
> +	putc('0', fp);
> +	fclose(fp);
> +	return 0;
> +}
> +
> +/* perf.data or any other output file name used by stats subcommand (only). */
> +const char *output_name;
> +
> +static int perf_sched__schedstat_record(struct perf_sched *sched,
> +					int argc, const char **argv)
> +{
> +	struct perf_session *session;
> +	struct evlist *evlist;
> +	struct target *target;
> +	int reset = 0;
> +	int err = 0;
> +	int fd;
> +	struct perf_data data = {
> +		.path  = output_name,
> +		.mode  = PERF_DATA_MODE_WRITE,
> +	};
> +
> +	signal(SIGINT, sighandler);
> +	signal(SIGCHLD, sighandler);
> +	signal(SIGTERM, sighandler);
> +
> +	evlist = evlist__new();
> +	if (!evlist)
> +		return -ENOMEM;
> +
> +	session = perf_session__new(&data, &sched->tool);
> +	if (IS_ERR(session)) {
> +		pr_err("Perf session creation failed.\n");

Also need evlist__delete().


> +		return PTR_ERR(session);
> +	}
> +
> +	session->evlist = evlist;
> +
> +	sched->session = session;
> +	sched->data = &data;
> +
> +	fd = perf_data__fd(&data);
> +
> +	/*
> +	 * Capture all important metadata about the system. Although they are
> +	 * not used by `perf sched stats` tool directly, they provide useful
> +	 * information about profiled environment.
> +	 */
> +	perf_header__set_feat(&session->header, HEADER_HOSTNAME);
> +	perf_header__set_feat(&session->header, HEADER_OSRELEASE);
> +	perf_header__set_feat(&session->header, HEADER_VERSION);
> +	perf_header__set_feat(&session->header, HEADER_ARCH);
> +	perf_header__set_feat(&session->header, HEADER_NRCPUS);
> +	perf_header__set_feat(&session->header, HEADER_CPUDESC);
> +	perf_header__set_feat(&session->header, HEADER_CPUID);
> +	perf_header__set_feat(&session->header, HEADER_TOTAL_MEM);
> +	perf_header__set_feat(&session->header, HEADER_CMDLINE);
> +	perf_header__set_feat(&session->header, HEADER_CPU_TOPOLOGY);
> +	perf_header__set_feat(&session->header, HEADER_NUMA_TOPOLOGY);
> +	perf_header__set_feat(&session->header, HEADER_CACHE);
> +	perf_header__set_feat(&session->header, HEADER_MEM_TOPOLOGY);
> +	perf_header__set_feat(&session->header, HEADER_CPU_PMU_CAPS);
> +	perf_header__set_feat(&session->header, HEADER_HYBRID_TOPOLOGY);
> +	perf_header__set_feat(&session->header, HEADER_PMU_CAPS);

Probably you don't need {CPU_,}PMU_CAPS.  Also I wonder if it's possible
to add cpu-domain info here.

> +
> +	err = perf_session__write_header(session, evlist, fd, false);
> +	if (err < 0)
> +		goto out;
> +
> +	/*
> +	 * `perf sched stats` does not support workload profiling (-p pid)
> +	 * since /proc/schedstat file contains cpu specific data only. Hence, a
> +	 * profile target is either set of cpus or systemwide, never a process.
> +	 * Note that, although `-- <workload>` is supported, profile data are
> +	 * still cpu/systemwide.
> +	 */
> +	target = zalloc(sizeof(struct target));

It seems no need to alloc the target, just putting it on stack would be
fine.


> +	if (cpu_list)
> +		target->cpu_list = cpu_list;
> +	else
> +		target->system_wide = true;
> +
> +	if (argc) {
> +		err = evlist__prepare_workload(evlist, target, argv, false, NULL);
> +		if (err)
> +			goto out_target;
> +	}
> +
> +	if (cpu_list) {
> +		user_requested_cpus = perf_cpu_map__new(cpu_list);

Where is this freed?


> +		if (!user_requested_cpus)
> +			goto out_target;
> +	}
> +
> +	err = perf_event__synthesize_schedstat(&(sched->tool),
> +					       process_synthesized_schedstat_event,
> +					       user_requested_cpus);
> +	if (err < 0)
> +		goto out_target;
> +
> +	err = enable_sched_schedstats(&reset);
> +	if (err < 0)
> +		goto out_target;
> +
> +	if (argc)
> +		evlist__start_workload(evlist);
> +
> +	/* wait for signal */
> +	pause();
> +
> +	if (reset) {
> +		err = disable_sched_schedstat();
> +		if (err < 0)
> +			goto out_target;
> +	}
> +
> +	err = perf_event__synthesize_schedstat(&(sched->tool),
> +					       process_synthesized_schedstat_event,
> +					       user_requested_cpus);
> +	if (err < 0)
> +		goto out_target;
> +
> +	err = perf_session__write_header(session, evlist, fd, true);
> +
> +out_target:
> +	free(target);
> +out:
> +	if (!err)
> +		fprintf(stderr, "[ perf sched stats: Wrote samples to %s ]\n", data.path);
> +	else
> +		fprintf(stderr, "[ perf sched stats: Failed !! ]\n");
> +
> +	close(fd);
> +	perf_session__delete(session);

It seems session->evlist is deleted only when the data is in read mode.

> +
> +	return err;
> +}
> +
>  static bool schedstat_events_exposed(void)
>  {
>  	/*
> @@ -3846,6 +4045,12 @@ int cmd_sched(int argc, const char **argv)
>  	OPT_BOOLEAN('P', "pre-migrations", &sched.pre_migrations, "Show pre-migration wait time"),
>  	OPT_PARENT(sched_options)
>  	};
> +	const struct option stats_options[] = {
> +	OPT_STRING('o', "output", &output_name, "file",
> +		   "`stats record` with output filename"),
> +	OPT_STRING('C', "cpu", &cpu_list, "cpu", "list of cpus to profile"),
> +	OPT_END()
> +	};
>  
>  	const char * const latency_usage[] = {
>  		"perf sched latency [<options>]",
> @@ -3863,9 +4068,13 @@ int cmd_sched(int argc, const char **argv)
>  		"perf sched timehist [<options>]",
>  		NULL
>  	};
> +	const char *stats_usage[] = {
> +		"perf sched stats {record} [<options>]",
> +		NULL
> +	};
>  	const char *const sched_subcommands[] = { "record", "latency", "map",
>  						  "replay", "script",
> -						  "timehist", NULL };
> +						  "timehist", "stats", NULL };
>  	const char *sched_usage[] = {
>  		NULL,
>  		NULL
> @@ -3961,6 +4170,21 @@ int cmd_sched(int argc, const char **argv)
>  			return ret;
>  
>  		return perf_sched__timehist(&sched);
> +	} else if (!strcmp(argv[0], "stats")) {
> +		const char *const stats_subcommands[] = {"record", NULL};
> +
> +		argc = parse_options_subcommand(argc, argv, stats_options,
> +						stats_subcommands,
> +						stats_usage,
> +						PARSE_OPT_STOP_AT_NON_OPTION);
> +
> +		if (argv[0] && !strcmp(argv[0], "record")) {
> +			if (argc)
> +				argc = parse_options(argc, argv, stats_options,
> +						     stats_usage, 0);
> +			return perf_sched__schedstat_record(&sched, argc, argv);
> +		}
> +		usage_with_options(stats_usage, stats_options);
>  	} else {
>  		usage_with_options(sched_usage, sched_options);
>  	}
> diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
> index aac96d5d1917..0f863d38abe2 100644
> --- a/tools/perf/util/event.c
> +++ b/tools/perf/util/event.c
> @@ -77,6 +77,8 @@ static const char *perf_event__names[] = {
>  	[PERF_RECORD_HEADER_FEATURE]		= "FEATURE",
>  	[PERF_RECORD_COMPRESSED]		= "COMPRESSED",
>  	[PERF_RECORD_FINISHED_INIT]		= "FINISHED_INIT",
> +	[PERF_RECORD_SCHEDSTAT_CPU]		= "SCHEDSTAT_CPU",
> +	[PERF_RECORD_SCHEDSTAT_DOMAIN]		= "SCHEDSTAT_DOMAIN",
>  };
>  
>  const char *perf_event__name(unsigned int id)
> @@ -550,6 +552,102 @@ size_t perf_event__fprintf_text_poke(union perf_event *event, struct machine *ma
>  	return ret;
>  }
>  
> +size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
> +{
> +	struct perf_record_schedstat_cpu *cs = &event->schedstat_cpu;
> +	__u16 version = cs->version;
> +	size_t size = 0;
> +
> +	size = fprintf(fp, "\ncpu%u ", cs->cpu);
> +
> +#define CPU_FIELD(_type, _name, _ver)						\
> +	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)cs->_ver._name)
> +
> +	if (version == 15) {
> +#include <perf/schedstat-v15.h>
> +		return size;
> +	}
> +#undef CPU_FIELD
> +
> +	return fprintf(fp, "Unsupported /proc/schedstat version %d.\n",
> +		       event->schedstat_cpu.version);
> +}
> +
> +size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
> +{
> +	struct perf_record_schedstat_domain *ds = &event->schedstat_domain;
> +	__u16 version = ds->version;
> +	size_t cpu_mask_len_2;
> +	size_t cpu_mask_len;
> +	size_t size = 0;
> +	char *cpu_mask;
> +	int idx;
> +	int i, j;
> +	bool low;
> +
> +	if (ds->name[0])
> +		size = fprintf(fp, "\ndomain%u:%s ", ds->domain, ds->name);
> +	else
> +		size = fprintf(fp, "\ndomain%u ", ds->domain);
> +
> +	cpu_mask_len = ((ds->nr_cpus + 3) >> 2);
> +	cpu_mask_len_2 = cpu_mask_len + ((cpu_mask_len - 1) / 8);
> +
> +	cpu_mask = zalloc(cpu_mask_len_2 + 1);
> +	if (!cpu_mask)
> +		return fprintf(fp, "Cannot allocate memory for cpumask\n");
> +
> +	idx = ((ds->nr_cpus + 7) >> 3) - 1;
> +
> +	i = cpu_mask_len_2 - 1;
> +
> +	low = true;
> +	j = 1;
> +	while (i >= 0) {
> +		__u8 m;
> +
> +		if (low)
> +			m = ds->cpu_mask[idx] & 0xf;
> +		else
> +			m = (ds->cpu_mask[idx] & 0xf0) >> 4;
> +
> +		if (m >= 0 && m <= 9)
> +			m += '0';
> +		else if (m >= 0xa && m <= 0xf)
> +			m = m + 'a' - 10;
> +		else if (m >= 0xA && m <= 0xF)
> +			m = m + 'A' - 10;
> +
> +		cpu_mask[i] = m;
> +
> +		if (j == 8 && i != 0) {
> +			cpu_mask[i - 1] = ',';
> +			j = 0;
> +			i--;
> +		}
> +
> +		if (!low)
> +			idx--;
> +		low = !low;
> +		i--;
> +		j++;
> +	}
> +	size += fprintf(fp, "%s ", cpu_mask);
> +	free(cpu_mask);
> +
> +#define DOMAIN_FIELD(_type, _name, _ver)					\
> +	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)ds->_ver._name)
> +
> +	if (version == 15) {
> +#include <perf/schedstat-v15.h>
> +		return size;
> +	}
> +#undef DOMAIN_FIELD
> +
> +	return fprintf(fp, "Unsupported /proc/schedstat version %d.\n",
> +		       event->schedstat_domain.version);
> +}
> +
>  size_t perf_event__fprintf(union perf_event *event, struct machine *machine, FILE *fp)
>  {
>  	size_t ret = fprintf(fp, "PERF_RECORD_%s",
> diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
> index 2744c54f404e..333f2405cd5a 100644
> --- a/tools/perf/util/event.h
> +++ b/tools/perf/util/event.h
> @@ -361,6 +361,8 @@ size_t perf_event__fprintf_cgroup(union perf_event *event, FILE *fp);
>  size_t perf_event__fprintf_ksymbol(union perf_event *event, FILE *fp);
>  size_t perf_event__fprintf_bpf(union perf_event *event, FILE *fp);
>  size_t perf_event__fprintf_text_poke(union perf_event *event, struct machine *machine,FILE *fp);
> +size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp);
> +size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp);
>  size_t perf_event__fprintf(union perf_event *event, struct machine *machine, FILE *fp);
>  
>  int kallsyms__get_function_start(const char *kallsyms_filename,
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index c06e3020a976..bcffee2b7239 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -692,6 +692,20 @@ static void perf_event__time_conv_swap(union perf_event *event,
>  	}
>  }
>  
> +static void
> +perf_event__schedstat_cpu_swap(union perf_event *event __maybe_unused,
> +			       bool sample_id_all __maybe_unused)
> +{
> +	/* FIXME */
> +}
> +
> +static void
> +perf_event__schedstat_domain_swap(union perf_event *event __maybe_unused,
> +				  bool sample_id_all __maybe_unused)
> +{
> +	/* FIXME */
> +}
> +
>  typedef void (*perf_event__swap_op)(union perf_event *event,
>  				    bool sample_id_all);
>  
> @@ -730,6 +744,8 @@ static perf_event__swap_op perf_event__swap_ops[] = {
>  	[PERF_RECORD_STAT_ROUND]	  = perf_event__stat_round_swap,
>  	[PERF_RECORD_EVENT_UPDATE]	  = perf_event__event_update_swap,
>  	[PERF_RECORD_TIME_CONV]		  = perf_event__time_conv_swap,
> +	[PERF_RECORD_SCHEDSTAT_CPU]	  = perf_event__schedstat_cpu_swap,
> +	[PERF_RECORD_SCHEDSTAT_DOMAIN]	  = perf_event__schedstat_domain_swap,
>  	[PERF_RECORD_HEADER_MAX]	  = NULL,
>  };
>  
> @@ -1455,6 +1471,10 @@ static s64 perf_session__process_user_event(struct perf_session *session,
>  		return err;
>  	case PERF_RECORD_FINISHED_INIT:
>  		return tool->finished_init(session, event);
> +	case PERF_RECORD_SCHEDSTAT_CPU:
> +		return tool->schedstat_cpu(session, event);
> +	case PERF_RECORD_SCHEDSTAT_DOMAIN:
> +		return tool->schedstat_domain(session, event);
>  	default:
>  		return -EINVAL;
>  	}
> diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
> index 6923b0d5efed..f928f07bea15 100644
> --- a/tools/perf/util/synthetic-events.c
> +++ b/tools/perf/util/synthetic-events.c
> @@ -2511,3 +2511,242 @@ int parse_synth_opt(char *synth)
>  
>  	return ret;
>  }
> +
> +static union perf_event *__synthesize_schedstat_cpu(struct io *io, __u16 version,
> +						    __u64 *cpu, __u64 timestamp)
> +{
> +	struct perf_record_schedstat_cpu *cs;
> +	union perf_event *event;
> +	size_t size;
> +	char ch;
> +
> +	size = sizeof(struct perf_record_schedstat_cpu);

I think the kernel code prefers sizeof(*cs) instead.


> +	size = PERF_ALIGN(size, sizeof(u64));
> +	event = zalloc(size);

The size is static, do you really need a dynamic allocation?

Thanks,
Namhyung

> +
> +	if (!event)
> +		return NULL;
> +
> +	cs = &event->schedstat_cpu;
> +	cs->header.type = PERF_RECORD_SCHEDSTAT_CPU;
> +	cs->header.size = size;
> +	cs->timestamp = timestamp;
> +
> +	if (io__get_char(io) != 'p' || io__get_char(io) != 'u')
> +		goto out_cpu;
> +
> +	if (io__get_dec(io, (__u64 *)cpu) != ' ')
> +		goto out_cpu;
> +
> +#define CPU_FIELD(_type, _name, _ver)					\
> +	do {								\
> +		__u64 _tmp;						\
> +		ch = io__get_dec(io, &_tmp);				\
> +		if (ch != ' ' && ch != '\n')				\
> +			goto out_cpu;					\
> +		cs->_ver._name = _tmp;					\
> +	} while (0)
> +
> +	if (version == 15) {
> +#include <perf/schedstat-v15.h>
> +	}
> +#undef CPU_FIELD
> +
> +	cs->cpu = *cpu;
> +	cs->version = version;
> +
> +	return event;
> +out_cpu:
> +	free(event);
> +	return NULL;
> +}
> +
> +static size_t schedstat_sanitize_cpumask(char *cpu_mask, size_t cpu_mask_len)
> +{
> +	char *dst = cpu_mask;
> +	char *src = cpu_mask;
> +	int i = 0;
> +
> +	for ( ; src < cpu_mask + cpu_mask_len; dst++, src++) {
> +		while (*src == ',')
> +			src++;
> +
> +		*dst = *src;
> +	}
> +
> +	for ( ; dst < src; dst++, i++)
> +		*dst = '\0';
> +
> +	return cpu_mask_len - i;
> +}
> +
> +static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 version,
> +						       __u64 cpu, __u64 timestamp)
> +{
> +	struct perf_env env = { .total_mem = 0, };
> +	int nr_cpus_avail = perf_env__nr_cpus_avail(&env);
> +	struct perf_record_schedstat_domain *ds;
> +	union perf_event *event;
> +	char *d_name = NULL;
> +	size_t cpu_mask_len = 0;
> +	char *cpu_mask = NULL;
> +	__u64 d_num;
> +	size_t size;
> +	int i = 0;
> +	bool low;
> +	char ch;
> +	int idx;
> +
> +	if (io__get_char(io) != 'o' || io__get_char(io) != 'm' || io__get_char(io) != 'a' ||
> +	    io__get_char(io) != 'i' || io__get_char(io) != 'n')
> +		return NULL;
> +
> +	ch = io__get_dec(io, &d_num);
> +
> +	if (io__getdelim(io, &cpu_mask, &cpu_mask_len, ' ') < 0 || !cpu_mask_len)
> +		goto out;
> +
> +	cpu_mask[cpu_mask_len - 1] = '\0';
> +	cpu_mask_len--;
> +	cpu_mask_len = schedstat_sanitize_cpumask(cpu_mask, cpu_mask_len);
> +
> +	size = sizeof(struct perf_record_schedstat_domain) + ((nr_cpus_avail + 7) >> 3);
> +	size = PERF_ALIGN(size, sizeof(u64));
> +	event = zalloc(size);
> +
> +	if (!event)
> +		goto out_cpu_mask;
> +
> +	ds = &event->schedstat_domain;
> +	ds->header.type = PERF_RECORD_SCHEDSTAT_DOMAIN;
> +	ds->header.size = size;
> +	ds->version = version;
> +	ds->timestamp = timestamp;
> +	if (d_name)
> +		strncpy(ds->name, d_name, DOMAIN_NAME_LEN - 1);
> +	ds->domain = d_num;
> +	ds->nr_cpus = nr_cpus_avail;
> +
> +	idx = ((nr_cpus_avail + 7) >> 3) - 1;
> +	low = true;
> +	for (i = cpu_mask_len - 1; i >= 0 && idx >= 0; i--) {
> +		char mask = cpu_mask[i];
> +
> +		if (mask >= '0' && mask <= '9')
> +			mask -= '0';
> +		else if (mask >= 'a' && mask <= 'f')
> +			mask = mask - 'a' + 10;
> +		else if (mask >= 'A' && mask <= 'F')
> +			mask = mask - 'A' + 10;
> +
> +		if (low) {
> +			ds->cpu_mask[idx] = mask;
> +		} else {
> +			ds->cpu_mask[idx] |= (mask << 4);
> +			idx--;
> +		}
> +		low = !low;
> +	}
> +
> +	free(cpu_mask);
> +
> +#define DOMAIN_FIELD(_type, _name, _ver)				\
> +	do {								\
> +		__u64 _tmp;						\
> +		ch = io__get_dec(io, &_tmp);				\
> +		if (ch != ' ' && ch != '\n')				\
> +			goto out_domain;				\
> +		ds->_ver._name = _tmp;					\
> +	} while (0)
> +
> +	if (version == 15) {
> +#include <perf/schedstat-v15.h>
> +	}
> +#undef DOMAIN_FIELD
> +
> +	ds->cpu = cpu;
> +	return event;
> +
> +out_domain:
> +	free(event);
> +out_cpu_mask:
> +	free(cpu_mask);
> +out:
> +	return NULL;
> +}
> +
> +int perf_event__synthesize_schedstat(const struct perf_tool *tool,
> +				     perf_event__handler_t process,
> +				     struct perf_cpu_map *user_requested_cpus)
> +{
> +	char *line = NULL, path[PATH_MAX];
> +	union perf_event *event = NULL;
> +	size_t line_len = 0;
> +	char bf[BUFSIZ];
> +	__u64 timestamp;
> +	__u64 cpu = -1;
> +	__u16 version;
> +	struct io io;
> +	int ret = -1;
> +	char ch;
> +
> +	snprintf(path, PATH_MAX, "%s/schedstat", procfs__mountpoint());
> +	io.fd = open(path, O_RDONLY, 0);
> +	if (io.fd < 0) {
> +		pr_err("Failed to open %s. Possibly CONFIG_SCHEDSTAT is disabled.\n", path);
> +		return -1;
> +	}
> +	io__init(&io, io.fd, bf, sizeof(bf));
> +
> +	if (io__getline(&io, &line, &line_len) < 0 || !line_len)
> +		goto out;
> +
> +	if (!strcmp(line, "version 15\n")) {
> +		version = 15;
> +	} else {
> +		pr_err("Unsupported %s version: %s", path, line + 8);
> +		goto out_free_line;
> +	}
> +
> +	if (io__getline(&io, &line, &line_len) < 0 || !line_len)
> +		goto out_free_line;
> +	timestamp = atol(line + 10);
> +
> +	/*
> +	 * FIXME: Can be optimized a bit by not synthesizing domain samples
> +	 * for filtered out cpus.
> +	 */
> +	for (ch = io__get_char(&io); !io.eof; ch = io__get_char(&io)) {
> +		struct perf_cpu this_cpu;
> +
> +		if (ch == 'c') {
> +			event = __synthesize_schedstat_cpu(&io, version,
> +							   &cpu, timestamp);
> +		} else if (ch == 'd') {
> +			event = __synthesize_schedstat_domain(&io, version,
> +							      cpu, timestamp);
> +		}
> +		if (!event)
> +			goto out_free_line;
> +
> +		this_cpu.cpu = cpu;
> +
> +		if (user_requested_cpus && !perf_cpu_map__has(user_requested_cpus, this_cpu))
> +			continue;
> +
> +		if (process(tool, event, NULL, NULL) < 0) {
> +			free(event);
> +			goto out_free_line;
> +		}
> +
> +		free(event);
> +	}
> +
> +	ret = 0;
> +
> +out_free_line:
> +	free(line);
> +out:
> +	close(io.fd);
> +	return ret;
> +}
> diff --git a/tools/perf/util/synthetic-events.h b/tools/perf/util/synthetic-events.h
> index b9c936b5cfeb..eab914c238df 100644
> --- a/tools/perf/util/synthetic-events.h
> +++ b/tools/perf/util/synthetic-events.h
> @@ -141,4 +141,7 @@ int perf_event__synthesize_for_pipe(const struct perf_tool *tool,
>  				    struct perf_data *data,
>  				    perf_event__handler_t process);
>  
> +int perf_event__synthesize_schedstat(const struct perf_tool *tool,
> +				     perf_event__handler_t process,
> +				     struct perf_cpu_map *user_requested_cpu);
>  #endif // __PERF_SYNTHETIC_EVENTS_H
> diff --git a/tools/perf/util/tool.c b/tools/perf/util/tool.c
> index 3b7f390f26eb..9f81d720735f 100644
> --- a/tools/perf/util/tool.c
> +++ b/tools/perf/util/tool.c
> @@ -230,6 +230,24 @@ static int perf_session__process_compressed_event_stub(struct perf_session *sess
>  	return 0;
>  }
>  
> +static int process_schedstat_cpu_stub(struct perf_session *perf_session __maybe_unused,
> +				      union perf_event *event)
> +{
> +	if (dump_trace)
> +		perf_event__fprintf_schedstat_cpu(event, stdout);
> +	dump_printf(": unhandled!\n");
> +	return 0;
> +}
> +
> +static int process_schedstat_domain_stub(struct perf_session *perf_session __maybe_unused,
> +					 union perf_event *event)
> +{
> +	if (dump_trace)
> +		perf_event__fprintf_schedstat_domain(event, stdout);
> +	dump_printf(": unhandled!\n");
> +	return 0;
> +}
> +
>  void perf_tool__init(struct perf_tool *tool, bool ordered_events)
>  {
>  	tool->ordered_events = ordered_events;
> @@ -286,6 +304,8 @@ void perf_tool__init(struct perf_tool *tool, bool ordered_events)
>  	tool->compressed = perf_session__process_compressed_event_stub;
>  #endif
>  	tool->finished_init = process_event_op2_stub;
> +	tool->schedstat_cpu = process_schedstat_cpu_stub;
> +	tool->schedstat_domain = process_schedstat_domain_stub;
>  }
>  
>  bool perf_tool__compressed_is_stub(const struct perf_tool *tool)
> diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
> index db1c7642b0d1..d289a5396b01 100644
> --- a/tools/perf/util/tool.h
> +++ b/tools/perf/util/tool.h
> @@ -77,7 +77,9 @@ struct perf_tool {
>  			stat,
>  			stat_round,
>  			feature,
> -			finished_init;
> +			finished_init,
> +			schedstat_cpu,
> +			schedstat_domain;
>  	event_op4	compressed;
>  	event_op3	auxtrace;
>  	bool		ordered_events;
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 3/8] perf sched stats: Add schedstat v17 support
  2025-03-11 12:02 ` [PATCH v3 3/8] perf sched stats: Add schedstat v17 support Swapnil Sapkal
@ 2025-03-15  2:27   ` Namhyung Kim
  2025-03-17 13:32     ` Sapkal, Swapnil
  0 siblings, 1 reply; 23+ messages in thread
From: Namhyung Kim @ 2025-03-15  2:27 UTC (permalink / raw)
  To: Swapnil Sapkal
  Cc: peterz, mingo, acme, irogers, james.clark, ravi.bangoria,
	yu.c.chen, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, adrian.hunter, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, sshegde,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das

On Tue, Mar 11, 2025 at 12:02:25PM +0000, Swapnil Sapkal wrote:
> /proc/schedstat file output is standardized with version number.
> Add support to record and raw dump v17 version layout.
> 
> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> ---
>  tools/lib/perf/Makefile                     |  2 +-
>  tools/lib/perf/include/perf/event.h         | 14 +++++
>  tools/lib/perf/include/perf/schedstat-v17.h | 61 +++++++++++++++++++++
>  tools/perf/util/event.c                     |  6 ++
>  tools/perf/util/synthetic-events.c          | 15 +++++
>  5 files changed, 97 insertions(+), 1 deletion(-)
>  create mode 100644 tools/lib/perf/include/perf/schedstat-v17.h
> 
> diff --git a/tools/lib/perf/Makefile b/tools/lib/perf/Makefile
> index d0506a13a97f..30712ce8b6b1 100644
> --- a/tools/lib/perf/Makefile
> +++ b/tools/lib/perf/Makefile
> @@ -174,7 +174,7 @@ install_lib: libs
>  		$(call do_install_mkdir,$(libdir_SQ)); \
>  		cp -fpR $(LIBPERF_ALL) $(DESTDIR)$(libdir_SQ)
>  
> -HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-v15.h schedstat-v16.h
> +HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-v15.h schedstat-v16.h schedstat-v17.h

Please put them in a separate line like

HDRS += schedstat-v15.h schedstat-v16.h schedstat-v17.h

Thanks,
Namhyung


>  INTERNAL_HDRS := cpumap.h evlist.h evsel.h lib.h mmap.h rc_check.h threadmap.h xyarray.h
>  
>  INSTALL_HDRS_PFX := $(DESTDIR)$(prefix)/include/perf
> diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
> index 8ef70799e070..0d1983ad9a41 100644
> --- a/tools/lib/perf/include/perf/event.h
> +++ b/tools/lib/perf/include/perf/event.h
> @@ -469,12 +469,19 @@ struct perf_record_schedstat_cpu_v16 {
>  #undef CPU_FIELD
>  };
>  
> +struct perf_record_schedstat_cpu_v17 {
> +#define CPU_FIELD(_type, _name, _ver)		_type _name
> +#include "schedstat-v17.h"
> +#undef CPU_FIELD
> +};
> +
>  struct perf_record_schedstat_cpu {
>  	struct perf_event_header header;
>  	__u64			 timestamp;
>  	union {
>  		struct perf_record_schedstat_cpu_v15 v15;
>  		struct perf_record_schedstat_cpu_v16 v16;
> +		struct perf_record_schedstat_cpu_v17 v17;
>  	};
>  	__u32			 cpu;
>  	__u16			 version;
> @@ -492,6 +499,12 @@ struct perf_record_schedstat_domain_v16 {
>  #undef DOMAIN_FIELD
>  };
>  
> +struct perf_record_schedstat_domain_v17 {
> +#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
> +#include "schedstat-v17.h"
> +#undef DOMAIN_FIELD
> +};
> +
>  #define DOMAIN_NAME_LEN		16
>  
>  struct perf_record_schedstat_domain {
> @@ -504,6 +517,7 @@ struct perf_record_schedstat_domain {
>  	union {
>  		struct perf_record_schedstat_domain_v15 v15;
>  		struct perf_record_schedstat_domain_v16 v16;
> +		struct perf_record_schedstat_domain_v17 v17;
>  	};
>  	__u16			 nr_cpus;
>  	__u8			 cpu_mask[];
> diff --git a/tools/lib/perf/include/perf/schedstat-v17.h b/tools/lib/perf/include/perf/schedstat-v17.h
> new file mode 100644
> index 000000000000..851d4f1f4ecb
> --- /dev/null
> +++ b/tools/lib/perf/include/perf/schedstat-v17.h
> @@ -0,0 +1,61 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifdef CPU_FIELD
> +CPU_FIELD(__u32, yld_count, v17);
> +CPU_FIELD(__u32, array_exp, v17);
> +CPU_FIELD(__u32, sched_count, v17);
> +CPU_FIELD(__u32, sched_goidle, v17);
> +CPU_FIELD(__u32, ttwu_count, v17);
> +CPU_FIELD(__u32, ttwu_local, v17);
> +CPU_FIELD(__u64, rq_cpu_time, v17);
> +CPU_FIELD(__u64, run_delay, v17);
> +CPU_FIELD(__u64, pcount, v17);
> +#endif
> +
> +#ifdef DOMAIN_FIELD
> +DOMAIN_FIELD(__u32, busy_lb_count, v17);
> +DOMAIN_FIELD(__u32, busy_lb_balanced, v17);
> +DOMAIN_FIELD(__u32, busy_lb_failed, v17);
> +DOMAIN_FIELD(__u32, busy_lb_imbalance_load, v17);
> +DOMAIN_FIELD(__u32, busy_lb_imbalance_util, v17);
> +DOMAIN_FIELD(__u32, busy_lb_imbalance_task, v17);
> +DOMAIN_FIELD(__u32, busy_lb_imbalance_misfit, v17);
> +DOMAIN_FIELD(__u32, busy_lb_gained, v17);
> +DOMAIN_FIELD(__u32, busy_lb_hot_gained, v17);
> +DOMAIN_FIELD(__u32, busy_lb_nobusyq, v17);
> +DOMAIN_FIELD(__u32, busy_lb_nobusyg, v17);
> +DOMAIN_FIELD(__u32, idle_lb_count, v17);
> +DOMAIN_FIELD(__u32, idle_lb_balanced, v17);
> +DOMAIN_FIELD(__u32, idle_lb_failed, v17);
> +DOMAIN_FIELD(__u32, idle_lb_imbalance_load, v17);
> +DOMAIN_FIELD(__u32, idle_lb_imbalance_util, v17);
> +DOMAIN_FIELD(__u32, idle_lb_imbalance_task, v17);
> +DOMAIN_FIELD(__u32, idle_lb_imbalance_misfit, v17);
> +DOMAIN_FIELD(__u32, idle_lb_gained, v17);
> +DOMAIN_FIELD(__u32, idle_lb_hot_gained, v17);
> +DOMAIN_FIELD(__u32, idle_lb_nobusyq, v17);
> +DOMAIN_FIELD(__u32, idle_lb_nobusyg, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_count, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_balanced, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_failed, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_imbalance_load, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_imbalance_util, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_imbalance_task, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_imbalance_misfit, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_gained, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v17);
> +DOMAIN_FIELD(__u32, alb_count, v17);
> +DOMAIN_FIELD(__u32, alb_failed, v17);
> +DOMAIN_FIELD(__u32, alb_pushed, v17);
> +DOMAIN_FIELD(__u32, sbe_count, v17);
> +DOMAIN_FIELD(__u32, sbe_balanced, v17);
> +DOMAIN_FIELD(__u32, sbe_pushed, v17);
> +DOMAIN_FIELD(__u32, sbf_count, v17);
> +DOMAIN_FIELD(__u32, sbf_balanced, v17);
> +DOMAIN_FIELD(__u32, sbf_pushed, v17);
> +DOMAIN_FIELD(__u32, ttwu_wake_remote, v17);
> +DOMAIN_FIELD(__u32, ttwu_move_affine, v17);
> +DOMAIN_FIELD(__u32, ttwu_move_balance, v17);
> +#endif
> diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
> index 64f81e7b7f70..d09c3c99ab48 100644
> --- a/tools/perf/util/event.c
> +++ b/tools/perf/util/event.c
> @@ -569,6 +569,9 @@ size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
>  	} else if (version == 16) {
>  #include <perf/schedstat-v16.h>
>  		return size;
> +	} else if (version == 17) {
> +#include <perf/schedstat-v17.h>
> +		return size;
>  	}
>  #undef CPU_FIELD
>  
> @@ -647,6 +650,9 @@ size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
>  	} else if (version == 16) {
>  #include <perf/schedstat-v16.h>
>  		return size;
> +	} else if (version == 17) {
> +#include <perf/schedstat-v17.h>
> +		return size;
>  	}
>  #undef DOMAIN_FIELD
>  
> diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
> index e9dc1e14cfea..fad0c472f297 100644
> --- a/tools/perf/util/synthetic-events.c
> +++ b/tools/perf/util/synthetic-events.c
> @@ -2551,6 +2551,8 @@ static union perf_event *__synthesize_schedstat_cpu(struct io *io, __u16 version
>  #include <perf/schedstat-v15.h>
>  	} else if (version == 16) {
>  #include <perf/schedstat-v16.h>
> +	} else if (version == 17) {
> +#include <perf/schedstat-v17.h>
>  	}
>  #undef CPU_FIELD
>  
> @@ -2589,6 +2591,7 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
>  	int nr_cpus_avail = perf_env__nr_cpus_avail(&env);
>  	struct perf_record_schedstat_domain *ds;
>  	union perf_event *event;
> +	size_t d_name_len = 0;
>  	char *d_name = NULL;
>  	size_t cpu_mask_len = 0;
>  	char *cpu_mask = NULL;
> @@ -2604,6 +2607,12 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
>  		return NULL;
>  
>  	ch = io__get_dec(io, &d_num);
> +	if (version >= 17) {
> +		if (io__getdelim(io, &d_name, &d_name_len, ' ') < 0 || !d_name_len)
> +			return NULL;
> +		d_name[d_name_len - 1] = '\0';
> +		d_name_len--;
> +	}
>  
>  	if (io__getdelim(io, &cpu_mask, &cpu_mask_len, ' ') < 0 || !cpu_mask_len)
>  		goto out;
> @@ -2650,6 +2659,7 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
>  		low = !low;
>  	}
>  
> +	free(d_name);
>  	free(cpu_mask);
>  
>  #define DOMAIN_FIELD(_type, _name, _ver)				\
> @@ -2665,6 +2675,8 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
>  #include <perf/schedstat-v15.h>
>  	} else if (version == 16) {
>  #include <perf/schedstat-v16.h>
> +	} else if (version == 17) {
> +#include <perf/schedstat-v17.h>
>  	}
>  #undef DOMAIN_FIELD
>  
> @@ -2676,6 +2688,7 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
>  out_cpu_mask:
>  	free(cpu_mask);
>  out:
> +	free(d_name);
>  	return NULL;
>  }
>  
> @@ -2709,6 +2722,8 @@ int perf_event__synthesize_schedstat(const struct perf_tool *tool,
>  		version = 15;
>  	} else if (!strcmp(line, "version 16\n")) {
>  		version = 16;
> +	} else if (!strcmp(line, "version 17\n")) {
> +		version = 17;
>  	} else {
>  		pr_err("Unsupported %s version: %s", path, line + 8);
>  		goto out_free_line;
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/8] perf sched stats: Add support for report subcommand
  2025-03-11 12:02 ` [PATCH v3 4/8] perf sched stats: Add support for report subcommand Swapnil Sapkal
@ 2025-03-15  4:39   ` Namhyung Kim
  2025-03-18 11:08     ` Sapkal, Swapnil
  2025-05-20 10:36   ` Peter Zijlstra
  1 sibling, 1 reply; 23+ messages in thread
From: Namhyung Kim @ 2025-03-15  4:39 UTC (permalink / raw)
  To: Swapnil Sapkal
  Cc: peterz, mingo, acme, irogers, james.clark, ravi.bangoria,
	yu.c.chen, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, adrian.hunter, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, sshegde,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das, James Clark

On Tue, Mar 11, 2025 at 12:02:26PM +0000, Swapnil Sapkal wrote:
> `perf sched stats record` captures two sets of samples. For workload
> profile, first set right before workload starts and second set after
> workload finishes. For the systemwide profile, first set at the
> beginning of profile and second set on receiving SIGINT signal.
> 
> Add `perf sched stats report` subcommand that will read both the set
> of samples, get the diff and render a final report. Final report prints
> scheduler stat at cpu granularity as well as sched domain granularity.
> 
> Example usage:
> 
>   # perf sched stats record
>   # perf sched stats report
> 
> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> Tested-by: James Clark <james.clark@linaro.org>
> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> ---
>  tools/lib/perf/include/perf/event.h         |  12 +-
>  tools/lib/perf/include/perf/schedstat-v15.h | 180 +++++--
>  tools/lib/perf/include/perf/schedstat-v16.h | 182 +++++--
>  tools/lib/perf/include/perf/schedstat-v17.h | 209 +++++---
>  tools/perf/builtin-sched.c                  | 504 +++++++++++++++++++-
>  tools/perf/util/event.c                     |   4 +-
>  tools/perf/util/synthetic-events.c          |   4 +-
>  7 files changed, 938 insertions(+), 157 deletions(-)
> 
> diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
> index 0d1983ad9a41..5e2c56c9b038 100644
> --- a/tools/lib/perf/include/perf/event.h
> +++ b/tools/lib/perf/include/perf/event.h
> @@ -458,19 +458,19 @@ struct perf_record_compressed {
>  };
>  
>  struct perf_record_schedstat_cpu_v15 {
> -#define CPU_FIELD(_type, _name, _ver)		_type _name
> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		_type _name
>  #include "schedstat-v15.h"
>  #undef CPU_FIELD
>  };
>  
>  struct perf_record_schedstat_cpu_v16 {
> -#define CPU_FIELD(_type, _name, _ver)		_type _name
> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		_type _name
>  #include "schedstat-v16.h"
>  #undef CPU_FIELD
>  };
>  
>  struct perf_record_schedstat_cpu_v17 {
> -#define CPU_FIELD(_type, _name, _ver)		_type _name
> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		_type _name
>  #include "schedstat-v17.h"
>  #undef CPU_FIELD
>  };
> @@ -488,19 +488,19 @@ struct perf_record_schedstat_cpu {
>  };
>  
>  struct perf_record_schedstat_domain_v15 {
> -#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		_type _name
>  #include "schedstat-v15.h"
>  #undef DOMAIN_FIELD
>  };
>  
>  struct perf_record_schedstat_domain_v16 {
> -#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		_type _name
>  #include "schedstat-v16.h"
>  #undef DOMAIN_FIELD
>  };
>  
>  struct perf_record_schedstat_domain_v17 {
> -#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		_type _name
>  #include "schedstat-v17.h"
>  #undef DOMAIN_FIELD
>  };
> diff --git a/tools/lib/perf/include/perf/schedstat-v15.h b/tools/lib/perf/include/perf/schedstat-v15.h
> index 43f8060c5337..011411ac0f7e 100644
> --- a/tools/lib/perf/include/perf/schedstat-v15.h
> +++ b/tools/lib/perf/include/perf/schedstat-v15.h
> @@ -1,52 +1,142 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  
>  #ifdef CPU_FIELD
> -CPU_FIELD(__u32, yld_count, v15);
> -CPU_FIELD(__u32, array_exp, v15);
> -CPU_FIELD(__u32, sched_count, v15);
> -CPU_FIELD(__u32, sched_goidle, v15);
> -CPU_FIELD(__u32, ttwu_count, v15);
> -CPU_FIELD(__u32, ttwu_local, v15);
> -CPU_FIELD(__u64, rq_cpu_time, v15);
> -CPU_FIELD(__u64, run_delay, v15);
> -CPU_FIELD(__u64, pcount, v15);
> +CPU_FIELD(__u32, yld_count, "sched_yield() count",
> +	  "%11u", false, yld_count, v15);
> +CPU_FIELD(__u32, array_exp, "Legacy counter can be ignored",
> +	  "%11u", false, array_exp, v15);
> +CPU_FIELD(__u32, sched_count, "schedule() called",
> +	  "%11u", false, sched_count, v15);
> +CPU_FIELD(__u32, sched_goidle, "schedule() left the processor idle",
> +	  "%11u", true, sched_count, v15);
> +CPU_FIELD(__u32, ttwu_count, "try_to_wake_up() was called",
> +	  "%11u", false, ttwu_count, v15);
> +CPU_FIELD(__u32, ttwu_local, "try_to_wake_up() was called to wake up the local cpu",
> +	  "%11u", true, ttwu_count, v15);
> +CPU_FIELD(__u64, rq_cpu_time, "total runtime by tasks on this processor (in jiffies)",
> +	  "%11llu", false, rq_cpu_time, v15);
> +CPU_FIELD(__u64, run_delay, "total waittime by tasks on this processor (in jiffies)",
> +	  "%11llu", true, rq_cpu_time, v15);
> +CPU_FIELD(__u64, pcount, "total timeslices run on this cpu",
> +	  "%11llu", false, pcount, v15);
>  #endif
>  
>  #ifdef DOMAIN_FIELD
> -DOMAIN_FIELD(__u32, idle_lb_count, v15);
> -DOMAIN_FIELD(__u32, idle_lb_balanced, v15);
> -DOMAIN_FIELD(__u32, idle_lb_failed, v15);
> -DOMAIN_FIELD(__u32, idle_lb_imbalance, v15);
> -DOMAIN_FIELD(__u32, idle_lb_gained, v15);
> -DOMAIN_FIELD(__u32, idle_lb_hot_gained, v15);
> -DOMAIN_FIELD(__u32, idle_lb_nobusyq, v15);
> -DOMAIN_FIELD(__u32, idle_lb_nobusyg, v15);
> -DOMAIN_FIELD(__u32, busy_lb_count, v15);
> -DOMAIN_FIELD(__u32, busy_lb_balanced, v15);
> -DOMAIN_FIELD(__u32, busy_lb_failed, v15);
> -DOMAIN_FIELD(__u32, busy_lb_imbalance, v15);
> -DOMAIN_FIELD(__u32, busy_lb_gained, v15);
> -DOMAIN_FIELD(__u32, busy_lb_hot_gained, v15);
> -DOMAIN_FIELD(__u32, busy_lb_nobusyq, v15);
> -DOMAIN_FIELD(__u32, busy_lb_nobusyg, v15);
> -DOMAIN_FIELD(__u32, newidle_lb_count, v15);
> -DOMAIN_FIELD(__u32, newidle_lb_balanced, v15);
> -DOMAIN_FIELD(__u32, newidle_lb_failed, v15);
> -DOMAIN_FIELD(__u32, newidle_lb_imbalance, v15);
> -DOMAIN_FIELD(__u32, newidle_lb_gained, v15);
> -DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v15);
> -DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v15);
> -DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v15);
> -DOMAIN_FIELD(__u32, alb_count, v15);
> -DOMAIN_FIELD(__u32, alb_failed, v15);
> -DOMAIN_FIELD(__u32, alb_pushed, v15);
> -DOMAIN_FIELD(__u32, sbe_count, v15);
> -DOMAIN_FIELD(__u32, sbe_balanced, v15);
> -DOMAIN_FIELD(__u32, sbe_pushed, v15);
> -DOMAIN_FIELD(__u32, sbf_count, v15);
> -DOMAIN_FIELD(__u32, sbf_balanced, v15);
> -DOMAIN_FIELD(__u32, sbf_pushed, v15);
> -DOMAIN_FIELD(__u32, ttwu_wake_remote, v15);
> -DOMAIN_FIELD(__u32, ttwu_move_affine, v15);
> -DOMAIN_FIELD(__u32, ttwu_move_balance, v15);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category idle> ");
>  #endif
> +DOMAIN_FIELD(__u32, idle_lb_count,
> +	     "load_balance() count on cpu idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, idle_lb_balanced,
> +	     "load_balance() found balanced on cpu idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, idle_lb_failed,
> +	     "load_balance() move task failed on cpu idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, idle_lb_imbalance,
> +	     "imbalance sum on cpu idle", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, idle_lb_gained,
> +	     "pull_task() count on cpu idle", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, idle_lb_hot_gained,
> +	     "pull_task() when target task was cache-hot on cpu idle", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, idle_lb_nobusyq,
> +	     "load_balance() failed to find busier queue on cpu idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, idle_lb_nobusyg,
> +	     "load_balance() failed to find busier group on cpu idle", "%11u", true, v15);
> +#ifdef DERIVED_CNT_FIELD
> +DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, v15);
> +#endif
> +#ifdef DERIVED_AVG_FIELD
> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu idle)", "%11.2Lf",
> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, idle_lb_gained, v15);
> +#endif
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category busy> ");
> +#endif
> +DOMAIN_FIELD(__u32, busy_lb_count,
> +	     "load_balance() count on cpu busy", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, busy_lb_balanced,
> +	     "load_balance() found balanced on cpu busy", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, busy_lb_failed,
> +	     "load_balance() move task failed on cpu busy", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, busy_lb_imbalance,
> +	     "imbalance sum on cpu busy", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, busy_lb_gained,
> +	     "pull_task() count on cpu busy", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, busy_lb_hot_gained,
> +	     "pull_task() when target task was cache-hot on cpu busy", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, busy_lb_nobusyq,
> +	     "load_balance() failed to find busier queue on cpu busy", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, busy_lb_nobusyg,
> +	     "load_balance() failed to find busier group on cpu busy", "%11u", true, v15);
> +#ifdef DERIVED_CNT_FIELD
> +DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, v15);
> +#endif
> +#ifdef DERIVED_AVG_FIELD
> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu busy)", "%11.2Lf",
> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, busy_lb_gained, v15);
> +#endif
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category newidle> ");
> +#endif
> +DOMAIN_FIELD(__u32, newidle_lb_count,
> +	     "load_balance() count on cpu newly idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_balanced,
> +	     "load_balance() found balanced on cpu newly idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_failed,
> +	     "load_balance() move task failed on cpu newly idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_imbalance,
> +	     "imbalance sum on cpu newly idle", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_gained,
> +	     "pull_task() count on cpu newly idle", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained,
> +	     "pull_task() when target task was cache-hot on cpu newly idle", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
> +	     "load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
> +	     "load_balance() failed to find busier group on cpu newly idle", "%11u", true, v15);
> +#ifdef DERIVED_CNT_FIELD
> +DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v15);
> +#endif
> +#ifdef DERIVED_AVG_FIELD
> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu newly idle)", "%11.2Lf",
> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, newidle_lb_gained, v15);
> +#endif
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category active_load_balance()> ");
> +#endif
> +DOMAIN_FIELD(__u32, alb_count,
> +	     "active_load_balance() count", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, alb_failed,
> +	     "active_load_balance() move task failed", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, alb_pushed,
> +	     "active_load_balance() successfully moved a task", "%11u", false, v15);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category sched_balance_exec()> ");
> +#endif
> +DOMAIN_FIELD(__u32, sbe_count,
> +	     "sbe_count is not used", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, sbe_balanced,
> +	     "sbe_balanced is not used", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, sbe_pushed,
> +	     "sbe_pushed is not used", "%11u", false, v15);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category sched_balance_fork()> ");
> +#endif
> +DOMAIN_FIELD(__u32, sbf_count,
> +	     "sbf_count is not used", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, sbf_balanced,
> +	     "sbf_balanced is not used", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, sbf_pushed,
> +	     "sbf_pushed is not used", "%11u", false, v15);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Wakeup Info> ");
> +#endif
> +DOMAIN_FIELD(__u32, ttwu_wake_remote,
> +	     "try_to_wake_up() awoke a task that last ran on a diff cpu", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, ttwu_move_affine,
> +	     "try_to_wake_up() moved task because cache-cold on own cpu", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, ttwu_move_balance,
> +	     "try_to_wake_up() started passive balancing", "%11u", false, v15);
> +#endif /* DOMAIN_FIELD */
> diff --git a/tools/lib/perf/include/perf/schedstat-v16.h b/tools/lib/perf/include/perf/schedstat-v16.h
> index d6a4691b2fd5..5ba53bd7d61a 100644
> --- a/tools/lib/perf/include/perf/schedstat-v16.h
> +++ b/tools/lib/perf/include/perf/schedstat-v16.h
> @@ -1,52 +1,142 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  
>  #ifdef CPU_FIELD
> -CPU_FIELD(__u32, yld_count, v16);
> -CPU_FIELD(__u32, array_exp, v16);
> -CPU_FIELD(__u32, sched_count, v16);
> -CPU_FIELD(__u32, sched_goidle, v16);
> -CPU_FIELD(__u32, ttwu_count, v16);
> -CPU_FIELD(__u32, ttwu_local, v16);
> -CPU_FIELD(__u64, rq_cpu_time, v16);
> -CPU_FIELD(__u64, run_delay, v16);
> -CPU_FIELD(__u64, pcount, v16);
> -#endif
> +CPU_FIELD(__u32, yld_count, "sched_yield() count",
> +	  "%11u", false, yld_count, v16);
> +CPU_FIELD(__u32, array_exp, "Legacy counter can be ignored",
> +	  "%11u", false, array_exp, v16);
> +CPU_FIELD(__u32, sched_count, "schedule() called",
> +	  "%11u", false, sched_count, v16);
> +CPU_FIELD(__u32, sched_goidle, "schedule() left the processor idle",
> +	  "%11u", true, sched_count, v16);
> +CPU_FIELD(__u32, ttwu_count, "try_to_wake_up() was called",
> +	  "%11u", false, ttwu_count, v16);
> +CPU_FIELD(__u32, ttwu_local, "try_to_wake_up() was called to wake up the local cpu",
> +	  "%11u", true, ttwu_count, v16);
> +CPU_FIELD(__u64, rq_cpu_time, "total runtime by tasks on this processor (in jiffies)",
> +	  "%11llu", false, rq_cpu_time, v16);
> +CPU_FIELD(__u64, run_delay, "total waittime by tasks on this processor (in jiffies)",
> +	  "%11llu", true, rq_cpu_time, v16);
> +CPU_FIELD(__u64, pcount, "total timeslices run on this cpu",
> +	  "%11llu", false, pcount, v16);
> +#endif /* CPU_FIELD */
>  
>  #ifdef DOMAIN_FIELD
> -DOMAIN_FIELD(__u32, busy_lb_count, v16);
> -DOMAIN_FIELD(__u32, busy_lb_balanced, v16);
> -DOMAIN_FIELD(__u32, busy_lb_failed, v16);
> -DOMAIN_FIELD(__u32, busy_lb_imbalance, v16);
> -DOMAIN_FIELD(__u32, busy_lb_gained, v16);
> -DOMAIN_FIELD(__u32, busy_lb_hot_gained, v16);
> -DOMAIN_FIELD(__u32, busy_lb_nobusyq, v16);
> -DOMAIN_FIELD(__u32, busy_lb_nobusyg, v16);
> -DOMAIN_FIELD(__u32, idle_lb_count, v16);
> -DOMAIN_FIELD(__u32, idle_lb_balanced, v16);
> -DOMAIN_FIELD(__u32, idle_lb_failed, v16);
> -DOMAIN_FIELD(__u32, idle_lb_imbalance, v16);
> -DOMAIN_FIELD(__u32, idle_lb_gained, v16);
> -DOMAIN_FIELD(__u32, idle_lb_hot_gained, v16);
> -DOMAIN_FIELD(__u32, idle_lb_nobusyq, v16);
> -DOMAIN_FIELD(__u32, idle_lb_nobusyg, v16);
> -DOMAIN_FIELD(__u32, newidle_lb_count, v16);
> -DOMAIN_FIELD(__u32, newidle_lb_balanced, v16);
> -DOMAIN_FIELD(__u32, newidle_lb_failed, v16);
> -DOMAIN_FIELD(__u32, newidle_lb_imbalance, v16);
> -DOMAIN_FIELD(__u32, newidle_lb_gained, v16);
> -DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v16);
> -DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v16);
> -DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v16);
> -DOMAIN_FIELD(__u32, alb_count, v16);
> -DOMAIN_FIELD(__u32, alb_failed, v16);
> -DOMAIN_FIELD(__u32, alb_pushed, v16);
> -DOMAIN_FIELD(__u32, sbe_count, v16);
> -DOMAIN_FIELD(__u32, sbe_balanced, v16);
> -DOMAIN_FIELD(__u32, sbe_pushed, v16);
> -DOMAIN_FIELD(__u32, sbf_count, v16);
> -DOMAIN_FIELD(__u32, sbf_balanced, v16);
> -DOMAIN_FIELD(__u32, sbf_pushed, v16);
> -DOMAIN_FIELD(__u32, ttwu_wake_remote, v16);
> -DOMAIN_FIELD(__u32, ttwu_move_affine, v16);
> -DOMAIN_FIELD(__u32, ttwu_move_balance, v16);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category busy> ");
> +#endif
> +DOMAIN_FIELD(__u32, busy_lb_count,
> +	     "load_balance() count on cpu busy", "%11u", true, v16);
> +DOMAIN_FIELD(__u32, busy_lb_balanced,
> +	     "load_balance() found balanced on cpu busy", "%11u", true, v16);
> +DOMAIN_FIELD(__u32, busy_lb_failed,
> +	     "load_balance() move task failed on cpu busy", "%11u", true, v16);
> +DOMAIN_FIELD(__u32, busy_lb_imbalance,
> +	     "imbalance sum on cpu busy", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, busy_lb_gained,
> +	     "pull_task() count on cpu busy", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, busy_lb_hot_gained,
> +	     "pull_task() when target task was cache-hot on cpu busy", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, busy_lb_nobusyq,
> +	     "load_balance() failed to find busier queue on cpu busy", "%11u", true, v16);
> +DOMAIN_FIELD(__u32, busy_lb_nobusyg,
> +	     "load_balance() failed to find busier group on cpu busy", "%11u", true, v16);
> +#ifdef DERIVED_CNT_FIELD
> +DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, v16);
> +#endif
> +#ifdef DERIVED_AVG_FIELD
> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu busy)", "%11.2Lf",
> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, busy_lb_gained, v16);
> +#endif
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category idle> ");
> +#endif
> +DOMAIN_FIELD(__u32, idle_lb_count,
> +	     "load_balance() count on cpu idle", "%11u", true, v16);
> +DOMAIN_FIELD(__u32, idle_lb_balanced,
> +	     "load_balance() found balanced on cpu idle", "%11u", true, v16);
> +DOMAIN_FIELD(__u32, idle_lb_failed,
> +	     "load_balance() move task failed on cpu idle", "%11u", true, v16);
> +DOMAIN_FIELD(__u32, idle_lb_imbalance,
> +	     "imbalance sum on cpu idle", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, idle_lb_gained,
> +	     "pull_task() count on cpu idle", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, idle_lb_hot_gained,
> +	     "pull_task() when target task was cache-hot on cpu idle", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, idle_lb_nobusyq,
> +	     "load_balance() failed to find busier queue on cpu idle", "%11u", true, v16);
> +DOMAIN_FIELD(__u32, idle_lb_nobusyg,
> +	     "load_balance() failed to find busier group on cpu idle", "%11u", true, v16);
> +#ifdef DERIVED_CNT_FIELD
> +DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, v16);
> +#endif
> +#ifdef DERIVED_AVG_FIELD
> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu idle)", "%11.2Lf",
> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, idle_lb_gained, v16);
> +#endif
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category newidle> ");
> +#endif
> +DOMAIN_FIELD(__u32, newidle_lb_count,
> +	     "load_balance() count on cpu newly idle", "%11u", true, v16);
> +DOMAIN_FIELD(__u32, newidle_lb_balanced,
> +	     "load_balance() found balanced on cpu newly idle", "%11u", true, v16);
> +DOMAIN_FIELD(__u32, newidle_lb_failed,
> +	     "load_balance() move task failed on cpu newly idle", "%11u", true, v16);
> +DOMAIN_FIELD(__u32, newidle_lb_imbalance,
> +	     "imbalance sum on cpu newly idle", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, newidle_lb_gained,
> +	     "pull_task() count on cpu newly idle", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained,
> +	     "pull_task() when target task was cache-hot on cpu newly idle", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
> +	     "load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v16);
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
> +	     "load_balance() failed to find busier group on cpu newly idle", "%11u", true, v16);
> +#ifdef DERIVED_CNT_FIELD
> +DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v16);
> +#endif
> +#ifdef DERIVED_AVG_FIELD
> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu newly idle)", "%11.2Lf",
> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, newidle_lb_gained, v16);
> +#endif
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category active_load_balance()> ");
> +#endif
> +DOMAIN_FIELD(__u32, alb_count,
> +	     "active_load_balance() count", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, alb_failed,
> +	     "active_load_balance() move task failed", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, alb_pushed,
> +	     "active_load_balance() successfully moved a task", "%11u", false, v16);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category sched_balance_exec()> ");
> +#endif
> +DOMAIN_FIELD(__u32, sbe_count,
> +	     "sbe_count is not used", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, sbe_balanced,
> +	     "sbe_balanced is not used", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, sbe_pushed,
> +	     "sbe_pushed is not used", "%11u", false, v16);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category sched_balance_fork()> ");
> +#endif
> +DOMAIN_FIELD(__u32, sbf_count,
> +	     "sbf_count is not used", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, sbf_balanced,
> +	     "sbf_balanced is not used", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, sbf_pushed,
> +	     "sbf_pushed is not used", "%11u", false, v16);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Wakeup Info> ");
>  #endif
> +DOMAIN_FIELD(__u32, ttwu_wake_remote,
> +	     "try_to_wake_up() awoke a task that last ran on a diff cpu", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, ttwu_move_affine,
> +	     "try_to_wake_up() moved task because cache-cold on own cpu", "%11u", false, v16);
> +DOMAIN_FIELD(__u32, ttwu_move_balance,
> +	     "try_to_wake_up() started passive balancing", "%11u", false, v16);
> +#endif /* DOMAIN_FIELD */
> diff --git a/tools/lib/perf/include/perf/schedstat-v17.h b/tools/lib/perf/include/perf/schedstat-v17.h
> index 851d4f1f4ecb..00009bd5f006 100644
> --- a/tools/lib/perf/include/perf/schedstat-v17.h
> +++ b/tools/lib/perf/include/perf/schedstat-v17.h
> @@ -1,61 +1,160 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  
>  #ifdef CPU_FIELD
> -CPU_FIELD(__u32, yld_count, v17);
> -CPU_FIELD(__u32, array_exp, v17);
> -CPU_FIELD(__u32, sched_count, v17);
> -CPU_FIELD(__u32, sched_goidle, v17);
> -CPU_FIELD(__u32, ttwu_count, v17);
> -CPU_FIELD(__u32, ttwu_local, v17);
> -CPU_FIELD(__u64, rq_cpu_time, v17);
> -CPU_FIELD(__u64, run_delay, v17);
> -CPU_FIELD(__u64, pcount, v17);
> -#endif
> +CPU_FIELD(__u32, yld_count, "sched_yield() count",
> +	  "%11u", false, yld_count, v17);
> +CPU_FIELD(__u32, array_exp, "Legacy counter can be ignored",
> +	  "%11u", false, array_exp, v17);
> +CPU_FIELD(__u32, sched_count, "schedule() called",
> +	  "%11u", false, sched_count, v17);
> +CPU_FIELD(__u32, sched_goidle, "schedule() left the processor idle",
> +	  "%11u", true, sched_count, v17);
> +CPU_FIELD(__u32, ttwu_count, "try_to_wake_up() was called",
> +	  "%11u", false, ttwu_count, v17);
> +CPU_FIELD(__u32, ttwu_local, "try_to_wake_up() was called to wake up the local cpu",
> +	  "%11u", true, ttwu_count, v17);
> +CPU_FIELD(__u64, rq_cpu_time, "total runtime by tasks on this processor (in jiffies)",
> +	  "%11llu", false, rq_cpu_time, v17);
> +CPU_FIELD(__u64, run_delay, "total waittime by tasks on this processor (in jiffies)",
> +	  "%11llu", true, rq_cpu_time, v17);
> +CPU_FIELD(__u64, pcount, "total timeslices run on this cpu",
> +	  "%11llu", false, pcount, v17);
> +#endif /* CPU_FIELD */
>  
>  #ifdef DOMAIN_FIELD
> -DOMAIN_FIELD(__u32, busy_lb_count, v17);
> -DOMAIN_FIELD(__u32, busy_lb_balanced, v17);
> -DOMAIN_FIELD(__u32, busy_lb_failed, v17);
> -DOMAIN_FIELD(__u32, busy_lb_imbalance_load, v17);
> -DOMAIN_FIELD(__u32, busy_lb_imbalance_util, v17);
> -DOMAIN_FIELD(__u32, busy_lb_imbalance_task, v17);
> -DOMAIN_FIELD(__u32, busy_lb_imbalance_misfit, v17);
> -DOMAIN_FIELD(__u32, busy_lb_gained, v17);
> -DOMAIN_FIELD(__u32, busy_lb_hot_gained, v17);
> -DOMAIN_FIELD(__u32, busy_lb_nobusyq, v17);
> -DOMAIN_FIELD(__u32, busy_lb_nobusyg, v17);
> -DOMAIN_FIELD(__u32, idle_lb_count, v17);
> -DOMAIN_FIELD(__u32, idle_lb_balanced, v17);
> -DOMAIN_FIELD(__u32, idle_lb_failed, v17);
> -DOMAIN_FIELD(__u32, idle_lb_imbalance_load, v17);
> -DOMAIN_FIELD(__u32, idle_lb_imbalance_util, v17);
> -DOMAIN_FIELD(__u32, idle_lb_imbalance_task, v17);
> -DOMAIN_FIELD(__u32, idle_lb_imbalance_misfit, v17);
> -DOMAIN_FIELD(__u32, idle_lb_gained, v17);
> -DOMAIN_FIELD(__u32, idle_lb_hot_gained, v17);
> -DOMAIN_FIELD(__u32, idle_lb_nobusyq, v17);
> -DOMAIN_FIELD(__u32, idle_lb_nobusyg, v17);
> -DOMAIN_FIELD(__u32, newidle_lb_count, v17);
> -DOMAIN_FIELD(__u32, newidle_lb_balanced, v17);
> -DOMAIN_FIELD(__u32, newidle_lb_failed, v17);
> -DOMAIN_FIELD(__u32, newidle_lb_imbalance_load, v17);
> -DOMAIN_FIELD(__u32, newidle_lb_imbalance_util, v17);
> -DOMAIN_FIELD(__u32, newidle_lb_imbalance_task, v17);
> -DOMAIN_FIELD(__u32, newidle_lb_imbalance_misfit, v17);
> -DOMAIN_FIELD(__u32, newidle_lb_gained, v17);
> -DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v17);
> -DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v17);
> -DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v17);
> -DOMAIN_FIELD(__u32, alb_count, v17);
> -DOMAIN_FIELD(__u32, alb_failed, v17);
> -DOMAIN_FIELD(__u32, alb_pushed, v17);
> -DOMAIN_FIELD(__u32, sbe_count, v17);
> -DOMAIN_FIELD(__u32, sbe_balanced, v17);
> -DOMAIN_FIELD(__u32, sbe_pushed, v17);
> -DOMAIN_FIELD(__u32, sbf_count, v17);
> -DOMAIN_FIELD(__u32, sbf_balanced, v17);
> -DOMAIN_FIELD(__u32, sbf_pushed, v17);
> -DOMAIN_FIELD(__u32, ttwu_wake_remote, v17);
> -DOMAIN_FIELD(__u32, ttwu_move_affine, v17);
> -DOMAIN_FIELD(__u32, ttwu_move_balance, v17);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category busy> ");
> +#endif
> +DOMAIN_FIELD(__u32, busy_lb_count,
> +	     "load_balance() count on cpu busy", "%11u", true, v17);
> +DOMAIN_FIELD(__u32, busy_lb_balanced,
> +	     "load_balance() found balanced on cpu busy", "%11u", true, v17);
> +DOMAIN_FIELD(__u32, busy_lb_failed,
> +	     "load_balance() move task failed on cpu busy", "%11u", true, v17);
> +DOMAIN_FIELD(__u32, busy_lb_imbalance_load,
> +	     "imbalance in load on cpu busy", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, busy_lb_imbalance_util,
> +	     "imbalance in utilization on cpu busy", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, busy_lb_imbalance_task,
> +	     "imbalance in number of tasks on cpu busy", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, busy_lb_imbalance_misfit,
> +	     "imbalance in misfit tasks on cpu busy", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, busy_lb_gained,
> +	     "pull_task() count on cpu busy", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, busy_lb_hot_gained,
> +	     "pull_task() when target task was cache-hot on cpu busy", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, busy_lb_nobusyq,
> +	     "load_balance() failed to find busier queue on cpu busy", "%11u", true, v17);
> +DOMAIN_FIELD(__u32, busy_lb_nobusyg,
> +	     "load_balance() failed to find busier group on cpu busy", "%11u", true, v17);
> +#ifdef DERIVED_CNT_FIELD
> +DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, v17);
> +#endif
> +#ifdef DERIVED_AVG_FIELD
> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu busy)", "%11.2Lf",
> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, busy_lb_gained, v17);
> +#endif
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category idle> ");
> +#endif
> +DOMAIN_FIELD(__u32, idle_lb_count,
> +	     "load_balance() count on cpu idle", "%11u", true, v17);
> +DOMAIN_FIELD(__u32, idle_lb_balanced,
> +	     "load_balance() found balanced on cpu idle", "%11u", true, v17);
> +DOMAIN_FIELD(__u32, idle_lb_failed,
> +	     "load_balance() move task failed on cpu idle", "%11u", true, v17);
> +DOMAIN_FIELD(__u32, idle_lb_imbalance_load,
> +	     "imbalance in load on cpu idle", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, idle_lb_imbalance_util,
> +	     "imbalance in utilization on cpu idle", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, idle_lb_imbalance_task,
> +	     "imbalance in number of tasks on cpu idle", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, idle_lb_imbalance_misfit,
> +	     "imbalance in misfit tasks on cpu idle", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, idle_lb_gained,
> +	     "pull_task() count on cpu idle", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, idle_lb_hot_gained,
> +	     "pull_task() when target task was cache-hot on cpu idle", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, idle_lb_nobusyq,
> +	     "load_balance() failed to find busier queue on cpu idle", "%11u", true, v17);
> +DOMAIN_FIELD(__u32, idle_lb_nobusyg,
> +	     "load_balance() failed to find busier group on cpu idle", "%11u", true, v17);
> +#ifdef DERIVED_CNT_FIELD
> +DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, v17);
> +#endif
> +#ifdef DERIVED_AVG_FIELD
> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu idle)", "%11.2Lf",
> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, idle_lb_gained, v17);
> +#endif
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category newidle> ");
> +#endif
> +DOMAIN_FIELD(__u32, newidle_lb_count,
> +	     "load_balance() count on cpu newly idle", "%11u", true, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_balanced,
> +	     "load_balance() found balanced on cpu newly idle", "%11u", true, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_failed,
> +	     "load_balance() move task failed on cpu newly idle", "%11u", true, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_imbalance_load,
> +	     "imbalance in load on cpu newly idle", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_imbalance_util,
> +	     "imbalance in utilization on cpu newly idle", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_imbalance_task,
> +	     "imbalance in number of tasks on cpu newly idle", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_imbalance_misfit,
> +	     "imbalance in misfit tasks on cpu newly idle", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_gained,
> +	     "pull_task() count on cpu newly idle", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained,
> +	     "pull_task() when target task was cache-hot on cpu newly idle", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
> +	     "load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v17);
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
> +	     "load_balance() failed to find busier group on cpu newly idle", "%11u", true, v17);
> +#ifdef DERIVED_CNT_FIELD
> +DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v17);
> +#endif
> +#ifdef DERIVED_AVG_FIELD
> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu newly idle)", "%11.2Lf",
> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, newidle_lb_gained, v17);
> +#endif
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category active_load_balance()> ");
> +#endif
> +DOMAIN_FIELD(__u32, alb_count,
> +	     "active_load_balance() count", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, alb_failed,
> +	     "active_load_balance() move task failed", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, alb_pushed,
> +	     "active_load_balance() successfully moved a task", "%11u", false, v17);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category sched_balance_exec()> ");
> +#endif
> +DOMAIN_FIELD(__u32, sbe_count,
> +	     "sbe_count is not used", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, sbe_balanced,
> +	     "sbe_balanced is not used", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, sbe_pushed,
> +	     "sbe_pushed is not used", "%11u", false, v17);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category sched_balance_fork()> ");
> +#endif
> +DOMAIN_FIELD(__u32, sbf_count,
> +	     "sbf_count is not used", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, sbf_balanced,
> +	     "sbf_balanced is not used", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, sbf_pushed,
> +	     "sbf_pushed is not used", "%11u", false, v17);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Wakeup Info> ");
>  #endif
> +DOMAIN_FIELD(__u32, ttwu_wake_remote,
> +	     "try_to_wake_up() awoke a task that last ran on a diff cpu", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, ttwu_move_affine,
> +	     "try_to_wake_up() moved task because cache-cold on own cpu", "%11u", false, v17);
> +DOMAIN_FIELD(__u32, ttwu_move_balance,
> +	     "try_to_wake_up() started passive balancing", "%11u", false, v17);
> +#endif /* DOMAIN_FIELD */

Probably better to put in the previous commits.


> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> index 1c3b56013164..e2e7dbc4f0aa 100644
> --- a/tools/perf/builtin-sched.c
> +++ b/tools/perf/builtin-sched.c
> @@ -3869,6 +3869,501 @@ static int perf_sched__schedstat_record(struct perf_sched *sched,
>  	return err;
>  }
>  
> +struct schedstat_domain {
> +	struct perf_record_schedstat_domain *domain_data;
> +	struct schedstat_domain *next;
> +};
> +
> +struct schedstat_cpu {
> +	struct perf_record_schedstat_cpu *cpu_data;
> +	struct schedstat_domain *domain_head;
> +	struct schedstat_cpu *next;
> +};
> +
> +struct schedstat_cpu *cpu_head = NULL, *cpu_tail = NULL, *cpu_second_pass = NULL;
> +struct schedstat_domain *domain_tail = NULL, *domain_second_pass = NULL;

No need to reset to NULL.  Also please add some comments how those
structs and lists are used.


> +bool after_workload_flag;
> +
> +static void store_schedtstat_cpu_diff(struct schedstat_cpu *after_workload)
> +{
> +	struct perf_record_schedstat_cpu *before = cpu_second_pass->cpu_data;
> +	struct perf_record_schedstat_cpu *after = after_workload->cpu_data;
> +	__u16 version = after_workload->cpu_data->version;
> +
> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)	\
> +	(before->_ver._name = after->_ver._name - before->_ver._name)
> +
> +	if (version == 15) {
> +#include <perf/schedstat-v15.h>
> +	} else if (version == 16) {
> +#include <perf/schedstat-v16.h>
> +	} else if (version == 17) {
> +#include <perf/schedstat-v17.h>
> +	}
> +
> +#undef CPU_FIELD
> +}
> +
> +static void store_schedstat_domain_diff(struct schedstat_domain *after_workload)
> +{
> +	struct perf_record_schedstat_domain *before = domain_second_pass->domain_data;
> +	struct perf_record_schedstat_domain *after = after_workload->domain_data;
> +	__u16 version = after_workload->domain_data->version;
> +
> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)	\
> +	(before->_ver._name = after->_ver._name - before->_ver._name)
> +
> +	if (version == 15) {
> +#include <perf/schedstat-v15.h>
> +	} else if (version == 16) {
> +#include <perf/schedstat-v16.h>
> +	} else if (version == 17) {
> +#include <perf/schedstat-v17.h>
> +	}
> +#undef DOMAIN_FIELD
> +}
> +
> +static void print_separator(size_t pre_dash_cnt, const char *s, size_t post_dash_cnt)
> +{
> +	size_t i;
> +
> +	for (i = 0; i < pre_dash_cnt; ++i)
> +		printf("-");
> +
> +	printf("%s", s);
> +
> +	for (i = 0; i < post_dash_cnt; ++i)
> +		printf("-");
> +
> +	printf("\n");

This can be simplified:

	printf("%.*s%s%.*s\n", pre_dash_cnt, graph_dotted_line, s,
		post_dash_cnt, graph_dotted_line);

> +}
> +
> +static inline void print_cpu_stats(struct perf_record_schedstat_cpu *cs)
> +{
> +	printf("%-65s %12s %12s\n", "DESC", "COUNT", "PCT_CHANGE");
> +	print_separator(100, "", 0);

	printf("%.*s\n", 100, graph_dotted_line);

You can define a macro for the length (100) as it's used in other places
too.

> +
> +#define CALC_PCT(_x, _y)	((_y) ? ((double)(_x) / (_y)) * 100 : 0.0)
> +
> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		\
> +	do {									\
> +		printf("%-65s: " _format, _desc, cs->_ver._name);		\
> +		if (_is_pct) {							\
> +			printf("  ( %8.2lf%% )",				\
> +			       CALC_PCT(cs->_ver._name, cs->_ver._pct_of));	\
> +		}								\
> +		printf("\n");							\
> +	} while (0)
> +
> +	if (cs->version == 15) {
> +#include <perf/schedstat-v15.h>
> +	} else if (cs->version == 16) {
> +#include <perf/schedstat-v16.h>
> +	} else if (cs->version == 17) {
> +#include <perf/schedstat-v17.h>
> +	}
> +
> +#undef CPU_FIELD
> +#undef CALC_PCT
> +}
> +
> +static inline void print_domain_stats(struct perf_record_schedstat_domain *ds,
> +				      __u64 jiffies)
> +{
> +	printf("%-65s %12s %14s\n", "DESC", "COUNT", "AVG_JIFFIES");
> +
> +#define DOMAIN_CATEGORY(_desc)							\
> +	do {									\
> +		size_t _len = strlen(_desc);					\
> +		size_t _pre_dash_cnt = (100 - _len) / 2;			\
> +		size_t _post_dash_cnt = 100 - _len - _pre_dash_cnt;		\
> +		print_separator(_pre_dash_cnt, _desc, _post_dash_cnt);		\

This can be useful in other places, can you please factor it out as a
function somewhere in util.c?


> +	} while (0)
> +
> +#define CALC_AVG(_x, _y)	((_y) ? (long double)(_x) / (_y) : 0.0)
> +
> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		\
> +	do {									\
> +		printf("%-65s: " _format, _desc, ds->_ver._name);		\
> +		if (_is_jiffies) {						\
> +			printf("  $ %11.2Lf $",					\
> +			       CALC_AVG(jiffies, ds->_ver._name));		\
> +		}								\
> +		printf("\n");							\
> +	} while (0)
> +
> +#define DERIVED_CNT_FIELD(_desc, _format, _x, _y, _z, _ver)			\
> +	printf("*%-64s: " _format "\n", _desc,					\
> +	       (ds->_ver._x) - (ds->_ver._y) - (ds->_ver._z))
> +
> +#define DERIVED_AVG_FIELD(_desc, _format, _x, _y, _z, _w, _ver)			\
> +	printf("*%-64s: " _format "\n", _desc, CALC_AVG(ds->_ver._w,		\
> +	       ((ds->_ver._x) - (ds->_ver._y) - (ds->_ver._z))))
> +
> +	if (ds->version == 15) {
> +#include <perf/schedstat-v15.h>
> +	} else if (ds->version == 16) {
> +#include <perf/schedstat-v16.h>
> +	} else if (ds->version == 17) {
> +#include <perf/schedstat-v17.h>
> +	}
> +
> +#undef DERIVED_AVG_FIELD
> +#undef DERIVED_CNT_FIELD
> +#undef DOMAIN_FIELD
> +#undef CALC_AVG
> +#undef DOMAIN_CATEGORY
> +}
> +
> +static void print_domain_cpu_list(struct perf_record_schedstat_domain *ds)
> +{
> +	char bin[16][5] = {"0000", "0001", "0010", "0011",
> +			   "0100", "0101", "0110", "0111",
> +			   "1000", "1001", "1010", "1011",
> +			   "1100", "1101", "1110", "1111"};
> +	bool print_flag = false, low = true;
> +	int cpu = 0, start, end, idx;
> +
> +	idx = ((ds->nr_cpus + 7) >> 3) - 1;
> +
> +	printf("<");
> +	while (idx >= 0) {
> +		__u8 index;
> +
> +		if (low)
> +			index = ds->cpu_mask[idx] & 0xf;
> +		else
> +			index = (ds->cpu_mask[idx--] & 0xf0) >> 4;

Isn't ds->cpu_mask a bitmap?  Can we use bitmap_scnprintf() or
something?

> +
> +		for (int i = 3; i >= 0; i--) {
> +			if (!print_flag && bin[index][i] == '1') {
> +				start = cpu;
> +				print_flag = true;
> +			} else if (print_flag && bin[index][i] == '0') {
> +				end = cpu - 1;
> +				print_flag = false;
> +				if (start == end)
> +					printf("%d, ", start);
> +				else
> +					printf("%d-%d, ", start, end);
> +			}
> +			cpu++;
> +		}
> +
> +		low = !low;
> +	}
> +
> +	if (print_flag) {
> +		if (start == cpu-1)
> +			printf("%d, ", start);
> +		else
> +			printf("%d-%d, ", start, cpu-1);
> +	}
> +	printf("\b\b>\n");
> +}
> +
> +static void summarize_schedstat_cpu(struct schedstat_cpu *summary_cpu,
> +				    struct schedstat_cpu *cptr,
> +				    int cnt, bool is_last)
> +{
> +	struct perf_record_schedstat_cpu *summary_cs = summary_cpu->cpu_data,
> +					 *temp_cs = cptr->cpu_data;
> +
> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		\
> +	do {									\
> +		summary_cs->_ver._name += temp_cs->_ver._name;			\
> +		if (is_last)							\
> +			summary_cs->_ver._name /= cnt;				\
> +	} while (0)
> +
> +	if (cptr->cpu_data->version == 15) {
> +#include <perf/schedstat-v15.h>
> +	} else if (cptr->cpu_data->version == 16) {
> +#include <perf/schedstat-v16.h>
> +	} else if (cptr->cpu_data->version == 17) {
> +#include <perf/schedstat-v17.h>
> +	}
> +#undef CPU_FIELD
> +}
> +
> +static void summarize_schedstat_domain(struct schedstat_domain *summary_domain,
> +				       struct schedstat_domain *dptr,
> +				       int cnt, bool is_last)
> +{
> +	struct perf_record_schedstat_domain *summary_ds = summary_domain->domain_data,
> +					    *temp_ds = dptr->domain_data;
> +
> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		\
> +	do {									\
> +		summary_ds->_ver._name += temp_ds->_ver._name;			\
> +		if (is_last)							\
> +			summary_ds->_ver._name /= cnt;				\
> +	} while (0)
> +
> +	if (dptr->domain_data->version == 15) {
> +#include <perf/schedstat-v15.h>
> +	} else if (dptr->domain_data->version == 16) {
> +#include <perf/schedstat-v16.h>
> +	} else if (dptr->domain_data->version == 17) {
> +#include <perf/schedstat-v17.h>
> +	}
> +#undef DOMAIN_FIELD
> +}
> +
> +static void get_all_cpu_stats(struct schedstat_cpu **cptr)
> +{
> +	struct schedstat_domain *dptr = NULL, *tdptr = NULL, *dtail = NULL;
> +	struct schedstat_cpu *tcptr = *cptr, *summary_head = NULL;
> +	struct perf_record_schedstat_domain *ds = NULL;
> +	struct perf_record_schedstat_cpu *cs = NULL;
> +	bool is_last = false;
> +	int cnt = 0;
> +
> +	if (tcptr) {
> +		summary_head = zalloc(sizeof(*summary_head));
> +		summary_head->cpu_data = zalloc(sizeof(*cs));

No error handlings.


> +		memcpy(summary_head->cpu_data, tcptr->cpu_data, sizeof(*cs));
> +		summary_head->next = NULL;
> +		summary_head->domain_head = NULL;
> +		dptr = tcptr->domain_head;
> +
> +		while (dptr) {
> +			size_t cpu_mask_size = (dptr->domain_data->nr_cpus + 7) >> 3;
> +
> +			tdptr = zalloc(sizeof(*tdptr));
> +			tdptr->domain_data = zalloc(sizeof(*ds) + cpu_mask_size);

Ditto.


> +			memcpy(tdptr->domain_data, dptr->domain_data, sizeof(*ds) + cpu_mask_size);
> +
> +			tdptr->next = NULL;
> +			if (!dtail) {
> +				summary_head->domain_head = tdptr;
> +				dtail = tdptr;
> +			} else {
> +				dtail->next = tdptr;
> +				dtail = dtail->next;
> +			}
> +			dptr = dptr->next;

Hmm.. can we just use list_head?


> +		}
> +	}
> +
> +	tcptr = (*cptr)->next;
> +	while (tcptr) {
> +		if (!tcptr->next)
> +			is_last = true;
> +
> +		cnt++;
> +		summarize_schedstat_cpu(summary_head, tcptr, cnt, is_last);
> +		tdptr = summary_head->domain_head;
> +		dptr = tcptr->domain_head;
> +
> +		while (tdptr) {
> +			summarize_schedstat_domain(tdptr, dptr, cnt, is_last);
> +			tdptr = tdptr->next;
> +			dptr = dptr->next;
> +		}
> +		tcptr = tcptr->next;
> +	}
> +
> +	tcptr = *cptr;
> +	summary_head->next = tcptr;
> +	*cptr = summary_head;
> +}
> +
> +/* FIXME: The code fails (segfaults) when one or ore cpus are offline. */

Sounds scary..  Do you have any clue?


> +static void show_schedstat_data(struct schedstat_cpu *cptr)
> +{
> +	struct perf_record_schedstat_domain *ds = NULL;
> +	struct perf_record_schedstat_cpu *cs = NULL;
> +	__u64 jiffies = cptr->cpu_data->timestamp;
> +	struct schedstat_domain *dptr = NULL;
> +	bool is_summary = true;
> +
> +	printf("Columns description\n");
> +	print_separator(100, "", 0);
> +	printf("DESC\t\t\t-> Description of the field\n");
> +	printf("COUNT\t\t\t-> Value of the field\n");
> +	printf("PCT_CHANGE\t\t-> Percent change with corresponding base value\n");
> +	printf("AVG_JIFFIES\t\t-> Avg time in jiffies between two consecutive occurrence of event\n");
> +
> +	print_separator(100, "", 0);
> +	printf("Time elapsed (in jiffies)                                        : %11llu\n",

Probably better to use printf("%-*s: %11llu\n", ...).


> +	       jiffies);
> +	print_separator(100, "", 0);
> +
> +	get_all_cpu_stats(&cptr);
> +
> +	while (cptr) {
> +		cs = cptr->cpu_data;
> +		printf("\n");
> +		print_separator(100, "", 0);
> +		if (is_summary)
> +			printf("CPU <ALL CPUS SUMMARY>\n");
> +		else
> +			printf("CPU %d\n", cs->cpu);
> +
> +		print_separator(100, "", 0);
> +		print_cpu_stats(cs);
> +		print_separator(100, "", 0);
> +
> +		dptr = cptr->domain_head;
> +
> +		while (dptr) {
> +			ds = dptr->domain_data;
> +			if (is_summary)
> +				if (ds->name[0])
> +					printf("CPU <ALL CPUS SUMMARY>, DOMAIN %s\n", ds->name);
> +				else
> +					printf("CPU <ALL CPUS SUMMARY>, DOMAIN %d\n", ds->domain);
> +			else {
> +				if (ds->name[0])
> +					printf("CPU %d, DOMAIN %s CPUS ", cs->cpu, ds->name);
> +				else
> +					printf("CPU %d, DOMAIN %d CPUS ", cs->cpu, ds->domain);
> +
> +				print_domain_cpu_list(ds);
> +			}
> +			print_separator(100, "", 0);
> +			print_domain_stats(ds, jiffies);
> +			print_separator(100, "", 0);
> +
> +			dptr = dptr->next;
> +		}
> +		is_summary = false;
> +		cptr = cptr->next;
> +	}
> +}
> +
> +static int perf_sched__process_schedstat(struct perf_session *session __maybe_unused,
> +					 union perf_event *event)
> +{
> +	struct perf_cpu this_cpu;
> +	static __u32 initial_cpu;
> +
> +	switch (event->header.type) {
> +	case PERF_RECORD_SCHEDSTAT_CPU:
> +		this_cpu.cpu = event->schedstat_cpu.cpu;
> +		break;
> +	case PERF_RECORD_SCHEDSTAT_DOMAIN:
> +		this_cpu.cpu = event->schedstat_domain.cpu;
> +		break;
> +	default:
> +		return 0;
> +	}
> +
> +	if (user_requested_cpus && !perf_cpu_map__has(user_requested_cpus, this_cpu))
> +		return 0;
> +
> +	if (event->header.type == PERF_RECORD_SCHEDSTAT_CPU) {
> +		struct schedstat_cpu *temp = zalloc(sizeof(struct schedstat_cpu));
> +
> +		temp->cpu_data = zalloc(sizeof(struct perf_record_schedstat_cpu));

No error checks.


> +		memcpy(temp->cpu_data, &event->schedstat_cpu,
> +		       sizeof(struct perf_record_schedstat_cpu));
> +		temp->next = NULL;
> +		temp->domain_head = NULL;
> +
> +		if (cpu_head && temp->cpu_data->cpu == initial_cpu)
> +			after_workload_flag = true;
> +
> +		if (!after_workload_flag) {
> +			if (!cpu_head) {
> +				initial_cpu = temp->cpu_data->cpu;
> +				cpu_head = temp;
> +			} else
> +				cpu_tail->next = temp;
> +
> +			cpu_tail = temp;
> +		} else {
> +			if (temp->cpu_data->cpu == initial_cpu) {
> +				cpu_second_pass = cpu_head;
> +				cpu_head->cpu_data->timestamp =
> +					temp->cpu_data->timestamp - cpu_second_pass->cpu_data->timestamp;
> +			} else {
> +				cpu_second_pass = cpu_second_pass->next;
> +			}
> +			domain_second_pass = cpu_second_pass->domain_head;
> +			store_schedtstat_cpu_diff(temp);

Is 'temp' used after this?


> +		}
> +	} else if (event->header.type == PERF_RECORD_SCHEDSTAT_DOMAIN) {
> +		size_t cpu_mask_size = (event->schedstat_domain.nr_cpus + 7) >> 3;
> +		struct schedstat_domain *temp = zalloc(sizeof(struct schedstat_domain));
> +
> +		temp->domain_data = zalloc(sizeof(struct perf_record_schedstat_domain) + cpu_mask_size);

No error checks.


> +		memcpy(temp->domain_data, &event->schedstat_domain,
> +		       sizeof(struct perf_record_schedstat_domain) + cpu_mask_size);
> +		temp->next = NULL;
> +
> +		if (!after_workload_flag) {
> +			if (cpu_tail->domain_head == NULL) {
> +				cpu_tail->domain_head = temp;
> +				domain_tail = temp;
> +			} else {
> +				domain_tail->next = temp;
> +				domain_tail = temp;
> +			}
> +		} else {
> +			store_schedstat_domain_diff(temp);
> +			domain_second_pass = domain_second_pass->next;

Is 'temp' leaking?


> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static void free_schedstat(struct schedstat_cpu *cptr)
> +{
> +	struct schedstat_domain *dptr = NULL, *tmp_dptr;
> +	struct schedstat_cpu *tmp_cptr;
> +
> +	while (cptr) {
> +		tmp_cptr = cptr;
> +		dptr = cptr->domain_head;
> +
> +		while (dptr) {
> +			tmp_dptr = dptr;
> +			dptr = dptr->next;
> +			free(tmp_dptr);
> +		}
> +		cptr = cptr->next;
> +		free(tmp_cptr);
> +	}
> +}
> +
> +static int perf_sched__schedstat_report(struct perf_sched *sched)
> +{
> +	struct perf_session *session;
> +	struct perf_data data = {
> +		.path  = input_name,
> +		.mode  = PERF_DATA_MODE_READ,
> +	};
> +	int err;
> +
> +	if (cpu_list) {
> +		user_requested_cpus = perf_cpu_map__new(cpu_list);
> +		if (!user_requested_cpus)
> +			return -EINVAL;
> +	}
> +
> +	sched->tool.schedstat_cpu = perf_sched__process_schedstat;
> +	sched->tool.schedstat_domain = perf_sched__process_schedstat;
> +
> +	session = perf_session__new(&data, &sched->tool);
> +	if (IS_ERR(session)) {
> +		pr_err("Perf session creation failed.\n");
> +		return PTR_ERR(session);
> +	}
> +
> +	err = perf_session__process_events(session);
> +
> +	perf_session__delete(session);

Quite unusual location to do this. :)  Probably better to call it after
finishing the actual logic as you might need some session data later.


> +	if (!err) {
> +		setup_pager();
> +		show_schedstat_data(cpu_head);
> +		free_schedstat(cpu_head);
> +	}

	perf_cpu_map__put(user_requested_cpus);

> +	return err;
> +}
> +
>  static bool schedstat_events_exposed(void)
>  {
>  	/*
> @@ -4046,6 +4541,8 @@ int cmd_sched(int argc, const char **argv)
>  	OPT_PARENT(sched_options)
>  	};
>  	const struct option stats_options[] = {
> +	OPT_STRING('i', "input", &input_name, "file",
> +		   "`stats report` with input filename"),
>  	OPT_STRING('o', "output", &output_name, "file",
>  		   "`stats record` with output filename"),
>  	OPT_STRING('C', "cpu", &cpu_list, "cpu", "list of cpus to profile"),
> @@ -4171,7 +4668,7 @@ int cmd_sched(int argc, const char **argv)
>  
>  		return perf_sched__timehist(&sched);
>  	} else if (!strcmp(argv[0], "stats")) {
> -		const char *const stats_subcommands[] = {"record", NULL};
> +		const char *const stats_subcommands[] = {"record", "report", NULL};
>  
>  		argc = parse_options_subcommand(argc, argv, stats_options,
>  						stats_subcommands,
> @@ -4183,6 +4680,11 @@ int cmd_sched(int argc, const char **argv)
>  				argc = parse_options(argc, argv, stats_options,
>  						     stats_usage, 0);
>  			return perf_sched__schedstat_record(&sched, argc, argv);
> +		} else if (argv[0] && !strcmp(argv[0], "report")) {
> +			if (argc)
> +				argc = parse_options(argc, argv, stats_options,
> +						     stats_usage, 0);
> +			return perf_sched__schedstat_report(&sched);
>  		}
>  		usage_with_options(stats_usage, stats_options);
>  	} else {
> diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
> index d09c3c99ab48..4071bd95192d 100644
> --- a/tools/perf/util/event.c
> +++ b/tools/perf/util/event.c
> @@ -560,7 +560,7 @@ size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
>  
>  	size = fprintf(fp, "\ncpu%u ", cs->cpu);
>  
> -#define CPU_FIELD(_type, _name, _ver)						\
> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		\
>  	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)cs->_ver._name)
>  
>  	if (version == 15) {
> @@ -641,7 +641,7 @@ size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
>  	size += fprintf(fp, "%s ", cpu_mask);
>  	free(cpu_mask);
>  
> -#define DOMAIN_FIELD(_type, _name, _ver)					\
> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		\
>  	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)ds->_ver._name)
>  
>  	if (version == 15) {
> diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
> index fad0c472f297..495ed8433c0c 100644
> --- a/tools/perf/util/synthetic-events.c
> +++ b/tools/perf/util/synthetic-events.c
> @@ -2538,7 +2538,7 @@ static union perf_event *__synthesize_schedstat_cpu(struct io *io, __u16 version
>  	if (io__get_dec(io, (__u64 *)cpu) != ' ')
>  		goto out_cpu;
>  
> -#define CPU_FIELD(_type, _name, _ver)					\
> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)	\
>  	do {								\
>  		__u64 _tmp;						\
>  		ch = io__get_dec(io, &_tmp);				\
> @@ -2662,7 +2662,7 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
>  	free(d_name);
>  	free(cpu_mask);
>  
> -#define DOMAIN_FIELD(_type, _name, _ver)				\
> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)	\
>  	do {								\
>  		__u64 _tmp;						\
>  		ch = io__get_dec(io, &_tmp);				\
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/8] perf sched stats: Add support for live mode
  2025-03-11 12:02 ` [PATCH v3 5/8] perf sched stats: Add support for live mode Swapnil Sapkal
@ 2025-03-15  4:46   ` Namhyung Kim
  2025-03-24  9:15     ` Sapkal, Swapnil
  0 siblings, 1 reply; 23+ messages in thread
From: Namhyung Kim @ 2025-03-15  4:46 UTC (permalink / raw)
  To: Swapnil Sapkal
  Cc: peterz, mingo, acme, irogers, james.clark, ravi.bangoria,
	yu.c.chen, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, adrian.hunter, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, sshegde,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das, James Clark

On Tue, Mar 11, 2025 at 12:02:27PM +0000, Swapnil Sapkal wrote:
> The live mode works similar to simple `perf stat` command, by profiling
> the target and printing results on the terminal as soon as the target
> finishes.
> 
> Example usage:
> 
>   # perf sched stats -- sleep 10
> 
> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> Tested-by: James Clark <james.clark@linaro.org>
> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> ---
>  tools/perf/builtin-sched.c | 87 +++++++++++++++++++++++++++++++++++++-
>  1 file changed, 86 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> index e2e7dbc4f0aa..9813e25b54b8 100644
> --- a/tools/perf/builtin-sched.c
> +++ b/tools/perf/builtin-sched.c
> @@ -4364,6 +4364,91 @@ static int perf_sched__schedstat_report(struct perf_sched *sched)
>  	return err;
>  }
>  
> +static int process_synthesized_event_live(const struct perf_tool *tool __maybe_unused,
> +					  union perf_event *event,
> +					  struct perf_sample *sample __maybe_unused,
> +					  struct machine *machine __maybe_unused)
> +{
> +	return perf_sched__process_schedstat(NULL, event);
> +}
> +
> +static int perf_sched__schedstat_live(struct perf_sched *sched,
> +				      int argc, const char **argv)
> +{
> +	struct evlist *evlist;
> +	struct target *target;
> +	int reset = 0;
> +	int err = 0;
> +
> +	signal(SIGINT, sighandler);
> +	signal(SIGCHLD, sighandler);
> +	signal(SIGTERM, sighandler);
> +
> +	evlist = evlist__new();
> +	if (!evlist)
> +		return -ENOMEM;
> +
> +	/*
> +	 * `perf sched schedstat` does not support workload profiling (-p pid)
> +	 * since /proc/schedstat file contains cpu specific data only. Hence, a
> +	 * profile target is either set of cpus or systemwide, never a process.
> +	 * Note that, although `-- <workload>` is supported, profile data are
> +	 * still cpu/systemwide.
> +	 */
> +	target = zalloc(sizeof(struct target));

As I said, you can put it on stack.


> +	if (cpu_list)
> +		target->cpu_list = cpu_list;
> +	else
> +		target->system_wide = true;
> +
> +	if (argc) {
> +		err = evlist__prepare_workload(evlist, target, argv, false, NULL);
> +		if (err)
> +			goto out_target;
> +	}
> +
> +	if (cpu_list) {
> +		user_requested_cpus = perf_cpu_map__new(cpu_list);
> +		if (!user_requested_cpus)
> +			goto out_target;
> +	}

How about this instead?

	evlist__create_maps(evlist, target);

> +
> +	err = perf_event__synthesize_schedstat(&(sched->tool),
> +					       process_synthesized_event_live,
> +					       user_requested_cpus);
> +	if (err < 0)
> +		goto out_target;
> +
> +	err = enable_sched_schedstats(&reset);
> +	if (err < 0)
> +		goto out_target;
> +
> +	if (argc)
> +		evlist__start_workload(evlist);
> +
> +	/* wait for signal */
> +	pause();
> +
> +	if (reset) {
> +		err = disable_sched_schedstat();
> +		if (err < 0)
> +			goto out_target;
> +	}
> +
> +	err = perf_event__synthesize_schedstat(&(sched->tool),
> +					       process_synthesized_event_live,
> +					       user_requested_cpus);
> +	if (err)
> +		goto out_target;
> +
> +	setup_pager();
> +	show_schedstat_data(cpu_head);
> +	free_schedstat(cpu_head);
> +out_target:
> +	free(target);

	evlist__delete(evlist);

and unless you use evlist__create_maps().

	perf_cpu_map__put(user_requested_cpus);

Thanks,
Namhyung


> +	return err;
> +}
> +
>  static bool schedstat_events_exposed(void)
>  {
>  	/*
> @@ -4686,7 +4771,7 @@ int cmd_sched(int argc, const char **argv)
>  						     stats_usage, 0);
>  			return perf_sched__schedstat_report(&sched);
>  		}
> -		usage_with_options(stats_usage, stats_options);
> +		return perf_sched__schedstat_live(&sched, argc, argv);
>  	} else {
>  		usage_with_options(sched_usage, sched_options);
>  	}
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 1/8] perf sched stats: Add record and rawdump support
  2025-03-15  2:24   ` Namhyung Kim
@ 2025-03-17 13:29     ` Sapkal, Swapnil
  0 siblings, 0 replies; 23+ messages in thread
From: Sapkal, Swapnil @ 2025-03-17 13:29 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: peterz, mingo, acme, irogers, james.clark, ravi.bangoria,
	yu.c.chen, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, adrian.hunter, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, sshegde,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das, James Clark

Hello Namhyung,

Thank you for reviewing the series.

On 3/15/2025 7:54 AM, Namhyung Kim wrote:
> Hello,
> 
> On Tue, Mar 11, 2025 at 12:02:23PM +0000, Swapnil Sapkal wrote:
>> Define new, perf tool only, sample types and their layouts. Add logic
>> to parse /proc/schedstat, convert it to perf sample format and save
>> samples to perf.data file with `perf sched stats record` command. Also
>> add logic to read perf.data file, interpret schedstat samples and
>> print rawdump of samples with `perf script -D`.
>>
>> Note that, /proc/schedstat file output is standardized with version
>> number. The patch supports v15 but older or newer version can be added
>> easily.
>>
>> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> Tested-by: James Clark <james.clark@linaro.org>
>> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
>> ---
>>   tools/lib/perf/Documentation/libperf.txt    |   2 +
>>   tools/lib/perf/Makefile                     |   2 +-
>>   tools/lib/perf/include/perf/event.h         |  42 ++++
>>   tools/lib/perf/include/perf/schedstat-v15.h |  52 +++++
>>   tools/perf/builtin-inject.c                 |   2 +
>>   tools/perf/builtin-sched.c                  | 226 +++++++++++++++++-
>>   tools/perf/util/event.c                     |  98 ++++++++
>>   tools/perf/util/event.h                     |   2 +
>>   tools/perf/util/session.c                   |  20 ++
>>   tools/perf/util/synthetic-events.c          | 239 ++++++++++++++++++++
>>   tools/perf/util/synthetic-events.h          |   3 +
>>   tools/perf/util/tool.c                      |  20 ++
>>   tools/perf/util/tool.h                      |   4 +-
>>   13 files changed, 709 insertions(+), 3 deletions(-)
>>   create mode 100644 tools/lib/perf/include/perf/schedstat-v15.h
>>
>> diff --git a/tools/lib/perf/Documentation/libperf.txt b/tools/lib/perf/Documentation/libperf.txt
>> index 59aabdd3cabf..3f295639903d 100644
>> --- a/tools/lib/perf/Documentation/libperf.txt
>> +++ b/tools/lib/perf/Documentation/libperf.txt
>> @@ -210,6 +210,8 @@ SYNOPSIS
>>     struct perf_record_time_conv;
>>     struct perf_record_header_feature;
>>     struct perf_record_compressed;
>> +  struct perf_record_schedstat_cpu;
>> +  struct perf_record_schedstat_domain;
>>   --
>>   
>>   DESCRIPTION
>> diff --git a/tools/lib/perf/Makefile b/tools/lib/perf/Makefile
>> index e9a7ac2c062e..4b60804aa0b6 100644
>> --- a/tools/lib/perf/Makefile
>> +++ b/tools/lib/perf/Makefile
>> @@ -174,7 +174,7 @@ install_lib: libs
>>   		$(call do_install_mkdir,$(libdir_SQ)); \
>>   		cp -fpR $(LIBPERF_ALL) $(DESTDIR)$(libdir_SQ)
>>   
>> -HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h
>> +HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-v15.h
>>   INTERNAL_HDRS := cpumap.h evlist.h evsel.h lib.h mmap.h rc_check.h threadmap.h xyarray.h
>>   
>>   INSTALL_HDRS_PFX := $(DESTDIR)$(prefix)/include/perf
>> diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
>> index 37bb7771d914..189106874063 100644
>> --- a/tools/lib/perf/include/perf/event.h
>> +++ b/tools/lib/perf/include/perf/event.h
>> @@ -457,6 +457,44 @@ struct perf_record_compressed {
>>   	char			 data[];
>>   };
>>   
>> +struct perf_record_schedstat_cpu_v15 {
>> +#define CPU_FIELD(_type, _name, _ver)		_type _name
>> +#include "schedstat-v15.h"
>> +#undef CPU_FIELD
>> +};
>> +
>> +struct perf_record_schedstat_cpu {
>> +	struct perf_event_header header;
>> +	__u64			 timestamp;
>> +	union {
>> +		struct perf_record_schedstat_cpu_v15 v15;
>> +	};
>> +	__u32			 cpu;
>> +	__u16			 version;
> 
> Why not putting these before the union?  I think it'll have variable
> size once you add different versions then it'd be hard to access the
> fields after union.  You may want to add a padding explicitly.
> 

I put these fields after the union to remove holes but your point makes
sense as we add different versions it will change. Sure, will add explicit
padding in the next version.

>> +};
>> +
>> +struct perf_record_schedstat_domain_v15 {
>> +#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
>> +#include "schedstat-v15.h"
>> +#undef DOMAIN_FIELD
>> +};
>> +
>> +#define DOMAIN_NAME_LEN		16
>> +
>> +struct perf_record_schedstat_domain {
>> +	struct perf_event_header header;
>> +	__u16			 version;
>> +	__u64			 timestamp;
>> +	__u32			 cpu;
>> +	__u16			 domain;
> 
> If this has similar information for schedstat_cpu, I think it's better
> to start with the same layout.  And having version before timestamp
> would add unnecessary paddings.
> 

Yes, It has the same information. I will keep the layout same.

> 
>> +	char			 name[DOMAIN_NAME_LEN];
>> +	union {
>> +		struct perf_record_schedstat_domain_v15 v15;
>> +	};
>> +	__u16			 nr_cpus;
>> +	__u8			 cpu_mask[];
> 
> Does cpu_mask represent the domain membership?  Maybe you can split
> those info into a separate record or put it in a header feature like
> we have topology information there.
> 

I got the idea what you are trying to say. I will think more on this and
come back.

> 
>> +};
>> +
>>   enum perf_user_event_type { /* above any possible kernel type */
>>   	PERF_RECORD_USER_TYPE_START		= 64,
>>   	PERF_RECORD_HEADER_ATTR			= 64,
>> @@ -478,6 +516,8 @@ enum perf_user_event_type { /* above any possible kernel type */
>>   	PERF_RECORD_HEADER_FEATURE		= 80,
>>   	PERF_RECORD_COMPRESSED			= 81,
>>   	PERF_RECORD_FINISHED_INIT		= 82,
>> +	PERF_RECORD_SCHEDSTAT_CPU		= 83,
>> +	PERF_RECORD_SCHEDSTAT_DOMAIN		= 84,
>>   	PERF_RECORD_HEADER_MAX
>>   };
>>   
>> @@ -518,6 +558,8 @@ union perf_event {
>>   	struct perf_record_time_conv		time_conv;
>>   	struct perf_record_header_feature	feat;
>>   	struct perf_record_compressed		pack;
>> +	struct perf_record_schedstat_cpu	schedstat_cpu;
>> +	struct perf_record_schedstat_domain	schedstat_domain;
>>   };
>>   
>>   #endif /* __LIBPERF_EVENT_H */
>> diff --git a/tools/lib/perf/include/perf/schedstat-v15.h b/tools/lib/perf/include/perf/schedstat-v15.h
>> new file mode 100644
>> index 000000000000..43f8060c5337
>> --- /dev/null
>> +++ b/tools/lib/perf/include/perf/schedstat-v15.h
>> @@ -0,0 +1,52 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifdef CPU_FIELD
>> +CPU_FIELD(__u32, yld_count, v15);
>> +CPU_FIELD(__u32, array_exp, v15);
>> +CPU_FIELD(__u32, sched_count, v15);
>> +CPU_FIELD(__u32, sched_goidle, v15);
>> +CPU_FIELD(__u32, ttwu_count, v15);
>> +CPU_FIELD(__u32, ttwu_local, v15);
>> +CPU_FIELD(__u64, rq_cpu_time, v15);
>> +CPU_FIELD(__u64, run_delay, v15);
>> +CPU_FIELD(__u64, pcount, v15);
>> +#endif
>> +
>> +#ifdef DOMAIN_FIELD
>> +DOMAIN_FIELD(__u32, idle_lb_count, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_balanced, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_failed, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_imbalance, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_gained, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_hot_gained, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_nobusyq, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_nobusyg, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_count, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_balanced, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_failed, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_imbalance, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_gained, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_hot_gained, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_nobusyq, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_nobusyg, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_count, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_balanced, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_failed, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_imbalance, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_gained, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v15);
>> +DOMAIN_FIELD(__u32, alb_count, v15);
>> +DOMAIN_FIELD(__u32, alb_failed, v15);
>> +DOMAIN_FIELD(__u32, alb_pushed, v15);
>> +DOMAIN_FIELD(__u32, sbe_count, v15);
>> +DOMAIN_FIELD(__u32, sbe_balanced, v15);
>> +DOMAIN_FIELD(__u32, sbe_pushed, v15);
>> +DOMAIN_FIELD(__u32, sbf_count, v15);
>> +DOMAIN_FIELD(__u32, sbf_balanced, v15);
>> +DOMAIN_FIELD(__u32, sbf_pushed, v15);
>> +DOMAIN_FIELD(__u32, ttwu_wake_remote, v15);
>> +DOMAIN_FIELD(__u32, ttwu_move_affine, v15);
>> +DOMAIN_FIELD(__u32, ttwu_move_balance, v15);
>> +#endif
>> diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
>> index 11e49cafa3af..af1add2abf72 100644
>> --- a/tools/perf/builtin-inject.c
>> +++ b/tools/perf/builtin-inject.c
>> @@ -2530,6 +2530,8 @@ int cmd_inject(int argc, const char **argv)
>>   	inject.tool.finished_init	= perf_event__repipe_op2_synth;
>>   	inject.tool.compressed		= perf_event__repipe_op4_synth;
>>   	inject.tool.auxtrace		= perf_event__repipe_auxtrace;
>> +	inject.tool.schedstat_cpu	= perf_event__repipe_op2_synth;
>> +	inject.tool.schedstat_domain	= perf_event__repipe_op2_synth;
>>   	inject.tool.dont_split_sample_group = true;
>>   	inject.session = __perf_session__new(&data, &inject.tool,
>>   					     /*trace_event_repipe=*/inject.output.is_pipe);
>> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
>> index 26ece6e9bfd1..1c3b56013164 100644
>> --- a/tools/perf/builtin-sched.c
>> +++ b/tools/perf/builtin-sched.c
>> @@ -28,6 +28,8 @@
>>   #include "util/debug.h"
>>   #include "util/event.h"
>>   #include "util/util.h"
>> +#include "util/synthetic-events.h"
>> +#include "util/target.h"
>>   
>>   #include <linux/kernel.h>
>>   #include <linux/log2.h>
>> @@ -55,6 +57,7 @@
>>   #define MAX_PRIO		140
>>   
>>   static const char *cpu_list;
>> +static struct perf_cpu_map *user_requested_cpus;
> 
> I guess this can be in evlist.
> 

Sure, Will add it in evlist.

> 
>>   static DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
>>   
>>   struct sched_atom;
>> @@ -236,6 +239,9 @@ struct perf_sched {
>>   	volatile bool   thread_funcs_exit;
>>   	const char	*prio_str;
>>   	DECLARE_BITMAP(prio_bitmap, MAX_PRIO);
>> +
>> +	struct perf_session *session;
>> +	struct perf_data *data;
>>   };
>>   
>>   /* per thread run time data */
>> @@ -3670,6 +3676,199 @@ static void setup_sorting(struct perf_sched *sched, const struct option *options
>>   	sort_dimension__add("pid", &sched->cmp_pid);
>>   }
>>   
>> +static int process_synthesized_schedstat_event(const struct perf_tool *tool,
>> +					       union perf_event *event,
>> +					       struct perf_sample *sample __maybe_unused,
>> +					       struct machine *machine __maybe_unused)
>> +{
>> +	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
>> +
>> +	if (perf_data__write(sched->data, event, event->header.size) <= 0) {
>> +		pr_err("failed to write perf data, error: %m\n");
>> +		return -1;
>> +	}
>> +
>> +	sched->session->header.data_size += event->header.size;
>> +	return 0;
>> +}
>> +
>> +static void sighandler(int sig __maybe_unused)
>> +{
>> +}
>> +
>> +static int enable_sched_schedstats(int *reset)
>> +{
>> +	char path[PATH_MAX];
>> +	FILE *fp;
>> +	char ch;
>> +
>> +	snprintf(path, PATH_MAX, "%s/sys/kernel/sched_schedstats", procfs__mountpoint());
>> +	fp = fopen(path, "w+");
>> +	if (!fp) {
>> +		pr_err("Failed to open %s\n", path);
>> +		return -1;
>> +	}
>> +
>> +	ch = getc(fp);
>> +	if (ch == '0') {
>> +		*reset = 1;
>> +		rewind(fp);
>> +		putc('1', fp);
>> +		fclose(fp);
>> +	}
>> +	return 0;
>> +}
>> +
>> +static int disable_sched_schedstat(void)
>> +{
>> +	char path[PATH_MAX];
>> +	FILE *fp;
>> +
>> +	snprintf(path, PATH_MAX, "%s/sys/kernel/sched_schedstats", procfs__mountpoint());
>> +	fp = fopen(path, "w");
>> +	if (!fp) {
>> +		pr_err("Failed to open %s\n", path);
>> +		return -1;
>> +	}
>> +
>> +	putc('0', fp);
>> +	fclose(fp);
>> +	return 0;
>> +}
>> +
>> +/* perf.data or any other output file name used by stats subcommand (only). */
>> +const char *output_name;
>> +
>> +static int perf_sched__schedstat_record(struct perf_sched *sched,
>> +					int argc, const char **argv)
>> +{
>> +	struct perf_session *session;
>> +	struct evlist *evlist;
>> +	struct target *target;
>> +	int reset = 0;
>> +	int err = 0;
>> +	int fd;
>> +	struct perf_data data = {
>> +		.path  = output_name,
>> +		.mode  = PERF_DATA_MODE_WRITE,
>> +	};
>> +
>> +	signal(SIGINT, sighandler);
>> +	signal(SIGCHLD, sighandler);
>> +	signal(SIGTERM, sighandler);
>> +
>> +	evlist = evlist__new();
>> +	if (!evlist)
>> +		return -ENOMEM;
>> +
>> +	session = perf_session__new(&data, &sched->tool);
>> +	if (IS_ERR(session)) {
>> +		pr_err("Perf session creation failed.\n");
> 
> Also need evlist__delete().
> 

Sure, Will add it in next version.

> 
>> +		return PTR_ERR(session);
>> +	}
>> +
>> +	session->evlist = evlist;
>> +
>> +	sched->session = session;
>> +	sched->data = &data;
>> +
>> +	fd = perf_data__fd(&data);
>> +
>> +	/*
>> +	 * Capture all important metadata about the system. Although they are
>> +	 * not used by `perf sched stats` tool directly, they provide useful
>> +	 * information about profiled environment.
>> +	 */
>> +	perf_header__set_feat(&session->header, HEADER_HOSTNAME);
>> +	perf_header__set_feat(&session->header, HEADER_OSRELEASE);
>> +	perf_header__set_feat(&session->header, HEADER_VERSION);
>> +	perf_header__set_feat(&session->header, HEADER_ARCH);
>> +	perf_header__set_feat(&session->header, HEADER_NRCPUS);
>> +	perf_header__set_feat(&session->header, HEADER_CPUDESC);
>> +	perf_header__set_feat(&session->header, HEADER_CPUID);
>> +	perf_header__set_feat(&session->header, HEADER_TOTAL_MEM);
>> +	perf_header__set_feat(&session->header, HEADER_CMDLINE);
>> +	perf_header__set_feat(&session->header, HEADER_CPU_TOPOLOGY);
>> +	perf_header__set_feat(&session->header, HEADER_NUMA_TOPOLOGY);
>> +	perf_header__set_feat(&session->header, HEADER_CACHE);
>> +	perf_header__set_feat(&session->header, HEADER_MEM_TOPOLOGY);
>> +	perf_header__set_feat(&session->header, HEADER_CPU_PMU_CAPS);
>> +	perf_header__set_feat(&session->header, HEADER_HYBRID_TOPOLOGY);
>> +	perf_header__set_feat(&session->header, HEADER_PMU_CAPS);
> 
> Probably you don't need {CPU_,}PMU_CAPS.  Also I wonder if it's possible
> to add cpu-domain info here.
> 

I will skip {CPU_,}PMU_CAPS. I will think about adding cpu-domain info here.

>> +
>> +	err = perf_session__write_header(session, evlist, fd, false);
>> +	if (err < 0)
>> +		goto out;
>> +
>> +	/*
>> +	 * `perf sched stats` does not support workload profiling (-p pid)
>> +	 * since /proc/schedstat file contains cpu specific data only. Hence, a
>> +	 * profile target is either set of cpus or systemwide, never a process.
>> +	 * Note that, although `-- <workload>` is supported, profile data are
>> +	 * still cpu/systemwide.
>> +	 */
>> +	target = zalloc(sizeof(struct target));
> 
> It seems no need to alloc the target, just putting it on stack would be
> fine.
> 

Sure, Will change this.

> 
>> +	if (cpu_list)
>> +		target->cpu_list = cpu_list;
>> +	else
>> +		target->system_wide = true;
>> +
>> +	if (argc) {
>> +		err = evlist__prepare_workload(evlist, target, argv, false, NULL);
>> +		if (err)
>> +			goto out_target;
>> +	}
>> +
>> +	if (cpu_list) {
>> +		user_requested_cpus = perf_cpu_map__new(cpu_list);
> 
> Where is this freed?
> 

Will fix this.

> 
>> +		if (!user_requested_cpus)
>> +			goto out_target;
>> +	}
>> +
>> +	err = perf_event__synthesize_schedstat(&(sched->tool),
>> +					       process_synthesized_schedstat_event,
>> +					       user_requested_cpus);
>> +	if (err < 0)
>> +		goto out_target;
>> +
>> +	err = enable_sched_schedstats(&reset);
>> +	if (err < 0)
>> +		goto out_target;
>> +
>> +	if (argc)
>> +		evlist__start_workload(evlist);
>> +
>> +	/* wait for signal */
>> +	pause();
>> +
>> +	if (reset) {
>> +		err = disable_sched_schedstat();
>> +		if (err < 0)
>> +			goto out_target;
>> +	}
>> +
>> +	err = perf_event__synthesize_schedstat(&(sched->tool),
>> +					       process_synthesized_schedstat_event,
>> +					       user_requested_cpus);
>> +	if (err < 0)
>> +		goto out_target;
>> +
>> +	err = perf_session__write_header(session, evlist, fd, true);
>> +
>> +out_target:
>> +	free(target);
>> +out:
>> +	if (!err)
>> +		fprintf(stderr, "[ perf sched stats: Wrote samples to %s ]\n", data.path);
>> +	else
>> +		fprintf(stderr, "[ perf sched stats: Failed !! ]\n");
>> +
>> +	close(fd);
>> +	perf_session__delete(session);
> 
> It seems session->evlist is deleted only when the data is in read mode.
> 

Ack.

>> +
>> +	return err;
>> +}
>> +
>>   static bool schedstat_events_exposed(void)
>>   {
>>   	/*
>> @@ -3846,6 +4045,12 @@ int cmd_sched(int argc, const char **argv)
>>   	OPT_BOOLEAN('P', "pre-migrations", &sched.pre_migrations, "Show pre-migration wait time"),
>>   	OPT_PARENT(sched_options)
>>   	};
>> +	const struct option stats_options[] = {
>> +	OPT_STRING('o', "output", &output_name, "file",
>> +		   "`stats record` with output filename"),
>> +	OPT_STRING('C', "cpu", &cpu_list, "cpu", "list of cpus to profile"),
>> +	OPT_END()
>> +	};
>>   
>>   	const char * const latency_usage[] = {
>>   		"perf sched latency [<options>]",
>> @@ -3863,9 +4068,13 @@ int cmd_sched(int argc, const char **argv)
>>   		"perf sched timehist [<options>]",
>>   		NULL
>>   	};
>> +	const char *stats_usage[] = {
>> +		"perf sched stats {record} [<options>]",
>> +		NULL
>> +	};
>>   	const char *const sched_subcommands[] = { "record", "latency", "map",
>>   						  "replay", "script",
>> -						  "timehist", NULL };
>> +						  "timehist", "stats", NULL };
>>   	const char *sched_usage[] = {
>>   		NULL,
>>   		NULL
>> @@ -3961,6 +4170,21 @@ int cmd_sched(int argc, const char **argv)
>>   			return ret;
>>   
>>   		return perf_sched__timehist(&sched);
>> +	} else if (!strcmp(argv[0], "stats")) {
>> +		const char *const stats_subcommands[] = {"record", NULL};
>> +
>> +		argc = parse_options_subcommand(argc, argv, stats_options,
>> +						stats_subcommands,
>> +						stats_usage,
>> +						PARSE_OPT_STOP_AT_NON_OPTION);
>> +
>> +		if (argv[0] && !strcmp(argv[0], "record")) {
>> +			if (argc)
>> +				argc = parse_options(argc, argv, stats_options,
>> +						     stats_usage, 0);
>> +			return perf_sched__schedstat_record(&sched, argc, argv);
>> +		}
>> +		usage_with_options(stats_usage, stats_options);
>>   	} else {
>>   		usage_with_options(sched_usage, sched_options);
>>   	}
>> diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
>> index aac96d5d1917..0f863d38abe2 100644
>> --- a/tools/perf/util/event.c
>> +++ b/tools/perf/util/event.c
>> @@ -77,6 +77,8 @@ static const char *perf_event__names[] = {
>>   	[PERF_RECORD_HEADER_FEATURE]		= "FEATURE",
>>   	[PERF_RECORD_COMPRESSED]		= "COMPRESSED",
>>   	[PERF_RECORD_FINISHED_INIT]		= "FINISHED_INIT",
>> +	[PERF_RECORD_SCHEDSTAT_CPU]		= "SCHEDSTAT_CPU",
>> +	[PERF_RECORD_SCHEDSTAT_DOMAIN]		= "SCHEDSTAT_DOMAIN",
>>   };
>>   
>>   const char *perf_event__name(unsigned int id)
>> @@ -550,6 +552,102 @@ size_t perf_event__fprintf_text_poke(union perf_event *event, struct machine *ma
>>   	return ret;
>>   }
>>   
>> +size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
>> +{
>> +	struct perf_record_schedstat_cpu *cs = &event->schedstat_cpu;
>> +	__u16 version = cs->version;
>> +	size_t size = 0;
>> +
>> +	size = fprintf(fp, "\ncpu%u ", cs->cpu);
>> +
>> +#define CPU_FIELD(_type, _name, _ver)						\
>> +	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)cs->_ver._name)
>> +
>> +	if (version == 15) {
>> +#include <perf/schedstat-v15.h>
>> +		return size;
>> +	}
>> +#undef CPU_FIELD
>> +
>> +	return fprintf(fp, "Unsupported /proc/schedstat version %d.\n",
>> +		       event->schedstat_cpu.version);
>> +}
>> +
>> +size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
>> +{
>> +	struct perf_record_schedstat_domain *ds = &event->schedstat_domain;
>> +	__u16 version = ds->version;
>> +	size_t cpu_mask_len_2;
>> +	size_t cpu_mask_len;
>> +	size_t size = 0;
>> +	char *cpu_mask;
>> +	int idx;
>> +	int i, j;
>> +	bool low;
>> +
>> +	if (ds->name[0])
>> +		size = fprintf(fp, "\ndomain%u:%s ", ds->domain, ds->name);
>> +	else
>> +		size = fprintf(fp, "\ndomain%u ", ds->domain);
>> +
>> +	cpu_mask_len = ((ds->nr_cpus + 3) >> 2);
>> +	cpu_mask_len_2 = cpu_mask_len + ((cpu_mask_len - 1) / 8);
>> +
>> +	cpu_mask = zalloc(cpu_mask_len_2 + 1);
>> +	if (!cpu_mask)
>> +		return fprintf(fp, "Cannot allocate memory for cpumask\n");
>> +
>> +	idx = ((ds->nr_cpus + 7) >> 3) - 1;
>> +
>> +	i = cpu_mask_len_2 - 1;
>> +
>> +	low = true;
>> +	j = 1;
>> +	while (i >= 0) {
>> +		__u8 m;
>> +
>> +		if (low)
>> +			m = ds->cpu_mask[idx] & 0xf;
>> +		else
>> +			m = (ds->cpu_mask[idx] & 0xf0) >> 4;
>> +
>> +		if (m >= 0 && m <= 9)
>> +			m += '0';
>> +		else if (m >= 0xa && m <= 0xf)
>> +			m = m + 'a' - 10;
>> +		else if (m >= 0xA && m <= 0xF)
>> +			m = m + 'A' - 10;
>> +
>> +		cpu_mask[i] = m;
>> +
>> +		if (j == 8 && i != 0) {
>> +			cpu_mask[i - 1] = ',';
>> +			j = 0;
>> +			i--;
>> +		}
>> +
>> +		if (!low)
>> +			idx--;
>> +		low = !low;
>> +		i--;
>> +		j++;
>> +	}
>> +	size += fprintf(fp, "%s ", cpu_mask);
>> +	free(cpu_mask);
>> +
>> +#define DOMAIN_FIELD(_type, _name, _ver)					\
>> +	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)ds->_ver._name)
>> +
>> +	if (version == 15) {
>> +#include <perf/schedstat-v15.h>
>> +		return size;
>> +	}
>> +#undef DOMAIN_FIELD
>> +
>> +	return fprintf(fp, "Unsupported /proc/schedstat version %d.\n",
>> +		       event->schedstat_domain.version);
>> +}
>> +
>>   size_t perf_event__fprintf(union perf_event *event, struct machine *machine, FILE *fp)
>>   {
>>   	size_t ret = fprintf(fp, "PERF_RECORD_%s",
>> diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
>> index 2744c54f404e..333f2405cd5a 100644
>> --- a/tools/perf/util/event.h
>> +++ b/tools/perf/util/event.h
>> @@ -361,6 +361,8 @@ size_t perf_event__fprintf_cgroup(union perf_event *event, FILE *fp);
>>   size_t perf_event__fprintf_ksymbol(union perf_event *event, FILE *fp);
>>   size_t perf_event__fprintf_bpf(union perf_event *event, FILE *fp);
>>   size_t perf_event__fprintf_text_poke(union perf_event *event, struct machine *machine,FILE *fp);
>> +size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp);
>> +size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp);
>>   size_t perf_event__fprintf(union perf_event *event, struct machine *machine, FILE *fp);
>>   
>>   int kallsyms__get_function_start(const char *kallsyms_filename,
>> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
>> index c06e3020a976..bcffee2b7239 100644
>> --- a/tools/perf/util/session.c
>> +++ b/tools/perf/util/session.c
>> @@ -692,6 +692,20 @@ static void perf_event__time_conv_swap(union perf_event *event,
>>   	}
>>   }
>>   
>> +static void
>> +perf_event__schedstat_cpu_swap(union perf_event *event __maybe_unused,
>> +			       bool sample_id_all __maybe_unused)
>> +{
>> +	/* FIXME */
>> +}
>> +
>> +static void
>> +perf_event__schedstat_domain_swap(union perf_event *event __maybe_unused,
>> +				  bool sample_id_all __maybe_unused)
>> +{
>> +	/* FIXME */
>> +}
>> +
>>   typedef void (*perf_event__swap_op)(union perf_event *event,
>>   				    bool sample_id_all);
>>   
>> @@ -730,6 +744,8 @@ static perf_event__swap_op perf_event__swap_ops[] = {
>>   	[PERF_RECORD_STAT_ROUND]	  = perf_event__stat_round_swap,
>>   	[PERF_RECORD_EVENT_UPDATE]	  = perf_event__event_update_swap,
>>   	[PERF_RECORD_TIME_CONV]		  = perf_event__time_conv_swap,
>> +	[PERF_RECORD_SCHEDSTAT_CPU]	  = perf_event__schedstat_cpu_swap,
>> +	[PERF_RECORD_SCHEDSTAT_DOMAIN]	  = perf_event__schedstat_domain_swap,
>>   	[PERF_RECORD_HEADER_MAX]	  = NULL,
>>   };
>>   
>> @@ -1455,6 +1471,10 @@ static s64 perf_session__process_user_event(struct perf_session *session,
>>   		return err;
>>   	case PERF_RECORD_FINISHED_INIT:
>>   		return tool->finished_init(session, event);
>> +	case PERF_RECORD_SCHEDSTAT_CPU:
>> +		return tool->schedstat_cpu(session, event);
>> +	case PERF_RECORD_SCHEDSTAT_DOMAIN:
>> +		return tool->schedstat_domain(session, event);
>>   	default:
>>   		return -EINVAL;
>>   	}
>> diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
>> index 6923b0d5efed..f928f07bea15 100644
>> --- a/tools/perf/util/synthetic-events.c
>> +++ b/tools/perf/util/synthetic-events.c
>> @@ -2511,3 +2511,242 @@ int parse_synth_opt(char *synth)
>>   
>>   	return ret;
>>   }
>> +
>> +static union perf_event *__synthesize_schedstat_cpu(struct io *io, __u16 version,
>> +						    __u64 *cpu, __u64 timestamp)
>> +{
>> +	struct perf_record_schedstat_cpu *cs;
>> +	union perf_event *event;
>> +	size_t size;
>> +	char ch;
>> +
>> +	size = sizeof(struct perf_record_schedstat_cpu);
> 
> I think the kernel code prefers sizeof(*cs) instead.
> 

Sure, Will change accordingly.

> 
>> +	size = PERF_ALIGN(size, sizeof(u64));
>> +	event = zalloc(size);
> 
> The size is static, do you really need a dynamic allocation?
> 

Will make event static.

> Thanks,
> Namhyung
> 
--
Thanks and Regards,
Swapnil

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 3/8] perf sched stats: Add schedstat v17 support
  2025-03-15  2:27   ` Namhyung Kim
@ 2025-03-17 13:32     ` Sapkal, Swapnil
  0 siblings, 0 replies; 23+ messages in thread
From: Sapkal, Swapnil @ 2025-03-17 13:32 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: peterz, mingo, acme, irogers, james.clark, ravi.bangoria,
	yu.c.chen, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, adrian.hunter, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, sshegde,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das

Hello Namhyung,

On 3/15/2025 7:57 AM, Namhyung Kim wrote:
> On Tue, Mar 11, 2025 at 12:02:25PM +0000, Swapnil Sapkal wrote:
>> /proc/schedstat file output is standardized with version number.
>> Add support to record and raw dump v17 version layout.
>>
>> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
>> ---
>>   tools/lib/perf/Makefile                     |  2 +-
>>   tools/lib/perf/include/perf/event.h         | 14 +++++
>>   tools/lib/perf/include/perf/schedstat-v17.h | 61 +++++++++++++++++++++
>>   tools/perf/util/event.c                     |  6 ++
>>   tools/perf/util/synthetic-events.c          | 15 +++++
>>   5 files changed, 97 insertions(+), 1 deletion(-)
>>   create mode 100644 tools/lib/perf/include/perf/schedstat-v17.h
>>
>> diff --git a/tools/lib/perf/Makefile b/tools/lib/perf/Makefile
>> index d0506a13a97f..30712ce8b6b1 100644
>> --- a/tools/lib/perf/Makefile
>> +++ b/tools/lib/perf/Makefile
>> @@ -174,7 +174,7 @@ install_lib: libs
>>   		$(call do_install_mkdir,$(libdir_SQ)); \
>>   		cp -fpR $(LIBPERF_ALL) $(DESTDIR)$(libdir_SQ)
>>   
>> -HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-v15.h schedstat-v16.h
>> +HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-v15.h schedstat-v16.h schedstat-v17.h
> 
> Please put them in a separate line like
> 
> HDRS += schedstat-v15.h schedstat-v16.h schedstat-v17.h
> 

Sure, Will change accordingly.

> Thanks,
> Namhyung
> 

--
Thanks and Regards,
Swapnil



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/8] perf sched stats: Add support for report subcommand
  2025-03-15  4:39   ` Namhyung Kim
@ 2025-03-18 11:08     ` Sapkal, Swapnil
  0 siblings, 0 replies; 23+ messages in thread
From: Sapkal, Swapnil @ 2025-03-18 11:08 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: peterz, mingo, acme, irogers, james.clark, ravi.bangoria,
	yu.c.chen, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, adrian.hunter, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, sshegde,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das, James Clark

Hello Namhyung,

On 3/15/2025 10:09 AM, Namhyung Kim wrote:
> On Tue, Mar 11, 2025 at 12:02:26PM +0000, Swapnil Sapkal wrote:
>> `perf sched stats record` captures two sets of samples. For workload
>> profile, first set right before workload starts and second set after
>> workload finishes. For the systemwide profile, first set at the
>> beginning of profile and second set on receiving SIGINT signal.
>>
>> Add `perf sched stats report` subcommand that will read both the set
>> of samples, get the diff and render a final report. Final report prints
>> scheduler stat at cpu granularity as well as sched domain granularity.
>>
>> Example usage:
>>
>>    # perf sched stats record
>>    # perf sched stats report
>>
>> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> Tested-by: James Clark <james.clark@linaro.org>
>> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
>> ---
>>   tools/lib/perf/include/perf/event.h         |  12 +-
>>   tools/lib/perf/include/perf/schedstat-v15.h | 180 +++++--
>>   tools/lib/perf/include/perf/schedstat-v16.h | 182 +++++--
>>   tools/lib/perf/include/perf/schedstat-v17.h | 209 +++++---
>>   tools/perf/builtin-sched.c                  | 504 +++++++++++++++++++-
>>   tools/perf/util/event.c                     |   4 +-
>>   tools/perf/util/synthetic-events.c          |   4 +-
>>   7 files changed, 938 insertions(+), 157 deletions(-)
>>
>> diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
>> index 0d1983ad9a41..5e2c56c9b038 100644
>> --- a/tools/lib/perf/include/perf/event.h
>> +++ b/tools/lib/perf/include/perf/event.h
>> @@ -458,19 +458,19 @@ struct perf_record_compressed {
>>   };
>>   
>>   struct perf_record_schedstat_cpu_v15 {
>> -#define CPU_FIELD(_type, _name, _ver)		_type _name
>> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		_type _name
>>   #include "schedstat-v15.h"
>>   #undef CPU_FIELD
>>   };
>>   
>>   struct perf_record_schedstat_cpu_v16 {
>> -#define CPU_FIELD(_type, _name, _ver)		_type _name
>> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		_type _name
>>   #include "schedstat-v16.h"
>>   #undef CPU_FIELD
>>   };
>>   
>>   struct perf_record_schedstat_cpu_v17 {
>> -#define CPU_FIELD(_type, _name, _ver)		_type _name
>> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		_type _name
>>   #include "schedstat-v17.h"
>>   #undef CPU_FIELD
>>   };
>> @@ -488,19 +488,19 @@ struct perf_record_schedstat_cpu {
>>   };
>>   
>>   struct perf_record_schedstat_domain_v15 {
>> -#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
>> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		_type _name
>>   #include "schedstat-v15.h"
>>   #undef DOMAIN_FIELD
>>   };
>>   
>>   struct perf_record_schedstat_domain_v16 {
>> -#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
>> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		_type _name
>>   #include "schedstat-v16.h"
>>   #undef DOMAIN_FIELD
>>   };
>>   
>>   struct perf_record_schedstat_domain_v17 {
>> -#define DOMAIN_FIELD(_type, _name, _ver)	_type _name
>> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		_type _name
>>   #include "schedstat-v17.h"
>>   #undef DOMAIN_FIELD
>>   };
>> diff --git a/tools/lib/perf/include/perf/schedstat-v15.h b/tools/lib/perf/include/perf/schedstat-v15.h
>> index 43f8060c5337..011411ac0f7e 100644
>> --- a/tools/lib/perf/include/perf/schedstat-v15.h
>> +++ b/tools/lib/perf/include/perf/schedstat-v15.h
>> @@ -1,52 +1,142 @@
>>   /* SPDX-License-Identifier: GPL-2.0 */
>>   
>>   #ifdef CPU_FIELD
>> -CPU_FIELD(__u32, yld_count, v15);
>> -CPU_FIELD(__u32, array_exp, v15);
>> -CPU_FIELD(__u32, sched_count, v15);
>> -CPU_FIELD(__u32, sched_goidle, v15);
>> -CPU_FIELD(__u32, ttwu_count, v15);
>> -CPU_FIELD(__u32, ttwu_local, v15);
>> -CPU_FIELD(__u64, rq_cpu_time, v15);
>> -CPU_FIELD(__u64, run_delay, v15);
>> -CPU_FIELD(__u64, pcount, v15);
>> +CPU_FIELD(__u32, yld_count, "sched_yield() count",
>> +	  "%11u", false, yld_count, v15);
>> +CPU_FIELD(__u32, array_exp, "Legacy counter can be ignored",
>> +	  "%11u", false, array_exp, v15);
>> +CPU_FIELD(__u32, sched_count, "schedule() called",
>> +	  "%11u", false, sched_count, v15);
>> +CPU_FIELD(__u32, sched_goidle, "schedule() left the processor idle",
>> +	  "%11u", true, sched_count, v15);
>> +CPU_FIELD(__u32, ttwu_count, "try_to_wake_up() was called",
>> +	  "%11u", false, ttwu_count, v15);
>> +CPU_FIELD(__u32, ttwu_local, "try_to_wake_up() was called to wake up the local cpu",
>> +	  "%11u", true, ttwu_count, v15);
>> +CPU_FIELD(__u64, rq_cpu_time, "total runtime by tasks on this processor (in jiffies)",
>> +	  "%11llu", false, rq_cpu_time, v15);
>> +CPU_FIELD(__u64, run_delay, "total waittime by tasks on this processor (in jiffies)",
>> +	  "%11llu", true, rq_cpu_time, v15);
>> +CPU_FIELD(__u64, pcount, "total timeslices run on this cpu",
>> +	  "%11llu", false, pcount, v15);
>>   #endif
>>   
>>   #ifdef DOMAIN_FIELD
>> -DOMAIN_FIELD(__u32, idle_lb_count, v15);
>> -DOMAIN_FIELD(__u32, idle_lb_balanced, v15);
>> -DOMAIN_FIELD(__u32, idle_lb_failed, v15);
>> -DOMAIN_FIELD(__u32, idle_lb_imbalance, v15);
>> -DOMAIN_FIELD(__u32, idle_lb_gained, v15);
>> -DOMAIN_FIELD(__u32, idle_lb_hot_gained, v15);
>> -DOMAIN_FIELD(__u32, idle_lb_nobusyq, v15);
>> -DOMAIN_FIELD(__u32, idle_lb_nobusyg, v15);
>> -DOMAIN_FIELD(__u32, busy_lb_count, v15);
>> -DOMAIN_FIELD(__u32, busy_lb_balanced, v15);
>> -DOMAIN_FIELD(__u32, busy_lb_failed, v15);
>> -DOMAIN_FIELD(__u32, busy_lb_imbalance, v15);
>> -DOMAIN_FIELD(__u32, busy_lb_gained, v15);
>> -DOMAIN_FIELD(__u32, busy_lb_hot_gained, v15);
>> -DOMAIN_FIELD(__u32, busy_lb_nobusyq, v15);
>> -DOMAIN_FIELD(__u32, busy_lb_nobusyg, v15);
>> -DOMAIN_FIELD(__u32, newidle_lb_count, v15);
>> -DOMAIN_FIELD(__u32, newidle_lb_balanced, v15);
>> -DOMAIN_FIELD(__u32, newidle_lb_failed, v15);
>> -DOMAIN_FIELD(__u32, newidle_lb_imbalance, v15);
>> -DOMAIN_FIELD(__u32, newidle_lb_gained, v15);
>> -DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v15);
>> -DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v15);
>> -DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v15);
>> -DOMAIN_FIELD(__u32, alb_count, v15);
>> -DOMAIN_FIELD(__u32, alb_failed, v15);
>> -DOMAIN_FIELD(__u32, alb_pushed, v15);
>> -DOMAIN_FIELD(__u32, sbe_count, v15);
>> -DOMAIN_FIELD(__u32, sbe_balanced, v15);
>> -DOMAIN_FIELD(__u32, sbe_pushed, v15);
>> -DOMAIN_FIELD(__u32, sbf_count, v15);
>> -DOMAIN_FIELD(__u32, sbf_balanced, v15);
>> -DOMAIN_FIELD(__u32, sbf_pushed, v15);
>> -DOMAIN_FIELD(__u32, ttwu_wake_remote, v15);
>> -DOMAIN_FIELD(__u32, ttwu_move_affine, v15);
>> -DOMAIN_FIELD(__u32, ttwu_move_balance, v15);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category idle> ");
>>   #endif
>> +DOMAIN_FIELD(__u32, idle_lb_count,
>> +	     "load_balance() count on cpu idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_balanced,
>> +	     "load_balance() found balanced on cpu idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_failed,
>> +	     "load_balance() move task failed on cpu idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_imbalance,
>> +	     "imbalance sum on cpu idle", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_gained,
>> +	     "pull_task() count on cpu idle", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_hot_gained,
>> +	     "pull_task() when target task was cache-hot on cpu idle", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_nobusyq,
>> +	     "load_balance() failed to find busier queue on cpu idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_nobusyg,
>> +	     "load_balance() failed to find busier group on cpu idle", "%11u", true, v15);
>> +#ifdef DERIVED_CNT_FIELD
>> +DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
>> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, v15);
>> +#endif
>> +#ifdef DERIVED_AVG_FIELD
>> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu idle)", "%11.2Lf",
>> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, idle_lb_gained, v15);
>> +#endif
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category busy> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, busy_lb_count,
>> +	     "load_balance() count on cpu busy", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_balanced,
>> +	     "load_balance() found balanced on cpu busy", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_failed,
>> +	     "load_balance() move task failed on cpu busy", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_imbalance,
>> +	     "imbalance sum on cpu busy", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_gained,
>> +	     "pull_task() count on cpu busy", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_hot_gained,
>> +	     "pull_task() when target task was cache-hot on cpu busy", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_nobusyq,
>> +	     "load_balance() failed to find busier queue on cpu busy", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_nobusyg,
>> +	     "load_balance() failed to find busier group on cpu busy", "%11u", true, v15);
>> +#ifdef DERIVED_CNT_FIELD
>> +DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
>> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, v15);
>> +#endif
>> +#ifdef DERIVED_AVG_FIELD
>> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu busy)", "%11.2Lf",
>> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, busy_lb_gained, v15);
>> +#endif
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category newidle> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, newidle_lb_count,
>> +	     "load_balance() count on cpu newly idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_balanced,
>> +	     "load_balance() found balanced on cpu newly idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_failed,
>> +	     "load_balance() move task failed on cpu newly idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_imbalance,
>> +	     "imbalance sum on cpu newly idle", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_gained,
>> +	     "pull_task() count on cpu newly idle", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained,
>> +	     "pull_task() when target task was cache-hot on cpu newly idle", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
>> +	     "load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
>> +	     "load_balance() failed to find busier group on cpu newly idle", "%11u", true, v15);
>> +#ifdef DERIVED_CNT_FIELD
>> +DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
>> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v15);
>> +#endif
>> +#ifdef DERIVED_AVG_FIELD
>> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu newly idle)", "%11.2Lf",
>> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, newidle_lb_gained, v15);
>> +#endif
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category active_load_balance()> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, alb_count,
>> +	     "active_load_balance() count", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, alb_failed,
>> +	     "active_load_balance() move task failed", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, alb_pushed,
>> +	     "active_load_balance() successfully moved a task", "%11u", false, v15);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category sched_balance_exec()> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, sbe_count,
>> +	     "sbe_count is not used", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, sbe_balanced,
>> +	     "sbe_balanced is not used", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, sbe_pushed,
>> +	     "sbe_pushed is not used", "%11u", false, v15);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category sched_balance_fork()> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, sbf_count,
>> +	     "sbf_count is not used", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, sbf_balanced,
>> +	     "sbf_balanced is not used", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, sbf_pushed,
>> +	     "sbf_pushed is not used", "%11u", false, v15);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Wakeup Info> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, ttwu_wake_remote,
>> +	     "try_to_wake_up() awoke a task that last ran on a diff cpu", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, ttwu_move_affine,
>> +	     "try_to_wake_up() moved task because cache-cold on own cpu", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, ttwu_move_balance,
>> +	     "try_to_wake_up() started passive balancing", "%11u", false, v15);
>> +#endif /* DOMAIN_FIELD */
>> diff --git a/tools/lib/perf/include/perf/schedstat-v16.h b/tools/lib/perf/include/perf/schedstat-v16.h
>> index d6a4691b2fd5..5ba53bd7d61a 100644
>> --- a/tools/lib/perf/include/perf/schedstat-v16.h
>> +++ b/tools/lib/perf/include/perf/schedstat-v16.h
>> @@ -1,52 +1,142 @@
>>   /* SPDX-License-Identifier: GPL-2.0 */
>>   
>>   #ifdef CPU_FIELD
>> -CPU_FIELD(__u32, yld_count, v16);
>> -CPU_FIELD(__u32, array_exp, v16);
>> -CPU_FIELD(__u32, sched_count, v16);
>> -CPU_FIELD(__u32, sched_goidle, v16);
>> -CPU_FIELD(__u32, ttwu_count, v16);
>> -CPU_FIELD(__u32, ttwu_local, v16);
>> -CPU_FIELD(__u64, rq_cpu_time, v16);
>> -CPU_FIELD(__u64, run_delay, v16);
>> -CPU_FIELD(__u64, pcount, v16);
>> -#endif
>> +CPU_FIELD(__u32, yld_count, "sched_yield() count",
>> +	  "%11u", false, yld_count, v16);
>> +CPU_FIELD(__u32, array_exp, "Legacy counter can be ignored",
>> +	  "%11u", false, array_exp, v16);
>> +CPU_FIELD(__u32, sched_count, "schedule() called",
>> +	  "%11u", false, sched_count, v16);
>> +CPU_FIELD(__u32, sched_goidle, "schedule() left the processor idle",
>> +	  "%11u", true, sched_count, v16);
>> +CPU_FIELD(__u32, ttwu_count, "try_to_wake_up() was called",
>> +	  "%11u", false, ttwu_count, v16);
>> +CPU_FIELD(__u32, ttwu_local, "try_to_wake_up() was called to wake up the local cpu",
>> +	  "%11u", true, ttwu_count, v16);
>> +CPU_FIELD(__u64, rq_cpu_time, "total runtime by tasks on this processor (in jiffies)",
>> +	  "%11llu", false, rq_cpu_time, v16);
>> +CPU_FIELD(__u64, run_delay, "total waittime by tasks on this processor (in jiffies)",
>> +	  "%11llu", true, rq_cpu_time, v16);
>> +CPU_FIELD(__u64, pcount, "total timeslices run on this cpu",
>> +	  "%11llu", false, pcount, v16);
>> +#endif /* CPU_FIELD */
>>   
>>   #ifdef DOMAIN_FIELD
>> -DOMAIN_FIELD(__u32, busy_lb_count, v16);
>> -DOMAIN_FIELD(__u32, busy_lb_balanced, v16);
>> -DOMAIN_FIELD(__u32, busy_lb_failed, v16);
>> -DOMAIN_FIELD(__u32, busy_lb_imbalance, v16);
>> -DOMAIN_FIELD(__u32, busy_lb_gained, v16);
>> -DOMAIN_FIELD(__u32, busy_lb_hot_gained, v16);
>> -DOMAIN_FIELD(__u32, busy_lb_nobusyq, v16);
>> -DOMAIN_FIELD(__u32, busy_lb_nobusyg, v16);
>> -DOMAIN_FIELD(__u32, idle_lb_count, v16);
>> -DOMAIN_FIELD(__u32, idle_lb_balanced, v16);
>> -DOMAIN_FIELD(__u32, idle_lb_failed, v16);
>> -DOMAIN_FIELD(__u32, idle_lb_imbalance, v16);
>> -DOMAIN_FIELD(__u32, idle_lb_gained, v16);
>> -DOMAIN_FIELD(__u32, idle_lb_hot_gained, v16);
>> -DOMAIN_FIELD(__u32, idle_lb_nobusyq, v16);
>> -DOMAIN_FIELD(__u32, idle_lb_nobusyg, v16);
>> -DOMAIN_FIELD(__u32, newidle_lb_count, v16);
>> -DOMAIN_FIELD(__u32, newidle_lb_balanced, v16);
>> -DOMAIN_FIELD(__u32, newidle_lb_failed, v16);
>> -DOMAIN_FIELD(__u32, newidle_lb_imbalance, v16);
>> -DOMAIN_FIELD(__u32, newidle_lb_gained, v16);
>> -DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v16);
>> -DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v16);
>> -DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v16);
>> -DOMAIN_FIELD(__u32, alb_count, v16);
>> -DOMAIN_FIELD(__u32, alb_failed, v16);
>> -DOMAIN_FIELD(__u32, alb_pushed, v16);
>> -DOMAIN_FIELD(__u32, sbe_count, v16);
>> -DOMAIN_FIELD(__u32, sbe_balanced, v16);
>> -DOMAIN_FIELD(__u32, sbe_pushed, v16);
>> -DOMAIN_FIELD(__u32, sbf_count, v16);
>> -DOMAIN_FIELD(__u32, sbf_balanced, v16);
>> -DOMAIN_FIELD(__u32, sbf_pushed, v16);
>> -DOMAIN_FIELD(__u32, ttwu_wake_remote, v16);
>> -DOMAIN_FIELD(__u32, ttwu_move_affine, v16);
>> -DOMAIN_FIELD(__u32, ttwu_move_balance, v16);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category busy> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, busy_lb_count,
>> +	     "load_balance() count on cpu busy", "%11u", true, v16);
>> +DOMAIN_FIELD(__u32, busy_lb_balanced,
>> +	     "load_balance() found balanced on cpu busy", "%11u", true, v16);
>> +DOMAIN_FIELD(__u32, busy_lb_failed,
>> +	     "load_balance() move task failed on cpu busy", "%11u", true, v16);
>> +DOMAIN_FIELD(__u32, busy_lb_imbalance,
>> +	     "imbalance sum on cpu busy", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, busy_lb_gained,
>> +	     "pull_task() count on cpu busy", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, busy_lb_hot_gained,
>> +	     "pull_task() when target task was cache-hot on cpu busy", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, busy_lb_nobusyq,
>> +	     "load_balance() failed to find busier queue on cpu busy", "%11u", true, v16);
>> +DOMAIN_FIELD(__u32, busy_lb_nobusyg,
>> +	     "load_balance() failed to find busier group on cpu busy", "%11u", true, v16);
>> +#ifdef DERIVED_CNT_FIELD
>> +DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
>> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, v16);
>> +#endif
>> +#ifdef DERIVED_AVG_FIELD
>> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu busy)", "%11.2Lf",
>> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, busy_lb_gained, v16);
>> +#endif
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category idle> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, idle_lb_count,
>> +	     "load_balance() count on cpu idle", "%11u", true, v16);
>> +DOMAIN_FIELD(__u32, idle_lb_balanced,
>> +	     "load_balance() found balanced on cpu idle", "%11u", true, v16);
>> +DOMAIN_FIELD(__u32, idle_lb_failed,
>> +	     "load_balance() move task failed on cpu idle", "%11u", true, v16);
>> +DOMAIN_FIELD(__u32, idle_lb_imbalance,
>> +	     "imbalance sum on cpu idle", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, idle_lb_gained,
>> +	     "pull_task() count on cpu idle", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, idle_lb_hot_gained,
>> +	     "pull_task() when target task was cache-hot on cpu idle", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, idle_lb_nobusyq,
>> +	     "load_balance() failed to find busier queue on cpu idle", "%11u", true, v16);
>> +DOMAIN_FIELD(__u32, idle_lb_nobusyg,
>> +	     "load_balance() failed to find busier group on cpu idle", "%11u", true, v16);
>> +#ifdef DERIVED_CNT_FIELD
>> +DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
>> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, v16);
>> +#endif
>> +#ifdef DERIVED_AVG_FIELD
>> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu idle)", "%11.2Lf",
>> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, idle_lb_gained, v16);
>> +#endif
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category newidle> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, newidle_lb_count,
>> +	     "load_balance() count on cpu newly idle", "%11u", true, v16);
>> +DOMAIN_FIELD(__u32, newidle_lb_balanced,
>> +	     "load_balance() found balanced on cpu newly idle", "%11u", true, v16);
>> +DOMAIN_FIELD(__u32, newidle_lb_failed,
>> +	     "load_balance() move task failed on cpu newly idle", "%11u", true, v16);
>> +DOMAIN_FIELD(__u32, newidle_lb_imbalance,
>> +	     "imbalance sum on cpu newly idle", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, newidle_lb_gained,
>> +	     "pull_task() count on cpu newly idle", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained,
>> +	     "pull_task() when target task was cache-hot on cpu newly idle", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
>> +	     "load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v16);
>> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
>> +	     "load_balance() failed to find busier group on cpu newly idle", "%11u", true, v16);
>> +#ifdef DERIVED_CNT_FIELD
>> +DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
>> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v16);
>> +#endif
>> +#ifdef DERIVED_AVG_FIELD
>> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu newly idle)", "%11.2Lf",
>> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, newidle_lb_gained, v16);
>> +#endif
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category active_load_balance()> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, alb_count,
>> +	     "active_load_balance() count", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, alb_failed,
>> +	     "active_load_balance() move task failed", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, alb_pushed,
>> +	     "active_load_balance() successfully moved a task", "%11u", false, v16);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category sched_balance_exec()> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, sbe_count,
>> +	     "sbe_count is not used", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, sbe_balanced,
>> +	     "sbe_balanced is not used", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, sbe_pushed,
>> +	     "sbe_pushed is not used", "%11u", false, v16);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category sched_balance_fork()> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, sbf_count,
>> +	     "sbf_count is not used", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, sbf_balanced,
>> +	     "sbf_balanced is not used", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, sbf_pushed,
>> +	     "sbf_pushed is not used", "%11u", false, v16);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Wakeup Info> ");
>>   #endif
>> +DOMAIN_FIELD(__u32, ttwu_wake_remote,
>> +	     "try_to_wake_up() awoke a task that last ran on a diff cpu", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, ttwu_move_affine,
>> +	     "try_to_wake_up() moved task because cache-cold on own cpu", "%11u", false, v16);
>> +DOMAIN_FIELD(__u32, ttwu_move_balance,
>> +	     "try_to_wake_up() started passive balancing", "%11u", false, v16);
>> +#endif /* DOMAIN_FIELD */
>> diff --git a/tools/lib/perf/include/perf/schedstat-v17.h b/tools/lib/perf/include/perf/schedstat-v17.h
>> index 851d4f1f4ecb..00009bd5f006 100644
>> --- a/tools/lib/perf/include/perf/schedstat-v17.h
>> +++ b/tools/lib/perf/include/perf/schedstat-v17.h
>> @@ -1,61 +1,160 @@
>>   /* SPDX-License-Identifier: GPL-2.0 */
>>   
>>   #ifdef CPU_FIELD
>> -CPU_FIELD(__u32, yld_count, v17);
>> -CPU_FIELD(__u32, array_exp, v17);
>> -CPU_FIELD(__u32, sched_count, v17);
>> -CPU_FIELD(__u32, sched_goidle, v17);
>> -CPU_FIELD(__u32, ttwu_count, v17);
>> -CPU_FIELD(__u32, ttwu_local, v17);
>> -CPU_FIELD(__u64, rq_cpu_time, v17);
>> -CPU_FIELD(__u64, run_delay, v17);
>> -CPU_FIELD(__u64, pcount, v17);
>> -#endif
>> +CPU_FIELD(__u32, yld_count, "sched_yield() count",
>> +	  "%11u", false, yld_count, v17);
>> +CPU_FIELD(__u32, array_exp, "Legacy counter can be ignored",
>> +	  "%11u", false, array_exp, v17);
>> +CPU_FIELD(__u32, sched_count, "schedule() called",
>> +	  "%11u", false, sched_count, v17);
>> +CPU_FIELD(__u32, sched_goidle, "schedule() left the processor idle",
>> +	  "%11u", true, sched_count, v17);
>> +CPU_FIELD(__u32, ttwu_count, "try_to_wake_up() was called",
>> +	  "%11u", false, ttwu_count, v17);
>> +CPU_FIELD(__u32, ttwu_local, "try_to_wake_up() was called to wake up the local cpu",
>> +	  "%11u", true, ttwu_count, v17);
>> +CPU_FIELD(__u64, rq_cpu_time, "total runtime by tasks on this processor (in jiffies)",
>> +	  "%11llu", false, rq_cpu_time, v17);
>> +CPU_FIELD(__u64, run_delay, "total waittime by tasks on this processor (in jiffies)",
>> +	  "%11llu", true, rq_cpu_time, v17);
>> +CPU_FIELD(__u64, pcount, "total timeslices run on this cpu",
>> +	  "%11llu", false, pcount, v17);
>> +#endif /* CPU_FIELD */
>>   
>>   #ifdef DOMAIN_FIELD
>> -DOMAIN_FIELD(__u32, busy_lb_count, v17);
>> -DOMAIN_FIELD(__u32, busy_lb_balanced, v17);
>> -DOMAIN_FIELD(__u32, busy_lb_failed, v17);
>> -DOMAIN_FIELD(__u32, busy_lb_imbalance_load, v17);
>> -DOMAIN_FIELD(__u32, busy_lb_imbalance_util, v17);
>> -DOMAIN_FIELD(__u32, busy_lb_imbalance_task, v17);
>> -DOMAIN_FIELD(__u32, busy_lb_imbalance_misfit, v17);
>> -DOMAIN_FIELD(__u32, busy_lb_gained, v17);
>> -DOMAIN_FIELD(__u32, busy_lb_hot_gained, v17);
>> -DOMAIN_FIELD(__u32, busy_lb_nobusyq, v17);
>> -DOMAIN_FIELD(__u32, busy_lb_nobusyg, v17);
>> -DOMAIN_FIELD(__u32, idle_lb_count, v17);
>> -DOMAIN_FIELD(__u32, idle_lb_balanced, v17);
>> -DOMAIN_FIELD(__u32, idle_lb_failed, v17);
>> -DOMAIN_FIELD(__u32, idle_lb_imbalance_load, v17);
>> -DOMAIN_FIELD(__u32, idle_lb_imbalance_util, v17);
>> -DOMAIN_FIELD(__u32, idle_lb_imbalance_task, v17);
>> -DOMAIN_FIELD(__u32, idle_lb_imbalance_misfit, v17);
>> -DOMAIN_FIELD(__u32, idle_lb_gained, v17);
>> -DOMAIN_FIELD(__u32, idle_lb_hot_gained, v17);
>> -DOMAIN_FIELD(__u32, idle_lb_nobusyq, v17);
>> -DOMAIN_FIELD(__u32, idle_lb_nobusyg, v17);
>> -DOMAIN_FIELD(__u32, newidle_lb_count, v17);
>> -DOMAIN_FIELD(__u32, newidle_lb_balanced, v17);
>> -DOMAIN_FIELD(__u32, newidle_lb_failed, v17);
>> -DOMAIN_FIELD(__u32, newidle_lb_imbalance_load, v17);
>> -DOMAIN_FIELD(__u32, newidle_lb_imbalance_util, v17);
>> -DOMAIN_FIELD(__u32, newidle_lb_imbalance_task, v17);
>> -DOMAIN_FIELD(__u32, newidle_lb_imbalance_misfit, v17);
>> -DOMAIN_FIELD(__u32, newidle_lb_gained, v17);
>> -DOMAIN_FIELD(__u32, newidle_lb_hot_gained, v17);
>> -DOMAIN_FIELD(__u32, newidle_lb_nobusyq, v17);
>> -DOMAIN_FIELD(__u32, newidle_lb_nobusyg, v17);
>> -DOMAIN_FIELD(__u32, alb_count, v17);
>> -DOMAIN_FIELD(__u32, alb_failed, v17);
>> -DOMAIN_FIELD(__u32, alb_pushed, v17);
>> -DOMAIN_FIELD(__u32, sbe_count, v17);
>> -DOMAIN_FIELD(__u32, sbe_balanced, v17);
>> -DOMAIN_FIELD(__u32, sbe_pushed, v17);
>> -DOMAIN_FIELD(__u32, sbf_count, v17);
>> -DOMAIN_FIELD(__u32, sbf_balanced, v17);
>> -DOMAIN_FIELD(__u32, sbf_pushed, v17);
>> -DOMAIN_FIELD(__u32, ttwu_wake_remote, v17);
>> -DOMAIN_FIELD(__u32, ttwu_move_affine, v17);
>> -DOMAIN_FIELD(__u32, ttwu_move_balance, v17);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category busy> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, busy_lb_count,
>> +	     "load_balance() count on cpu busy", "%11u", true, v17);
>> +DOMAIN_FIELD(__u32, busy_lb_balanced,
>> +	     "load_balance() found balanced on cpu busy", "%11u", true, v17);
>> +DOMAIN_FIELD(__u32, busy_lb_failed,
>> +	     "load_balance() move task failed on cpu busy", "%11u", true, v17);
>> +DOMAIN_FIELD(__u32, busy_lb_imbalance_load,
>> +	     "imbalance in load on cpu busy", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, busy_lb_imbalance_util,
>> +	     "imbalance in utilization on cpu busy", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, busy_lb_imbalance_task,
>> +	     "imbalance in number of tasks on cpu busy", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, busy_lb_imbalance_misfit,
>> +	     "imbalance in misfit tasks on cpu busy", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, busy_lb_gained,
>> +	     "pull_task() count on cpu busy", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, busy_lb_hot_gained,
>> +	     "pull_task() when target task was cache-hot on cpu busy", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, busy_lb_nobusyq,
>> +	     "load_balance() failed to find busier queue on cpu busy", "%11u", true, v17);
>> +DOMAIN_FIELD(__u32, busy_lb_nobusyg,
>> +	     "load_balance() failed to find busier group on cpu busy", "%11u", true, v17);
>> +#ifdef DERIVED_CNT_FIELD
>> +DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
>> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, v17);
>> +#endif
>> +#ifdef DERIVED_AVG_FIELD
>> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu busy)", "%11.2Lf",
>> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, busy_lb_gained, v17);
>> +#endif
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category idle> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, idle_lb_count,
>> +	     "load_balance() count on cpu idle", "%11u", true, v17);
>> +DOMAIN_FIELD(__u32, idle_lb_balanced,
>> +	     "load_balance() found balanced on cpu idle", "%11u", true, v17);
>> +DOMAIN_FIELD(__u32, idle_lb_failed,
>> +	     "load_balance() move task failed on cpu idle", "%11u", true, v17);
>> +DOMAIN_FIELD(__u32, idle_lb_imbalance_load,
>> +	     "imbalance in load on cpu idle", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, idle_lb_imbalance_util,
>> +	     "imbalance in utilization on cpu idle", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, idle_lb_imbalance_task,
>> +	     "imbalance in number of tasks on cpu idle", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, idle_lb_imbalance_misfit,
>> +	     "imbalance in misfit tasks on cpu idle", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, idle_lb_gained,
>> +	     "pull_task() count on cpu idle", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, idle_lb_hot_gained,
>> +	     "pull_task() when target task was cache-hot on cpu idle", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, idle_lb_nobusyq,
>> +	     "load_balance() failed to find busier queue on cpu idle", "%11u", true, v17);
>> +DOMAIN_FIELD(__u32, idle_lb_nobusyg,
>> +	     "load_balance() failed to find busier group on cpu idle", "%11u", true, v17);
>> +#ifdef DERIVED_CNT_FIELD
>> +DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
>> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, v17);
>> +#endif
>> +#ifdef DERIVED_AVG_FIELD
>> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu idle)", "%11.2Lf",
>> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, idle_lb_gained, v17);
>> +#endif
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category newidle> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, newidle_lb_count,
>> +	     "load_balance() count on cpu newly idle", "%11u", true, v17);
>> +DOMAIN_FIELD(__u32, newidle_lb_balanced,
>> +	     "load_balance() found balanced on cpu newly idle", "%11u", true, v17);
>> +DOMAIN_FIELD(__u32, newidle_lb_failed,
>> +	     "load_balance() move task failed on cpu newly idle", "%11u", true, v17);
>> +DOMAIN_FIELD(__u32, newidle_lb_imbalance_load,
>> +	     "imbalance in load on cpu newly idle", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, newidle_lb_imbalance_util,
>> +	     "imbalance in utilization on cpu newly idle", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, newidle_lb_imbalance_task,
>> +	     "imbalance in number of tasks on cpu newly idle", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, newidle_lb_imbalance_misfit,
>> +	     "imbalance in misfit tasks on cpu newly idle", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, newidle_lb_gained,
>> +	     "pull_task() count on cpu newly idle", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained,
>> +	     "pull_task() when target task was cache-hot on cpu newly idle", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
>> +	     "load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v17);
>> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
>> +	     "load_balance() failed to find busier group on cpu newly idle", "%11u", true, v17);
>> +#ifdef DERIVED_CNT_FIELD
>> +DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
>> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v17);
>> +#endif
>> +#ifdef DERIVED_AVG_FIELD
>> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu newly idle)", "%11.2Lf",
>> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, newidle_lb_gained, v17);
>> +#endif
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category active_load_balance()> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, alb_count,
>> +	     "active_load_balance() count", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, alb_failed,
>> +	     "active_load_balance() move task failed", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, alb_pushed,
>> +	     "active_load_balance() successfully moved a task", "%11u", false, v17);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category sched_balance_exec()> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, sbe_count,
>> +	     "sbe_count is not used", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, sbe_balanced,
>> +	     "sbe_balanced is not used", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, sbe_pushed,
>> +	     "sbe_pushed is not used", "%11u", false, v17);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category sched_balance_fork()> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, sbf_count,
>> +	     "sbf_count is not used", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, sbf_balanced,
>> +	     "sbf_balanced is not used", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, sbf_pushed,
>> +	     "sbf_pushed is not used", "%11u", false, v17);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Wakeup Info> ");
>>   #endif
>> +DOMAIN_FIELD(__u32, ttwu_wake_remote,
>> +	     "try_to_wake_up() awoke a task that last ran on a diff cpu", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, ttwu_move_affine,
>> +	     "try_to_wake_up() moved task because cache-cold on own cpu", "%11u", false, v17);
>> +DOMAIN_FIELD(__u32, ttwu_move_balance,
>> +	     "try_to_wake_up() started passive balancing", "%11u", false, v17);
>> +#endif /* DOMAIN_FIELD */
> 
> Probably better to put in the previous commits.
> 

Sure, I can do it. The reason I did it this way is because these new
field values are unused in previous patches. But I don't have a strong
opinion so I can change it as well.

> 
>> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
>> index 1c3b56013164..e2e7dbc4f0aa 100644
>> --- a/tools/perf/builtin-sched.c
>> +++ b/tools/perf/builtin-sched.c
>> @@ -3869,6 +3869,501 @@ static int perf_sched__schedstat_record(struct perf_sched *sched,
>>   	return err;
>>   }
>>   
>> +struct schedstat_domain {
>> +	struct perf_record_schedstat_domain *domain_data;
>> +	struct schedstat_domain *next;
>> +};
>> +
>> +struct schedstat_cpu {
>> +	struct perf_record_schedstat_cpu *cpu_data;
>> +	struct schedstat_domain *domain_head;
>> +	struct schedstat_cpu *next;
>> +};
>> +
>> +struct schedstat_cpu *cpu_head = NULL, *cpu_tail = NULL, *cpu_second_pass = NULL;
>> +struct schedstat_domain *domain_tail = NULL, *domain_second_pass = NULL;
> 
> No need to reset to NULL.  Also please add some comments how those
> structs and lists are used.
> 

Ack.

> 
>> +bool after_workload_flag;
>> +
>> +static void store_schedtstat_cpu_diff(struct schedstat_cpu *after_workload)
>> +{
>> +	struct perf_record_schedstat_cpu *before = cpu_second_pass->cpu_data;
>> +	struct perf_record_schedstat_cpu *after = after_workload->cpu_data;
>> +	__u16 version = after_workload->cpu_data->version;
>> +
>> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)	\
>> +	(before->_ver._name = after->_ver._name - before->_ver._name)
>> +
>> +	if (version == 15) {
>> +#include <perf/schedstat-v15.h>
>> +	} else if (version == 16) {
>> +#include <perf/schedstat-v16.h>
>> +	} else if (version == 17) {
>> +#include <perf/schedstat-v17.h>
>> +	}
>> +
>> +#undef CPU_FIELD
>> +}
>> +
>> +static void store_schedstat_domain_diff(struct schedstat_domain *after_workload)
>> +{
>> +	struct perf_record_schedstat_domain *before = domain_second_pass->domain_data;
>> +	struct perf_record_schedstat_domain *after = after_workload->domain_data;
>> +	__u16 version = after_workload->domain_data->version;
>> +
>> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)	\
>> +	(before->_ver._name = after->_ver._name - before->_ver._name)
>> +
>> +	if (version == 15) {
>> +#include <perf/schedstat-v15.h>
>> +	} else if (version == 16) {
>> +#include <perf/schedstat-v16.h>
>> +	} else if (version == 17) {
>> +#include <perf/schedstat-v17.h>
>> +	}
>> +#undef DOMAIN_FIELD
>> +}
>> +
>> +static void print_separator(size_t pre_dash_cnt, const char *s, size_t post_dash_cnt)
>> +{
>> +	size_t i;
>> +
>> +	for (i = 0; i < pre_dash_cnt; ++i)
>> +		printf("-");
>> +
>> +	printf("%s", s);
>> +
>> +	for (i = 0; i < post_dash_cnt; ++i)
>> +		printf("-");
>> +
>> +	printf("\n");
> 
> This can be simplified:
> 
> 	printf("%.*s%s%.*s\n", pre_dash_cnt, graph_dotted_line, s,
> 		post_dash_cnt, graph_dotted_line);
> 

This is better. Will change it.

>> +}
>> +
>> +static inline void print_cpu_stats(struct perf_record_schedstat_cpu *cs)
>> +{
>> +	printf("%-65s %12s %12s\n", "DESC", "COUNT", "PCT_CHANGE");
>> +	print_separator(100, "", 0);
> 
> 	printf("%.*s\n", 100, graph_dotted_line);
> 
> You can define a macro for the length (100) as it's used in other places
> too.
> 

Will do this.

>> +
>> +#define CALC_PCT(_x, _y)	((_y) ? ((double)(_x) / (_y)) * 100 : 0.0)
>> +
>> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		\
>> +	do {									\
>> +		printf("%-65s: " _format, _desc, cs->_ver._name);		\
>> +		if (_is_pct) {							\
>> +			printf("  ( %8.2lf%% )",				\
>> +			       CALC_PCT(cs->_ver._name, cs->_ver._pct_of));	\
>> +		}								\
>> +		printf("\n");							\
>> +	} while (0)
>> +
>> +	if (cs->version == 15) {
>> +#include <perf/schedstat-v15.h>
>> +	} else if (cs->version == 16) {
>> +#include <perf/schedstat-v16.h>
>> +	} else if (cs->version == 17) {
>> +#include <perf/schedstat-v17.h>
>> +	}
>> +
>> +#undef CPU_FIELD
>> +#undef CALC_PCT
>> +}
>> +
>> +static inline void print_domain_stats(struct perf_record_schedstat_domain *ds,
>> +				      __u64 jiffies)
>> +{
>> +	printf("%-65s %12s %14s\n", "DESC", "COUNT", "AVG_JIFFIES");
>> +
>> +#define DOMAIN_CATEGORY(_desc)							\
>> +	do {									\
>> +		size_t _len = strlen(_desc);					\
>> +		size_t _pre_dash_cnt = (100 - _len) / 2;			\
>> +		size_t _post_dash_cnt = 100 - _len - _pre_dash_cnt;		\
>> +		print_separator(_pre_dash_cnt, _desc, _post_dash_cnt);		\
> 
> This can be useful in other places, can you please factor it out as a
> function somewhere in util.c?
> 

Will do this.

> 
>> +	} while (0)
>> +
>> +#define CALC_AVG(_x, _y)	((_y) ? (long double)(_x) / (_y) : 0.0)
>> +
>> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		\
>> +	do {									\
>> +		printf("%-65s: " _format, _desc, ds->_ver._name);		\
>> +		if (_is_jiffies) {						\
>> +			printf("  $ %11.2Lf $",					\
>> +			       CALC_AVG(jiffies, ds->_ver._name));		\
>> +		}								\
>> +		printf("\n");							\
>> +	} while (0)
>> +
>> +#define DERIVED_CNT_FIELD(_desc, _format, _x, _y, _z, _ver)			\
>> +	printf("*%-64s: " _format "\n", _desc,					\
>> +	       (ds->_ver._x) - (ds->_ver._y) - (ds->_ver._z))
>> +
>> +#define DERIVED_AVG_FIELD(_desc, _format, _x, _y, _z, _w, _ver)			\
>> +	printf("*%-64s: " _format "\n", _desc, CALC_AVG(ds->_ver._w,		\
>> +	       ((ds->_ver._x) - (ds->_ver._y) - (ds->_ver._z))))
>> +
>> +	if (ds->version == 15) {
>> +#include <perf/schedstat-v15.h>
>> +	} else if (ds->version == 16) {
>> +#include <perf/schedstat-v16.h>
>> +	} else if (ds->version == 17) {
>> +#include <perf/schedstat-v17.h>
>> +	}
>> +
>> +#undef DERIVED_AVG_FIELD
>> +#undef DERIVED_CNT_FIELD
>> +#undef DOMAIN_FIELD
>> +#undef CALC_AVG
>> +#undef DOMAIN_CATEGORY
>> +}
>> +
>> +static void print_domain_cpu_list(struct perf_record_schedstat_domain *ds)
>> +{
>> +	char bin[16][5] = {"0000", "0001", "0010", "0011",
>> +			   "0100", "0101", "0110", "0111",
>> +			   "1000", "1001", "1010", "1011",
>> +			   "1100", "1101", "1110", "1111"};
>> +	bool print_flag = false, low = true;
>> +	int cpu = 0, start, end, idx;
>> +
>> +	idx = ((ds->nr_cpus + 7) >> 3) - 1;
>> +
>> +	printf("<");
>> +	while (idx >= 0) {
>> +		__u8 index;
>> +
>> +		if (low)
>> +			index = ds->cpu_mask[idx] & 0xf;
>> +		else
>> +			index = (ds->cpu_mask[idx--] & 0xf0) >> 4;
> 
> Isn't ds->cpu_mask a bitmap?  Can we use bitmap_scnprintf() or
> something?
>

Yes, ds->cpu_mask is a bitmap. I will use bitmap_scnprintf().
  
>> +
>> +		for (int i = 3; i >= 0; i--) {
>> +			if (!print_flag && bin[index][i] == '1') {
>> +				start = cpu;
>> +				print_flag = true;
>> +			} else if (print_flag && bin[index][i] == '0') {
>> +				end = cpu - 1;
>> +				print_flag = false;
>> +				if (start == end)
>> +					printf("%d, ", start);
>> +				else
>> +					printf("%d-%d, ", start, end);
>> +			}
>> +			cpu++;
>> +		}
>> +
>> +		low = !low;
>> +	}
>> +
>> +	if (print_flag) {
>> +		if (start == cpu-1)
>> +			printf("%d, ", start);
>> +		else
>> +			printf("%d-%d, ", start, cpu-1);
>> +	}
>> +	printf("\b\b>\n");
>> +}
>> +
>> +static void summarize_schedstat_cpu(struct schedstat_cpu *summary_cpu,
>> +				    struct schedstat_cpu *cptr,
>> +				    int cnt, bool is_last)
>> +{
>> +	struct perf_record_schedstat_cpu *summary_cs = summary_cpu->cpu_data,
>> +					 *temp_cs = cptr->cpu_data;
>> +
>> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		\
>> +	do {									\
>> +		summary_cs->_ver._name += temp_cs->_ver._name;			\
>> +		if (is_last)							\
>> +			summary_cs->_ver._name /= cnt;				\
>> +	} while (0)
>> +
>> +	if (cptr->cpu_data->version == 15) {
>> +#include <perf/schedstat-v15.h>
>> +	} else if (cptr->cpu_data->version == 16) {
>> +#include <perf/schedstat-v16.h>
>> +	} else if (cptr->cpu_data->version == 17) {
>> +#include <perf/schedstat-v17.h>
>> +	}
>> +#undef CPU_FIELD
>> +}
>> +
>> +static void summarize_schedstat_domain(struct schedstat_domain *summary_domain,
>> +				       struct schedstat_domain *dptr,
>> +				       int cnt, bool is_last)
>> +{
>> +	struct perf_record_schedstat_domain *summary_ds = summary_domain->domain_data,
>> +					    *temp_ds = dptr->domain_data;
>> +
>> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		\
>> +	do {									\
>> +		summary_ds->_ver._name += temp_ds->_ver._name;			\
>> +		if (is_last)							\
>> +			summary_ds->_ver._name /= cnt;				\
>> +	} while (0)
>> +
>> +	if (dptr->domain_data->version == 15) {
>> +#include <perf/schedstat-v15.h>
>> +	} else if (dptr->domain_data->version == 16) {
>> +#include <perf/schedstat-v16.h>
>> +	} else if (dptr->domain_data->version == 17) {
>> +#include <perf/schedstat-v17.h>
>> +	}
>> +#undef DOMAIN_FIELD
>> +}
>> +
>> +static void get_all_cpu_stats(struct schedstat_cpu **cptr)
>> +{
>> +	struct schedstat_domain *dptr = NULL, *tdptr = NULL, *dtail = NULL;
>> +	struct schedstat_cpu *tcptr = *cptr, *summary_head = NULL;
>> +	struct perf_record_schedstat_domain *ds = NULL;
>> +	struct perf_record_schedstat_cpu *cs = NULL;
>> +	bool is_last = false;
>> +	int cnt = 0;
>> +
>> +	if (tcptr) {
>> +		summary_head = zalloc(sizeof(*summary_head));
>> +		summary_head->cpu_data = zalloc(sizeof(*cs));
> 
> No error handlings.
> 

Will handle this in next version.

> 
>> +		memcpy(summary_head->cpu_data, tcptr->cpu_data, sizeof(*cs));
>> +		summary_head->next = NULL;
>> +		summary_head->domain_head = NULL;
>> +		dptr = tcptr->domain_head;
>> +
>> +		while (dptr) {
>> +			size_t cpu_mask_size = (dptr->domain_data->nr_cpus + 7) >> 3;
>> +
>> +			tdptr = zalloc(sizeof(*tdptr));
>> +			tdptr->domain_data = zalloc(sizeof(*ds) + cpu_mask_size);
> 
> Ditto.
> 

Ack.

> 
>> +			memcpy(tdptr->domain_data, dptr->domain_data, sizeof(*ds) + cpu_mask_size);
>> +
>> +			tdptr->next = NULL;
>> +			if (!dtail) {
>> +				summary_head->domain_head = tdptr;
>> +				dtail = tdptr;
>> +			} else {
>> +				dtail->next = tdptr;
>> +				dtail = dtail->next;
>> +			}
>> +			dptr = dptr->next;
> 
> Hmm.. can we just use list_head?
> 

I will switch to list_head.

> 
>> +		}
>> +	}
>> +
>> +	tcptr = (*cptr)->next;
>> +	while (tcptr) {
>> +		if (!tcptr->next)
>> +			is_last = true;
>> +
>> +		cnt++;
>> +		summarize_schedstat_cpu(summary_head, tcptr, cnt, is_last);
>> +		tdptr = summary_head->domain_head;
>> +		dptr = tcptr->domain_head;
>> +
>> +		while (tdptr) {
>> +			summarize_schedstat_domain(tdptr, dptr, cnt, is_last);
>> +			tdptr = tdptr->next;
>> +			dptr = dptr->next;
>> +		}
>> +		tcptr = tcptr->next;
>> +	}
>> +
>> +	tcptr = *cptr;
>> +	summary_head->next = tcptr;
>> +	*cptr = summary_head;
>> +}
>> +
>> +/* FIXME: The code fails (segfaults) when one or ore cpus are offline. */
> 
> Sounds scary..  Do you have any clue?
> 

It is a stale comment. I have handled it properly in case of online/offline
of cpus. Will remove this comment.

> 
>> +static void show_schedstat_data(struct schedstat_cpu *cptr)
>> +{
>> +	struct perf_record_schedstat_domain *ds = NULL;
>> +	struct perf_record_schedstat_cpu *cs = NULL;
>> +	__u64 jiffies = cptr->cpu_data->timestamp;
>> +	struct schedstat_domain *dptr = NULL;
>> +	bool is_summary = true;
>> +
>> +	printf("Columns description\n");
>> +	print_separator(100, "", 0);
>> +	printf("DESC\t\t\t-> Description of the field\n");
>> +	printf("COUNT\t\t\t-> Value of the field\n");
>> +	printf("PCT_CHANGE\t\t-> Percent change with corresponding base value\n");
>> +	printf("AVG_JIFFIES\t\t-> Avg time in jiffies between two consecutive occurrence of event\n");
>> +
>> +	print_separator(100, "", 0);
>> +	printf("Time elapsed (in jiffies)                                        : %11llu\n",
> 
> Probably better to use printf("%-*s: %11llu\n", ...).
> 

Ack.

> 
>> +	       jiffies);
>> +	print_separator(100, "", 0);
>> +
>> +	get_all_cpu_stats(&cptr);
>> +
>> +	while (cptr) {
>> +		cs = cptr->cpu_data;
>> +		printf("\n");
>> +		print_separator(100, "", 0);
>> +		if (is_summary)
>> +			printf("CPU <ALL CPUS SUMMARY>\n");
>> +		else
>> +			printf("CPU %d\n", cs->cpu);
>> +
>> +		print_separator(100, "", 0);
>> +		print_cpu_stats(cs);
>> +		print_separator(100, "", 0);
>> +
>> +		dptr = cptr->domain_head;
>> +
>> +		while (dptr) {
>> +			ds = dptr->domain_data;
>> +			if (is_summary)
>> +				if (ds->name[0])
>> +					printf("CPU <ALL CPUS SUMMARY>, DOMAIN %s\n", ds->name);
>> +				else
>> +					printf("CPU <ALL CPUS SUMMARY>, DOMAIN %d\n", ds->domain);
>> +			else {
>> +				if (ds->name[0])
>> +					printf("CPU %d, DOMAIN %s CPUS ", cs->cpu, ds->name);
>> +				else
>> +					printf("CPU %d, DOMAIN %d CPUS ", cs->cpu, ds->domain);
>> +
>> +				print_domain_cpu_list(ds);
>> +			}
>> +			print_separator(100, "", 0);
>> +			print_domain_stats(ds, jiffies);
>> +			print_separator(100, "", 0);
>> +
>> +			dptr = dptr->next;
>> +		}
>> +		is_summary = false;
>> +		cptr = cptr->next;
>> +	}
>> +}
>> +
>> +static int perf_sched__process_schedstat(struct perf_session *session __maybe_unused,
>> +					 union perf_event *event)
>> +{
>> +	struct perf_cpu this_cpu;
>> +	static __u32 initial_cpu;
>> +
>> +	switch (event->header.type) {
>> +	case PERF_RECORD_SCHEDSTAT_CPU:
>> +		this_cpu.cpu = event->schedstat_cpu.cpu;
>> +		break;
>> +	case PERF_RECORD_SCHEDSTAT_DOMAIN:
>> +		this_cpu.cpu = event->schedstat_domain.cpu;
>> +		break;
>> +	default:
>> +		return 0;
>> +	}
>> +
>> +	if (user_requested_cpus && !perf_cpu_map__has(user_requested_cpus, this_cpu))
>> +		return 0;
>> +
>> +	if (event->header.type == PERF_RECORD_SCHEDSTAT_CPU) {
>> +		struct schedstat_cpu *temp = zalloc(sizeof(struct schedstat_cpu));
>> +
>> +		temp->cpu_data = zalloc(sizeof(struct perf_record_schedstat_cpu));
> 
> No error checks.
> 

Will fix this.

> 
>> +		memcpy(temp->cpu_data, &event->schedstat_cpu,
>> +		       sizeof(struct perf_record_schedstat_cpu));
>> +		temp->next = NULL;
>> +		temp->domain_head = NULL;
>> +
>> +		if (cpu_head && temp->cpu_data->cpu == initial_cpu)
>> +			after_workload_flag = true;
>> +
>> +		if (!after_workload_flag) {
>> +			if (!cpu_head) {
>> +				initial_cpu = temp->cpu_data->cpu;
>> +				cpu_head = temp;
>> +			} else
>> +				cpu_tail->next = temp;
>> +
>> +			cpu_tail = temp;
>> +		} else {
>> +			if (temp->cpu_data->cpu == initial_cpu) {
>> +				cpu_second_pass = cpu_head;
>> +				cpu_head->cpu_data->timestamp =
>> +					temp->cpu_data->timestamp - cpu_second_pass->cpu_data->timestamp;
>> +			} else {
>> +				cpu_second_pass = cpu_second_pass->next;
>> +			}
>> +			domain_second_pass = cpu_second_pass->domain_head;
>> +			store_schedtstat_cpu_diff(temp);
> 
> Is 'temp' used after this?
> 

I will free temp as it is not used later.

> 
>> +		}
>> +	} else if (event->header.type == PERF_RECORD_SCHEDSTAT_DOMAIN) {
>> +		size_t cpu_mask_size = (event->schedstat_domain.nr_cpus + 7) >> 3;
>> +		struct schedstat_domain *temp = zalloc(sizeof(struct schedstat_domain));
>> +
>> +		temp->domain_data = zalloc(sizeof(struct perf_record_schedstat_domain) + cpu_mask_size);
> 
> No error checks.
> 

Will handle this.

> 
>> +		memcpy(temp->domain_data, &event->schedstat_domain,
>> +		       sizeof(struct perf_record_schedstat_domain) + cpu_mask_size);
>> +		temp->next = NULL;
>> +
>> +		if (!after_workload_flag) {
>> +			if (cpu_tail->domain_head == NULL) {
>> +				cpu_tail->domain_head = temp;
>> +				domain_tail = temp;
>> +			} else {
>> +				domain_tail->next = temp;
>> +				domain_tail = temp;
>> +			}
>> +		} else {
>> +			store_schedstat_domain_diff(temp);
>> +			domain_second_pass = domain_second_pass->next;
> 
> Is 'temp' leaking?
> 

Yes, I will fix it.

> 
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void free_schedstat(struct schedstat_cpu *cptr)
>> +{
>> +	struct schedstat_domain *dptr = NULL, *tmp_dptr;
>> +	struct schedstat_cpu *tmp_cptr;
>> +
>> +	while (cptr) {
>> +		tmp_cptr = cptr;
>> +		dptr = cptr->domain_head;
>> +
>> +		while (dptr) {
>> +			tmp_dptr = dptr;
>> +			dptr = dptr->next;
>> +			free(tmp_dptr);
>> +		}
>> +		cptr = cptr->next;
>> +		free(tmp_cptr);
>> +	}
>> +}
>> +
>> +static int perf_sched__schedstat_report(struct perf_sched *sched)
>> +{
>> +	struct perf_session *session;
>> +	struct perf_data data = {
>> +		.path  = input_name,
>> +		.mode  = PERF_DATA_MODE_READ,
>> +	};
>> +	int err;
>> +
>> +	if (cpu_list) {
>> +		user_requested_cpus = perf_cpu_map__new(cpu_list);
>> +		if (!user_requested_cpus)
>> +			return -EINVAL;
>> +	}
>> +
>> +	sched->tool.schedstat_cpu = perf_sched__process_schedstat;
>> +	sched->tool.schedstat_domain = perf_sched__process_schedstat;
>> +
>> +	session = perf_session__new(&data, &sched->tool);
>> +	if (IS_ERR(session)) {
>> +		pr_err("Perf session creation failed.\n");
>> +		return PTR_ERR(session);
>> +	}
>> +
>> +	err = perf_session__process_events(session);
>> +
>> +	perf_session__delete(session);
> 
> Quite unusual location to do this. :)  Probably better to call it after
> finishing the actual logic as you might need some session data later.
> 

Will fix this.

> 
>> +	if (!err) {
>> +		setup_pager();
>> +		show_schedstat_data(cpu_head);
>> +		free_schedstat(cpu_head);
>> +	}
> 
> 	perf_cpu_map__put(user_requested_cpus);

Ack.

> 
>> +	return err;
>> +}
>> +
>>   static bool schedstat_events_exposed(void)
>>   {
>>   	/*
>> @@ -4046,6 +4541,8 @@ int cmd_sched(int argc, const char **argv)
>>   	OPT_PARENT(sched_options)
>>   	};
>>   	const struct option stats_options[] = {
>> +	OPT_STRING('i', "input", &input_name, "file",
>> +		   "`stats report` with input filename"),
>>   	OPT_STRING('o', "output", &output_name, "file",
>>   		   "`stats record` with output filename"),
>>   	OPT_STRING('C', "cpu", &cpu_list, "cpu", "list of cpus to profile"),
>> @@ -4171,7 +4668,7 @@ int cmd_sched(int argc, const char **argv)
>>   
>>   		return perf_sched__timehist(&sched);
>>   	} else if (!strcmp(argv[0], "stats")) {
>> -		const char *const stats_subcommands[] = {"record", NULL};
>> +		const char *const stats_subcommands[] = {"record", "report", NULL};
>>   
>>   		argc = parse_options_subcommand(argc, argv, stats_options,
>>   						stats_subcommands,
>> @@ -4183,6 +4680,11 @@ int cmd_sched(int argc, const char **argv)
>>   				argc = parse_options(argc, argv, stats_options,
>>   						     stats_usage, 0);
>>   			return perf_sched__schedstat_record(&sched, argc, argv);
>> +		} else if (argv[0] && !strcmp(argv[0], "report")) {
>> +			if (argc)
>> +				argc = parse_options(argc, argv, stats_options,
>> +						     stats_usage, 0);
>> +			return perf_sched__schedstat_report(&sched);
>>   		}
>>   		usage_with_options(stats_usage, stats_options);
>>   	} else {
>> diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
>> index d09c3c99ab48..4071bd95192d 100644
>> --- a/tools/perf/util/event.c
>> +++ b/tools/perf/util/event.c
>> @@ -560,7 +560,7 @@ size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
>>   
>>   	size = fprintf(fp, "\ncpu%u ", cs->cpu);
>>   
>> -#define CPU_FIELD(_type, _name, _ver)						\
>> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)		\
>>   	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)cs->_ver._name)
>>   
>>   	if (version == 15) {
>> @@ -641,7 +641,7 @@ size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
>>   	size += fprintf(fp, "%s ", cpu_mask);
>>   	free(cpu_mask);
>>   
>> -#define DOMAIN_FIELD(_type, _name, _ver)					\
>> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)		\
>>   	size += fprintf(fp, "%" PRIu64 " ", (unsigned long)ds->_ver._name)
>>   
>>   	if (version == 15) {
>> diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
>> index fad0c472f297..495ed8433c0c 100644
>> --- a/tools/perf/util/synthetic-events.c
>> +++ b/tools/perf/util/synthetic-events.c
>> @@ -2538,7 +2538,7 @@ static union perf_event *__synthesize_schedstat_cpu(struct io *io, __u16 version
>>   	if (io__get_dec(io, (__u64 *)cpu) != ' ')
>>   		goto out_cpu;
>>   
>> -#define CPU_FIELD(_type, _name, _ver)					\
>> +#define CPU_FIELD(_type, _name, _desc, _format, _is_pct, _pct_of, _ver)	\
>>   	do {								\
>>   		__u64 _tmp;						\
>>   		ch = io__get_dec(io, &_tmp);				\
>> @@ -2662,7 +2662,7 @@ static union perf_event *__synthesize_schedstat_domain(struct io *io, __u16 vers
>>   	free(d_name);
>>   	free(cpu_mask);
>>   
>> -#define DOMAIN_FIELD(_type, _name, _ver)				\
>> +#define DOMAIN_FIELD(_type, _name, _desc, _format, _is_jiffies, _ver)	\
>>   	do {								\
>>   		__u64 _tmp;						\
>>   		ch = io__get_dec(io, &_tmp);				\
>> -- 
>> 2.43.0
>>
--
Thanks and Regards,
Swapnil

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/8] perf sched stats: Add support for live mode
  2025-03-15  4:46   ` Namhyung Kim
@ 2025-03-24  9:15     ` Sapkal, Swapnil
  0 siblings, 0 replies; 23+ messages in thread
From: Sapkal, Swapnil @ 2025-03-24  9:15 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: peterz, mingo, acme, irogers, james.clark, ravi.bangoria,
	yu.c.chen, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, adrian.hunter, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, sshegde,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das, James Clark

Hi Namhyung,

Sorry for the delay in response.

On 3/15/2025 10:16 AM, Namhyung Kim wrote:
> On Tue, Mar 11, 2025 at 12:02:27PM +0000, Swapnil Sapkal wrote:
>> The live mode works similar to simple `perf stat` command, by profiling
>> the target and printing results on the terminal as soon as the target
>> finishes.
>>
>> Example usage:
>>
>>    # perf sched stats -- sleep 10
>>
>> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> Tested-by: James Clark <james.clark@linaro.org>
>> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
>> ---
>>   tools/perf/builtin-sched.c | 87 +++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 86 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
>> index e2e7dbc4f0aa..9813e25b54b8 100644
>> --- a/tools/perf/builtin-sched.c
>> +++ b/tools/perf/builtin-sched.c
>> @@ -4364,6 +4364,91 @@ static int perf_sched__schedstat_report(struct perf_sched *sched)
>>   	return err;
>>   }
>>   
>> +static int process_synthesized_event_live(const struct perf_tool *tool __maybe_unused,
>> +					  union perf_event *event,
>> +					  struct perf_sample *sample __maybe_unused,
>> +					  struct machine *machine __maybe_unused)
>> +{
>> +	return perf_sched__process_schedstat(NULL, event);
>> +}
>> +
>> +static int perf_sched__schedstat_live(struct perf_sched *sched,
>> +				      int argc, const char **argv)
>> +{
>> +	struct evlist *evlist;
>> +	struct target *target;
>> +	int reset = 0;
>> +	int err = 0;
>> +
>> +	signal(SIGINT, sighandler);
>> +	signal(SIGCHLD, sighandler);
>> +	signal(SIGTERM, sighandler);
>> +
>> +	evlist = evlist__new();
>> +	if (!evlist)
>> +		return -ENOMEM;
>> +
>> +	/*
>> +	 * `perf sched schedstat` does not support workload profiling (-p pid)
>> +	 * since /proc/schedstat file contains cpu specific data only. Hence, a
>> +	 * profile target is either set of cpus or systemwide, never a process.
>> +	 * Note that, although `-- <workload>` is supported, profile data are
>> +	 * still cpu/systemwide.
>> +	 */
>> +	target = zalloc(sizeof(struct target));
> 
> As I said, you can put it on stack.
> 

Sure.

> 
>> +	if (cpu_list)
>> +		target->cpu_list = cpu_list;
>> +	else
>> +		target->system_wide = true;
>> +
>> +	if (argc) {
>> +		err = evlist__prepare_workload(evlist, target, argv, false, NULL);
>> +		if (err)
>> +			goto out_target;
>> +	}
>> +
>> +	if (cpu_list) {
>> +		user_requested_cpus = perf_cpu_map__new(cpu_list);
>> +		if (!user_requested_cpus)
>> +			goto out_target;
>> +	}
> 
> How about this instead?
> 
> 	evlist__create_maps(evlist, target);
> 

Sure, will use evlist__create_maps(evlist, target).

>> +
>> +	err = perf_event__synthesize_schedstat(&(sched->tool),
>> +					       process_synthesized_event_live,
>> +					       user_requested_cpus);
>> +	if (err < 0)
>> +		goto out_target;
>> +
>> +	err = enable_sched_schedstats(&reset);
>> +	if (err < 0)
>> +		goto out_target;
>> +
>> +	if (argc)
>> +		evlist__start_workload(evlist);
>> +
>> +	/* wait for signal */
>> +	pause();
>> +
>> +	if (reset) {
>> +		err = disable_sched_schedstat();
>> +		if (err < 0)
>> +			goto out_target;
>> +	}
>> +
>> +	err = perf_event__synthesize_schedstat(&(sched->tool),
>> +					       process_synthesized_event_live,
>> +					       user_requested_cpus);
>> +	if (err)
>> +		goto out_target;
>> +
>> +	setup_pager();
>> +	show_schedstat_data(cpu_head);
>> +	free_schedstat(cpu_head);
>> +out_target:
>> +	free(target);
> 
> 	evlist__delete(evlist);
> 
> and unless you use evlist__create_maps().
> 

Ack.

> 	perf_cpu_map__put(user_requested_cpus);
> 
> Thanks,
> Namhyung
> 
> 
>> +	return err;
>> +}
>> +
>>   static bool schedstat_events_exposed(void)
>>   {
>>   	/*
>> @@ -4686,7 +4771,7 @@ int cmd_sched(int argc, const char **argv)
>>   						     stats_usage, 0);
>>   			return perf_sched__schedstat_report(&sched);
>>   		}
>> -		usage_with_options(stats_usage, stats_options);
>> +		return perf_sched__schedstat_live(&sched, argc, argv);
>>   	} else {
>>   		usage_with_options(sched_usage, sched_options);
>>   	}
>> -- 
>> 2.43.0
>>
--
Thanks and Regards,
Swapnil

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/8] perf sched: Introduce stats tool
  2025-03-11 12:02 [PATCH v3 0/8] perf sched: Introduce stats tool Swapnil Sapkal
                   ` (7 preceding siblings ...)
  2025-03-11 12:02 ` [PATCH v3 8/8] perf sched stats: Add details in man page Swapnil Sapkal
@ 2025-04-10  9:41 ` Chen, Yu C
  2025-04-10 10:29   ` Sapkal, Swapnil
  8 siblings, 1 reply; 23+ messages in thread
From: Chen, Yu C @ 2025-04-10  9:41 UTC (permalink / raw)
  To: Swapnil Sapkal
  Cc: ravi.bangoria, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, adrian.hunter, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, sshegde,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das, peterz, acme, james.clark, namhyung, irogers, mingo

Hi Swapnil,

On 3/11/2025 8:02 PM, Swapnil Sapkal wrote:
> MOTIVATION
> ----------
> 
> Existing `perf sched` is quite exhaustive and provides lot of insights
> into scheduler behavior but it quickly becomes impractical to use for
> long running or scheduler intensive workload. For ex, `perf sched record`
> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
> generates huge 56G perf.data for which perf takes ~137 mins to prepare
> and write it to disk [1].
> 
> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
> and generates samples on a tracepoint hit, `perf sched stats record` takes
> snapshot of the /proc/schedstat file before and after the workload, i.e.
> there is almost zero interference on workload run. Also, it takes very
> minimal time to parse /proc/schedstat, convert it into perf samples and
> save those samples into perf.data file. Result perf.data file is much
> smaller. So, overall `perf sched stats record` is much more light weight
> compare to `perf sched record`.
> 
> We, internally at AMD, have been using this (a variant of this, known as
> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
> of any scheduler code changes[3][4]. Prateek used v2[5] of this patch
> series to report the analysis[6][7].
> 
> Please note that, this is not a replacement of perf sched record/report.
> The intended users of the new tool are scheduler developers, not regular
> users.
> 
> USAGE
> -----
> 
>    # perf sched stats record
>    # perf sched stats report
>    # perf sched stats diff
> 

May I know the status of this patch set? I tested it on a 96 cores 
system and it works as expected in general.

One nit question:
Is perf.data and perf.data.old the default files
for comparison if no files are provided in
perf sched stats diff?


thanks,
Chenyu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/8] perf sched: Introduce stats tool
  2025-04-10  9:41 ` [PATCH v3 0/8] perf sched: Introduce stats tool Chen, Yu C
@ 2025-04-10 10:29   ` Sapkal, Swapnil
  0 siblings, 0 replies; 23+ messages in thread
From: Sapkal, Swapnil @ 2025-04-10 10:29 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: ravi.bangoria, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, adrian.hunter, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, sshegde,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das, peterz, acme, james.clark, namhyung, irogers, mingo

Hi Chenyu,

On 4/10/2025 3:11 PM, Chen, Yu C wrote:
> Hi Swapnil,
> 
> On 3/11/2025 8:02 PM, Swapnil Sapkal wrote:
>> MOTIVATION
>> ----------
>>
>> Existing `perf sched` is quite exhaustive and provides lot of insights
>> into scheduler behavior but it quickly becomes impractical to use for
>> long running or scheduler intensive workload. For ex, `perf sched record`
>> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
>> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
>> generates huge 56G perf.data for which perf takes ~137 mins to prepare
>> and write it to disk [1].
>>
>> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
>> and generates samples on a tracepoint hit, `perf sched stats record` takes
>> snapshot of the /proc/schedstat file before and after the workload, i.e.
>> there is almost zero interference on workload run. Also, it takes very
>> minimal time to parse /proc/schedstat, convert it into perf samples and
>> save those samples into perf.data file. Result perf.data file is much
>> smaller. So, overall `perf sched stats record` is much more light weight
>> compare to `perf sched record`.
>>
>> We, internally at AMD, have been using this (a variant of this, known as
>> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
>> of any scheduler code changes[3][4]. Prateek used v2[5] of this patch
>> series to report the analysis[6][7].
>>
>> Please note that, this is not a replacement of perf sched record/report.
>> The intended users of the new tool are scheduler developers, not regular
>> users.
>>
>> USAGE
>> -----
>>
>>    # perf sched stats record
>>    # perf sched stats report
>>    # perf sched stats diff
>>
> 
> May I know the status of this patch set? I tested it on a 96 cores system and it works as expected in general.

Thank you for testing the patch set. I am working on v4 based on the Namhyung's suggestions.

> 
> One nit question:
> Is perf.data and perf.data.old the default files
> for comparison if no files are provided in
> perf sched stats diff?
> 

Yes, if no files are provided to `perf sched stats diff`, it will take perf.data and perf.data.old as defaults.

> 
> thanks,
> Chenyu
> 
> 
--
Thanks and Regards,
Swapnil

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/8] perf sched stats: Add support for report subcommand
  2025-03-11 12:02 ` [PATCH v3 4/8] perf sched stats: Add support for report subcommand Swapnil Sapkal
  2025-03-15  4:39   ` Namhyung Kim
@ 2025-05-20 10:36   ` Peter Zijlstra
  2025-05-21  5:32     ` Sapkal, Swapnil
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-05-20 10:36 UTC (permalink / raw)
  To: Swapnil Sapkal
  Cc: mingo, acme, namhyung, irogers, james.clark, ravi.bangoria,
	yu.c.chen, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, adrian.hunter, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, sshegde,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das, James Clark

On Tue, Mar 11, 2025 at 12:02:26PM +0000, Swapnil Sapkal wrote:
> `perf sched stats record` captures two sets of samples. For workload
> profile, first set right before workload starts and second set after
> workload finishes. For the systemwide profile, first set at the
> beginning of profile and second set on receiving SIGINT signal.
> 
> Add `perf sched stats report` subcommand that will read both the set
> of samples, get the diff and render a final report. Final report prints
> scheduler stat at cpu granularity as well as sched domain granularity.
> 
> Example usage:
> 
>   # perf sched stats record
>   # perf sched stats report
> 

> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category idle> ");
>  #endif
> +DOMAIN_FIELD(__u32, idle_lb_count,
> +	     "load_balance() count on cpu idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, idle_lb_balanced,
> +	     "load_balance() found balanced on cpu idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, idle_lb_failed,
> +	     "load_balance() move task failed on cpu idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, idle_lb_imbalance,
> +	     "imbalance sum on cpu idle", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, idle_lb_gained,
> +	     "pull_task() count on cpu idle", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, idle_lb_hot_gained,
> +	     "pull_task() when target task was cache-hot on cpu idle", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, idle_lb_nobusyq,
> +	     "load_balance() failed to find busier queue on cpu idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, idle_lb_nobusyg,
> +	     "load_balance() failed to find busier group on cpu idle", "%11u", true, v15);
> +#ifdef DERIVED_CNT_FIELD
> +DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, v15);
> +#endif
> +#ifdef DERIVED_AVG_FIELD
> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu idle)", "%11.2Lf",
> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, idle_lb_gained, v15);
> +#endif
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category busy> ");
> +#endif
> +DOMAIN_FIELD(__u32, busy_lb_count,
> +	     "load_balance() count on cpu busy", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, busy_lb_balanced,
> +	     "load_balance() found balanced on cpu busy", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, busy_lb_failed,
> +	     "load_balance() move task failed on cpu busy", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, busy_lb_imbalance,
> +	     "imbalance sum on cpu busy", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, busy_lb_gained,
> +	     "pull_task() count on cpu busy", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, busy_lb_hot_gained,
> +	     "pull_task() when target task was cache-hot on cpu busy", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, busy_lb_nobusyq,
> +	     "load_balance() failed to find busier queue on cpu busy", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, busy_lb_nobusyg,
> +	     "load_balance() failed to find busier group on cpu busy", "%11u", true, v15);
> +#ifdef DERIVED_CNT_FIELD
> +DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, v15);
> +#endif
> +#ifdef DERIVED_AVG_FIELD
> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu busy)", "%11.2Lf",
> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, busy_lb_gained, v15);
> +#endif
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category newidle> ");
> +#endif
> +DOMAIN_FIELD(__u32, newidle_lb_count,
> +	     "load_balance() count on cpu newly idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_balanced,
> +	     "load_balance() found balanced on cpu newly idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_failed,
> +	     "load_balance() move task failed on cpu newly idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_imbalance,
> +	     "imbalance sum on cpu newly idle", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_gained,
> +	     "pull_task() count on cpu newly idle", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained,
> +	     "pull_task() when target task was cache-hot on cpu newly idle", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
> +	     "load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v15);
> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
> +	     "load_balance() failed to find busier group on cpu newly idle", "%11u", true, v15);
> +#ifdef DERIVED_CNT_FIELD
> +DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v15);
> +#endif
> +#ifdef DERIVED_AVG_FIELD
> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu newly idle)", "%11.2Lf",
> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, newidle_lb_gained, v15);
> +#endif
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category active_load_balance()> ");
> +#endif
> +DOMAIN_FIELD(__u32, alb_count,
> +	     "active_load_balance() count", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, alb_failed,
> +	     "active_load_balance() move task failed", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, alb_pushed,
> +	     "active_load_balance() successfully moved a task", "%11u", false, v15);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category sched_balance_exec()> ");
> +#endif
> +DOMAIN_FIELD(__u32, sbe_count,
> +	     "sbe_count is not used", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, sbe_balanced,
> +	     "sbe_balanced is not used", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, sbe_pushed,
> +	     "sbe_pushed is not used", "%11u", false, v15);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Category sched_balance_fork()> ");
> +#endif
> +DOMAIN_FIELD(__u32, sbf_count,
> +	     "sbf_count is not used", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, sbf_balanced,
> +	     "sbf_balanced is not used", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, sbf_pushed,
> +	     "sbf_pushed is not used", "%11u", false, v15);
> +#ifdef DOMAIN_CATEGORY
> +DOMAIN_CATEGORY(" <Wakeup Info> ");
> +#endif
> +DOMAIN_FIELD(__u32, ttwu_wake_remote,
> +	     "try_to_wake_up() awoke a task that last ran on a diff cpu", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, ttwu_move_affine,
> +	     "try_to_wake_up() moved task because cache-cold on own cpu", "%11u", false, v15);
> +DOMAIN_FIELD(__u32, ttwu_move_balance,
> +	     "try_to_wake_up() started passive balancing", "%11u", false, v15);
> +#endif /* DOMAIN_FIELD */

So I have one request for a future version of this. Could we please add
a knob to print the output using the field name instead of the fancy
pants description?

It is *MUCH* easier to grep the field name in the code than to try and
figure out wth this description is on about :-)

That is, ttwu_wake_remote is infinitely better than "try_to_wake_up()
awake a task that last ran on a diff cpu" and so on.

I realize I might be weird, but it should be simple enough to add and it
makes my life easier :-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/8] perf sched stats: Add support for report subcommand
  2025-05-20 10:36   ` Peter Zijlstra
@ 2025-05-21  5:32     ` Sapkal, Swapnil
  0 siblings, 0 replies; 23+ messages in thread
From: Sapkal, Swapnil @ 2025-05-21  5:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, namhyung, irogers, james.clark, ravi.bangoria,
	yu.c.chen, mark.rutland, alexander.shishkin, jolsa, rostedt,
	vincent.guittot, adrian.hunter, kan.liang, gautham.shenoy,
	kprateek.nayak, juri.lelli, yangjihong, void, tj, sshegde,
	linux-kernel, linux-perf-users, santosh.shukla, ananth.narayan,
	sandipan.das, James Clark

Hi Peter,

On 5/20/2025 4:06 PM, Peter Zijlstra wrote:
> On Tue, Mar 11, 2025 at 12:02:26PM +0000, Swapnil Sapkal wrote:
>> `perf sched stats record` captures two sets of samples. For workload
>> profile, first set right before workload starts and second set after
>> workload finishes. For the systemwide profile, first set at the
>> beginning of profile and second set on receiving SIGINT signal.
>>
>> Add `perf sched stats report` subcommand that will read both the set
>> of samples, get the diff and render a final report. Final report prints
>> scheduler stat at cpu granularity as well as sched domain granularity.
>>
>> Example usage:
>>
>>    # perf sched stats record
>>    # perf sched stats report
>>
> 
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category idle> ");
>>   #endif
>> +DOMAIN_FIELD(__u32, idle_lb_count,
>> +	     "load_balance() count on cpu idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_balanced,
>> +	     "load_balance() found balanced on cpu idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_failed,
>> +	     "load_balance() move task failed on cpu idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_imbalance,
>> +	     "imbalance sum on cpu idle", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_gained,
>> +	     "pull_task() count on cpu idle", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_hot_gained,
>> +	     "pull_task() when target task was cache-hot on cpu idle", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_nobusyq,
>> +	     "load_balance() failed to find busier queue on cpu idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, idle_lb_nobusyg,
>> +	     "load_balance() failed to find busier group on cpu idle", "%11u", true, v15);
>> +#ifdef DERIVED_CNT_FIELD
>> +DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
>> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, v15);
>> +#endif
>> +#ifdef DERIVED_AVG_FIELD
>> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu idle)", "%11.2Lf",
>> +		  idle_lb_count, idle_lb_balanced, idle_lb_failed, idle_lb_gained, v15);
>> +#endif
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category busy> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, busy_lb_count,
>> +	     "load_balance() count on cpu busy", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_balanced,
>> +	     "load_balance() found balanced on cpu busy", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_failed,
>> +	     "load_balance() move task failed on cpu busy", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_imbalance,
>> +	     "imbalance sum on cpu busy", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_gained,
>> +	     "pull_task() count on cpu busy", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_hot_gained,
>> +	     "pull_task() when target task was cache-hot on cpu busy", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_nobusyq,
>> +	     "load_balance() failed to find busier queue on cpu busy", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, busy_lb_nobusyg,
>> +	     "load_balance() failed to find busier group on cpu busy", "%11u", true, v15);
>> +#ifdef DERIVED_CNT_FIELD
>> +DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
>> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, v15);
>> +#endif
>> +#ifdef DERIVED_AVG_FIELD
>> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu busy)", "%11.2Lf",
>> +		  busy_lb_count, busy_lb_balanced, busy_lb_failed, busy_lb_gained, v15);
>> +#endif
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category newidle> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, newidle_lb_count,
>> +	     "load_balance() count on cpu newly idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_balanced,
>> +	     "load_balance() found balanced on cpu newly idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_failed,
>> +	     "load_balance() move task failed on cpu newly idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_imbalance,
>> +	     "imbalance sum on cpu newly idle", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_gained,
>> +	     "pull_task() count on cpu newly idle", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_hot_gained,
>> +	     "pull_task() when target task was cache-hot on cpu newly idle", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
>> +	     "load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v15);
>> +DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
>> +	     "load_balance() failed to find busier group on cpu newly idle", "%11u", true, v15);
>> +#ifdef DERIVED_CNT_FIELD
>> +DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
>> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v15);
>> +#endif
>> +#ifdef DERIVED_AVG_FIELD
>> +DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu newly idle)", "%11.2Lf",
>> +		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, newidle_lb_gained, v15);
>> +#endif
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category active_load_balance()> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, alb_count,
>> +	     "active_load_balance() count", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, alb_failed,
>> +	     "active_load_balance() move task failed", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, alb_pushed,
>> +	     "active_load_balance() successfully moved a task", "%11u", false, v15);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category sched_balance_exec()> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, sbe_count,
>> +	     "sbe_count is not used", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, sbe_balanced,
>> +	     "sbe_balanced is not used", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, sbe_pushed,
>> +	     "sbe_pushed is not used", "%11u", false, v15);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Category sched_balance_fork()> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, sbf_count,
>> +	     "sbf_count is not used", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, sbf_balanced,
>> +	     "sbf_balanced is not used", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, sbf_pushed,
>> +	     "sbf_pushed is not used", "%11u", false, v15);
>> +#ifdef DOMAIN_CATEGORY
>> +DOMAIN_CATEGORY(" <Wakeup Info> ");
>> +#endif
>> +DOMAIN_FIELD(__u32, ttwu_wake_remote,
>> +	     "try_to_wake_up() awoke a task that last ran on a diff cpu", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, ttwu_move_affine,
>> +	     "try_to_wake_up() moved task because cache-cold on own cpu", "%11u", false, v15);
>> +DOMAIN_FIELD(__u32, ttwu_move_balance,
>> +	     "try_to_wake_up() started passive balancing", "%11u", false, v15);
>> +#endif /* DOMAIN_FIELD */
> 
> So I have one request for a future version of this. Could we please add
> a knob to print the output using the field name instead of the fancy
> pants description?
> 

Sure, I will add a knob to print the field name.

> It is *MUCH* easier to grep the field name in the code than to try and
> figure out wth this description is on about :-)
> 
> That is, ttwu_wake_remote is infinitely better than "try_to_wake_up()
> awake a task that last ran on a diff cpu" and so on.
> 

I agree.

> I realize I might be weird, but it should be simple enough to add and it
> makes my life easier :-)

Thank you for the suggestion. It is simple to add it.

--
Thanks and Regards,
Swapnil

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-05-21  5:33 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-11 12:02 [PATCH v3 0/8] perf sched: Introduce stats tool Swapnil Sapkal
2025-03-11 12:02 ` [PATCH v3 1/8] perf sched stats: Add record and rawdump support Swapnil Sapkal
2025-03-11 13:10   ` Markus Elfring
2025-03-11 16:19   ` Markus Elfring
2025-03-15  2:24   ` Namhyung Kim
2025-03-17 13:29     ` Sapkal, Swapnil
2025-03-11 12:02 ` [PATCH v3 2/8] perf sched stats: Add schedstat v16 support Swapnil Sapkal
2025-03-11 12:02 ` [PATCH v3 3/8] perf sched stats: Add schedstat v17 support Swapnil Sapkal
2025-03-15  2:27   ` Namhyung Kim
2025-03-17 13:32     ` Sapkal, Swapnil
2025-03-11 12:02 ` [PATCH v3 4/8] perf sched stats: Add support for report subcommand Swapnil Sapkal
2025-03-15  4:39   ` Namhyung Kim
2025-03-18 11:08     ` Sapkal, Swapnil
2025-05-20 10:36   ` Peter Zijlstra
2025-05-21  5:32     ` Sapkal, Swapnil
2025-03-11 12:02 ` [PATCH v3 5/8] perf sched stats: Add support for live mode Swapnil Sapkal
2025-03-15  4:46   ` Namhyung Kim
2025-03-24  9:15     ` Sapkal, Swapnil
2025-03-11 12:02 ` [PATCH v3 6/8] perf sched stats: Add support for diff subcommand Swapnil Sapkal
2025-03-11 12:02 ` [PATCH v3 7/8] perf sched stats: Add basic perf sched stats test Swapnil Sapkal
2025-03-11 12:02 ` [PATCH v3 8/8] perf sched stats: Add details in man page Swapnil Sapkal
2025-04-10  9:41 ` [PATCH v3 0/8] perf sched: Introduce stats tool Chen, Yu C
2025-04-10 10:29   ` Sapkal, Swapnil

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).