[PATCH] tools/perf: Add wall-clock and parallelism profiling

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] tools/perf: Add wall-clock and parallelism profiling
@ 2025-01-08  8:24 Dmitry Vyukov
  2025-01-08  8:34 ` Dmitry Vyukov
  0 siblings, 1 reply; 22+ messages in thread
From: Dmitry Vyukov @ 2025-01-08  8:24 UTC (permalink / raw)
  To: namhyung, irogers; +Cc: Dmitry Vyukov, linux-perf-users, linux-kernel

There are two notions of time: wall-clock time and CPU time.
For a single-threaded program, or a program running on a single-core
machine, these notions are the same. However, for a multi-threaded/
multi-process program running on a multi-core machine, these notions are
significantly different. Each second of wall-clock time we have
number-of-cores seconds of CPU time.

Currently perf only allows to profile CPU time. Perf (and all other
existing profilers to be best of my knowledge) does not allow to profile
wall-clock time.

Optimizing CPU overhead is useful to improve 'throughput', while
optimizing wall-clock overhead is useful to improve 'latency'.
These profiles are complementary and are not interchangeable.
Examples of where wall-clock profile is needed:
 - optimzing build latency
 - optimizing server request latency
 - optimizing ML training/inference latency
 - optimizing running time of any command line program

CPU profile is useless for these use cases at best (if a user understands
the difference), or misleading at worst (if a user tries to use a wrong
profile for a job).

This patch adds wall-clock and parallelization profiling.
See the added documentation and flags descriptions for details.

Brief outline of the implementation:
 - add context switch collection during record
 - calculate number of threads running on CPUs (parallelism level)
   during report
 - divide each sample weight by the parallelism level
This effectively models that we were taking 1 sample per unit of
wall-clock time.

The feature is added on an equal footing with the existing CPU profiling
rather than a separate mode enabled with special flags. The reasoning is
that users may not understand the problem and the meaning of numbers they
are seeing in the first place, so won't even realize that they may need
to be looking for some different profiling mode. When they are presented
with 2 sets of different numbers, they should start asking questions.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 .../callchain-overhead-calculation.txt        |  5 +-
 .../cpu-and-wallclock-overheads.txt           | 85 +++++++++++++++++
 tools/perf/Documentation/perf-report.txt      | 50 ++++++----
 tools/perf/Documentation/tips.txt             |  3 +
 tools/perf/builtin-record.c                   |  6 ++
 tools/perf/builtin-report.c                   | 30 ++++++
 tools/perf/ui/browsers/hists.c                | 27 ++++--
 tools/perf/ui/hist.c                          | 61 +++++++++++--
 tools/perf/util/addr_location.c               |  1 +
 tools/perf/util/addr_location.h               |  6 +-
 tools/perf/util/event.c                       |  6 ++
 tools/perf/util/events_stats.h                |  2 +
 tools/perf/util/hist.c                        | 91 ++++++++++++++++---
 tools/perf/util/hist.h                        | 25 ++++-
 tools/perf/util/machine.c                     |  7 ++
 tools/perf/util/machine.h                     |  6 ++
 tools/perf/util/session.c                     | 12 +++
 tools/perf/util/session.h                     |  1 +
 tools/perf/util/sort.c                        | 59 +++++++++++-
 tools/perf/util/sort.h                        |  1 +
 tools/perf/util/symbol.c                      | 34 +++++++
 tools/perf/util/symbol_conf.h                 |  7 +-
 22 files changed, 467 insertions(+), 58 deletions(-)
 create mode 100644 tools/perf/Documentation/cpu-and-wallclock-overheads.txt

diff --git a/tools/perf/Documentation/callchain-overhead-calculation.txt b/tools/perf/Documentation/callchain-overhead-calculation.txt
index 1a757927195ed..e0202bf5bd1a0 100644
--- a/tools/perf/Documentation/callchain-overhead-calculation.txt
+++ b/tools/perf/Documentation/callchain-overhead-calculation.txt
@@ -1,7 +1,8 @@
 Overhead calculation
 --------------------
-The overhead can be shown in two columns as 'Children' and 'Self' when
-perf collects callchains.  The 'self' overhead is simply calculated by
+The CPU overhead can be shown in two columns as 'Children' and 'Self'
+when perf collects callchains (and corresponding 'Wall' columns for
+wall-clock overhead).  The 'self' overhead is simply calculated by
 adding all period values of the entry - usually a function (symbol).
 This is the value that perf shows traditionally and sum of all the
 'self' overhead values should be 100%.
diff --git a/tools/perf/Documentation/cpu-and-wallclock-overheads.txt b/tools/perf/Documentation/cpu-and-wallclock-overheads.txt
new file mode 100644
index 0000000000000..4f739ff4de437
--- /dev/null
+++ b/tools/perf/Documentation/cpu-and-wallclock-overheads.txt
@@ -0,0 +1,85 @@
+CPU and wall-clock overheads
+----------------------------
+There are two notions of time: wall-clock time and CPU time.
+For a single-threaded program, or a program running on a single-core machine,
+these notions are the same. However, for a multi-threaded/multi-process program
+running on a multi-core machine, these notions are significantly different.
+Each second of wall-clock time we have number-of-cores seconds of CPU time.
+Perf can measure overhead for both of these times (shown in 'overhead' and
+'wallclock' columns for CPU and wall-clock correspondingly).
+
+Optimizing CPU overhead is useful to improve 'throughput', while optimizing
+wall-clock overhead is useful to improve 'latency'. It's important to understand
+which one is useful in a concrete situation at hand. For example, the former
+may be useful to improve max throughput of a CI build server that runs on 100%
+CPU utilization, while the latter may be useful to improve user-perceived
+latency of a single interactive program build.
+These overheads may be significantly different in some cases. For example,
+consider a program that executes function 'foo' for 9 seconds with 1 thread,
+and then executes function 'bar' for 1 second with 128 threads (consumes
+128 seconds of CPU time). The CPU overhead is: 'foo' - 6.6%, 'bar' - 93.4%.
+While wall-clock overhead is: 'foo' - 90%, 'bar' - 10%. If we try to optimize
+running time of the program looking at the (wrong in this case) CPU overhead,
+we would concentrate on the function 'bar', but it can yield only 10% running
+time improvement at best.
+
+By default, perf shows both overheads (for parallel programs), but sorts by
+CPU overhead:
+
+-----------------------------------
+Overhead  Wallclock  Command
+  93.88%     25.79%  cc1
+   1.90%     39.87%  gzip
+   0.99%     10.16%  dpkg-deb
+   0.57%      1.00%  as
+   0.40%      0.46%  sh
+-----------------------------------
+
+To sort by wall-clock overhead, use '--sort=wallclock' flag:
+
+-----------------------------------
+Wallclock  Overhead  Command
+   39.87%     1.90%  gzip
+   25.79%    93.88%  cc1
+   10.16%     0.99%  dpkg-deb
+    4.17%     0.29%  git
+    2.81%     0.11%  objtool
+-----------------------------------
+
+To get insight into the difference between the overheads, you may check
+parallelization histogram with '--sort=wallclock,parallelism,comm,symbol --hierarchy'
+flags. It shows fraction of (wall-clock) time the workload utilizes different
+numbers of cores ('parallelism' column). For example, in the following case
+the workload utilizes only 1 core most of the time, but also has some
+highly-parallel phases, which explains significant difference between
+CPU and wall-clock overheads:
+
+-----------------------------------
+  Wallclock  Overhead     Parallelism / Command / Symbol
++    56.98%     2.29%     1
++    16.94%     1.36%     2
++     4.00%    20.13%     125
++     3.66%    18.25%     124
++     3.48%    17.66%     126
++     3.26%     0.39%     3
++     2.61%    12.93%     123
+-----------------------------------
+
+By expanding corresponding lines, you may see what programs/functions run
+at the given parallelism level:
+
+-----------------------------------
+  Wallclock  Overhead     Parallelism / Command / Symbol
+-    56.98%     2.29%     1
+        32.80%     1.32%     gzip
+         4.46%     0.18%     cc1
+         2.81%     0.11%     objtool
+         2.43%     0.10%     dpkg-source
+         2.22%     0.09%     ld
+         2.10%     0.08%     dpkg-genchanges
+-----------------------------------
+
+To see the normal function-level profile for particular parallelism levels
+(number of threads actively running on CPUs), you may use '--parallelism'
+filter. For example, to see the profile only for low parallelism phases
+of a workload use '--sort=wallclock,comm,symbol --parallelism=1-2' flags.
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 87f8645194062..e79ff412af0a5 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -44,7 +44,7 @@ OPTIONS
 --comms=::
 	Only consider symbols in these comms. CSV that understands
 	file://filename entries.  This option will affect the percentage of
-	the overhead column.  See --percentage for more info.
+	the overhead and wallclock columns.  See --percentage for more info.
 --pid=::
         Only show events for given process ID (comma separated list).
 
@@ -54,12 +54,12 @@ OPTIONS
 --dsos=::
 	Only consider symbols in these dsos. CSV that understands
 	file://filename entries.  This option will affect the percentage of
-	the overhead column.  See --percentage for more info.
+	the overhead and wallclock columns.  See --percentage for more info.
 -S::
 --symbols=::
 	Only consider these symbols. CSV that understands
 	file://filename entries.  This option will affect the percentage of
-	the overhead column.  See --percentage for more info.
+	the overhead and wallclock columns.  See --percentage for more info.
 
 --symbol-filter=::
 	Only show symbols that match (partially) with this filter.
@@ -68,6 +68,16 @@ OPTIONS
 --hide-unresolved::
         Only display entries resolved to a symbol.
 
+--parallelism::
+        Only consider these parallelism levels. Parallelism level is the number
+        of threads that actively run on CPUs at the time of sample. The flag
+        accepts single number, comma-separated list, and ranges (for example:
+        "1", "7,8", "1,64-128"). This is useful in understanding what a program
+        is doing during sequential/low-parallelism phases as compared to
+        high-parallelism phases. This option will affect the percentage of
+        the overhead and wallclock columns. See --percentage for more info.
+        Also see the `CPU and wall-clock overheads' section for more details.
+
 -s::
 --sort=::
 	Sort histogram entries by given key(s) - multiple keys can be specified
@@ -87,6 +97,7 @@ OPTIONS
 	entries are displayed as "[other]".
 	- cpu: cpu number the task ran at the time of sample
 	- socket: processor socket number the task ran at the time of sample
+	- parallelism: number of running threads at the time of sample
 	- srcline: filename and line number executed at the time of sample.  The
 	DWARF debugging info must be provided.
 	- srcfile: file name of the source file of the samples. Requires dwarf
@@ -97,12 +108,14 @@ OPTIONS
 	- cgroup_id: ID derived from cgroup namespace device and inode numbers.
 	- cgroup: cgroup pathname in the cgroupfs.
 	- transaction: Transaction abort flags.
-	- overhead: Overhead percentage of sample
-	- overhead_sys: Overhead percentage of sample running in system mode
-	- overhead_us: Overhead percentage of sample running in user mode
-	- overhead_guest_sys: Overhead percentage of sample running in system mode
+	- overhead: CPU overhead percentage of sample
+	- wallclock: Wall-clock (latency) overhead percentage of sample.
+	  See the `CPU and wall-clock overheads' section for more details.
+	- overhead_sys: CPU overhead percentage of sample running in system mode
+	- overhead_us: CPU overhead percentage of sample running in user mode
+	- overhead_guest_sys: CPU overhead percentage of sample running in system mode
 	on guest machine
-	- overhead_guest_us: Overhead percentage of sample running in user mode on
+	- overhead_guest_us: CPU overhead percentage of sample running in user mode on
 	guest machine
 	- sample: Number of sample
 	- period: Raw number of event count of sample
@@ -125,8 +138,9 @@ OPTIONS
 	- weight2: Average value of event specific weight (2nd field of weight_struct).
 	- weight3: Average value of event specific weight (3rd field of weight_struct).
 
-	By default, comm, dso and symbol keys are used.
-	(i.e. --sort comm,dso,symbol)
+	By default, overhead, comm, dso and symbol keys are used.
+	(i.e. --sort overhead,comm,dso,symbol). If only wallclock is specified,
+	(i.e. --sort wallclock), it expands to wallclock,comm,dso,symbol.
 
 	If --branch-stack option is used, following sort keys are also
 	available:
@@ -201,9 +215,9 @@ OPTIONS
 --fields=::
 	Specify output field - multiple keys can be specified in CSV format.
 	Following fields are available:
-	overhead, overhead_sys, overhead_us, overhead_children, sample, period,
-	weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat.  The
-	last 3 names are alias for the corresponding weights.  When the weight
+	overhead, wallclock, overhead_sys, overhead_us, overhead_children, sample,
+	period, weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat.
+	The last 3 names are alias for the corresponding weights.  When the weight
 	fields are used, they will show the average value of the weight.
 
 	Also it can contain any sort key(s).
@@ -289,7 +303,7 @@ OPTIONS
 	Accumulate callchain of children to parent entry so that then can
 	show up in the output.  The output will have a new "Children" column
 	and will be sorted on the data.  It requires callchains are recorded.
-	See the `overhead calculation' section for more details. Enabled by
+	See the `Overhead calculation' section for more details. Enabled by
 	default, disable with --no-children.
 
 --max-stack::
@@ -442,9 +456,9 @@ OPTIONS
 	--call-graph option for details.
 
 --percentage::
-	Determine how to display the overhead percentage of filtered entries.
-	Filters can be applied by --comms, --dsos and/or --symbols options and
-	Zoom operations on the TUI (thread, dso, etc).
+	Determine how to display the CPU and wall-clock overhead percentage
+	of filtered entries. Filters can be applied by --comms, --dsos, --symbols
+	and/or --parallelism options and Zoom operations on the TUI (thread, dso, etc).
 
 	"relative" means it's relative to filtered entries only so that the
 	sum of shown entries will be always 100%.  "absolute" means it retains
@@ -627,6 +641,8 @@ include::itrace.txt[]
 --skip-empty::
 	Do not print 0 results in the --stat output.
 
+include::cpu-and-wallclock-overheads.txt[]
+
 include::callchain-overhead-calculation.txt[]
 
 SEE ALSO
diff --git a/tools/perf/Documentation/tips.txt b/tools/perf/Documentation/tips.txt
index 67b326ba00407..50916c58b81c5 100644
--- a/tools/perf/Documentation/tips.txt
+++ b/tools/perf/Documentation/tips.txt
@@ -62,3 +62,6 @@ To show context switches in perf report sample context add --switch-events to pe
 To show time in nanoseconds in record/report add --ns
 To compare hot regions in two workloads use perf record -b -o file ... ; perf diff --stream file1 file2
 To compare scalability of two workload samples use perf diff -c ratio file1 file2
+For latency analysis, try: perf report --sort wallclock
+For parallelism histogram, try: perf report --hierarchy --sort wallclock,parallelism,comm,symbol
+To analyze particular parallelism levels, try: perf report --sort=wallclock,symbol --parallelism=32-64
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f832524729211..225e3997f03cb 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -3342,6 +3342,12 @@ static struct record record = {
 		.ctl_fd              = -1,
 		.ctl_fd_ack          = -1,
 		.synth               = PERF_SYNTH_ALL,
+		/*
+		 * This is enabled by default because it's required to show
+		 * wallclock overhead and parallelism profile, and does not
+		 * add lots of overhead on top of samples data.
+		 */
+		.record_switch_events	= true,
 	},
 };
 
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 42d7dfdf07d9b..d86937174a1e1 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -112,6 +112,8 @@ struct report {
 	u64			nr_entries;
 	u64			queue_size;
 	u64			total_cycles;
+	u64			total_samples;
+	u64			singlethreaded_samples;
 	int			socket_filter;
 	DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
 	struct branch_type_stat	brtype_stat;
@@ -331,6 +333,10 @@ static int process_sample_event(const struct perf_tool *tool,
 				     &rep->total_cycles, evsel);
 	}
 
+	rep->total_samples++;
+	if (al.parallelism == 1)
+		rep->singlethreaded_samples++;
+
 	ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
 	if (ret < 0)
 		pr_debug("problem adding hist entry, skipping event\n");
@@ -1081,6 +1087,10 @@ static int __cmd_report(struct report *rep)
 		return ret;
 	}
 
+	/* Don't show Wallclock column for non-parallel profiles. */
+	if (rep->singlethreaded_samples * 100 / rep->total_samples >= 99)
+		perf_hpp__cancel_wallclock();
+
 	evlist__check_mem_load_aux(session->evlist);
 
 	if (rep->stats_mode)
@@ -1392,6 +1402,8 @@ int cmd_report(int argc, const char **argv)
 		     symbol__config_symfs),
 	OPT_STRING('C', "cpu", &report.cpu_list, "cpu",
 		   "list of cpus to profile"),
+	OPT_STRING(0, "parallelism", &symbol_conf.parallelism_list_str, "parallelism",
+		   "only consider these parallelism levels (cpu set format)"),
 	OPT_BOOLEAN('I', "show-info", &report.show_full_info,
 		    "Display extended information about perf.data file"),
 	OPT_BOOLEAN(0, "source", &annotate_opts.annotate_src,
@@ -1570,6 +1582,7 @@ int cmd_report(int argc, const char **argv)
 	report.tool.cgroup		 = perf_event__process_cgroup;
 	report.tool.exit		 = perf_event__process_exit;
 	report.tool.fork		 = perf_event__process_fork;
+	report.tool.context_switch	 = perf_event__process_switch;
 	report.tool.lost		 = perf_event__process_lost;
 	report.tool.read		 = process_read_event;
 	report.tool.attr		 = process_attr;
@@ -1721,6 +1734,23 @@ int cmd_report(int argc, const char **argv)
 		symbol_conf.annotate_data_sample = true;
 	}
 
+	if (report.disable_order || !perf_session__has_switch_events(session)) {
+		if (symbol_conf.parallelism_list_str ||
+			(sort_order && (strstr(sort_order, "wallclock") ||
+				strstr(sort_order, "parallelism"))) ||
+			(field_order && (strstr(field_order, "wallclock") ||
+				strstr(field_order, "parallelism")))) {
+			if (report.disable_order)
+				ui__error("Use of wallclock profile or parallelism"
+					" is incompatible with --disable-order.\n");
+			else
+				ui__error("Use of wallclock profile or parallelism"
+					" requires --switch-events during record.\n");
+			return -1;
+		}
+		symbol_conf.disable_wallclock = true;
+	}
+
 	if (last_key != K_SWITCH_INPUT_DATA) {
 		if (sort_order && strstr(sort_order, "ipc")) {
 			parse_options_usage(report_usage, options, "s", 1);
diff --git a/tools/perf/ui/browsers/hists.c b/tools/perf/ui/browsers/hists.c
index 49ba82bf33918..abcc4a11849f4 100644
--- a/tools/perf/ui/browsers/hists.c
+++ b/tools/perf/ui/browsers/hists.c
@@ -1226,7 +1226,7 @@ int __hpp__slsmg_color_printf(struct perf_hpp *hpp, const char *fmt, ...)
 	return ret;
 }
 
-#define __HPP_COLOR_PERCENT_FN(_type, _field)				\
+#define __HPP_COLOR_PERCENT_FN(_type, _field, _fmttype)			\
 static u64 __hpp_get_##_field(struct hist_entry *he)			\
 {									\
 	return he->stat._field;						\
@@ -1238,10 +1238,10 @@ hist_browser__hpp_color_##_type(struct perf_hpp_fmt *fmt,		\
 				struct hist_entry *he)			\
 {									\
 	return hpp__fmt(fmt, hpp, he, __hpp_get_##_field, " %*.2f%%",	\
-			__hpp__slsmg_color_printf, true);		\
+			__hpp__slsmg_color_printf, _fmttype);		\
 }
 
-#define __HPP_COLOR_ACC_PERCENT_FN(_type, _field)			\
+#define __HPP_COLOR_ACC_PERCENT_FN(_type, _field, _fmttype)		\
 static u64 __hpp_get_acc_##_field(struct hist_entry *he)		\
 {									\
 	return he->stat_acc->_field;					\
@@ -1262,15 +1262,18 @@ hist_browser__hpp_color_##_type(struct perf_hpp_fmt *fmt,		\
 		return ret;						\
 	}								\
 	return hpp__fmt(fmt, hpp, he, __hpp_get_acc_##_field,		\
-			" %*.2f%%", __hpp__slsmg_color_printf, true);	\
+			" %*.2f%%", __hpp__slsmg_color_printf,		\
+			_fmttype);					\
 }
 
-__HPP_COLOR_PERCENT_FN(overhead, period)
-__HPP_COLOR_PERCENT_FN(overhead_sys, period_sys)
-__HPP_COLOR_PERCENT_FN(overhead_us, period_us)
-__HPP_COLOR_PERCENT_FN(overhead_guest_sys, period_guest_sys)
-__HPP_COLOR_PERCENT_FN(overhead_guest_us, period_guest_us)
-__HPP_COLOR_ACC_PERCENT_FN(overhead_acc, period)
+__HPP_COLOR_PERCENT_FN(overhead, period, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_PERCENT_FN(wallclock, wallclock, PERF_HPP_FMT_TYPE__WALLCLOCK)
+__HPP_COLOR_PERCENT_FN(overhead_sys, period_sys, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_PERCENT_FN(overhead_us, period_us, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_PERCENT_FN(overhead_guest_sys, period_guest_sys, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_PERCENT_FN(overhead_guest_us, period_guest_us, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_ACC_PERCENT_FN(overhead_acc, period, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_ACC_PERCENT_FN(wallclock_acc, wallclock, PERF_HPP_FMT_TYPE__WALLCLOCK)
 
 #undef __HPP_COLOR_PERCENT_FN
 #undef __HPP_COLOR_ACC_PERCENT_FN
@@ -1279,6 +1282,8 @@ void hist_browser__init_hpp(void)
 {
 	perf_hpp__format[PERF_HPP__OVERHEAD].color =
 				hist_browser__hpp_color_overhead;
+	perf_hpp__format[PERF_HPP__WALLCLOCK].color =
+				hist_browser__hpp_color_wallclock;
 	perf_hpp__format[PERF_HPP__OVERHEAD_SYS].color =
 				hist_browser__hpp_color_overhead_sys;
 	perf_hpp__format[PERF_HPP__OVERHEAD_US].color =
@@ -1289,6 +1294,8 @@ void hist_browser__init_hpp(void)
 				hist_browser__hpp_color_overhead_guest_us;
 	perf_hpp__format[PERF_HPP__OVERHEAD_ACC].color =
 				hist_browser__hpp_color_overhead_acc;
+	perf_hpp__format[PERF_HPP__WALLCLOCK_ACC].color =
+				hist_browser__hpp_color_wallclock_acc;
 
 	res_sample_init();
 }
diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
index 34fda1d5eccb4..c85ad841fb4ed 100644
--- a/tools/perf/ui/hist.c
+++ b/tools/perf/ui/hist.c
@@ -27,9 +27,10 @@ static int __hpp__fmt_print(struct perf_hpp *hpp, struct hists *hists, u64 val,
 			    int nr_samples, const char *fmt, int len,
 			    hpp_snprint_fn print_fn, enum perf_hpp_fmt_type fmtype)
 {
-	if (fmtype == PERF_HPP_FMT_TYPE__PERCENT) {
+	if (fmtype == PERF_HPP_FMT_TYPE__PERCENT || fmtype == PERF_HPP_FMT_TYPE__WALLCLOCK) {
 		double percent = 0.0;
-		u64 total = hists__total_period(hists);
+		u64 total = fmtype == PERF_HPP_FMT_TYPE__PERCENT ? hists__total_period(hists) :
+			hists__total_wallclock(hists);
 
 		if (total)
 			percent = 100.0 * val / total;
@@ -128,7 +129,7 @@ int hpp__fmt(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
 				  print_fn, fmtype);
 	}
 
-	if (fmtype == PERF_HPP_FMT_TYPE__PERCENT)
+	if (fmtype == PERF_HPP_FMT_TYPE__PERCENT || fmtype == PERF_HPP_FMT_TYPE__WALLCLOCK)
 		len -= 2; /* 2 for a space and a % sign */
 	else
 		len -= 1;
@@ -472,11 +473,13 @@ __HPP_ENTRY_AVERAGE_FN(_type, _field)					\
 __HPP_SORT_AVERAGE_FN(_type, _field)
 
 HPP_PERCENT_FNS(overhead, period)
+HPP_PERCENT_FNS(wallclock, wallclock)
 HPP_PERCENT_FNS(overhead_sys, period_sys)
 HPP_PERCENT_FNS(overhead_us, period_us)
 HPP_PERCENT_FNS(overhead_guest_sys, period_guest_sys)
 HPP_PERCENT_FNS(overhead_guest_us, period_guest_us)
 HPP_PERCENT_ACC_FNS(overhead_acc, period)
+HPP_PERCENT_ACC_FNS(wallclock_acc, wallclock)
 
 HPP_RAW_FNS(samples, nr_events)
 HPP_RAW_FNS(period, period)
@@ -548,11 +551,13 @@ static bool hpp__equal(struct perf_hpp_fmt *a, struct perf_hpp_fmt *b)
 
 struct perf_hpp_fmt perf_hpp__format[] = {
 	HPP__COLOR_PRINT_FNS("Overhead", overhead, OVERHEAD),
+	HPP__COLOR_PRINT_FNS("Wallclock", wallclock, WALLCLOCK),
 	HPP__COLOR_PRINT_FNS("sys", overhead_sys, OVERHEAD_SYS),
 	HPP__COLOR_PRINT_FNS("usr", overhead_us, OVERHEAD_US),
 	HPP__COLOR_PRINT_FNS("guest sys", overhead_guest_sys, OVERHEAD_GUEST_SYS),
 	HPP__COLOR_PRINT_FNS("guest usr", overhead_guest_us, OVERHEAD_GUEST_US),
 	HPP__COLOR_ACC_PRINT_FNS("Children", overhead_acc, OVERHEAD_ACC),
+	HPP__COLOR_ACC_PRINT_FNS("Wall", wallclock_acc, WALLCLOCK_ACC),
 	HPP__PRINT_FNS("Samples", samples, SAMPLES),
 	HPP__PRINT_FNS("Period", period, PERIOD),
 	HPP__PRINT_FNS("Weight1", weight1, WEIGHT1),
@@ -601,9 +606,15 @@ static void fmt_free(struct perf_hpp_fmt *fmt)
 		fmt->free(fmt);
 }
 
+static bool fmt_equal(struct perf_hpp_fmt *a, struct perf_hpp_fmt *b)
+{
+	return a->equal && a->equal(a, b);
+}
+
 void perf_hpp__init(void)
 {
 	int i;
+	bool prefer_wallclock;
 
 	for (i = 0; i < PERF_HPP__MAX_INDEX; i++) {
 		struct perf_hpp_fmt *fmt = &perf_hpp__format[i];
@@ -621,12 +632,30 @@ void perf_hpp__init(void)
 	if (is_strict_order(field_order))
 		return;
 
+	/*
+	 * There are numerous ways how a user can setup a custom view with
+	 * --sort/fields flags, but we provide --sort=wallclock as a simple
+	 * preset for latency analysis and show wallclock before CPU overhead.
+	 */
+	prefer_wallclock = !symbol_conf.disable_wallclock && sort_order &&
+		!strncmp(sort_order, "wallclock", sizeof("wallclock")-1);
+
 	if (symbol_conf.cumulate_callchain) {
+		/* Use idempotent addition to avoid more complex logic. */
+		if (prefer_wallclock)
+			hpp_dimension__add_output(PERF_HPP__WALLCLOCK_ACC);
 		hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC);
+		if (!symbol_conf.disable_wallclock)
+			hpp_dimension__add_output(PERF_HPP__WALLCLOCK_ACC);
 		perf_hpp__format[PERF_HPP__OVERHEAD].name = "Self";
+		perf_hpp__format[PERF_HPP__WALLCLOCK].name = "Wall";
 	}
 
+	if (prefer_wallclock)
+		hpp_dimension__add_output(PERF_HPP__WALLCLOCK);
 	hpp_dimension__add_output(PERF_HPP__OVERHEAD);
+	if (!symbol_conf.disable_wallclock)
+		hpp_dimension__add_output(PERF_HPP__WALLCLOCK);
 
 	if (symbol_conf.show_cpu_utilization) {
 		hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS);
@@ -671,28 +700,43 @@ static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
 
 void perf_hpp__cancel_cumulate(void)
 {
-	struct perf_hpp_fmt *fmt, *acc, *ovh, *tmp;
+	struct perf_hpp_fmt *fmt, *acc, *ovh, *wall, *acc_wall, *tmp;
 
 	if (is_strict_order(field_order))
 		return;
 
 	ovh = &perf_hpp__format[PERF_HPP__OVERHEAD];
 	acc = &perf_hpp__format[PERF_HPP__OVERHEAD_ACC];
+	wall = &perf_hpp__format[PERF_HPP__WALLCLOCK];
+	acc_wall = &perf_hpp__format[PERF_HPP__WALLCLOCK_ACC];
 
 	perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
-		if (acc->equal(acc, fmt)) {
+		if (fmt_equal(acc, fmt) || fmt_equal(acc_wall, fmt)) {
 			perf_hpp__column_unregister(fmt);
 			continue;
 		}
 
-		if (ovh->equal(ovh, fmt))
+		if (fmt_equal(ovh, fmt))
 			fmt->name = "Overhead";
+		if (fmt_equal(wall, fmt))
+			fmt->name = "Wallclock";
 	}
 }
 
-static bool fmt_equal(struct perf_hpp_fmt *a, struct perf_hpp_fmt *b)
+void perf_hpp__cancel_wallclock(void)
 {
-	return a->equal && a->equal(a, b);
+	struct perf_hpp_fmt *fmt, *wall, *acc, *tmp;
+
+	if (is_strict_order(field_order))
+		return;
+
+	wall = &perf_hpp__format[PERF_HPP__WALLCLOCK];
+	acc = &perf_hpp__format[PERF_HPP__WALLCLOCK_ACC];
+
+	perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
+		if (fmt_equal(wall, fmt) || fmt_equal(acc, fmt))
+			perf_hpp__column_unregister(fmt);
+	}
 }
 
 void perf_hpp__setup_output_field(struct perf_hpp_list *list)
@@ -819,6 +863,7 @@ void perf_hpp__reset_width(struct perf_hpp_fmt *fmt, struct hists *hists)
 
 	switch (fmt->idx) {
 	case PERF_HPP__OVERHEAD:
+	case PERF_HPP__WALLCLOCK:
 	case PERF_HPP__OVERHEAD_SYS:
 	case PERF_HPP__OVERHEAD_US:
 	case PERF_HPP__OVERHEAD_ACC:
diff --git a/tools/perf/util/addr_location.c b/tools/perf/util/addr_location.c
index 51825ef8c0ab7..007a2f5df9a6a 100644
--- a/tools/perf/util/addr_location.c
+++ b/tools/perf/util/addr_location.c
@@ -17,6 +17,7 @@ void addr_location__init(struct addr_location *al)
 	al->cpumode = 0;
 	al->cpu = 0;
 	al->socket = 0;
+	al->parallelism = 1;
 }
 
 /*
diff --git a/tools/perf/util/addr_location.h b/tools/perf/util/addr_location.h
index d8ac0428dff23..d8af6f467ae9d 100644
--- a/tools/perf/util/addr_location.h
+++ b/tools/perf/util/addr_location.h
@@ -17,10 +17,14 @@ struct addr_location {
 	const char    *srcline;
 	u64	      addr;
 	char	      level;
-	u8	      filtered;
+	u16	      filtered;
 	u8	      cpumode;
 	s32	      cpu;
 	s32	      socket;
+	/* Same as machine.parallelism but within [1, nr_cpus]. */
+	int	      parallelism;
+	/* See he_stat.wallclock. */
+	u64	      wallclock;
 };
 
 void addr_location__init(struct addr_location *al);
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index aac96d5d19170..1c50af20b3a4c 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -767,6 +767,12 @@ int machine__resolve(struct machine *machine, struct addr_location *al,
 			al->socket = env->cpu[al->cpu].socket_id;
 	}
 
+	/* Account for possible out-of-order switch events. */
+	al->parallelism = min(max(machine->parallelism, 1), machine__nr_cpus_avail(machine));
+	if (test_bit(al->parallelism, symbol_conf.parallelism_filter))
+		al->filtered |= (1 << HIST_FILTER__PARALLELISM);
+	al->wallclock = sample->period * 1000 / al->parallelism;
+
 	if (al->map) {
 		if (symbol_conf.dso_list &&
 		    (!dso || !(strlist__has_entry(symbol_conf.dso_list,
diff --git a/tools/perf/util/events_stats.h b/tools/perf/util/events_stats.h
index eabd7913c3092..c32c99c16ad39 100644
--- a/tools/perf/util/events_stats.h
+++ b/tools/perf/util/events_stats.h
@@ -57,6 +57,8 @@ struct events_stats {
 struct hists_stats {
 	u64 total_period;
 	u64 total_non_filtered_period;
+	u64 total_wallclock;
+	u64 total_non_filtered_wallclock;
 	u32 nr_samples;
 	u32 nr_non_filtered_samples;
 	u32 nr_lost_samples;
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index b99d89b1a152d..fe690e9aece45 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -43,6 +43,8 @@ static bool hists__filter_entry_by_symbol(struct hists *hists,
 					  struct hist_entry *he);
 static bool hists__filter_entry_by_socket(struct hists *hists,
 					  struct hist_entry *he);
+static bool hists__filter_entry_by_parallelism(struct hists *hists,
+					       struct hist_entry *he);
 
 u16 hists__col_len(struct hists *hists, enum hist_column col)
 {
@@ -207,6 +209,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 
 	hists__new_col_len(hists, HISTC_CGROUP, 6);
 	hists__new_col_len(hists, HISTC_CGROUP_ID, 20);
+	hists__new_col_len(hists, HISTC_PARALLELISM, 11);
 	hists__new_col_len(hists, HISTC_CPU, 3);
 	hists__new_col_len(hists, HISTC_SOCKET, 6);
 	hists__new_col_len(hists, HISTC_MEM_LOCKED, 6);
@@ -302,9 +305,10 @@ static long hist_time(unsigned long htime)
 	return htime;
 }
 
-static void he_stat__add_period(struct he_stat *he_stat, u64 period)
+static void he_stat__add_period(struct he_stat *he_stat, u64 period, u64 wallclock)
 {
 	he_stat->period		+= period;
+	he_stat->wallclock	+= wallclock;
 	he_stat->nr_events	+= 1;
 }
 
@@ -319,6 +323,7 @@ static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
 	dest->weight2		+= src->weight2;
 	dest->weight3		+= src->weight3;
 	dest->nr_events		+= src->nr_events;
+	dest->wallclock		+= src->wallclock;
 }
 
 static void he_stat__decay(struct he_stat *he_stat)
@@ -328,6 +333,7 @@ static void he_stat__decay(struct he_stat *he_stat)
 	he_stat->weight1 = (he_stat->weight1 * 7) / 8;
 	he_stat->weight2 = (he_stat->weight2 * 7) / 8;
 	he_stat->weight3 = (he_stat->weight3 * 7) / 8;
+	he_stat->wallclock = (he_stat->wallclock * 7) / 8;
 }
 
 static void hists__delete_entry(struct hists *hists, struct hist_entry *he);
@@ -335,7 +341,7 @@ static void hists__delete_entry(struct hists *hists, struct hist_entry *he);
 static bool hists__decay_entry(struct hists *hists, struct hist_entry *he)
 {
 	u64 prev_period = he->stat.period;
-	u64 diff;
+	u64 prev_wallclock = he->stat.wallclock;
 
 	if (prev_period == 0)
 		return true;
@@ -345,12 +351,16 @@ static bool hists__decay_entry(struct hists *hists, struct hist_entry *he)
 		he_stat__decay(he->stat_acc);
 	decay_callchain(he->callchain);
 
-	diff = prev_period - he->stat.period;
-
 	if (!he->depth) {
-		hists->stats.total_period -= diff;
-		if (!he->filtered)
-			hists->stats.total_non_filtered_period -= diff;
+		u64 period_diff = prev_period - he->stat.period;
+		u64 wallclock_diff = prev_wallclock - he->stat.wallclock;
+
+		hists->stats.total_period -= period_diff;
+		hists->stats.total_wallclock -= wallclock_diff;
+		if (!he->filtered) {
+			hists->stats.total_non_filtered_period -= period_diff;
+			hists->stats.total_non_filtered_wallclock -= wallclock_diff;
+		}
 	}
 
 	if (!he->leaf) {
@@ -365,7 +375,7 @@ static bool hists__decay_entry(struct hists *hists, struct hist_entry *he)
 		}
 	}
 
-	return he->stat.period == 0;
+	return he->stat.period == 0 && he->stat.wallclock == 0;
 }
 
 static void hists__delete_entry(struct hists *hists, struct hist_entry *he)
@@ -584,21 +594,24 @@ static struct hist_entry *hist_entry__new(struct hist_entry *template,
 	return he;
 }
 
-static u8 symbol__parent_filter(const struct symbol *parent)
+static u16 symbol__parent_filter(const struct symbol *parent)
 {
 	if (symbol_conf.exclude_other && parent == NULL)
 		return 1 << HIST_FILTER__PARENT;
 	return 0;
 }
 
-static void hist_entry__add_callchain_period(struct hist_entry *he, u64 period)
+static void hist_entry__add_callchain_period(struct hist_entry *he, u64 period, u64 wallclock)
 {
 	if (!hist_entry__has_callchains(he) || !symbol_conf.use_callchain)
 		return;
 
 	he->hists->callchain_period += period;
-	if (!he->filtered)
+	he->hists->callchain_wallclock += wallclock;
+	if (!he->filtered) {
 		he->hists->callchain_non_filtered_period += period;
+		he->hists->callchain_non_filtered_wallclock += wallclock;
+	}
 }
 
 static struct hist_entry *hists__findnew_entry(struct hists *hists,
@@ -611,6 +624,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 	struct hist_entry *he;
 	int64_t cmp;
 	u64 period = entry->stat.period;
+	u64 wallclock = entry->stat.wallclock;
 	bool leftmost = true;
 
 	p = &hists->entries_in->rb_root.rb_node;
@@ -629,10 +643,10 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 		if (!cmp) {
 			if (sample_self) {
 				he_stat__add_stat(&he->stat, &entry->stat);
-				hist_entry__add_callchain_period(he, period);
+				hist_entry__add_callchain_period(he, period, wallclock);
 			}
 			if (symbol_conf.cumulate_callchain)
-				he_stat__add_period(he->stat_acc, period);
+				he_stat__add_period(he->stat_acc, period, wallclock);
 
 			block_info__delete(entry->block_info);
 
@@ -669,7 +683,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 		return NULL;
 
 	if (sample_self)
-		hist_entry__add_callchain_period(he, period);
+		hist_entry__add_callchain_period(he, period, wallclock);
 	hists->nr_entries++;
 
 	rb_link_node(&he->rb_node_in, parent, p);
@@ -741,12 +755,14 @@ __hists__add_entry(struct hists *hists,
 		.ip	 = al->addr,
 		.level	 = al->level,
 		.code_page_size = sample->code_page_size,
+		.parallelism	= al->parallelism,
 		.stat = {
 			.nr_events = 1,
 			.period	= sample->period,
 			.weight1 = sample->weight,
 			.weight2 = sample->ins_lat,
 			.weight3 = sample->p_stage_cyc,
+			.wallclock = al->wallclock,
 		},
 		.parent = sym_parent,
 		.filtered = symbol__parent_filter(sym_parent) | al->filtered,
@@ -1456,6 +1472,10 @@ static void hist_entry__check_and_remove_filter(struct hist_entry *he,
 		if (symbol_conf.sym_list == NULL)
 			return;
 		break;
+	case HIST_FILTER__PARALLELISM:
+		if (__bitmap_weight(symbol_conf.parallelism_filter, MAX_NR_CPUS + 1) == 0)
+			return;
+		break;
 	case HIST_FILTER__PARENT:
 	case HIST_FILTER__GUEST:
 	case HIST_FILTER__HOST:
@@ -1514,6 +1534,9 @@ static void hist_entry__apply_hierarchy_filters(struct hist_entry *he)
 	hist_entry__check_and_remove_filter(he, HIST_FILTER__SYMBOL,
 					    perf_hpp__is_sym_entry);
 
+	hist_entry__check_and_remove_filter(he, HIST_FILTER__PARALLELISM,
+					    perf_hpp__is_parallelism_entry);
+
 	hists__apply_filters(he->hists, he);
 }
 
@@ -1710,6 +1733,7 @@ static void hists__apply_filters(struct hists *hists, struct hist_entry *he)
 	hists__filter_entry_by_thread(hists, he);
 	hists__filter_entry_by_symbol(hists, he);
 	hists__filter_entry_by_socket(hists, he);
+	hists__filter_entry_by_parallelism(hists, he);
 }
 
 int hists__collapse_resort(struct hists *hists, struct ui_progress *prog)
@@ -1757,12 +1781,14 @@ static void hists__reset_filter_stats(struct hists *hists)
 {
 	hists->nr_non_filtered_entries = 0;
 	hists->stats.total_non_filtered_period = 0;
+	hists->stats.total_non_filtered_wallclock = 0;
 }
 
 void hists__reset_stats(struct hists *hists)
 {
 	hists->nr_entries = 0;
 	hists->stats.total_period = 0;
+	hists->stats.total_wallclock = 0;
 
 	hists__reset_filter_stats(hists);
 }
@@ -1771,6 +1797,7 @@ static void hists__inc_filter_stats(struct hists *hists, struct hist_entry *h)
 {
 	hists->nr_non_filtered_entries++;
 	hists->stats.total_non_filtered_period += h->stat.period;
+	hists->stats.total_non_filtered_wallclock += h->stat.wallclock;
 }
 
 void hists__inc_stats(struct hists *hists, struct hist_entry *h)
@@ -1780,6 +1807,7 @@ void hists__inc_stats(struct hists *hists, struct hist_entry *h)
 
 	hists->nr_entries++;
 	hists->stats.total_period += h->stat.period;
+	hists->stats.total_wallclock += h->stat.wallclock;
 }
 
 static void hierarchy_recalc_total_periods(struct hists *hists)
@@ -1791,6 +1819,8 @@ static void hierarchy_recalc_total_periods(struct hists *hists)
 
 	hists->stats.total_period = 0;
 	hists->stats.total_non_filtered_period = 0;
+	hists->stats.total_wallclock = 0;
+	hists->stats.total_non_filtered_wallclock = 0;
 
 	/*
 	 * recalculate total period using top-level entries only
@@ -1802,8 +1832,11 @@ static void hierarchy_recalc_total_periods(struct hists *hists)
 		node = rb_next(node);
 
 		hists->stats.total_period += he->stat.period;
-		if (!he->filtered)
+		hists->stats.total_wallclock += he->stat.wallclock;
+		if (!he->filtered) {
 			hists->stats.total_non_filtered_period += he->stat.period;
+			hists->stats.total_non_filtered_wallclock += he->stat.wallclock;
+		}
 	}
 }
 
@@ -2196,6 +2229,17 @@ static bool hists__filter_entry_by_socket(struct hists *hists,
 	return false;
 }
 
+static bool hists__filter_entry_by_parallelism(struct hists *hists,
+					       struct hist_entry *he)
+{
+	if (test_bit(he->parallelism, hists->parallelism_filter)) {
+		he->filtered |= (1 << HIST_FILTER__PARALLELISM);
+		return true;
+	}
+
+	return false;
+}
+
 typedef bool (*filter_fn_t)(struct hists *hists, struct hist_entry *he);
 
 static void hists__filter_by_type(struct hists *hists, int type, filter_fn_t filter)
@@ -2365,6 +2409,16 @@ void hists__filter_by_socket(struct hists *hists)
 				      hists__filter_entry_by_socket);
 }
 
+void hists__filter_by_parallelism(struct hists *hists)
+{
+	if (symbol_conf.report_hierarchy)
+		hists__filter_hierarchy(hists, HIST_FILTER__PARALLELISM,
+					hists->parallelism_filter);
+	else
+		hists__filter_by_type(hists, HIST_FILTER__PARALLELISM,
+				      hists__filter_entry_by_parallelism);
+}
+
 void events_stats__inc(struct events_stats *stats, u32 type)
 {
 	++stats->nr_events[0];
@@ -2760,6 +2814,12 @@ u64 hists__total_period(struct hists *hists)
 		hists->stats.total_period;
 }
 
+u64 hists__total_wallclock(struct hists *hists)
+{
+	return symbol_conf.filter_relative ? hists->stats.total_non_filtered_wallclock :
+		hists->stats.total_wallclock;
+}
+
 int __hists__scnprintf_title(struct hists *hists, char *bf, size_t size, bool show_freq)
 {
 	char unit;
@@ -2871,6 +2931,7 @@ int __hists__init(struct hists *hists, struct perf_hpp_list *hpp_list)
 	hists->entries = RB_ROOT_CACHED;
 	mutex_init(&hists->lock);
 	hists->socket_filter = -1;
+	hists->parallelism_filter = symbol_conf.parallelism_filter;
 	hists->hpp_list = hpp_list;
 	INIT_LIST_HEAD(&hists->hpp_formats);
 	return 0;
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 6cff03eb043b7..db9c05b810bd7 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -31,6 +31,7 @@ enum hist_filter {
 	HIST_FILTER__HOST,
 	HIST_FILTER__SOCKET,
 	HIST_FILTER__C2C,
+	HIST_FILTER__PARALLELISM,
 };
 
 enum hist_column {
@@ -42,6 +43,7 @@ enum hist_column {
 	HISTC_CGROUP_ID,
 	HISTC_CGROUP,
 	HISTC_PARENT,
+	HISTC_PARALLELISM,
 	HISTC_CPU,
 	HISTC_SOCKET,
 	HISTC_SRCLINE,
@@ -105,10 +107,13 @@ struct hists {
 	u64			nr_non_filtered_entries;
 	u64			callchain_period;
 	u64			callchain_non_filtered_period;
+	u64			callchain_wallclock;
+	u64			callchain_non_filtered_wallclock;
 	struct thread		*thread_filter;
 	const struct dso	*dso_filter;
 	const char		*uid_filter_str;
 	const char		*symbol_filter_str;
+	unsigned long		*parallelism_filter;
 	struct mutex		lock;
 	struct hists_stats	stats;
 	u64			event_stream;
@@ -165,6 +170,13 @@ struct res_sample {
 
 struct he_stat {
 	u64			period;
+	/*
+	 * Overhead re-scaled from CPU time to wallclock time (period divided
+	 * by the parallelism at the time of the sample). It's multipled by
+	 * some const to avoid precision loss or dealing with floats.
+	 * It does not matter otherwise since we only print it as percents.
+	 */
+	u64			wallclock;
 	u64			period_sys;
 	u64			period_us;
 	u64			period_guest_sys;
@@ -228,6 +240,7 @@ struct hist_entry {
 	u64			transaction;
 	s32			socket;
 	s32			cpu;
+	int			parallelism;
 	u64			code_page_size;
 	u64			weight;
 	u64			ins_lat;
@@ -242,7 +255,7 @@ struct hist_entry {
 	bool			leaf;
 
 	char			level;
-	u8			filtered;
+	u16			filtered;
 
 	u16			callchain_size;
 	union {
@@ -368,6 +381,7 @@ void hists__output_recalc_col_len(struct hists *hists, int max_rows);
 struct hist_entry *hists__get_entry(struct hists *hists, int idx);
 
 u64 hists__total_period(struct hists *hists);
+u64 hists__total_wallclock(struct hists *hists);
 void hists__reset_stats(struct hists *hists);
 void hists__inc_stats(struct hists *hists, struct hist_entry *h);
 void hists__inc_nr_events(struct hists *hists);
@@ -384,11 +398,13 @@ void hists__filter_by_dso(struct hists *hists);
 void hists__filter_by_thread(struct hists *hists);
 void hists__filter_by_symbol(struct hists *hists);
 void hists__filter_by_socket(struct hists *hists);
+void hists__filter_by_parallelism(struct hists *hists);
 
 static inline bool hists__has_filter(struct hists *hists)
 {
 	return hists->thread_filter || hists->dso_filter ||
-		hists->symbol_filter_str || (hists->socket_filter > -1);
+		hists->symbol_filter_str || (hists->socket_filter > -1) ||
+		hists->parallelism_filter;
 }
 
 u16 hists__col_len(struct hists *hists, enum hist_column col);
@@ -547,11 +563,13 @@ extern struct perf_hpp_fmt perf_hpp__format[];
 enum {
 	/* Matches perf_hpp__format array. */
 	PERF_HPP__OVERHEAD,
+	PERF_HPP__WALLCLOCK,
 	PERF_HPP__OVERHEAD_SYS,
 	PERF_HPP__OVERHEAD_US,
 	PERF_HPP__OVERHEAD_GUEST_SYS,
 	PERF_HPP__OVERHEAD_GUEST_US,
 	PERF_HPP__OVERHEAD_ACC,
+	PERF_HPP__WALLCLOCK_ACC,
 	PERF_HPP__SAMPLES,
 	PERF_HPP__PERIOD,
 	PERF_HPP__WEIGHT1,
@@ -563,6 +581,7 @@ enum {
 
 void perf_hpp__init(void);
 void perf_hpp__cancel_cumulate(void);
+void perf_hpp__cancel_wallclock(void);
 void perf_hpp__setup_output_field(struct perf_hpp_list *list);
 void perf_hpp__reset_output_field(struct perf_hpp_list *list);
 void perf_hpp__append_sort_keys(struct perf_hpp_list *list);
@@ -580,6 +599,7 @@ bool perf_hpp__is_thread_entry(struct perf_hpp_fmt *fmt);
 bool perf_hpp__is_comm_entry(struct perf_hpp_fmt *fmt);
 bool perf_hpp__is_dso_entry(struct perf_hpp_fmt *fmt);
 bool perf_hpp__is_sym_entry(struct perf_hpp_fmt *fmt);
+bool perf_hpp__is_parallelism_entry(struct perf_hpp_fmt *fmt);
 
 struct perf_hpp_fmt *perf_hpp_fmt__dup(struct perf_hpp_fmt *fmt);
 
@@ -606,6 +626,7 @@ void hists__reset_column_width(struct hists *hists);
 enum perf_hpp_fmt_type {
 	PERF_HPP_FMT_TYPE__RAW,
 	PERF_HPP_FMT_TYPE__PERCENT,
+	PERF_HPP_FMT_TYPE__WALLCLOCK,
 	PERF_HPP_FMT_TYPE__AVERAGE,
 };
 
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 27d5345d2b307..956581bfeb5b1 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -94,6 +94,8 @@ int machine__init(struct machine *machine, const char *root_dir, pid_t pid)
 	machine->comm_exec = false;
 	machine->kernel_start = 0;
 	machine->vmlinux_map = NULL;
+	/* There is no initial context switch in, so we start at 1. */
+	machine->parallelism = 1;
 
 	machine->root_dir = strdup(root_dir);
 	if (machine->root_dir == NULL)
@@ -677,8 +679,11 @@ int machine__process_aux_output_hw_id_event(struct machine *machine __maybe_unus
 int machine__process_switch_event(struct machine *machine __maybe_unused,
 				  union perf_event *event)
 {
+	bool out = event->header.misc & PERF_RECORD_MISC_SWITCH_OUT;
+
 	if (dump_trace)
 		perf_event__fprintf_switch(event, stdout);
+	machine->parallelism += out ? -1 : 1;
 	return 0;
 }
 
@@ -1898,6 +1903,8 @@ int machine__process_exit_event(struct machine *machine, union perf_event *event
 	if (dump_trace)
 		perf_event__fprintf_task(event, stdout);
 
+	/* There is no context switch out before exit, so we decrement here. */
+	machine->parallelism--;
 	if (thread != NULL) {
 		if (symbol_conf.keep_exited_threads)
 			thread__set_exited(thread, /*exited=*/true);
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index 2e5a4cb342d82..a5f28652873ce 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -50,6 +50,12 @@ struct machine {
 		u64	  text_start;
 		u64	  text_end;
 	} sched, lock, traceiter, trace;
+	/*
+	 * The current parallelism level (number of threads that run on CPUs).
+	 * This value can be less than 1, or larger than the total number
+	 * of CPUs, if events are poorly ordered.
+	 */
+	int		  parallelism;
 	pid_t		  *current_tid;
 	size_t		  current_tid_sz;
 	union { /* Tool specific area */
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 507e6cba95458..19db4bb315aa5 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2402,6 +2402,18 @@ bool perf_session__has_traces(struct perf_session *session, const char *msg)
 	return false;
 }
 
+bool perf_session__has_switch_events(struct perf_session *session)
+{
+	struct evsel *evsel;
+
+	evlist__for_each_entry(session->evlist, evsel) {
+		if (evsel->core.attr.context_switch)
+			return true;
+	}
+
+	return false;
+}
+
 int map__set_kallsyms_ref_reloc_sym(struct map *map, const char *symbol_name, u64 addr)
 {
 	char *bracket;
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index bcf1bcf06959b..db1c120a9e67f 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -141,6 +141,7 @@ int perf_session__resolve_callchain(struct perf_session *session,
 				    struct symbol **parent);
 
 bool perf_session__has_traces(struct perf_session *session, const char *msg);
+bool perf_session__has_switch_events(struct perf_session *session);
 
 void perf_event__attr_swap(struct perf_event_attr *attr);
 
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 9dd60c7869a28..01e8c54d93464 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -892,6 +892,38 @@ struct sort_entry sort_cpu = {
 	.se_width_idx	= HISTC_CPU,
 };
 
+/* --sort parallelism */
+
+static int64_t
+sort__parallelism_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	return right->parallelism - left->parallelism;
+}
+
+static int hist_entry__parallelism_filter(struct hist_entry *he, int type, const void *arg)
+{
+	const unsigned long *parallelism_filter = arg;
+
+	if (type != HIST_FILTER__PARALLELISM)
+		return -1;
+
+	return test_bit(he->parallelism, parallelism_filter);
+}
+
+static int hist_entry__parallelism_snprintf(struct hist_entry *he, char *bf,
+				    size_t size, unsigned int width)
+{
+	return repsep_snprintf(bf, size, "%*d", width, he->parallelism);
+}
+
+struct sort_entry sort_parallelism = {
+	.se_header      = "Parallelism",
+	.se_cmp	        = sort__parallelism_cmp,
+	.se_filter	= hist_entry__parallelism_filter,
+	.se_snprintf    = hist_entry__parallelism_snprintf,
+	.se_width_idx	= HISTC_PARALLELISM,
+};
+
 /* --sort cgroup_id */
 
 static int64_t _sort__cgroup_dev_cmp(u64 left_dev, u64 right_dev)
@@ -2532,6 +2564,7 @@ static struct sort_dimension common_sort_dimensions[] = {
 	DIM(SORT_ANNOTATE_DATA_TYPE_OFFSET, "typeoff", sort_type_offset),
 	DIM(SORT_SYM_OFFSET, "symoff", sort_sym_offset),
 	DIM(SORT_ANNOTATE_DATA_TYPE_CACHELINE, "typecln", sort_type_cacheline),
+	DIM(SORT_PARALLELISM, "parallelism", sort_parallelism),
 };
 
 #undef DIM
@@ -2593,11 +2626,13 @@ struct hpp_dimension {
 
 static struct hpp_dimension hpp_sort_dimensions[] = {
 	DIM(PERF_HPP__OVERHEAD, "overhead"),
+	DIM(PERF_HPP__WALLCLOCK, "wallclock"),
 	DIM(PERF_HPP__OVERHEAD_SYS, "overhead_sys"),
 	DIM(PERF_HPP__OVERHEAD_US, "overhead_us"),
 	DIM(PERF_HPP__OVERHEAD_GUEST_SYS, "overhead_guest_sys"),
 	DIM(PERF_HPP__OVERHEAD_GUEST_US, "overhead_guest_us"),
 	DIM(PERF_HPP__OVERHEAD_ACC, "overhead_children"),
+	DIM(PERF_HPP__WALLCLOCK_ACC, "wallclock_children"),
 	DIM(PERF_HPP__SAMPLES, "sample"),
 	DIM(PERF_HPP__PERIOD, "period"),
 	DIM(PERF_HPP__WEIGHT1, "weight1"),
@@ -2733,6 +2768,7 @@ MK_SORT_ENTRY_CHK(thread)
 MK_SORT_ENTRY_CHK(comm)
 MK_SORT_ENTRY_CHK(dso)
 MK_SORT_ENTRY_CHK(sym)
+MK_SORT_ENTRY_CHK(parallelism)
 
 
 static bool __sort__hpp_equal(struct perf_hpp_fmt *a, struct perf_hpp_fmt *b)
@@ -3715,6 +3751,21 @@ static int setup_sort_order(struct evlist *evlist)
 {
 	char *new_sort_order;
 
+	/*
+	 * Append the default sort order to "wallclock".
+	 * Just "wallclock" is not useful on its own, and there is no easy way
+	 * to express it otherwise.
+	 */
+	if (sort_order && !strcmp(sort_order, "wallclock")) {
+		if (asprintf(&new_sort_order, "%s,%s",
+			     sort_order, get_default_sort_order(evlist)) < 0) {
+			pr_err("Not enough memory to set up --sort");
+			return -ENOMEM;
+		}
+		sort_order = new_sort_order;
+		return 0;
+	}
+
 	/*
 	 * Append '+'-prefixed sort order to the default sort
 	 * order string.
@@ -3764,10 +3815,14 @@ static char *setup_overhead(char *keys)
 	if (sort__mode == SORT_MODE__DIFF)
 		return keys;
 
-	keys = prefix_if_not_in("overhead", keys);
+	if (!strstr(keys, "wallclock"))
+		keys = prefix_if_not_in("overhead", keys);
 
-	if (symbol_conf.cumulate_callchain)
+	if (symbol_conf.cumulate_callchain) {
+		if (!symbol_conf.disable_wallclock)
+			keys = prefix_if_not_in("wallclock_children", keys);
 		keys = prefix_if_not_in("overhead_children", keys);
+	}
 
 	return keys;
 }
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index a8572574e1686..11fb15f914093 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -72,6 +72,7 @@ enum sort_type {
 	SORT_ANNOTATE_DATA_TYPE_OFFSET,
 	SORT_SYM_OFFSET,
 	SORT_ANNOTATE_DATA_TYPE_CACHELINE,
+	SORT_PARALLELISM,
 
 	/* branch stack specific sort keys */
 	__SORT_BRANCH_STACK,
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 0037f11639195..38500609ea6ed 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -18,6 +18,7 @@
 #include "annotate.h"
 #include "build-id.h"
 #include "cap.h"
+#include "cpumap.h"
 #include "dso.h"
 #include "util.h" // lsdir()
 #include "debug.h"
@@ -2464,6 +2465,36 @@ int symbol__annotation_init(void)
 	return 0;
 }
 
+static int setup_parallelism_bitmap(void)
+{
+	struct perf_cpu_map *map;
+	struct perf_cpu cpu;
+	int i, err = -1;
+
+	if (symbol_conf.parallelism_list_str == NULL)
+		return 0;
+
+	map = perf_cpu_map__new(symbol_conf.parallelism_list_str);
+	if (map == NULL) {
+		pr_err("failed to parse parallelism filter list\n");
+		return -1;
+	}
+
+	bitmap_fill(symbol_conf.parallelism_filter, MAX_NR_CPUS + 1);
+	perf_cpu_map__for_each_cpu(cpu, i, map) {
+		if (cpu.cpu <= 0 || cpu.cpu > MAX_NR_CPUS) {
+			pr_err("Requested parallelism level %d is invalid.\n", cpu.cpu);
+			goto out_delete_map;
+		}
+		__clear_bit(cpu.cpu, symbol_conf.parallelism_filter);
+	}
+
+	err = 0;
+out_delete_map:
+	perf_cpu_map__put(map);
+	return err;
+}
+
 int symbol__init(struct perf_env *env)
 {
 	const char *symfs;
@@ -2483,6 +2514,9 @@ int symbol__init(struct perf_env *env)
 		return -1;
 	}
 
+	if (setup_parallelism_bitmap())
+		return -1;
+
 	if (setup_list(&symbol_conf.dso_list,
 		       symbol_conf.dso_list_str, "dso") < 0)
 		return -1;
diff --git a/tools/perf/util/symbol_conf.h b/tools/perf/util/symbol_conf.h
index a9c51acc722fe..7f6d6a1c773f1 100644
--- a/tools/perf/util/symbol_conf.h
+++ b/tools/perf/util/symbol_conf.h
@@ -3,6 +3,8 @@
 #define __PERF_SYMBOL_CONF 1
 
 #include <stdbool.h>
+#include <linux/bitmap.h>
+#include "perf.h"
 
 struct strlist;
 struct intlist;
@@ -47,7 +49,8 @@ struct symbol_conf {
 			keep_exited_threads,
 			annotate_data_member,
 			annotate_data_sample,
-			skip_empty;
+			skip_empty,
+			disable_wallclock;
 	const char	*vmlinux_name,
 			*kallsyms_name,
 			*source_prefix,
@@ -62,6 +65,7 @@ struct symbol_conf {
 			*pid_list_str,
 			*tid_list_str,
 			*sym_list_str,
+			*parallelism_list_str,
 			*col_width_list_str,
 			*bt_stop_list_str;
 	const char		*addr2line_path;
@@ -82,6 +86,7 @@ struct symbol_conf {
 	int		pad_output_len_dso;
 	int		group_sort_idx;
 	int		addr_range;
+	DECLARE_BITMAP(parallelism_filter, MAX_NR_CPUS + 1);
 };
 
 extern struct symbol_conf symbol_conf;

base-commit: 660d2400e5bf865ff2e2d8413284d0d3248cbf25
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] tools/perf: Add wall-clock and parallelism profiling
  2025-01-08  8:24 [PATCH] tools/perf: Add wall-clock and parallelism profiling Dmitry Vyukov
@ 2025-01-08  8:34 ` Dmitry Vyukov
  2025-01-13 12:25   ` Dmitry Vyukov
  0 siblings, 1 reply; 22+ messages in thread
From: Dmitry Vyukov @ 2025-01-08  8:34 UTC (permalink / raw)
  To: namhyung, irogers; +Cc: linux-perf-users, linux-kernel

On Wed, 8 Jan 2025 at 09:24, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> There are two notions of time: wall-clock time and CPU time.
> For a single-threaded program, or a program running on a single-core
> machine, these notions are the same. However, for a multi-threaded/
> multi-process program running on a multi-core machine, these notions are
> significantly different. Each second of wall-clock time we have
> number-of-cores seconds of CPU time.
>
> Currently perf only allows to profile CPU time. Perf (and all other
> existing profilers to the best of my knowledge) does not allow profile
> wall-clock time.
>
> Optimizing CPU overhead is useful to improve 'throughput', while
> optimizing wall-clock overhead is useful to improve 'latency'.
> These profiles are complementary and are not interchangeable.
> Examples of where wall-clock profile is needed:
>  - optimzing build latency
>  - optimizing server request latency
>  - optimizing ML training/inference latency
>  - optimizing running time of any command line program
>
> CPU profile is useless for these use cases at best (if a user understands
> the difference), or misleading at worst (if a user tries to use a wrong
> profile for a job).
>
> This patch adds wall-clock and parallelization profiling.
> See the added documentation and flags descriptions for details.
>
> Brief outline of the implementation:
>  - add context switch collection during record
>  - calculate number of threads running on CPUs (parallelism level)
>    during report
>  - divide each sample weight by the parallelism level
> This effectively models that we were taking 1 sample per unit of
> wall-clock time.
>
> The feature is added on an equal footing with the existing CPU profiling
> rather than a separate mode enabled with special flags. The reasoning is
> that users may not understand the problem and the meaning of numbers they
> are seeing in the first place, so won't even realize that they may need
> to be looking for some different profiling mode. When they are presented
> with 2 sets of different numbers, they should start asking questions.

Hi folks,

Am I missing something and this is possible/known already?

I understand this is a large change, and I am open to comments.
I've also uploaded it to gerrit if you prefer to review there:
https://linux-review.git.corp.google.com/c/linux/kernel/git/torvalds/linux/+/25608

You may also checkout that branch and try it locally. It works on older kernels.

What of this is testable within the current testing framework?
Also how do I run tests? I failed to figure it out.

Btw, the profile example in the docs is from a real kernel build on my machine.
You can see how misleading the current profile is wrt latency.

Or you can see what takes time in the perf make itself.
(despite -j128, 73% of time was spent with 1 running thread,
only a few percent of time was spent with high parallelism).

  Wallclock  Overhead           Parallelism / Command
-    73.64%     6.96%           1
   +    28.53%     2.70%           cc1
   +    17.93%     1.69%           python3
   +    10.79%     1.02%           ld
-     7.49%     1.42%           2
   +     4.26%     0.81%           cc1
   +     0.72%     0.14%           ld
   +     0.68%     0.13%           cc1plus
...
-     1.33%    15.74%           125
   +     1.23%    14.50%           cc1
   +     0.03%     0.33%           gcc
   +     0.03%     0.32%           sh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] tools/perf: Add wall-clock and parallelism profiling
  2025-01-08  8:34 ` Dmitry Vyukov
@ 2025-01-13 12:25   ` Dmitry Vyukov
  2025-01-13 13:40     ` [PATCH v2] perf report: " Dmitry Vyukov
  0 siblings, 1 reply; 22+ messages in thread
From: Dmitry Vyukov @ 2025-01-13 12:25 UTC (permalink / raw)
  To: namhyung, irogers; +Cc: linux-perf-users, linux-kernel

On Wed, 8 Jan 2025 at 09:34, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Wed, 8 Jan 2025 at 09:24, Dmitry Vyukov <dvyukov@google.com> wrote:
> >
> > There are two notions of time: wall-clock time and CPU time.
> > For a single-threaded program, or a program running on a single-core
> > machine, these notions are the same. However, for a multi-threaded/
> > multi-process program running on a multi-core machine, these notions are
> > significantly different. Each second of wall-clock time we have
> > number-of-cores seconds of CPU time.
> >
> > Currently perf only allows to profile CPU time. Perf (and all other
> > existing profilers to the best of my knowledge) does not allow profile
> > wall-clock time.
> >
> > Optimizing CPU overhead is useful to improve 'throughput', while
> > optimizing wall-clock overhead is useful to improve 'latency'.
> > These profiles are complementary and are not interchangeable.
> > Examples of where wall-clock profile is needed:
> >  - optimzing build latency
> >  - optimizing server request latency
> >  - optimizing ML training/inference latency
> >  - optimizing running time of any command line program
> >
> > CPU profile is useless for these use cases at best (if a user understands
> > the difference), or misleading at worst (if a user tries to use a wrong
> > profile for a job).
> >
> > This patch adds wall-clock and parallelization profiling.
> > See the added documentation and flags descriptions for details.
> >
> > Brief outline of the implementation:
> >  - add context switch collection during record
> >  - calculate number of threads running on CPUs (parallelism level)
> >    during report
> >  - divide each sample weight by the parallelism level
> > This effectively models that we were taking 1 sample per unit of
> > wall-clock time.
> >
> > The feature is added on an equal footing with the existing CPU profiling
> > rather than a separate mode enabled with special flags. The reasoning is
> > that users may not understand the problem and the meaning of numbers they
> > are seeing in the first place, so won't even realize that they may need
> > to be looking for some different profiling mode. When they are presented
> > with 2 sets of different numbers, they should start asking questions.
>
> Hi folks,
>
> Am I missing something and this is possible/known already?
>
> I understand this is a large change, and I am open to comments.
> I've also uploaded it to gerrit if you prefer to review there:
> https://linux-review.git.corp.google.com/c/linux/kernel/git/torvalds/linux/+/25608
>
> You may also checkout that branch and try it locally. It works on older kernels.
>
> What of this is testable within the current testing framework?
> Also how do I run tests? I failed to figure it out.
>
> Btw, the profile example in the docs is from a real kernel build on my machine.
> You can see how misleading the current profile is wrt latency.
>
> Or you can see what takes time in the perf make itself.
> (despite -j128, 73% of time was spent with 1 running thread,
> only a few percent of time was spent with high parallelism).
>
>   Wallclock  Overhead           Parallelism / Command
> -    73.64%     6.96%           1
>    +    28.53%     2.70%           cc1
>    +    17.93%     1.69%           python3
>    +    10.79%     1.02%           ld
> -     7.49%     1.42%           2
>    +     4.26%     0.81%           cc1
>    +     0.72%     0.14%           ld
>    +     0.68%     0.13%           cc1plus
> ...
> -     1.33%    15.74%           125
>    +     1.23%    14.50%           cc1
>    +     0.03%     0.33%           gcc
>    +     0.03%     0.32%           sh


> [PATCH] tools/perf: Add wall-clock and parallelism profiling

Note to myself: need to change the subject to "perf report:".

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-13 12:25   ` Dmitry Vyukov
@ 2025-01-13 13:40     ` Dmitry Vyukov
  2025-01-14  1:51       ` Namhyung Kim
  2025-01-24 19:02       ` [PATCH v2] perf report: Add wall-clock and parallelism profiling Namhyung Kim
  0 siblings, 2 replies; 22+ messages in thread
From: Dmitry Vyukov @ 2025-01-13 13:40 UTC (permalink / raw)
  To: namhyung, irogers; +Cc: Dmitry Vyukov, linux-perf-users, linux-kernel

There are two notions of time: wall-clock time and CPU time.
For a single-threaded program, or a program running on a single-core
machine, these notions are the same. However, for a multi-threaded/
multi-process program running on a multi-core machine, these notions are
significantly different. Each second of wall-clock time we have
number-of-cores seconds of CPU time.

Currently perf only allows to profile CPU time. Perf (and all other
existing profilers to the be best of my knowledge) does not allow to
profile wall-clock time.

Optimizing CPU overhead is useful to improve 'throughput', while
optimizing wall-clock overhead is useful to improve 'latency'.
These profiles are complementary and are not interchangeable.
Examples of where wall-clock profile is needed:
 - optimzing build latency
 - optimizing server request latency
 - optimizing ML training/inference latency
 - optimizing running time of any command line program

CPU profile is useless for these use cases at best (if a user understands
the difference), or misleading at worst (if a user tries to use a wrong
profile for a job).

This patch adds wall-clock and parallelization profiling.
See the added documentation and flags descriptions for details.

Brief outline of the implementation:
 - add context switch collection during record
 - calculate number of threads running on CPUs (parallelism level)
   during report
 - divide each sample weight by the parallelism level
This effectively models that we were taking 1 sample per unit of
wall-clock time.

The feature is added on an equal footing with the existing CPU profiling
rather than a separate mode enabled with special flags. The reasoning is
that users may not understand the problem and the meaning of numbers they
are seeing in the first place, so won't even realize that they may need
to be looking for some different profiling mode. When they are presented
with 2 sets of different numbers, they should start asking questions.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

---
Changes in v2:
 - just rebased and changed subject
---
 .../callchain-overhead-calculation.txt        |  5 +-
 .../cpu-and-wallclock-overheads.txt           | 85 +++++++++++++++++
 tools/perf/Documentation/perf-report.txt      | 50 ++++++----
 tools/perf/Documentation/tips.txt             |  3 +
 tools/perf/builtin-record.c                   |  6 ++
 tools/perf/builtin-report.c                   | 30 ++++++
 tools/perf/ui/browsers/hists.c                | 27 ++++--
 tools/perf/ui/hist.c                          | 61 +++++++++++--
 tools/perf/util/addr_location.c               |  1 +
 tools/perf/util/addr_location.h               |  6 +-
 tools/perf/util/event.c                       |  6 ++
 tools/perf/util/events_stats.h                |  2 +
 tools/perf/util/hist.c                        | 91 ++++++++++++++++---
 tools/perf/util/hist.h                        | 25 ++++-
 tools/perf/util/machine.c                     |  7 ++
 tools/perf/util/machine.h                     |  6 ++
 tools/perf/util/session.c                     | 12 +++
 tools/perf/util/session.h                     |  1 +
 tools/perf/util/sort.c                        | 59 +++++++++++-
 tools/perf/util/sort.h                        |  1 +
 tools/perf/util/symbol.c                      | 34 +++++++
 tools/perf/util/symbol_conf.h                 |  7 +-
 22 files changed, 467 insertions(+), 58 deletions(-)
 create mode 100644 tools/perf/Documentation/cpu-and-wallclock-overheads.txt

diff --git a/tools/perf/Documentation/callchain-overhead-calculation.txt b/tools/perf/Documentation/callchain-overhead-calculation.txt
index 1a757927195ed..e0202bf5bd1a0 100644
--- a/tools/perf/Documentation/callchain-overhead-calculation.txt
+++ b/tools/perf/Documentation/callchain-overhead-calculation.txt
@@ -1,7 +1,8 @@
 Overhead calculation
 --------------------
-The overhead can be shown in two columns as 'Children' and 'Self' when
-perf collects callchains.  The 'self' overhead is simply calculated by
+The CPU overhead can be shown in two columns as 'Children' and 'Self'
+when perf collects callchains (and corresponding 'Wall' columns for
+wall-clock overhead).  The 'self' overhead is simply calculated by
 adding all period values of the entry - usually a function (symbol).
 This is the value that perf shows traditionally and sum of all the
 'self' overhead values should be 100%.
diff --git a/tools/perf/Documentation/cpu-and-wallclock-overheads.txt b/tools/perf/Documentation/cpu-and-wallclock-overheads.txt
new file mode 100644
index 0000000000000..4f739ff4de437
--- /dev/null
+++ b/tools/perf/Documentation/cpu-and-wallclock-overheads.txt
@@ -0,0 +1,85 @@
+CPU and wall-clock overheads
+----------------------------
+There are two notions of time: wall-clock time and CPU time.
+For a single-threaded program, or a program running on a single-core machine,
+these notions are the same. However, for a multi-threaded/multi-process program
+running on a multi-core machine, these notions are significantly different.
+Each second of wall-clock time we have number-of-cores seconds of CPU time.
+Perf can measure overhead for both of these times (shown in 'overhead' and
+'wallclock' columns for CPU and wall-clock correspondingly).
+
+Optimizing CPU overhead is useful to improve 'throughput', while optimizing
+wall-clock overhead is useful to improve 'latency'. It's important to understand
+which one is useful in a concrete situation at hand. For example, the former
+may be useful to improve max throughput of a CI build server that runs on 100%
+CPU utilization, while the latter may be useful to improve user-perceived
+latency of a single interactive program build.
+These overheads may be significantly different in some cases. For example,
+consider a program that executes function 'foo' for 9 seconds with 1 thread,
+and then executes function 'bar' for 1 second with 128 threads (consumes
+128 seconds of CPU time). The CPU overhead is: 'foo' - 6.6%, 'bar' - 93.4%.
+While wall-clock overhead is: 'foo' - 90%, 'bar' - 10%. If we try to optimize
+running time of the program looking at the (wrong in this case) CPU overhead,
+we would concentrate on the function 'bar', but it can yield only 10% running
+time improvement at best.
+
+By default, perf shows both overheads (for parallel programs), but sorts by
+CPU overhead:
+
+-----------------------------------
+Overhead  Wallclock  Command
+  93.88%     25.79%  cc1
+   1.90%     39.87%  gzip
+   0.99%     10.16%  dpkg-deb
+   0.57%      1.00%  as
+   0.40%      0.46%  sh
+-----------------------------------
+
+To sort by wall-clock overhead, use '--sort=wallclock' flag:
+
+-----------------------------------
+Wallclock  Overhead  Command
+   39.87%     1.90%  gzip
+   25.79%    93.88%  cc1
+   10.16%     0.99%  dpkg-deb
+    4.17%     0.29%  git
+    2.81%     0.11%  objtool
+-----------------------------------
+
+To get insight into the difference between the overheads, you may check
+parallelization histogram with '--sort=wallclock,parallelism,comm,symbol --hierarchy'
+flags. It shows fraction of (wall-clock) time the workload utilizes different
+numbers of cores ('parallelism' column). For example, in the following case
+the workload utilizes only 1 core most of the time, but also has some
+highly-parallel phases, which explains significant difference between
+CPU and wall-clock overheads:
+
+-----------------------------------
+  Wallclock  Overhead     Parallelism / Command / Symbol
++    56.98%     2.29%     1
++    16.94%     1.36%     2
++     4.00%    20.13%     125
++     3.66%    18.25%     124
++     3.48%    17.66%     126
++     3.26%     0.39%     3
++     2.61%    12.93%     123
+-----------------------------------
+
+By expanding corresponding lines, you may see what programs/functions run
+at the given parallelism level:
+
+-----------------------------------
+  Wallclock  Overhead     Parallelism / Command / Symbol
+-    56.98%     2.29%     1
+        32.80%     1.32%     gzip
+         4.46%     0.18%     cc1
+         2.81%     0.11%     objtool
+         2.43%     0.10%     dpkg-source
+         2.22%     0.09%     ld
+         2.10%     0.08%     dpkg-genchanges
+-----------------------------------
+
+To see the normal function-level profile for particular parallelism levels
+(number of threads actively running on CPUs), you may use '--parallelism'
+filter. For example, to see the profile only for low parallelism phases
+of a workload use '--sort=wallclock,comm,symbol --parallelism=1-2' flags.
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 87f8645194062..e79ff412af0a5 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -44,7 +44,7 @@ OPTIONS
 --comms=::
 	Only consider symbols in these comms. CSV that understands
 	file://filename entries.  This option will affect the percentage of
-	the overhead column.  See --percentage for more info.
+	the overhead and wallclock columns.  See --percentage for more info.
 --pid=::
         Only show events for given process ID (comma separated list).
 
@@ -54,12 +54,12 @@ OPTIONS
 --dsos=::
 	Only consider symbols in these dsos. CSV that understands
 	file://filename entries.  This option will affect the percentage of
-	the overhead column.  See --percentage for more info.
+	the overhead and wallclock columns.  See --percentage for more info.
 -S::
 --symbols=::
 	Only consider these symbols. CSV that understands
 	file://filename entries.  This option will affect the percentage of
-	the overhead column.  See --percentage for more info.
+	the overhead and wallclock columns.  See --percentage for more info.
 
 --symbol-filter=::
 	Only show symbols that match (partially) with this filter.
@@ -68,6 +68,16 @@ OPTIONS
 --hide-unresolved::
         Only display entries resolved to a symbol.
 
+--parallelism::
+        Only consider these parallelism levels. Parallelism level is the number
+        of threads that actively run on CPUs at the time of sample. The flag
+        accepts single number, comma-separated list, and ranges (for example:
+        "1", "7,8", "1,64-128"). This is useful in understanding what a program
+        is doing during sequential/low-parallelism phases as compared to
+        high-parallelism phases. This option will affect the percentage of
+        the overhead and wallclock columns. See --percentage for more info.
+        Also see the `CPU and wall-clock overheads' section for more details.
+
 -s::
 --sort=::
 	Sort histogram entries by given key(s) - multiple keys can be specified
@@ -87,6 +97,7 @@ OPTIONS
 	entries are displayed as "[other]".
 	- cpu: cpu number the task ran at the time of sample
 	- socket: processor socket number the task ran at the time of sample
+	- parallelism: number of running threads at the time of sample
 	- srcline: filename and line number executed at the time of sample.  The
 	DWARF debugging info must be provided.
 	- srcfile: file name of the source file of the samples. Requires dwarf
@@ -97,12 +108,14 @@ OPTIONS
 	- cgroup_id: ID derived from cgroup namespace device and inode numbers.
 	- cgroup: cgroup pathname in the cgroupfs.
 	- transaction: Transaction abort flags.
-	- overhead: Overhead percentage of sample
-	- overhead_sys: Overhead percentage of sample running in system mode
-	- overhead_us: Overhead percentage of sample running in user mode
-	- overhead_guest_sys: Overhead percentage of sample running in system mode
+	- overhead: CPU overhead percentage of sample
+	- wallclock: Wall-clock (latency) overhead percentage of sample.
+	  See the `CPU and wall-clock overheads' section for more details.
+	- overhead_sys: CPU overhead percentage of sample running in system mode
+	- overhead_us: CPU overhead percentage of sample running in user mode
+	- overhead_guest_sys: CPU overhead percentage of sample running in system mode
 	on guest machine
-	- overhead_guest_us: Overhead percentage of sample running in user mode on
+	- overhead_guest_us: CPU overhead percentage of sample running in user mode on
 	guest machine
 	- sample: Number of sample
 	- period: Raw number of event count of sample
@@ -125,8 +138,9 @@ OPTIONS
 	- weight2: Average value of event specific weight (2nd field of weight_struct).
 	- weight3: Average value of event specific weight (3rd field of weight_struct).
 
-	By default, comm, dso and symbol keys are used.
-	(i.e. --sort comm,dso,symbol)
+	By default, overhead, comm, dso and symbol keys are used.
+	(i.e. --sort overhead,comm,dso,symbol). If only wallclock is specified,
+	(i.e. --sort wallclock), it expands to wallclock,comm,dso,symbol.
 
 	If --branch-stack option is used, following sort keys are also
 	available:
@@ -201,9 +215,9 @@ OPTIONS
 --fields=::
 	Specify output field - multiple keys can be specified in CSV format.
 	Following fields are available:
-	overhead, overhead_sys, overhead_us, overhead_children, sample, period,
-	weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat.  The
-	last 3 names are alias for the corresponding weights.  When the weight
+	overhead, wallclock, overhead_sys, overhead_us, overhead_children, sample,
+	period, weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat.
+	The last 3 names are alias for the corresponding weights.  When the weight
 	fields are used, they will show the average value of the weight.
 
 	Also it can contain any sort key(s).
@@ -289,7 +303,7 @@ OPTIONS
 	Accumulate callchain of children to parent entry so that then can
 	show up in the output.  The output will have a new "Children" column
 	and will be sorted on the data.  It requires callchains are recorded.
-	See the `overhead calculation' section for more details. Enabled by
+	See the `Overhead calculation' section for more details. Enabled by
 	default, disable with --no-children.
 
 --max-stack::
@@ -442,9 +456,9 @@ OPTIONS
 	--call-graph option for details.
 
 --percentage::
-	Determine how to display the overhead percentage of filtered entries.
-	Filters can be applied by --comms, --dsos and/or --symbols options and
-	Zoom operations on the TUI (thread, dso, etc).
+	Determine how to display the CPU and wall-clock overhead percentage
+	of filtered entries. Filters can be applied by --comms, --dsos, --symbols
+	and/or --parallelism options and Zoom operations on the TUI (thread, dso, etc).
 
 	"relative" means it's relative to filtered entries only so that the
 	sum of shown entries will be always 100%.  "absolute" means it retains
@@ -627,6 +641,8 @@ include::itrace.txt[]
 --skip-empty::
 	Do not print 0 results in the --stat output.
 
+include::cpu-and-wallclock-overheads.txt[]
+
 include::callchain-overhead-calculation.txt[]
 
 SEE ALSO
diff --git a/tools/perf/Documentation/tips.txt b/tools/perf/Documentation/tips.txt
index 67b326ba00407..50916c58b81c5 100644
--- a/tools/perf/Documentation/tips.txt
+++ b/tools/perf/Documentation/tips.txt
@@ -62,3 +62,6 @@ To show context switches in perf report sample context add --switch-events to pe
 To show time in nanoseconds in record/report add --ns
 To compare hot regions in two workloads use perf record -b -o file ... ; perf diff --stream file1 file2
 To compare scalability of two workload samples use perf diff -c ratio file1 file2
+For latency analysis, try: perf report --sort wallclock
+For parallelism histogram, try: perf report --hierarchy --sort wallclock,parallelism,comm,symbol
+To analyze particular parallelism levels, try: perf report --sort=wallclock,symbol --parallelism=32-64
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f832524729211..225e3997f03cb 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -3342,6 +3342,12 @@ static struct record record = {
 		.ctl_fd              = -1,
 		.ctl_fd_ack          = -1,
 		.synth               = PERF_SYNTH_ALL,
+		/*
+		 * This is enabled by default because it's required to show
+		 * wallclock overhead and parallelism profile, and does not
+		 * add lots of overhead on top of samples data.
+		 */
+		.record_switch_events	= true,
 	},
 };
 
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 42d7dfdf07d9b..d86937174a1e1 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -112,6 +112,8 @@ struct report {
 	u64			nr_entries;
 	u64			queue_size;
 	u64			total_cycles;
+	u64			total_samples;
+	u64			singlethreaded_samples;
 	int			socket_filter;
 	DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
 	struct branch_type_stat	brtype_stat;
@@ -331,6 +333,10 @@ static int process_sample_event(const struct perf_tool *tool,
 				     &rep->total_cycles, evsel);
 	}
 
+	rep->total_samples++;
+	if (al.parallelism == 1)
+		rep->singlethreaded_samples++;
+
 	ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
 	if (ret < 0)
 		pr_debug("problem adding hist entry, skipping event\n");
@@ -1081,6 +1087,10 @@ static int __cmd_report(struct report *rep)
 		return ret;
 	}
 
+	/* Don't show Wallclock column for non-parallel profiles. */
+	if (rep->singlethreaded_samples * 100 / rep->total_samples >= 99)
+		perf_hpp__cancel_wallclock();
+
 	evlist__check_mem_load_aux(session->evlist);
 
 	if (rep->stats_mode)
@@ -1392,6 +1402,8 @@ int cmd_report(int argc, const char **argv)
 		     symbol__config_symfs),
 	OPT_STRING('C', "cpu", &report.cpu_list, "cpu",
 		   "list of cpus to profile"),
+	OPT_STRING(0, "parallelism", &symbol_conf.parallelism_list_str, "parallelism",
+		   "only consider these parallelism levels (cpu set format)"),
 	OPT_BOOLEAN('I', "show-info", &report.show_full_info,
 		    "Display extended information about perf.data file"),
 	OPT_BOOLEAN(0, "source", &annotate_opts.annotate_src,
@@ -1570,6 +1582,7 @@ int cmd_report(int argc, const char **argv)
 	report.tool.cgroup		 = perf_event__process_cgroup;
 	report.tool.exit		 = perf_event__process_exit;
 	report.tool.fork		 = perf_event__process_fork;
+	report.tool.context_switch	 = perf_event__process_switch;
 	report.tool.lost		 = perf_event__process_lost;
 	report.tool.read		 = process_read_event;
 	report.tool.attr		 = process_attr;
@@ -1721,6 +1734,23 @@ int cmd_report(int argc, const char **argv)
 		symbol_conf.annotate_data_sample = true;
 	}
 
+	if (report.disable_order || !perf_session__has_switch_events(session)) {
+		if (symbol_conf.parallelism_list_str ||
+			(sort_order && (strstr(sort_order, "wallclock") ||
+				strstr(sort_order, "parallelism"))) ||
+			(field_order && (strstr(field_order, "wallclock") ||
+				strstr(field_order, "parallelism")))) {
+			if (report.disable_order)
+				ui__error("Use of wallclock profile or parallelism"
+					" is incompatible with --disable-order.\n");
+			else
+				ui__error("Use of wallclock profile or parallelism"
+					" requires --switch-events during record.\n");
+			return -1;
+		}
+		symbol_conf.disable_wallclock = true;
+	}
+
 	if (last_key != K_SWITCH_INPUT_DATA) {
 		if (sort_order && strstr(sort_order, "ipc")) {
 			parse_options_usage(report_usage, options, "s", 1);
diff --git a/tools/perf/ui/browsers/hists.c b/tools/perf/ui/browsers/hists.c
index 49ba82bf33918..abcc4a11849f4 100644
--- a/tools/perf/ui/browsers/hists.c
+++ b/tools/perf/ui/browsers/hists.c
@@ -1226,7 +1226,7 @@ int __hpp__slsmg_color_printf(struct perf_hpp *hpp, const char *fmt, ...)
 	return ret;
 }
 
-#define __HPP_COLOR_PERCENT_FN(_type, _field)				\
+#define __HPP_COLOR_PERCENT_FN(_type, _field, _fmttype)			\
 static u64 __hpp_get_##_field(struct hist_entry *he)			\
 {									\
 	return he->stat._field;						\
@@ -1238,10 +1238,10 @@ hist_browser__hpp_color_##_type(struct perf_hpp_fmt *fmt,		\
 				struct hist_entry *he)			\
 {									\
 	return hpp__fmt(fmt, hpp, he, __hpp_get_##_field, " %*.2f%%",	\
-			__hpp__slsmg_color_printf, true);		\
+			__hpp__slsmg_color_printf, _fmttype);		\
 }
 
-#define __HPP_COLOR_ACC_PERCENT_FN(_type, _field)			\
+#define __HPP_COLOR_ACC_PERCENT_FN(_type, _field, _fmttype)		\
 static u64 __hpp_get_acc_##_field(struct hist_entry *he)		\
 {									\
 	return he->stat_acc->_field;					\
@@ -1262,15 +1262,18 @@ hist_browser__hpp_color_##_type(struct perf_hpp_fmt *fmt,		\
 		return ret;						\
 	}								\
 	return hpp__fmt(fmt, hpp, he, __hpp_get_acc_##_field,		\
-			" %*.2f%%", __hpp__slsmg_color_printf, true);	\
+			" %*.2f%%", __hpp__slsmg_color_printf,		\
+			_fmttype);					\
 }
 
-__HPP_COLOR_PERCENT_FN(overhead, period)
-__HPP_COLOR_PERCENT_FN(overhead_sys, period_sys)
-__HPP_COLOR_PERCENT_FN(overhead_us, period_us)
-__HPP_COLOR_PERCENT_FN(overhead_guest_sys, period_guest_sys)
-__HPP_COLOR_PERCENT_FN(overhead_guest_us, period_guest_us)
-__HPP_COLOR_ACC_PERCENT_FN(overhead_acc, period)
+__HPP_COLOR_PERCENT_FN(overhead, period, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_PERCENT_FN(wallclock, wallclock, PERF_HPP_FMT_TYPE__WALLCLOCK)
+__HPP_COLOR_PERCENT_FN(overhead_sys, period_sys, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_PERCENT_FN(overhead_us, period_us, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_PERCENT_FN(overhead_guest_sys, period_guest_sys, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_PERCENT_FN(overhead_guest_us, period_guest_us, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_ACC_PERCENT_FN(overhead_acc, period, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_ACC_PERCENT_FN(wallclock_acc, wallclock, PERF_HPP_FMT_TYPE__WALLCLOCK)
 
 #undef __HPP_COLOR_PERCENT_FN
 #undef __HPP_COLOR_ACC_PERCENT_FN
@@ -1279,6 +1282,8 @@ void hist_browser__init_hpp(void)
 {
 	perf_hpp__format[PERF_HPP__OVERHEAD].color =
 				hist_browser__hpp_color_overhead;
+	perf_hpp__format[PERF_HPP__WALLCLOCK].color =
+				hist_browser__hpp_color_wallclock;
 	perf_hpp__format[PERF_HPP__OVERHEAD_SYS].color =
 				hist_browser__hpp_color_overhead_sys;
 	perf_hpp__format[PERF_HPP__OVERHEAD_US].color =
@@ -1289,6 +1294,8 @@ void hist_browser__init_hpp(void)
 				hist_browser__hpp_color_overhead_guest_us;
 	perf_hpp__format[PERF_HPP__OVERHEAD_ACC].color =
 				hist_browser__hpp_color_overhead_acc;
+	perf_hpp__format[PERF_HPP__WALLCLOCK_ACC].color =
+				hist_browser__hpp_color_wallclock_acc;
 
 	res_sample_init();
 }
diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
index 34fda1d5eccb4..c85ad841fb4ed 100644
--- a/tools/perf/ui/hist.c
+++ b/tools/perf/ui/hist.c
@@ -27,9 +27,10 @@ static int __hpp__fmt_print(struct perf_hpp *hpp, struct hists *hists, u64 val,
 			    int nr_samples, const char *fmt, int len,
 			    hpp_snprint_fn print_fn, enum perf_hpp_fmt_type fmtype)
 {
-	if (fmtype == PERF_HPP_FMT_TYPE__PERCENT) {
+	if (fmtype == PERF_HPP_FMT_TYPE__PERCENT || fmtype == PERF_HPP_FMT_TYPE__WALLCLOCK) {
 		double percent = 0.0;
-		u64 total = hists__total_period(hists);
+		u64 total = fmtype == PERF_HPP_FMT_TYPE__PERCENT ? hists__total_period(hists) :
+			hists__total_wallclock(hists);
 
 		if (total)
 			percent = 100.0 * val / total;
@@ -128,7 +129,7 @@ int hpp__fmt(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
 				  print_fn, fmtype);
 	}
 
-	if (fmtype == PERF_HPP_FMT_TYPE__PERCENT)
+	if (fmtype == PERF_HPP_FMT_TYPE__PERCENT || fmtype == PERF_HPP_FMT_TYPE__WALLCLOCK)
 		len -= 2; /* 2 for a space and a % sign */
 	else
 		len -= 1;
@@ -472,11 +473,13 @@ __HPP_ENTRY_AVERAGE_FN(_type, _field)					\
 __HPP_SORT_AVERAGE_FN(_type, _field)
 
 HPP_PERCENT_FNS(overhead, period)
+HPP_PERCENT_FNS(wallclock, wallclock)
 HPP_PERCENT_FNS(overhead_sys, period_sys)
 HPP_PERCENT_FNS(overhead_us, period_us)
 HPP_PERCENT_FNS(overhead_guest_sys, period_guest_sys)
 HPP_PERCENT_FNS(overhead_guest_us, period_guest_us)
 HPP_PERCENT_ACC_FNS(overhead_acc, period)
+HPP_PERCENT_ACC_FNS(wallclock_acc, wallclock)
 
 HPP_RAW_FNS(samples, nr_events)
 HPP_RAW_FNS(period, period)
@@ -548,11 +551,13 @@ static bool hpp__equal(struct perf_hpp_fmt *a, struct perf_hpp_fmt *b)
 
 struct perf_hpp_fmt perf_hpp__format[] = {
 	HPP__COLOR_PRINT_FNS("Overhead", overhead, OVERHEAD),
+	HPP__COLOR_PRINT_FNS("Wallclock", wallclock, WALLCLOCK),
 	HPP__COLOR_PRINT_FNS("sys", overhead_sys, OVERHEAD_SYS),
 	HPP__COLOR_PRINT_FNS("usr", overhead_us, OVERHEAD_US),
 	HPP__COLOR_PRINT_FNS("guest sys", overhead_guest_sys, OVERHEAD_GUEST_SYS),
 	HPP__COLOR_PRINT_FNS("guest usr", overhead_guest_us, OVERHEAD_GUEST_US),
 	HPP__COLOR_ACC_PRINT_FNS("Children", overhead_acc, OVERHEAD_ACC),
+	HPP__COLOR_ACC_PRINT_FNS("Wall", wallclock_acc, WALLCLOCK_ACC),
 	HPP__PRINT_FNS("Samples", samples, SAMPLES),
 	HPP__PRINT_FNS("Period", period, PERIOD),
 	HPP__PRINT_FNS("Weight1", weight1, WEIGHT1),
@@ -601,9 +606,15 @@ static void fmt_free(struct perf_hpp_fmt *fmt)
 		fmt->free(fmt);
 }
 
+static bool fmt_equal(struct perf_hpp_fmt *a, struct perf_hpp_fmt *b)
+{
+	return a->equal && a->equal(a, b);
+}
+
 void perf_hpp__init(void)
 {
 	int i;
+	bool prefer_wallclock;
 
 	for (i = 0; i < PERF_HPP__MAX_INDEX; i++) {
 		struct perf_hpp_fmt *fmt = &perf_hpp__format[i];
@@ -621,12 +632,30 @@ void perf_hpp__init(void)
 	if (is_strict_order(field_order))
 		return;
 
+	/*
+	 * There are numerous ways how a user can setup a custom view with
+	 * --sort/fields flags, but we provide --sort=wallclock as a simple
+	 * preset for latency analysis and show wallclock before CPU overhead.
+	 */
+	prefer_wallclock = !symbol_conf.disable_wallclock && sort_order &&
+		!strncmp(sort_order, "wallclock", sizeof("wallclock")-1);
+
 	if (symbol_conf.cumulate_callchain) {
+		/* Use idempotent addition to avoid more complex logic. */
+		if (prefer_wallclock)
+			hpp_dimension__add_output(PERF_HPP__WALLCLOCK_ACC);
 		hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC);
+		if (!symbol_conf.disable_wallclock)
+			hpp_dimension__add_output(PERF_HPP__WALLCLOCK_ACC);
 		perf_hpp__format[PERF_HPP__OVERHEAD].name = "Self";
+		perf_hpp__format[PERF_HPP__WALLCLOCK].name = "Wall";
 	}
 
+	if (prefer_wallclock)
+		hpp_dimension__add_output(PERF_HPP__WALLCLOCK);
 	hpp_dimension__add_output(PERF_HPP__OVERHEAD);
+	if (!symbol_conf.disable_wallclock)
+		hpp_dimension__add_output(PERF_HPP__WALLCLOCK);
 
 	if (symbol_conf.show_cpu_utilization) {
 		hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS);
@@ -671,28 +700,43 @@ static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
 
 void perf_hpp__cancel_cumulate(void)
 {
-	struct perf_hpp_fmt *fmt, *acc, *ovh, *tmp;
+	struct perf_hpp_fmt *fmt, *acc, *ovh, *wall, *acc_wall, *tmp;
 
 	if (is_strict_order(field_order))
 		return;
 
 	ovh = &perf_hpp__format[PERF_HPP__OVERHEAD];
 	acc = &perf_hpp__format[PERF_HPP__OVERHEAD_ACC];
+	wall = &perf_hpp__format[PERF_HPP__WALLCLOCK];
+	acc_wall = &perf_hpp__format[PERF_HPP__WALLCLOCK_ACC];
 
 	perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
-		if (acc->equal(acc, fmt)) {
+		if (fmt_equal(acc, fmt) || fmt_equal(acc_wall, fmt)) {
 			perf_hpp__column_unregister(fmt);
 			continue;
 		}
 
-		if (ovh->equal(ovh, fmt))
+		if (fmt_equal(ovh, fmt))
 			fmt->name = "Overhead";
+		if (fmt_equal(wall, fmt))
+			fmt->name = "Wallclock";
 	}
 }
 
-static bool fmt_equal(struct perf_hpp_fmt *a, struct perf_hpp_fmt *b)
+void perf_hpp__cancel_wallclock(void)
 {
-	return a->equal && a->equal(a, b);
+	struct perf_hpp_fmt *fmt, *wall, *acc, *tmp;
+
+	if (is_strict_order(field_order))
+		return;
+
+	wall = &perf_hpp__format[PERF_HPP__WALLCLOCK];
+	acc = &perf_hpp__format[PERF_HPP__WALLCLOCK_ACC];
+
+	perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
+		if (fmt_equal(wall, fmt) || fmt_equal(acc, fmt))
+			perf_hpp__column_unregister(fmt);
+	}
 }
 
 void perf_hpp__setup_output_field(struct perf_hpp_list *list)
@@ -819,6 +863,7 @@ void perf_hpp__reset_width(struct perf_hpp_fmt *fmt, struct hists *hists)
 
 	switch (fmt->idx) {
 	case PERF_HPP__OVERHEAD:
+	case PERF_HPP__WALLCLOCK:
 	case PERF_HPP__OVERHEAD_SYS:
 	case PERF_HPP__OVERHEAD_US:
 	case PERF_HPP__OVERHEAD_ACC:
diff --git a/tools/perf/util/addr_location.c b/tools/perf/util/addr_location.c
index 51825ef8c0ab7..007a2f5df9a6a 100644
--- a/tools/perf/util/addr_location.c
+++ b/tools/perf/util/addr_location.c
@@ -17,6 +17,7 @@ void addr_location__init(struct addr_location *al)
 	al->cpumode = 0;
 	al->cpu = 0;
 	al->socket = 0;
+	al->parallelism = 1;
 }
 
 /*
diff --git a/tools/perf/util/addr_location.h b/tools/perf/util/addr_location.h
index d8ac0428dff23..d8af6f467ae9d 100644
--- a/tools/perf/util/addr_location.h
+++ b/tools/perf/util/addr_location.h
@@ -17,10 +17,14 @@ struct addr_location {
 	const char    *srcline;
 	u64	      addr;
 	char	      level;
-	u8	      filtered;
+	u16	      filtered;
 	u8	      cpumode;
 	s32	      cpu;
 	s32	      socket;
+	/* Same as machine.parallelism but within [1, nr_cpus]. */
+	int	      parallelism;
+	/* See he_stat.wallclock. */
+	u64	      wallclock;
 };
 
 void addr_location__init(struct addr_location *al);
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index aac96d5d19170..1c50af20b3a4c 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -767,6 +767,12 @@ int machine__resolve(struct machine *machine, struct addr_location *al,
 			al->socket = env->cpu[al->cpu].socket_id;
 	}
 
+	/* Account for possible out-of-order switch events. */
+	al->parallelism = min(max(machine->parallelism, 1), machine__nr_cpus_avail(machine));
+	if (test_bit(al->parallelism, symbol_conf.parallelism_filter))
+		al->filtered |= (1 << HIST_FILTER__PARALLELISM);
+	al->wallclock = sample->period * 1000 / al->parallelism;
+
 	if (al->map) {
 		if (symbol_conf.dso_list &&
 		    (!dso || !(strlist__has_entry(symbol_conf.dso_list,
diff --git a/tools/perf/util/events_stats.h b/tools/perf/util/events_stats.h
index eabd7913c3092..c32c99c16ad39 100644
--- a/tools/perf/util/events_stats.h
+++ b/tools/perf/util/events_stats.h
@@ -57,6 +57,8 @@ struct events_stats {
 struct hists_stats {
 	u64 total_period;
 	u64 total_non_filtered_period;
+	u64 total_wallclock;
+	u64 total_non_filtered_wallclock;
 	u32 nr_samples;
 	u32 nr_non_filtered_samples;
 	u32 nr_lost_samples;
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index b99d89b1a152d..fe690e9aece45 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -43,6 +43,8 @@ static bool hists__filter_entry_by_symbol(struct hists *hists,
 					  struct hist_entry *he);
 static bool hists__filter_entry_by_socket(struct hists *hists,
 					  struct hist_entry *he);
+static bool hists__filter_entry_by_parallelism(struct hists *hists,
+					       struct hist_entry *he);
 
 u16 hists__col_len(struct hists *hists, enum hist_column col)
 {
@@ -207,6 +209,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 
 	hists__new_col_len(hists, HISTC_CGROUP, 6);
 	hists__new_col_len(hists, HISTC_CGROUP_ID, 20);
+	hists__new_col_len(hists, HISTC_PARALLELISM, 11);
 	hists__new_col_len(hists, HISTC_CPU, 3);
 	hists__new_col_len(hists, HISTC_SOCKET, 6);
 	hists__new_col_len(hists, HISTC_MEM_LOCKED, 6);
@@ -302,9 +305,10 @@ static long hist_time(unsigned long htime)
 	return htime;
 }
 
-static void he_stat__add_period(struct he_stat *he_stat, u64 period)
+static void he_stat__add_period(struct he_stat *he_stat, u64 period, u64 wallclock)
 {
 	he_stat->period		+= period;
+	he_stat->wallclock	+= wallclock;
 	he_stat->nr_events	+= 1;
 }
 
@@ -319,6 +323,7 @@ static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
 	dest->weight2		+= src->weight2;
 	dest->weight3		+= src->weight3;
 	dest->nr_events		+= src->nr_events;
+	dest->wallclock		+= src->wallclock;
 }
 
 static void he_stat__decay(struct he_stat *he_stat)
@@ -328,6 +333,7 @@ static void he_stat__decay(struct he_stat *he_stat)
 	he_stat->weight1 = (he_stat->weight1 * 7) / 8;
 	he_stat->weight2 = (he_stat->weight2 * 7) / 8;
 	he_stat->weight3 = (he_stat->weight3 * 7) / 8;
+	he_stat->wallclock = (he_stat->wallclock * 7) / 8;
 }
 
 static void hists__delete_entry(struct hists *hists, struct hist_entry *he);
@@ -335,7 +341,7 @@ static void hists__delete_entry(struct hists *hists, struct hist_entry *he);
 static bool hists__decay_entry(struct hists *hists, struct hist_entry *he)
 {
 	u64 prev_period = he->stat.period;
-	u64 diff;
+	u64 prev_wallclock = he->stat.wallclock;
 
 	if (prev_period == 0)
 		return true;
@@ -345,12 +351,16 @@ static bool hists__decay_entry(struct hists *hists, struct hist_entry *he)
 		he_stat__decay(he->stat_acc);
 	decay_callchain(he->callchain);
 
-	diff = prev_period - he->stat.period;
-
 	if (!he->depth) {
-		hists->stats.total_period -= diff;
-		if (!he->filtered)
-			hists->stats.total_non_filtered_period -= diff;
+		u64 period_diff = prev_period - he->stat.period;
+		u64 wallclock_diff = prev_wallclock - he->stat.wallclock;
+
+		hists->stats.total_period -= period_diff;
+		hists->stats.total_wallclock -= wallclock_diff;
+		if (!he->filtered) {
+			hists->stats.total_non_filtered_period -= period_diff;
+			hists->stats.total_non_filtered_wallclock -= wallclock_diff;
+		}
 	}
 
 	if (!he->leaf) {
@@ -365,7 +375,7 @@ static bool hists__decay_entry(struct hists *hists, struct hist_entry *he)
 		}
 	}
 
-	return he->stat.period == 0;
+	return he->stat.period == 0 && he->stat.wallclock == 0;
 }
 
 static void hists__delete_entry(struct hists *hists, struct hist_entry *he)
@@ -584,21 +594,24 @@ static struct hist_entry *hist_entry__new(struct hist_entry *template,
 	return he;
 }
 
-static u8 symbol__parent_filter(const struct symbol *parent)
+static u16 symbol__parent_filter(const struct symbol *parent)
 {
 	if (symbol_conf.exclude_other && parent == NULL)
 		return 1 << HIST_FILTER__PARENT;
 	return 0;
 }
 
-static void hist_entry__add_callchain_period(struct hist_entry *he, u64 period)
+static void hist_entry__add_callchain_period(struct hist_entry *he, u64 period, u64 wallclock)
 {
 	if (!hist_entry__has_callchains(he) || !symbol_conf.use_callchain)
 		return;
 
 	he->hists->callchain_period += period;
-	if (!he->filtered)
+	he->hists->callchain_wallclock += wallclock;
+	if (!he->filtered) {
 		he->hists->callchain_non_filtered_period += period;
+		he->hists->callchain_non_filtered_wallclock += wallclock;
+	}
 }
 
 static struct hist_entry *hists__findnew_entry(struct hists *hists,
@@ -611,6 +624,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 	struct hist_entry *he;
 	int64_t cmp;
 	u64 period = entry->stat.period;
+	u64 wallclock = entry->stat.wallclock;
 	bool leftmost = true;
 
 	p = &hists->entries_in->rb_root.rb_node;
@@ -629,10 +643,10 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 		if (!cmp) {
 			if (sample_self) {
 				he_stat__add_stat(&he->stat, &entry->stat);
-				hist_entry__add_callchain_period(he, period);
+				hist_entry__add_callchain_period(he, period, wallclock);
 			}
 			if (symbol_conf.cumulate_callchain)
-				he_stat__add_period(he->stat_acc, period);
+				he_stat__add_period(he->stat_acc, period, wallclock);
 
 			block_info__delete(entry->block_info);
 
@@ -669,7 +683,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 		return NULL;
 
 	if (sample_self)
-		hist_entry__add_callchain_period(he, period);
+		hist_entry__add_callchain_period(he, period, wallclock);
 	hists->nr_entries++;
 
 	rb_link_node(&he->rb_node_in, parent, p);
@@ -741,12 +755,14 @@ __hists__add_entry(struct hists *hists,
 		.ip	 = al->addr,
 		.level	 = al->level,
 		.code_page_size = sample->code_page_size,
+		.parallelism	= al->parallelism,
 		.stat = {
 			.nr_events = 1,
 			.period	= sample->period,
 			.weight1 = sample->weight,
 			.weight2 = sample->ins_lat,
 			.weight3 = sample->p_stage_cyc,
+			.wallclock = al->wallclock,
 		},
 		.parent = sym_parent,
 		.filtered = symbol__parent_filter(sym_parent) | al->filtered,
@@ -1456,6 +1472,10 @@ static void hist_entry__check_and_remove_filter(struct hist_entry *he,
 		if (symbol_conf.sym_list == NULL)
 			return;
 		break;
+	case HIST_FILTER__PARALLELISM:
+		if (__bitmap_weight(symbol_conf.parallelism_filter, MAX_NR_CPUS + 1) == 0)
+			return;
+		break;
 	case HIST_FILTER__PARENT:
 	case HIST_FILTER__GUEST:
 	case HIST_FILTER__HOST:
@@ -1514,6 +1534,9 @@ static void hist_entry__apply_hierarchy_filters(struct hist_entry *he)
 	hist_entry__check_and_remove_filter(he, HIST_FILTER__SYMBOL,
 					    perf_hpp__is_sym_entry);
 
+	hist_entry__check_and_remove_filter(he, HIST_FILTER__PARALLELISM,
+					    perf_hpp__is_parallelism_entry);
+
 	hists__apply_filters(he->hists, he);
 }
 
@@ -1710,6 +1733,7 @@ static void hists__apply_filters(struct hists *hists, struct hist_entry *he)
 	hists__filter_entry_by_thread(hists, he);
 	hists__filter_entry_by_symbol(hists, he);
 	hists__filter_entry_by_socket(hists, he);
+	hists__filter_entry_by_parallelism(hists, he);
 }
 
 int hists__collapse_resort(struct hists *hists, struct ui_progress *prog)
@@ -1757,12 +1781,14 @@ static void hists__reset_filter_stats(struct hists *hists)
 {
 	hists->nr_non_filtered_entries = 0;
 	hists->stats.total_non_filtered_period = 0;
+	hists->stats.total_non_filtered_wallclock = 0;
 }
 
 void hists__reset_stats(struct hists *hists)
 {
 	hists->nr_entries = 0;
 	hists->stats.total_period = 0;
+	hists->stats.total_wallclock = 0;
 
 	hists__reset_filter_stats(hists);
 }
@@ -1771,6 +1797,7 @@ static void hists__inc_filter_stats(struct hists *hists, struct hist_entry *h)
 {
 	hists->nr_non_filtered_entries++;
 	hists->stats.total_non_filtered_period += h->stat.period;
+	hists->stats.total_non_filtered_wallclock += h->stat.wallclock;
 }
 
 void hists__inc_stats(struct hists *hists, struct hist_entry *h)
@@ -1780,6 +1807,7 @@ void hists__inc_stats(struct hists *hists, struct hist_entry *h)
 
 	hists->nr_entries++;
 	hists->stats.total_period += h->stat.period;
+	hists->stats.total_wallclock += h->stat.wallclock;
 }
 
 static void hierarchy_recalc_total_periods(struct hists *hists)
@@ -1791,6 +1819,8 @@ static void hierarchy_recalc_total_periods(struct hists *hists)
 
 	hists->stats.total_period = 0;
 	hists->stats.total_non_filtered_period = 0;
+	hists->stats.total_wallclock = 0;
+	hists->stats.total_non_filtered_wallclock = 0;
 
 	/*
 	 * recalculate total period using top-level entries only
@@ -1802,8 +1832,11 @@ static void hierarchy_recalc_total_periods(struct hists *hists)
 		node = rb_next(node);
 
 		hists->stats.total_period += he->stat.period;
-		if (!he->filtered)
+		hists->stats.total_wallclock += he->stat.wallclock;
+		if (!he->filtered) {
 			hists->stats.total_non_filtered_period += he->stat.period;
+			hists->stats.total_non_filtered_wallclock += he->stat.wallclock;
+		}
 	}
 }
 
@@ -2196,6 +2229,17 @@ static bool hists__filter_entry_by_socket(struct hists *hists,
 	return false;
 }
 
+static bool hists__filter_entry_by_parallelism(struct hists *hists,
+					       struct hist_entry *he)
+{
+	if (test_bit(he->parallelism, hists->parallelism_filter)) {
+		he->filtered |= (1 << HIST_FILTER__PARALLELISM);
+		return true;
+	}
+
+	return false;
+}
+
 typedef bool (*filter_fn_t)(struct hists *hists, struct hist_entry *he);
 
 static void hists__filter_by_type(struct hists *hists, int type, filter_fn_t filter)
@@ -2365,6 +2409,16 @@ void hists__filter_by_socket(struct hists *hists)
 				      hists__filter_entry_by_socket);
 }
 
+void hists__filter_by_parallelism(struct hists *hists)
+{
+	if (symbol_conf.report_hierarchy)
+		hists__filter_hierarchy(hists, HIST_FILTER__PARALLELISM,
+					hists->parallelism_filter);
+	else
+		hists__filter_by_type(hists, HIST_FILTER__PARALLELISM,
+				      hists__filter_entry_by_parallelism);
+}
+
 void events_stats__inc(struct events_stats *stats, u32 type)
 {
 	++stats->nr_events[0];
@@ -2760,6 +2814,12 @@ u64 hists__total_period(struct hists *hists)
 		hists->stats.total_period;
 }
 
+u64 hists__total_wallclock(struct hists *hists)
+{
+	return symbol_conf.filter_relative ? hists->stats.total_non_filtered_wallclock :
+		hists->stats.total_wallclock;
+}
+
 int __hists__scnprintf_title(struct hists *hists, char *bf, size_t size, bool show_freq)
 {
 	char unit;
@@ -2871,6 +2931,7 @@ int __hists__init(struct hists *hists, struct perf_hpp_list *hpp_list)
 	hists->entries = RB_ROOT_CACHED;
 	mutex_init(&hists->lock);
 	hists->socket_filter = -1;
+	hists->parallelism_filter = symbol_conf.parallelism_filter;
 	hists->hpp_list = hpp_list;
 	INIT_LIST_HEAD(&hists->hpp_formats);
 	return 0;
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 6cff03eb043b7..db9c05b810bd7 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -31,6 +31,7 @@ enum hist_filter {
 	HIST_FILTER__HOST,
 	HIST_FILTER__SOCKET,
 	HIST_FILTER__C2C,
+	HIST_FILTER__PARALLELISM,
 };
 
 enum hist_column {
@@ -42,6 +43,7 @@ enum hist_column {
 	HISTC_CGROUP_ID,
 	HISTC_CGROUP,
 	HISTC_PARENT,
+	HISTC_PARALLELISM,
 	HISTC_CPU,
 	HISTC_SOCKET,
 	HISTC_SRCLINE,
@@ -105,10 +107,13 @@ struct hists {
 	u64			nr_non_filtered_entries;
 	u64			callchain_period;
 	u64			callchain_non_filtered_period;
+	u64			callchain_wallclock;
+	u64			callchain_non_filtered_wallclock;
 	struct thread		*thread_filter;
 	const struct dso	*dso_filter;
 	const char		*uid_filter_str;
 	const char		*symbol_filter_str;
+	unsigned long		*parallelism_filter;
 	struct mutex		lock;
 	struct hists_stats	stats;
 	u64			event_stream;
@@ -165,6 +170,13 @@ struct res_sample {
 
 struct he_stat {
 	u64			period;
+	/*
+	 * Overhead re-scaled from CPU time to wallclock time (period divided
+	 * by the parallelism at the time of the sample). It's multipled by
+	 * some const to avoid precision loss or dealing with floats.
+	 * It does not matter otherwise since we only print it as percents.
+	 */
+	u64			wallclock;
 	u64			period_sys;
 	u64			period_us;
 	u64			period_guest_sys;
@@ -228,6 +240,7 @@ struct hist_entry {
 	u64			transaction;
 	s32			socket;
 	s32			cpu;
+	int			parallelism;
 	u64			code_page_size;
 	u64			weight;
 	u64			ins_lat;
@@ -242,7 +255,7 @@ struct hist_entry {
 	bool			leaf;
 
 	char			level;
-	u8			filtered;
+	u16			filtered;
 
 	u16			callchain_size;
 	union {
@@ -368,6 +381,7 @@ void hists__output_recalc_col_len(struct hists *hists, int max_rows);
 struct hist_entry *hists__get_entry(struct hists *hists, int idx);
 
 u64 hists__total_period(struct hists *hists);
+u64 hists__total_wallclock(struct hists *hists);
 void hists__reset_stats(struct hists *hists);
 void hists__inc_stats(struct hists *hists, struct hist_entry *h);
 void hists__inc_nr_events(struct hists *hists);
@@ -384,11 +398,13 @@ void hists__filter_by_dso(struct hists *hists);
 void hists__filter_by_thread(struct hists *hists);
 void hists__filter_by_symbol(struct hists *hists);
 void hists__filter_by_socket(struct hists *hists);
+void hists__filter_by_parallelism(struct hists *hists);
 
 static inline bool hists__has_filter(struct hists *hists)
 {
 	return hists->thread_filter || hists->dso_filter ||
-		hists->symbol_filter_str || (hists->socket_filter > -1);
+		hists->symbol_filter_str || (hists->socket_filter > -1) ||
+		hists->parallelism_filter;
 }
 
 u16 hists__col_len(struct hists *hists, enum hist_column col);
@@ -547,11 +563,13 @@ extern struct perf_hpp_fmt perf_hpp__format[];
 enum {
 	/* Matches perf_hpp__format array. */
 	PERF_HPP__OVERHEAD,
+	PERF_HPP__WALLCLOCK,
 	PERF_HPP__OVERHEAD_SYS,
 	PERF_HPP__OVERHEAD_US,
 	PERF_HPP__OVERHEAD_GUEST_SYS,
 	PERF_HPP__OVERHEAD_GUEST_US,
 	PERF_HPP__OVERHEAD_ACC,
+	PERF_HPP__WALLCLOCK_ACC,
 	PERF_HPP__SAMPLES,
 	PERF_HPP__PERIOD,
 	PERF_HPP__WEIGHT1,
@@ -563,6 +581,7 @@ enum {
 
 void perf_hpp__init(void);
 void perf_hpp__cancel_cumulate(void);
+void perf_hpp__cancel_wallclock(void);
 void perf_hpp__setup_output_field(struct perf_hpp_list *list);
 void perf_hpp__reset_output_field(struct perf_hpp_list *list);
 void perf_hpp__append_sort_keys(struct perf_hpp_list *list);
@@ -580,6 +599,7 @@ bool perf_hpp__is_thread_entry(struct perf_hpp_fmt *fmt);
 bool perf_hpp__is_comm_entry(struct perf_hpp_fmt *fmt);
 bool perf_hpp__is_dso_entry(struct perf_hpp_fmt *fmt);
 bool perf_hpp__is_sym_entry(struct perf_hpp_fmt *fmt);
+bool perf_hpp__is_parallelism_entry(struct perf_hpp_fmt *fmt);
 
 struct perf_hpp_fmt *perf_hpp_fmt__dup(struct perf_hpp_fmt *fmt);
 
@@ -606,6 +626,7 @@ void hists__reset_column_width(struct hists *hists);
 enum perf_hpp_fmt_type {
 	PERF_HPP_FMT_TYPE__RAW,
 	PERF_HPP_FMT_TYPE__PERCENT,
+	PERF_HPP_FMT_TYPE__WALLCLOCK,
 	PERF_HPP_FMT_TYPE__AVERAGE,
 };
 
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 27d5345d2b307..956581bfeb5b1 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -94,6 +94,8 @@ int machine__init(struct machine *machine, const char *root_dir, pid_t pid)
 	machine->comm_exec = false;
 	machine->kernel_start = 0;
 	machine->vmlinux_map = NULL;
+	/* There is no initial context switch in, so we start at 1. */
+	machine->parallelism = 1;
 
 	machine->root_dir = strdup(root_dir);
 	if (machine->root_dir == NULL)
@@ -677,8 +679,11 @@ int machine__process_aux_output_hw_id_event(struct machine *machine __maybe_unus
 int machine__process_switch_event(struct machine *machine __maybe_unused,
 				  union perf_event *event)
 {
+	bool out = event->header.misc & PERF_RECORD_MISC_SWITCH_OUT;
+
 	if (dump_trace)
 		perf_event__fprintf_switch(event, stdout);
+	machine->parallelism += out ? -1 : 1;
 	return 0;
 }
 
@@ -1898,6 +1903,8 @@ int machine__process_exit_event(struct machine *machine, union perf_event *event
 	if (dump_trace)
 		perf_event__fprintf_task(event, stdout);
 
+	/* There is no context switch out before exit, so we decrement here. */
+	machine->parallelism--;
 	if (thread != NULL) {
 		if (symbol_conf.keep_exited_threads)
 			thread__set_exited(thread, /*exited=*/true);
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index 2e5a4cb342d82..a5f28652873ce 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -50,6 +50,12 @@ struct machine {
 		u64	  text_start;
 		u64	  text_end;
 	} sched, lock, traceiter, trace;
+	/*
+	 * The current parallelism level (number of threads that run on CPUs).
+	 * This value can be less than 1, or larger than the total number
+	 * of CPUs, if events are poorly ordered.
+	 */
+	int		  parallelism;
 	pid_t		  *current_tid;
 	size_t		  current_tid_sz;
 	union { /* Tool specific area */
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 507e6cba95458..19db4bb315aa5 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2402,6 +2402,18 @@ bool perf_session__has_traces(struct perf_session *session, const char *msg)
 	return false;
 }
 
+bool perf_session__has_switch_events(struct perf_session *session)
+{
+	struct evsel *evsel;
+
+	evlist__for_each_entry(session->evlist, evsel) {
+		if (evsel->core.attr.context_switch)
+			return true;
+	}
+
+	return false;
+}
+
 int map__set_kallsyms_ref_reloc_sym(struct map *map, const char *symbol_name, u64 addr)
 {
 	char *bracket;
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index bcf1bcf06959b..db1c120a9e67f 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -141,6 +141,7 @@ int perf_session__resolve_callchain(struct perf_session *session,
 				    struct symbol **parent);
 
 bool perf_session__has_traces(struct perf_session *session, const char *msg);
+bool perf_session__has_switch_events(struct perf_session *session);
 
 void perf_event__attr_swap(struct perf_event_attr *attr);
 
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 9dd60c7869a28..01e8c54d93464 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -892,6 +892,38 @@ struct sort_entry sort_cpu = {
 	.se_width_idx	= HISTC_CPU,
 };
 
+/* --sort parallelism */
+
+static int64_t
+sort__parallelism_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	return right->parallelism - left->parallelism;
+}
+
+static int hist_entry__parallelism_filter(struct hist_entry *he, int type, const void *arg)
+{
+	const unsigned long *parallelism_filter = arg;
+
+	if (type != HIST_FILTER__PARALLELISM)
+		return -1;
+
+	return test_bit(he->parallelism, parallelism_filter);
+}
+
+static int hist_entry__parallelism_snprintf(struct hist_entry *he, char *bf,
+				    size_t size, unsigned int width)
+{
+	return repsep_snprintf(bf, size, "%*d", width, he->parallelism);
+}
+
+struct sort_entry sort_parallelism = {
+	.se_header      = "Parallelism",
+	.se_cmp	        = sort__parallelism_cmp,
+	.se_filter	= hist_entry__parallelism_filter,
+	.se_snprintf    = hist_entry__parallelism_snprintf,
+	.se_width_idx	= HISTC_PARALLELISM,
+};
+
 /* --sort cgroup_id */
 
 static int64_t _sort__cgroup_dev_cmp(u64 left_dev, u64 right_dev)
@@ -2532,6 +2564,7 @@ static struct sort_dimension common_sort_dimensions[] = {
 	DIM(SORT_ANNOTATE_DATA_TYPE_OFFSET, "typeoff", sort_type_offset),
 	DIM(SORT_SYM_OFFSET, "symoff", sort_sym_offset),
 	DIM(SORT_ANNOTATE_DATA_TYPE_CACHELINE, "typecln", sort_type_cacheline),
+	DIM(SORT_PARALLELISM, "parallelism", sort_parallelism),
 };
 
 #undef DIM
@@ -2593,11 +2626,13 @@ struct hpp_dimension {
 
 static struct hpp_dimension hpp_sort_dimensions[] = {
 	DIM(PERF_HPP__OVERHEAD, "overhead"),
+	DIM(PERF_HPP__WALLCLOCK, "wallclock"),
 	DIM(PERF_HPP__OVERHEAD_SYS, "overhead_sys"),
 	DIM(PERF_HPP__OVERHEAD_US, "overhead_us"),
 	DIM(PERF_HPP__OVERHEAD_GUEST_SYS, "overhead_guest_sys"),
 	DIM(PERF_HPP__OVERHEAD_GUEST_US, "overhead_guest_us"),
 	DIM(PERF_HPP__OVERHEAD_ACC, "overhead_children"),
+	DIM(PERF_HPP__WALLCLOCK_ACC, "wallclock_children"),
 	DIM(PERF_HPP__SAMPLES, "sample"),
 	DIM(PERF_HPP__PERIOD, "period"),
 	DIM(PERF_HPP__WEIGHT1, "weight1"),
@@ -2733,6 +2768,7 @@ MK_SORT_ENTRY_CHK(thread)
 MK_SORT_ENTRY_CHK(comm)
 MK_SORT_ENTRY_CHK(dso)
 MK_SORT_ENTRY_CHK(sym)
+MK_SORT_ENTRY_CHK(parallelism)
 
 
 static bool __sort__hpp_equal(struct perf_hpp_fmt *a, struct perf_hpp_fmt *b)
@@ -3715,6 +3751,21 @@ static int setup_sort_order(struct evlist *evlist)
 {
 	char *new_sort_order;
 
+	/*
+	 * Append the default sort order to "wallclock".
+	 * Just "wallclock" is not useful on its own, and there is no easy way
+	 * to express it otherwise.
+	 */
+	if (sort_order && !strcmp(sort_order, "wallclock")) {
+		if (asprintf(&new_sort_order, "%s,%s",
+			     sort_order, get_default_sort_order(evlist)) < 0) {
+			pr_err("Not enough memory to set up --sort");
+			return -ENOMEM;
+		}
+		sort_order = new_sort_order;
+		return 0;
+	}
+
 	/*
 	 * Append '+'-prefixed sort order to the default sort
 	 * order string.
@@ -3764,10 +3815,14 @@ static char *setup_overhead(char *keys)
 	if (sort__mode == SORT_MODE__DIFF)
 		return keys;
 
-	keys = prefix_if_not_in("overhead", keys);
+	if (!strstr(keys, "wallclock"))
+		keys = prefix_if_not_in("overhead", keys);
 
-	if (symbol_conf.cumulate_callchain)
+	if (symbol_conf.cumulate_callchain) {
+		if (!symbol_conf.disable_wallclock)
+			keys = prefix_if_not_in("wallclock_children", keys);
 		keys = prefix_if_not_in("overhead_children", keys);
+	}
 
 	return keys;
 }
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index a8572574e1686..11fb15f914093 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -72,6 +72,7 @@ enum sort_type {
 	SORT_ANNOTATE_DATA_TYPE_OFFSET,
 	SORT_SYM_OFFSET,
 	SORT_ANNOTATE_DATA_TYPE_CACHELINE,
+	SORT_PARALLELISM,
 
 	/* branch stack specific sort keys */
 	__SORT_BRANCH_STACK,
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 0037f11639195..38500609ea6ed 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -18,6 +18,7 @@
 #include "annotate.h"
 #include "build-id.h"
 #include "cap.h"
+#include "cpumap.h"
 #include "dso.h"
 #include "util.h" // lsdir()
 #include "debug.h"
@@ -2464,6 +2465,36 @@ int symbol__annotation_init(void)
 	return 0;
 }
 
+static int setup_parallelism_bitmap(void)
+{
+	struct perf_cpu_map *map;
+	struct perf_cpu cpu;
+	int i, err = -1;
+
+	if (symbol_conf.parallelism_list_str == NULL)
+		return 0;
+
+	map = perf_cpu_map__new(symbol_conf.parallelism_list_str);
+	if (map == NULL) {
+		pr_err("failed to parse parallelism filter list\n");
+		return -1;
+	}
+
+	bitmap_fill(symbol_conf.parallelism_filter, MAX_NR_CPUS + 1);
+	perf_cpu_map__for_each_cpu(cpu, i, map) {
+		if (cpu.cpu <= 0 || cpu.cpu > MAX_NR_CPUS) {
+			pr_err("Requested parallelism level %d is invalid.\n", cpu.cpu);
+			goto out_delete_map;
+		}
+		__clear_bit(cpu.cpu, symbol_conf.parallelism_filter);
+	}
+
+	err = 0;
+out_delete_map:
+	perf_cpu_map__put(map);
+	return err;
+}
+
 int symbol__init(struct perf_env *env)
 {
 	const char *symfs;
@@ -2483,6 +2514,9 @@ int symbol__init(struct perf_env *env)
 		return -1;
 	}
 
+	if (setup_parallelism_bitmap())
+		return -1;
+
 	if (setup_list(&symbol_conf.dso_list,
 		       symbol_conf.dso_list_str, "dso") < 0)
 		return -1;
diff --git a/tools/perf/util/symbol_conf.h b/tools/perf/util/symbol_conf.h
index a9c51acc722fe..7f6d6a1c773f1 100644
--- a/tools/perf/util/symbol_conf.h
+++ b/tools/perf/util/symbol_conf.h
@@ -3,6 +3,8 @@
 #define __PERF_SYMBOL_CONF 1
 
 #include <stdbool.h>
+#include <linux/bitmap.h>
+#include "perf.h"
 
 struct strlist;
 struct intlist;
@@ -47,7 +49,8 @@ struct symbol_conf {
 			keep_exited_threads,
 			annotate_data_member,
 			annotate_data_sample,
-			skip_empty;
+			skip_empty,
+			disable_wallclock;
 	const char	*vmlinux_name,
 			*kallsyms_name,
 			*source_prefix,
@@ -62,6 +65,7 @@ struct symbol_conf {
 			*pid_list_str,
 			*tid_list_str,
 			*sym_list_str,
+			*parallelism_list_str,
 			*col_width_list_str,
 			*bt_stop_list_str;
 	const char		*addr2line_path;
@@ -82,6 +86,7 @@ struct symbol_conf {
 	int		pad_output_len_dso;
 	int		group_sort_idx;
 	int		addr_range;
+	DECLARE_BITMAP(parallelism_filter, MAX_NR_CPUS + 1);
 };
 
 extern struct symbol_conf symbol_conf;

base-commit: a53496a3fa608f303904e0613c1bf2023b89db91
-- 
2.47.1.688.g23fc6f90ad-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-13 13:40     ` [PATCH v2] perf report: " Dmitry Vyukov
@ 2025-01-14  1:51       ` Namhyung Kim
  2025-01-14  8:26         ` Dmitry Vyukov
  2025-01-24 19:02       ` [PATCH v2] perf report: Add wall-clock and parallelism profiling Namhyung Kim
  1 sibling, 1 reply; 22+ messages in thread
From: Namhyung Kim @ 2025-01-14  1:51 UTC (permalink / raw)
  To: Dmitry Vyukov; +Cc: irogers, linux-perf-users, linux-kernel, eranian

Hello,

On Mon, Jan 13, 2025 at 02:40:06PM +0100, Dmitry Vyukov wrote:
> There are two notions of time: wall-clock time and CPU time.
> For a single-threaded program, or a program running on a single-core
> machine, these notions are the same. However, for a multi-threaded/
> multi-process program running on a multi-core machine, these notions are
> significantly different. Each second of wall-clock time we have
> number-of-cores seconds of CPU time.
> 
> Currently perf only allows to profile CPU time. Perf (and all other
> existing profilers to the be best of my knowledge) does not allow to
> profile wall-clock time.
> 
> Optimizing CPU overhead is useful to improve 'throughput', while
> optimizing wall-clock overhead is useful to improve 'latency'.
> These profiles are complementary and are not interchangeable.
> Examples of where wall-clock profile is needed:
>  - optimzing build latency
>  - optimizing server request latency
>  - optimizing ML training/inference latency
>  - optimizing running time of any command line program
> 
> CPU profile is useless for these use cases at best (if a user understands
> the difference), or misleading at worst (if a user tries to use a wrong
> profile for a job).
> 
> This patch adds wall-clock and parallelization profiling.
> See the added documentation and flags descriptions for details.
> 
> Brief outline of the implementation:
>  - add context switch collection during record
>  - calculate number of threads running on CPUs (parallelism level)
>    during report
>  - divide each sample weight by the parallelism level
> This effectively models that we were taking 1 sample per unit of
> wall-clock time.

Thanks for working on this, very interesting!

But I guess this implementation depends on cpu-cycles event and single
target process.  Do you think if it'd work for system-wide profiling?
How do you define wall-clock overhead if the event counts something
different (like the number of L3 cache-misses)?

Also I'm not sure about the impact of context switch events which could
generate a lot of records that may end up with losing some of them.  And
in that case the parallelism tracking would break.

> 
> The feature is added on an equal footing with the existing CPU profiling
> rather than a separate mode enabled with special flags. The reasoning is
> that users may not understand the problem and the meaning of numbers they
> are seeing in the first place, so won't even realize that they may need
> to be looking for some different profiling mode. When they are presented
> with 2 sets of different numbers, they should start asking questions.

I understand your point but I think it has some limitation so maybe it's
better to put in a separate mode with special flags.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-14  1:51       ` Namhyung Kim
@ 2025-01-14  8:26         ` Dmitry Vyukov
  2025-01-14 15:56           ` Arnaldo Carvalho de Melo
                             ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Dmitry Vyukov @ 2025-01-14  8:26 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: irogers, linux-perf-users, linux-kernel, eranian

On Tue, 14 Jan 2025 at 02:51, Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> On Mon, Jan 13, 2025 at 02:40:06PM +0100, Dmitry Vyukov wrote:
> > There are two notions of time: wall-clock time and CPU time.
> > For a single-threaded program, or a program running on a single-core
> > machine, these notions are the same. However, for a multi-threaded/
> > multi-process program running on a multi-core machine, these notions are
> > significantly different. Each second of wall-clock time we have
> > number-of-cores seconds of CPU time.
> >
> > Currently perf only allows to profile CPU time. Perf (and all other
> > existing profilers to the be best of my knowledge) does not allow to
> > profile wall-clock time.
> >
> > Optimizing CPU overhead is useful to improve 'throughput', while
> > optimizing wall-clock overhead is useful to improve 'latency'.
> > These profiles are complementary and are not interchangeable.
> > Examples of where wall-clock profile is needed:
> >  - optimzing build latency
> >  - optimizing server request latency
> >  - optimizing ML training/inference latency
> >  - optimizing running time of any command line program
> >
> > CPU profile is useless for these use cases at best (if a user understands
> > the difference), or misleading at worst (if a user tries to use a wrong
> > profile for a job).
> >
> > This patch adds wall-clock and parallelization profiling.
> > See the added documentation and flags descriptions for details.
> >
> > Brief outline of the implementation:
> >  - add context switch collection during record
> >  - calculate number of threads running on CPUs (parallelism level)
> >    during report
> >  - divide each sample weight by the parallelism level
> > This effectively models that we were taking 1 sample per unit of
> > wall-clock time.
>
> Thanks for working on this, very interesting!
>
> But I guess this implementation depends on cpu-cycles event and single
> target process.

Not necessarily, it works the same for PMU events, e.g. again from perf make:

Samples: 481K of event 'dTLB-load-misses', Event count (approx.): 169413440
Overhead  Wallclock  Command          Symbol
   5.34%     16.33%  ld               [.] bfd_elf_link_add_symbols
   4.14%     14.99%  ld               [.] bfd_link_hash_traverse
   3.53%     10.87%  ld               [.] __memmove_avx_unaligned_erms
   1.61%      0.05%  cc1              [k] native_queued_spin_lock_slowpath
   1.47%      0.09%  cc1              [k] __d_lookup_rcu
   1.38%      0.55%  sh               [k] __percpu_counter_sum
   1.22%      0.11%  cc1              [k] __audit_filter_op
   0.95%      0.08%  cc1              [k] audit_filter_rules.isra.0
   0.72%      0.47%  sh               [k] pcpu_alloc_noprof

And also it works well for multi-process targets, again most examples
in the patch are from kernel make.

> Do you think if it'd work for system-wide profiling?

Currently it doesn't b/c system-wide emits PERF_RECORD_SWITCH_CPU_WIDE
instead of PERF_RECORD_SWITCH which this patch handles.

It can be made to work, I just didn't want to complicate this patch more.
It can work as is, if the system activity includes mostly a single
task, e.g. you run something on command line, it's taking too long, so
you decide to do perf record -a in parallel to see what it's doing.

For system-wide profiling we do Google (GWP), it may need some
additional logic to make it useful. Namely, the system-wide
parallelism is not useful in multi-task contexts. But we can extend
report to allow us to calculate parallelism/wall-clock overhead for a
single multi-threaded process only. Or, if we auto-matically
post-process system-wide profile, then we can split the profile
per-process and calculate parallelism for each of them.

I am also not sure if we use perf report for GWP, or do manual trace
parsing. When/if this part is merged, we can think of the system-wide
story.

TODO: don't enable context-switch recording for -a mode.

> How do you define wall-clock overhead if the event counts something
> different (like the number of L3 cache-misses)?

It's the same. We calculate CPU overhead for these. Wallclock overhead
is the same, it's just the weights are adjusted proportionally.

Physical meaning is similar:
CPU overhead: how much L3 misses affect throughput/cpu consumption
Wallclock overhead: how much L3 misses affect latency

> Also I'm not sure about the impact of context switch events which could
> generate a lot of records that may end up with losing some of them.  And
> in that case the parallelism tracking would break.

I've counted SAMPLE/SWITCH events on perf make:

perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
813829
perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
46249

And this is for:
perf record -e L1-dcache-load-misses,dTLB-load-misses make

perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
1138615
perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
115433

Doesn't pretend to be scientifically accurate, but gives some idea
about proportion.
So for the number of samples that's 5% and 10% respectively.
SWITCH events are smaller (24 bytes vs 40 bytes for SAMPLE), so for
ring size it's 3-6%.

FWIW I've also considered and started implementing a different
approach where the kernel would count parallelism level for each
context and write it out with samples:
https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
Then sched_in/out would need to do atomic inc/dec on a global counter.
Not sure how hard it is to make all corner cases work there, I dropped
it half way b/c the perf record post-processing looked like a better
approach.

Lost SWITCH events may be a problem. Perf report already detects and
prints a warning about lost events. Should we stop reporting
parallelism/wall-clock if any events are lost? I just don't see it in
practice.
Kernel could write-out a bitmask of types of lost events in PERF_RECORD_LOST.

> > The feature is added on an equal footing with the existing CPU profiling
> > rather than a separate mode enabled with special flags. The reasoning is
> > that users may not understand the problem and the meaning of numbers they
> > are seeing in the first place, so won't even realize that they may need
> > to be looking for some different profiling mode. When they are presented
> > with 2 sets of different numbers, they should start asking questions.
>
> I understand your point but I think it has some limitation so maybe it's
> better to put in a separate mode with special flags.

I hate all options.

The 3rd option is to add a _mandary_ flag to perf record to ask for
profiling mode. Then the old "perf record" will say "nope, you need to
choose, here are your options and explanation of the options".

If we add it as a new separate mode, I fear it's doomed to be
unknown/unused and people will continue to not realize what they are
profiling, meaning of numbers, and continue to use wrong profiles for
the job.

I consider latency/wall-clock profiling as useful and as fundamental
as the current bandwidth profiling for parallel workloads (more or
less all workloads today). Yet we are in 2025 and no profilers out
there support latency profiling. Which suggests to me there is a
fundamental lack of understanding of the matter, so new optional mode
may not work well.

If you are interested for context, I used a similar mode (for pprof at
the time) to optimize TensorFlow latency (people mostly care about
latency). All existing profilers said it's 99% highly optimized and
parallelized matrix multiplication, we are done here. Yet, the
wallclock profile said, nope, it's 45% Python interpreter, which
resulted in this fix:
https://github.com/soumith/convnet-benchmarks/commit/605988f847fcf261c288abc2351597d71a2ee149
Full story and examples of these profiles are here:
https://github.com/dvyukov/perf-load

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-14  8:26         ` Dmitry Vyukov
@ 2025-01-14 15:56           ` Arnaldo Carvalho de Melo
  2025-01-14 16:07             ` Dmitry Vyukov
  2025-01-15  0:30           ` Namhyung Kim
  2025-01-15  5:59           ` Ian Rogers
  2 siblings, 1 reply; 22+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-01-14 15:56 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Namhyung Kim, irogers, linux-perf-users, linux-kernel, eranian,
	Ingo Molnar, Peter Zijlstra

Adding Ingo and PeterZ to the CC list.

On Tue, Jan 14, 2025 at 09:26:57AM +0100, Dmitry Vyukov wrote:
> On Tue, 14 Jan 2025 at 02:51, Namhyung Kim <namhyung@kernel.org> wrote:
> > On Mon, Jan 13, 2025 at 02:40:06PM +0100, Dmitry Vyukov wrote:
> > > There are two notions of time: wall-clock time and CPU time.
> > > For a single-threaded program, or a program running on a single-core
> > > machine, these notions are the same. However, for a multi-threaded/
> > > multi-process program running on a multi-core machine, these notions are
> > > significantly different. Each second of wall-clock time we have
> > > number-of-cores seconds of CPU time.
> > >
> > > Currently perf only allows to profile CPU time. Perf (and all other
> > > existing profilers to the be best of my knowledge) does not allow to
> > > profile wall-clock time.
> > >
> > > Optimizing CPU overhead is useful to improve 'throughput', while
> > > optimizing wall-clock overhead is useful to improve 'latency'.
> > > These profiles are complementary and are not interchangeable.
> > > Examples of where wall-clock profile is needed:
> > >  - optimzing build latency
> > >  - optimizing server request latency
> > >  - optimizing ML training/inference latency
> > >  - optimizing running time of any command line program
> > >
> > > CPU profile is useless for these use cases at best (if a user understands
> > > the difference), or misleading at worst (if a user tries to use a wrong
> > > profile for a job).
> > >
> > > This patch adds wall-clock and parallelization profiling.
> > > See the added documentation and flags descriptions for details.
> > >
> > > Brief outline of the implementation:
> > >  - add context switch collection during record
> > >  - calculate number of threads running on CPUs (parallelism level)
> > >    during report
> > >  - divide each sample weight by the parallelism level
> > > This effectively models that we were taking 1 sample per unit of
> > > wall-clock time.

> > Thanks for working on this, very interesting!

Indeed!

> > But I guess this implementation depends on cpu-cycles event and single
> > target process.
 
> Not necessarily, it works the same for PMU events, e.g. again from perf make:
 
> Samples: 481K of event 'dTLB-load-misses', Event count (approx.): 169413440
> Overhead  Wallclock  Command          Symbol
>    5.34%     16.33%  ld               [.] bfd_elf_link_add_symbols
>    4.14%     14.99%  ld               [.] bfd_link_hash_traverse
>    3.53%     10.87%  ld               [.] __memmove_avx_unaligned_erms
>    1.61%      0.05%  cc1              [k] native_queued_spin_lock_slowpath
>    1.47%      0.09%  cc1              [k] __d_lookup_rcu
>    1.38%      0.55%  sh               [k] __percpu_counter_sum
>    1.22%      0.11%  cc1              [k] __audit_filter_op
>    0.95%      0.08%  cc1              [k] audit_filter_rules.isra.0
>    0.72%      0.47%  sh               [k] pcpu_alloc_noprof
 
> And also it works well for multi-process targets, again most examples
> in the patch are from kernel make.
 
> > Do you think if it'd work for system-wide profiling?
 
> Currently it doesn't b/c system-wide emits PERF_RECORD_SWITCH_CPU_WIDE
> instead of PERF_RECORD_SWITCH which this patch handles.
 
> It can be made to work, I just didn't want to complicate this patch more.
> It can work as is, if the system activity includes mostly a single
> task, e.g. you run something on command line, it's taking too long, so
> you decide to do perf record -a in parallel to see what it's doing.
 
> For system-wide profiling we do Google (GWP), it may need some
> additional logic to make it useful. Namely, the system-wide
> parallelism is not useful in multi-task contexts. But we can extend
> report to allow us to calculate parallelism/wall-clock overhead for a
> single multi-threaded process only. Or, if we auto-matically
> post-process system-wide profile, then we can split the profile
> per-process and calculate parallelism for each of them.
 
> I am also not sure if we use perf report for GWP, or do manual trace
> parsing. When/if this part is merged, we can think of the system-wide
> story.
 
> TODO: don't enable context-switch recording for -a mode.
 
> > How do you define wall-clock overhead if the event counts something
> > different (like the number of L3 cache-misses)?
 
> It's the same. We calculate CPU overhead for these. Wallclock overhead
> is the same, it's just the weights are adjusted proportionally.
 
> Physical meaning is similar:
> CPU overhead: how much L3 misses affect throughput/cpu consumption
> Wallclock overhead: how much L3 misses affect latency
 
> > Also I'm not sure about the impact of context switch events which could
> > generate a lot of records that may end up with losing some of them.  And
> > in that case the parallelism tracking would break.
 
> I've counted SAMPLE/SWITCH events on perf make:
 
> perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
> 813829
> perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
> 46249
 
> And this is for:
> perf record -e L1-dcache-load-misses,dTLB-load-misses make
 
> perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
> 1138615
> perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
> 115433
 
> Doesn't pretend to be scientifically accurate, but gives some idea
> about proportion.

> So for the number of samples that's 5% and 10% respectively.
> SWITCH events are smaller (24 bytes vs 40 bytes for SAMPLE), so for
> ring size it's 3-6%.
 
> FWIW I've also considered and started implementing a different
> approach where the kernel would count parallelism level for each
> context and write it out with samples:
> https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
> Then sched_in/out would need to do atomic inc/dec on a global counter.
> Not sure how hard it is to make all corner cases work there, I dropped
> it half way b/c the perf record post-processing looked like a better
> approach.
 
> Lost SWITCH events may be a problem. Perf report already detects and
> prints a warning about lost events. Should we stop reporting
> parallelism/wall-clock if any events are lost? I just don't see it in
> practice.

> Kernel could write-out a bitmask of types of lost events in PERF_RECORD_LOST.

right

> > > The feature is added on an equal footing with the existing CPU profiling
> > > rather than a separate mode enabled with special flags. The reasoning is
> > > that users may not understand the problem and the meaning of numbers they
> > > are seeing in the first place, so won't even realize that they may need
> > > to be looking for some different profiling mode. When they are presented
> > > with 2 sets of different numbers, they should start asking questions.
> >
> > I understand your point but I think it has some limitation so maybe it's
> > better to put in a separate mode with special flags.
 
> I hate all options.
 
> The 3rd option is to add a _mandary_ flag to perf record to ask for
> profiling mode. Then the old "perf record" will say "nope, you need to
> choose, here are your options and explanation of the options".

Well, we have a line for tips about perf usage at the botton, for
instance, when I ran 'perf report' to check it it showed:

Tip: Add -I to perf record to sample register values, which will be visible in perf report sample context.

We could add a line with:

Tip: Use '-s wallclock' to get wallclock/latency profiling

And when using it we could have another tip:

Tip: Use 'perf config report.mode=wallclock' to change the default

⬢ [acme@toolbox pahole]$ perf config report.mode=wallclock
⬢ [acme@toolbox pahole]$ perf config report.mode
report.mode=wallclock
⬢ [acme@toolbox pahole]$

Or something along these lines.

Or we could have a popup that would appear with a "Important notice",
explaining the wallclock mode and asking the user to opt-in to having
both, just one or keep the old mode, that would be a quick, temporary
"annoyance" for people used with the current mode while bringing this
cool new mode to the attention of users.

Another idea, more general, is to be able to press some hotkey that
would toggle available sort keys, like we have, for instance, 'k' to
toggle the 'Shared object' column in the default 'perf report/top'
TUI while filtering out non-kernel samples.

But my initial feeling is that you're right and we want this wallclock
column on by default, its just a matter of finding a convenient/easy way
for users to opt-in.

Being able to quickly toggle ordering by the Overhead/Wallclock columns
also looks useful.

> If we add it as a new separate mode, I fear it's doomed to be
> unknown/unused and people will continue to not realize what they are
> profiling, meaning of numbers, and continue to use wrong profiles for
> the job.
 
> I consider latency/wall-clock profiling as useful and as fundamental
> as the current bandwidth profiling for parallel workloads (more or
> less all workloads today). Yet we are in 2025 and no profilers out
> there support latency profiling. Which suggests to me there is a
> fundamental lack of understanding of the matter, so new optional mode
> may not work well.
 
> If you are interested for context, I used a similar mode (for pprof at
> the time) to optimize TensorFlow latency (people mostly care about
> latency). All existing profilers said it's 99% highly optimized and
> parallelized matrix multiplication, we are done here. Yet, the
> wallclock profile said, nope, it's 45% Python interpreter, which
> resulted in this fix:
> https://github.com/soumith/convnet-benchmarks/commit/605988f847fcf261c288abc2351597d71a2ee149
> Full story and examples of these profiles are here:
> https://github.com/dvyukov/perf-load

Cool, I'm about to start vacation till Feb, 4, so will not be able to
test all your work, but its important work that we definetely have to
test and help in getting merged.

Thanks!

- Arnaldo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-14 15:56           ` Arnaldo Carvalho de Melo
@ 2025-01-14 16:07             ` Dmitry Vyukov
  2025-01-14 17:52               ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 22+ messages in thread
From: Dmitry Vyukov @ 2025-01-14 16:07 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Namhyung Kim, irogers, linux-perf-users, linux-kernel, eranian,
	Ingo Molnar, Peter Zijlstra

On Tue, 14 Jan 2025 at 16:57, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>
> Adding Ingo and PeterZ to the CC list.
>
> On Tue, Jan 14, 2025 at 09:26:57AM +0100, Dmitry Vyukov wrote:
> > On Tue, 14 Jan 2025 at 02:51, Namhyung Kim <namhyung@kernel.org> wrote:
> > > On Mon, Jan 13, 2025 at 02:40:06PM +0100, Dmitry Vyukov wrote:
> > > > There are two notions of time: wall-clock time and CPU time.
> > > > For a single-threaded program, or a program running on a single-core
> > > > machine, these notions are the same. However, for a multi-threaded/
> > > > multi-process program running on a multi-core machine, these notions are
> > > > significantly different. Each second of wall-clock time we have
> > > > number-of-cores seconds of CPU time.
> > > >
> > > > Currently perf only allows to profile CPU time. Perf (and all other
> > > > existing profilers to the be best of my knowledge) does not allow to
> > > > profile wall-clock time.
> > > >
> > > > Optimizing CPU overhead is useful to improve 'throughput', while
> > > > optimizing wall-clock overhead is useful to improve 'latency'.
> > > > These profiles are complementary and are not interchangeable.
> > > > Examples of where wall-clock profile is needed:
> > > >  - optimzing build latency
> > > >  - optimizing server request latency
> > > >  - optimizing ML training/inference latency
> > > >  - optimizing running time of any command line program
> > > >
> > > > CPU profile is useless for these use cases at best (if a user understands
> > > > the difference), or misleading at worst (if a user tries to use a wrong
> > > > profile for a job).
> > > >
> > > > This patch adds wall-clock and parallelization profiling.
> > > > See the added documentation and flags descriptions for details.
> > > >
> > > > Brief outline of the implementation:
> > > >  - add context switch collection during record
> > > >  - calculate number of threads running on CPUs (parallelism level)
> > > >    during report
> > > >  - divide each sample weight by the parallelism level
> > > > This effectively models that we were taking 1 sample per unit of
> > > > wall-clock time.
>
> > > Thanks for working on this, very interesting!
>
> Indeed!
>
> > > But I guess this implementation depends on cpu-cycles event and single
> > > target process.
>
> > Not necessarily, it works the same for PMU events, e.g. again from perf make:
>
> > Samples: 481K of event 'dTLB-load-misses', Event count (approx.): 169413440
> > Overhead  Wallclock  Command          Symbol
> >    5.34%     16.33%  ld               [.] bfd_elf_link_add_symbols
> >    4.14%     14.99%  ld               [.] bfd_link_hash_traverse
> >    3.53%     10.87%  ld               [.] __memmove_avx_unaligned_erms
> >    1.61%      0.05%  cc1              [k] native_queued_spin_lock_slowpath
> >    1.47%      0.09%  cc1              [k] __d_lookup_rcu
> >    1.38%      0.55%  sh               [k] __percpu_counter_sum
> >    1.22%      0.11%  cc1              [k] __audit_filter_op
> >    0.95%      0.08%  cc1              [k] audit_filter_rules.isra.0
> >    0.72%      0.47%  sh               [k] pcpu_alloc_noprof
>
> > And also it works well for multi-process targets, again most examples
> > in the patch are from kernel make.
>
> > > Do you think if it'd work for system-wide profiling?
>
> > Currently it doesn't b/c system-wide emits PERF_RECORD_SWITCH_CPU_WIDE
> > instead of PERF_RECORD_SWITCH which this patch handles.
>
> > It can be made to work, I just didn't want to complicate this patch more.
> > It can work as is, if the system activity includes mostly a single
> > task, e.g. you run something on command line, it's taking too long, so
> > you decide to do perf record -a in parallel to see what it's doing.
>
> > For system-wide profiling we do Google (GWP), it may need some
> > additional logic to make it useful. Namely, the system-wide
> > parallelism is not useful in multi-task contexts. But we can extend
> > report to allow us to calculate parallelism/wall-clock overhead for a
> > single multi-threaded process only. Or, if we auto-matically
> > post-process system-wide profile, then we can split the profile
> > per-process and calculate parallelism for each of them.
>
> > I am also not sure if we use perf report for GWP, or do manual trace
> > parsing. When/if this part is merged, we can think of the system-wide
> > story.
>
> > TODO: don't enable context-switch recording for -a mode.
>
> > > How do you define wall-clock overhead if the event counts something
> > > different (like the number of L3 cache-misses)?
>
> > It's the same. We calculate CPU overhead for these. Wallclock overhead
> > is the same, it's just the weights are adjusted proportionally.
>
> > Physical meaning is similar:
> > CPU overhead: how much L3 misses affect throughput/cpu consumption
> > Wallclock overhead: how much L3 misses affect latency
>
> > > Also I'm not sure about the impact of context switch events which could
> > > generate a lot of records that may end up with losing some of them.  And
> > > in that case the parallelism tracking would break.
>
> > I've counted SAMPLE/SWITCH events on perf make:
>
> > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
> > 813829
> > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
> > 46249
>
> > And this is for:
> > perf record -e L1-dcache-load-misses,dTLB-load-misses make
>
> > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
> > 1138615
> > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
> > 115433
>
> > Doesn't pretend to be scientifically accurate, but gives some idea
> > about proportion.
>
> > So for the number of samples that's 5% and 10% respectively.
> > SWITCH events are smaller (24 bytes vs 40 bytes for SAMPLE), so for
> > ring size it's 3-6%.
>
> > FWIW I've also considered and started implementing a different
> > approach where the kernel would count parallelism level for each
> > context and write it out with samples:
> > https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
> > Then sched_in/out would need to do atomic inc/dec on a global counter.
> > Not sure how hard it is to make all corner cases work there, I dropped
> > it half way b/c the perf record post-processing looked like a better
> > approach.
>
> > Lost SWITCH events may be a problem. Perf report already detects and
> > prints a warning about lost events. Should we stop reporting
> > parallelism/wall-clock if any events are lost? I just don't see it in
> > practice.
>
> > Kernel could write-out a bitmask of types of lost events in PERF_RECORD_LOST.
>
> right
>
> > > > The feature is added on an equal footing with the existing CPU profiling
> > > > rather than a separate mode enabled with special flags. The reasoning is
> > > > that users may not understand the problem and the meaning of numbers they
> > > > are seeing in the first place, so won't even realize that they may need
> > > > to be looking for some different profiling mode. When they are presented
> > > > with 2 sets of different numbers, they should start asking questions.
> > >
> > > I understand your point but I think it has some limitation so maybe it's
> > > better to put in a separate mode with special flags.
>
> > I hate all options.
>
> > The 3rd option is to add a _mandary_ flag to perf record to ask for
> > profiling mode. Then the old "perf record" will say "nope, you need to
> > choose, here are your options and explanation of the options".
>
> Well, we have a line for tips about perf usage at the botton, for
> instance, when I ran 'perf report' to check it it showed:
>
> Tip: Add -I to perf record to sample register values, which will be visible in perf report sample context.
>
> We could add a line with:
>
> Tip: Use '-s wallclock' to get wallclock/latency profiling

Thanks for the review!

This is an option too.
I am not very strong about the exact option we choose, I understand
what I proposed has downsides too.
I am ready to go with a solution we will agree on here.

> And when using it we could have another tip:
>
> Tip: Use 'perf config report.mode=wallclock' to change the default
>
> ⬢ [acme@toolbox pahole]$ perf config report.mode=wallclock
> ⬢ [acme@toolbox pahole]$ perf config report.mode
> report.mode=wallclock
> ⬢ [acme@toolbox pahole]$
>
> Or something along these lines.
>
> Or we could have a popup that would appear with a "Important notice",
> explaining the wallclock mode and asking the user to opt-in to having
> both, just one or keep the old mode, that would be a quick, temporary
> "annoyance" for people used with the current mode while bringing this
> cool new mode to the attention of users.

Is there an example of doing something similar in the code?


> Another idea, more general, is to be able to press some hotkey that
> would toggle available sort keys, like we have, for instance, 'k' to
> toggle the 'Shared object' column in the default 'perf report/top'
> TUI while filtering out non-kernel samples.
>
> But my initial feeling is that you're right and we want this wallclock
> column on by default, its just a matter of finding a convenient/easy way
> for users to opt-in.
>
> Being able to quickly toggle ordering by the Overhead/Wallclock columns
> also looks useful.

I've tried to do something similar, e.g. filtering by parallelism
level on-the-fly, thus this patch:
https://lore.kernel.org/linux-perf-users/CACT4Y+b8X7PnUwQtuU2QXSqumNDbN8pWDm8EX+wnvgNAKbW0xw@mail.gmail.com/T/#t
But it turned out the reload functionality is more broken than working.

I agree it would be nice to have these usability improvements.
But this is also a large change, so I would prefer to merge the basic
functionality, and then we can do some improvements on top. This is
also the first time I am touching perf code, there are too many new
things to learn for me.

> > If we add it as a new separate mode, I fear it's doomed to be
> > unknown/unused and people will continue to not realize what they are
> > profiling, meaning of numbers, and continue to use wrong profiles for
> > the job.
>
> > I consider latency/wall-clock profiling as useful and as fundamental
> > as the current bandwidth profiling for parallel workloads (more or
> > less all workloads today). Yet we are in 2025 and no profilers out
> > there support latency profiling. Which suggests to me there is a
> > fundamental lack of understanding of the matter, so new optional mode
> > may not work well.
>
> > If you are interested for context, I used a similar mode (for pprof at
> > the time) to optimize TensorFlow latency (people mostly care about
> > latency). All existing profilers said it's 99% highly optimized and
> > parallelized matrix multiplication, we are done here. Yet, the
> > wallclock profile said, nope, it's 45% Python interpreter, which
> > resulted in this fix:
> > https://github.com/soumith/convnet-benchmarks/commit/605988f847fcf261c288abc2351597d71a2ee149
> > Full story and examples of these profiles are here:
> > https://github.com/dvyukov/perf-load
>
> Cool, I'm about to start vacation till Feb, 4, so will not be able to
> test all your work, but its important work that we definetely have to
> test and help in getting merged.
>
> Thanks!
>
> - Arnaldo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-14 16:07             ` Dmitry Vyukov
@ 2025-01-14 17:52               ` Arnaldo Carvalho de Melo
  2025-01-14 18:16                 ` Arnaldo Carvalho de Melo
  2025-01-19 10:22                 ` Dmitry Vyukov
  0 siblings, 2 replies; 22+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-01-14 17:52 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Namhyung Kim, irogers, linux-perf-users, linux-kernel, eranian,
	Ingo Molnar, Peter Zijlstra

On Tue, Jan 14, 2025 at 05:07:14PM +0100, Dmitry Vyukov wrote:
> On Tue, 14 Jan 2025 at 16:57, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > On Tue, Jan 14, 2025 at 09:26:57AM +0100, Dmitry Vyukov wrote:
> > > On Tue, 14 Jan 2025 at 02:51, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > I understand your point but I think it has some limitation so maybe it's
> > > > better to put in a separate mode with special flags.

> > > I hate all options.

> > > The 3rd option is to add a _mandary_ flag to perf record to ask for
> > > profiling mode. Then the old "perf record" will say "nope, you need to
> > > choose, here are your options and explanation of the options".

> > Well, we have a line for tips about perf usage at the botton, for
> > instance, when I ran 'perf report' to check it it showed:

> > Tip: Add -I to perf record to sample register values, which will be visible in perf report sample context.

> > We could add a line with:

> > Tip: Use '-s wallclock' to get wallclock/latency profiling
 
> Thanks for the review!
 
> This is an option too.
> I am not very strong about the exact option we choose, I understand
> what I proposed has downsides too.
> I am ready to go with a solution we will agree on here.
 
> > And when using it we could have another tip:
> >
> > Tip: Use 'perf config report.mode=wallclock' to change the default
> >
> > ⬢ [acme@toolbox pahole]$ perf config report.mode=wallclock
> > ⬢ [acme@toolbox pahole]$ perf config report.mode
> > report.mode=wallclock
> > ⬢ [acme@toolbox pahole]$
> >
> > Or something along these lines.

> > Or we could have a popup that would appear with a "Important notice",
> > explaining the wallclock mode and asking the user to opt-in to having
> > both, just one or keep the old mode, that would be a quick, temporary
> > "annoyance" for people used with the current mode while bringing this
> > cool new mode to the attention of users.
 
> Is there an example of doing something similar in the code?

Sure, for the config part look at "report.children", that is processed
in tools/perf/builtin-report.c report__config() function:

        if (!strcmp(var, "report.children")) {
                symbol_conf.cumulate_callchain = perf_config_bool(var, value);
                return 0;
        }

To test, capture with callchains:

root@number:~# perf record -g -a -e cpu_core/cycles/P sleep 5
[ perf record: Woken up 97 times to write data ]
[ perf record: Captured and wrote 28.084 MB perf.data (194988 samples) ]
root@number:~#
root@number:~# perf evlist
cpu_core/cycles/P
dummy:u
root@number:~#

The default ends up being:

root@number:~# perf report
Samples: 194K of event 'cpu_core/cycles/P', Event count (approx.): 242759929168
  Children      Self  Command   Shared Object       Symbol
+    9.25%     0.00%  cc1       [unknown]           [.] 0000000000000000
+    7.23%     0.05%  cc1       [kernel.kallsyms]   [k] entry_SYSCALL_64
+    7.09%     0.02%  cc1       [kernel.kallsyms]   [k] do_syscall_64
+    4.43%     0.02%  cc1       [kernel.kallsyms]   [k] asm_exc_page_fault
+    3.89%     0.06%  cc1       libc.so.6           [.] __GI___getcwd

The other mode:

root@number:~# perf report --no-children
Samples: 194K of event 'cpu_core/cycles/P', Event count (approx.): 242759929168
  Overhead  Command   Shared Object   Symbol
+    2.43%  cc1       cc1             [.] ht_lookup_with_hash(ht*, unsigned char const*, unsigned long, unsigned int, ht_lookup_option)
+    1.62%  cc1       cc1             [.] _cpp_lex_direct
+    1.34%  cc1       cc1             [.] iterative_hash
+    1.09%  cc1       cc1             [.] bitmap_set_bit(bitmap_head*, int)
+    0.98%  cc1       libc.so.6       [.] sysmalloc
+    0.97%  cc1       libc.so.6       [.] _int_malloc

So people not liking the default do:

root@number:~# perf config report.children=no
root@number:~# perf config report.children
report.children=no
root@number:~# 

To avoid having to use --no-children.

See tools/perf/Documentation/perf-config.txt for more of these config
options.

We don't have the popup that sets the ~/.perfconfig variable, but I
think it is easily possible using ui_browser__dialog_yesno()... ok, look
at the patch below, with it:

root@x1:~# rm -f ~/.perfconfig 
root@x1:~# perf config annotate.disassemblers=objdump,llvm
root@x1:~# cat ~/.perfconfig 
# this file is auto-generated.
[annotate]
	disassemblers = objdump,llvm
root@x1:~# perf report
root@x1:~# cat ~/.perfconfig 
# this file is auto-generated.
[annotate]
	disassemblers = objdump,llvm
[report]
	wallclock = yes
root@x1:~# 

So with it when the user doesn't have that "report.wallclock=yes" (or
"no") set, i.e. everybody, the question will be made, once answered it
will not be made again.

> > Another idea, more general, is to be able to press some hotkey that
> > would toggle available sort keys, like we have, for instance, 'k' to
> > toggle the 'Shared object' column in the default 'perf report/top'
> > TUI while filtering out non-kernel samples.
> >
> > But my initial feeling is that you're right and we want this wallclock
> > column on by default, its just a matter of finding a convenient/easy way
> > for users to opt-in.
> >
> > Being able to quickly toggle ordering by the Overhead/Wallclock columns
> > also looks useful.
> 
> I've tried to do something similar, e.g. filtering by parallelism
> level on-the-fly, thus this patch:
> https://lore.kernel.org/linux-perf-users/CACT4Y+b8X7PnUwQtuU2QXSqumNDbN8pWDm8EX+wnvgNAKbW0xw@mail.gmail.com/T/#t
> But it turned out the reload functionality is more broken than working.
> 
> I agree it would be nice to have these usability improvements.

Right, and I'm not talking about those as a requirement for your patch
to get accepted.

> But this is also a large change, so I would prefer to merge the basic
> functionality, and then we can do some improvements on top. This is
> also the first time I am touching perf code, there are too many new
> things to learn for me.

I'm happy you made the effort and will try to help you as much as I can.

I agree that, with the patch below, that I'll try to split and merge
before travelling, we'll have a good starting point.

Here is the patch, I'll split the perf_config__set_variable into a
separate patch and merge, so that you can continue from there. The
"<LONGER EXPLANATION>" part should tell the user that 'perf config
report.workload=yes|no' can be done later, to switch the mode, if one
doesn't want to continue with it as default.

Thanks,

- Arnaldo

diff --git a/tools/perf/builtin-config.c b/tools/perf/builtin-config.c
index 2e8363778935e8d5..45b5312fbe8370e8 100644
--- a/tools/perf/builtin-config.c
+++ b/tools/perf/builtin-config.c
@@ -154,6 +154,44 @@ static int parse_config_arg(char *arg, char **var, char **value)
 	return 0;
 }
 
+int perf_config__set_variable(const char *var, const char *value)
+{
+	char path[PATH_MAX];
+	char *user_config = mkpath(path, sizeof(path), "%s/.perfconfig", getenv("HOME"));
+	const char *config_filename;
+	struct perf_config_set *set;
+	int ret = -1;
+
+	if (use_system_config)
+		config_exclusive_filename = perf_etc_perfconfig();
+	else if (use_user_config)
+		config_exclusive_filename = user_config;
+
+	if (!config_exclusive_filename)
+		config_filename = user_config;
+	else
+		config_filename = config_exclusive_filename;
+
+	set = perf_config_set__new();
+	if (!set)
+		goto out_err;
+
+	if (perf_config_set__collect(set, config_filename, var, value) < 0) {
+		pr_err("Failed to add '%s=%s'\n", var, value);
+		goto out_err;
+	}
+
+	if (set_config(set, config_filename) < 0) {
+		pr_err("Failed to set the configs on %s\n", config_filename);
+		goto out_err;
+	}
+
+	ret = 0;
+out_err:
+	perf_config_set__delete(set);
+	return ret;
+}
+
 int cmd_config(int argc, const char **argv)
 {
 	int i, ret = -1;
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index f5fbd670d619a32a..ccd6eef8ece6e238 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -50,6 +50,7 @@
 #include "util/units.h"
 #include "util/util.h" // perf_tip()
 #include "ui/ui.h"
+#include "ui/util.h"
 #include "ui/progress.h"
 #include "util/block-info.h"
 
@@ -99,6 +100,8 @@ struct report {
 	bool			disable_order;
 	bool			skip_empty;
 	bool			data_type;
+	bool			wallclock_config_set;
+	bool			use_wallclock;
 	int			max_stack;
 	struct perf_read_values	show_threads_values;
 	const char		*pretty_printing_style;
@@ -151,6 +154,11 @@ static int report__config(const char *var, const char *value, void *cb)
 		}
 		return 0;
 	}
+	if (!strcmp(var, "report.wallclock")) {
+		rep->use_wallclock = perf_config_bool(var, value);
+		rep->wallclock_config_set = true;
+		return 0;
+	}
 
 	if (!strcmp(var, "report.skip-empty")) {
 		rep->skip_empty = perf_config_bool(var, value);
@@ -1089,6 +1097,15 @@ static int __cmd_report(struct report *rep)
 
 	report__warn_kptr_restrict(rep);
 
+	if (!rep->wallclock_config_set) {
+		if (ui__dialog_yesno("Do you want to use wallclock/latency mode?\n"
+				     "<LONGER EXPLANATION>\n")) {
+			rep->use_wallclock = true;
+			// do other prep to add the "wallclock" key to the sort order, etc
+			perf_config__set_variable("report.wallclock", "yes");
+		}
+	}
+
 	evlist__for_each_entry(session->evlist, pos)
 		rep->nr_entries += evsel__hists(pos)->nr_entries;
 
diff --git a/tools/perf/util/config.h b/tools/perf/util/config.h
index 9971313d61c1e5c6..a727c95cb1195e06 100644
--- a/tools/perf/util/config.h
+++ b/tools/perf/util/config.h
@@ -50,6 +50,7 @@ int perf_config_set__collect(struct perf_config_set *set, const char *file_name,
 			     const char *var, const char *value);
 void perf_config__exit(void);
 void perf_config__refresh(void);
+int perf_config__set_variable(const char *var, const char *value);
 
 /**
  * perf_config_sections__for_each - iterate thru all the sections

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-14 17:52               ` Arnaldo Carvalho de Melo
@ 2025-01-14 18:16                 ` Arnaldo Carvalho de Melo
  2025-01-19 10:22                 ` Dmitry Vyukov
  1 sibling, 0 replies; 22+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-01-14 18:16 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Namhyung Kim, irogers, linux-perf-users, linux-kernel, eranian,
	Ingo Molnar, Peter Zijlstra

On Tue, Jan 14, 2025 at 02:53:03PM -0300, Arnaldo Carvalho de Melo wrote:
> On Tue, Jan 14, 2025 at 05:07:14PM +0100, Dmitry Vyukov wrote:
> > I agree it would be nice to have these usability improvements.
 
> Right, and I'm not talking about those as a requirement for your patch
> to get accepted.
 
> > But this is also a large change, so I would prefer to merge the basic
> > functionality, and then we can do some improvements on top. This is
> > also the first time I am touching perf code, there are too many new
> > things to learn for me.
 
> I'm happy you made the effort and will try to help you as much as I can.
 
> I agree that, with the patch below, that I'll try to split and merge
> before travelling, we'll have a good starting point.
> 
> Here is the patch, I'll split the perf_config__set_variable into a
> separate patch and merge, so that you can continue from there. The
> "<LONGER EXPLANATION>" part should tell the user that 'perf config
> report.workload=yes|no' can be done later, to switch the mode, if one
> doesn't want to continue with it as default.

Its now in the tmp.perf-tools-next branch at:

https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git

I'm testing it together with a batch of patches one of which I had to
bisect and remove, soon will be in the main perf-tools-next branch.

I mean this patch:

"perf config: Add a function to set one variable in .perfconfig"
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/?h=tmp.perf-tools-next

Thanks,

- Arnaldo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-14  8:26         ` Dmitry Vyukov
  2025-01-14 15:56           ` Arnaldo Carvalho de Melo
@ 2025-01-15  0:30           ` Namhyung Kim
  2025-01-19 10:50             ` Dmitry Vyukov
  2025-01-15  5:59           ` Ian Rogers
  2 siblings, 1 reply; 22+ messages in thread
From: Namhyung Kim @ 2025-01-15  0:30 UTC (permalink / raw)
  To: Dmitry Vyukov; +Cc: irogers, linux-perf-users, linux-kernel, eranian

On Tue, Jan 14, 2025 at 09:26:57AM +0100, Dmitry Vyukov wrote:
> On Tue, 14 Jan 2025 at 02:51, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Hello,
> >
> > On Mon, Jan 13, 2025 at 02:40:06PM +0100, Dmitry Vyukov wrote:
> > > There are two notions of time: wall-clock time and CPU time.
> > > For a single-threaded program, or a program running on a single-core
> > > machine, these notions are the same. However, for a multi-threaded/
> > > multi-process program running on a multi-core machine, these notions are
> > > significantly different. Each second of wall-clock time we have
> > > number-of-cores seconds of CPU time.
> > >
> > > Currently perf only allows to profile CPU time. Perf (and all other
> > > existing profilers to the be best of my knowledge) does not allow to
> > > profile wall-clock time.
> > >
> > > Optimizing CPU overhead is useful to improve 'throughput', while
> > > optimizing wall-clock overhead is useful to improve 'latency'.
> > > These profiles are complementary and are not interchangeable.
> > > Examples of where wall-clock profile is needed:
> > >  - optimzing build latency
> > >  - optimizing server request latency
> > >  - optimizing ML training/inference latency
> > >  - optimizing running time of any command line program
> > >
> > > CPU profile is useless for these use cases at best (if a user understands
> > > the difference), or misleading at worst (if a user tries to use a wrong
> > > profile for a job).
> > >
> > > This patch adds wall-clock and parallelization profiling.
> > > See the added documentation and flags descriptions for details.
> > >
> > > Brief outline of the implementation:
> > >  - add context switch collection during record
> > >  - calculate number of threads running on CPUs (parallelism level)
> > >    during report
> > >  - divide each sample weight by the parallelism level
> > > This effectively models that we were taking 1 sample per unit of
> > > wall-clock time.
> >
> > Thanks for working on this, very interesting!
> >
> > But I guess this implementation depends on cpu-cycles event and single
> > target process.
> 
> Not necessarily, it works the same for PMU events, e.g. again from perf make:

Of course it would work with other events.  But the thing is how to
interprete the result IMHO.

> 
> Samples: 481K of event 'dTLB-load-misses', Event count (approx.): 169413440
> Overhead  Wallclock  Command          Symbol
>    5.34%     16.33%  ld               [.] bfd_elf_link_add_symbols
>    4.14%     14.99%  ld               [.] bfd_link_hash_traverse
>    3.53%     10.87%  ld               [.] __memmove_avx_unaligned_erms
>    1.61%      0.05%  cc1              [k] native_queued_spin_lock_slowpath
>    1.47%      0.09%  cc1              [k] __d_lookup_rcu
>    1.38%      0.55%  sh               [k] __percpu_counter_sum
>    1.22%      0.11%  cc1              [k] __audit_filter_op
>    0.95%      0.08%  cc1              [k] audit_filter_rules.isra.0
>    0.72%      0.47%  sh               [k] pcpu_alloc_noprof
> 
> And also it works well for multi-process targets, again most examples
> in the patch are from kernel make.

It's the single-origin case.  I'm curious about the multiple-origin
cases like profiling two or more existing processes together.  How do
you determine the initial parallelism?

Also I think the parallelism is defined in a process (group).  Does it
still make sense in system-wide profiling?

> 
> > Do you think if it'd work for system-wide profiling?
> 
> Currently it doesn't b/c system-wide emits PERF_RECORD_SWITCH_CPU_WIDE
> instead of PERF_RECORD_SWITCH which this patch handles.
> 
> It can be made to work, I just didn't want to complicate this patch more.

Ok.

> It can work as is, if the system activity includes mostly a single
> task, e.g. you run something on command line, it's taking too long, so
> you decide to do perf record -a in parallel to see what it's doing.
> 
> For system-wide profiling we do Google (GWP), it may need some
> additional logic to make it useful. Namely, the system-wide
> parallelism is not useful in multi-task contexts. But we can extend
> report to allow us to calculate parallelism/wall-clock overhead for a
> single multi-threaded process only. Or, if we auto-matically
> post-process system-wide profile, then we can split the profile
> per-process and calculate parallelism for each of them.
> 
> I am also not sure if we use perf report for GWP, or do manual trace
> parsing. When/if this part is merged, we can think of the system-wide
> story.
> 
> TODO: don't enable context-switch recording for -a mode.
> 
> 
> > How do you define wall-clock overhead if the event counts something
> > different (like the number of L3 cache-misses)?
> 
> It's the same. We calculate CPU overhead for these. Wallclock overhead
> is the same, it's just the weights are adjusted proportionally.

I got this part but ...

> 
> Physical meaning is similar:
> CPU overhead: how much L3 misses affect throughput/cpu consumption
> Wallclock overhead: how much L3 misses affect latency

I'm not sure if it's correct.  The event would tell where you get the L3
miss mostly.  It doesn't necessarily mean latency.  IOW more L3 miss
doesn't mean more CPU time.  I guess that's why perf doesn't say it as
"CPU overhead".

Hmm.. maybe you could rename it differently?

> 
> 
> > Also I'm not sure about the impact of context switch events which could
> > generate a lot of records that may end up with losing some of them.  And
> > in that case the parallelism tracking would break.
> 
> I've counted SAMPLE/SWITCH events on perf make:
> 
> perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
> 813829
> perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
> 46249
> 
> And this is for:
> perf record -e L1-dcache-load-misses,dTLB-load-misses make
> 
> perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
> 1138615
> perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
> 115433
> 
> Doesn't pretend to be scientifically accurate, but gives some idea
> about proportion.
> So for the number of samples that's 5% and 10% respectively.

I think it depends on the workload.  In a very loaded system with short
timeslice or something, you could get a lot more sched switches.  In
your workload, `make` may have a long single-threaded portion.


> SWITCH events are smaller (24 bytes vs 40 bytes for SAMPLE), so for
> ring size it's 3-6%.
> 
> FWIW I've also considered and started implementing a different
> approach where the kernel would count parallelism level for each
> context and write it out with samples:
> https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
> Then sched_in/out would need to do atomic inc/dec on a global counter.

It seems you counted parallelism in an event (including inherited ones).
But usually perf tool opens a separate FD on each CPU for an event (like
cycles), then you need to adjust the value across all CPUS.  I'm not
sure if it's doable in this approach.


> Not sure how hard it is to make all corner cases work there, I dropped
> it half way b/c the perf record post-processing looked like a better
> approach.

Probably.

> 
> Lost SWITCH events may be a problem. Perf report already detects and
> prints a warning about lost events. Should we stop reporting
> parallelism/wall-clock if any events are lost? I just don't see it in
> practice.

It happens quite often when there are many activities.  Probably we can
consider stopping when it detects some lost records.


> Kernel could write-out a bitmask of types of lost events in PERF_RECORD_LOST.

It'd be nice, but given the rate of sched switch I guess it'd most
likely contain some.

> 
> 
> > > The feature is added on an equal footing with the existing CPU profiling
> > > rather than a separate mode enabled with special flags. The reasoning is
> > > that users may not understand the problem and the meaning of numbers they
> > > are seeing in the first place, so won't even realize that they may need
> > > to be looking for some different profiling mode. When they are presented
> > > with 2 sets of different numbers, they should start asking questions.
> >
> > I understand your point but I think it has some limitation so maybe it's
> > better to put in a separate mode with special flags.
> 
> I hate all options.
> 
> The 3rd option is to add a _mandary_ flag to perf record to ask for
> profiling mode. Then the old "perf record" will say "nope, you need to
> choose, here are your options and explanation of the options".
> 
> If we add it as a new separate mode, I fear it's doomed to be
> unknown/unused and people will continue to not realize what they are
> profiling, meaning of numbers, and continue to use wrong profiles for
> the job.

I think people will find it out if it really works well! :)

> 
> I consider latency/wall-clock profiling as useful and as fundamental
> as the current bandwidth profiling for parallel workloads (more or
> less all workloads today). Yet we are in 2025 and no profilers out
> there support latency profiling. Which suggests to me there is a
> fundamental lack of understanding of the matter, so new optional mode
> may not work well.

I'm open to make it default if we can make sure it works well and have a
clear interpretation of the result.

> 
> If you are interested for context, I used a similar mode (for pprof at
> the time) to optimize TensorFlow latency (people mostly care about
> latency). All existing profilers said it's 99% highly optimized and
> parallelized matrix multiplication, we are done here. Yet, the
> wallclock profile said, nope, it's 45% Python interpreter, which
> resulted in this fix:
> https://github.com/soumith/convnet-benchmarks/commit/605988f847fcf261c288abc2351597d71a2ee149
> Full story and examples of these profiles are here:
> https://github.com/dvyukov/perf-load

Great examples!  I'll take a look later.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-14  8:26         ` Dmitry Vyukov
  2025-01-14 15:56           ` Arnaldo Carvalho de Melo
  2025-01-15  0:30           ` Namhyung Kim
@ 2025-01-15  5:59           ` Ian Rogers
  2025-01-15  7:11             ` Dmitry Vyukov
  2 siblings, 1 reply; 22+ messages in thread
From: Ian Rogers @ 2025-01-15  5:59 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Namhyung Kim, linux-perf-users, linux-kernel, eranian,
	Ingo Molnar, Peter Zijlstra, chu howard

On Tue, Jan 14, 2025 at 12:27 AM Dmitry Vyukov <dvyukov@google.com> wrote:
[snip]
> FWIW I've also considered and started implementing a different
> approach where the kernel would count parallelism level for each
> context and write it out with samples:
> https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
> Then sched_in/out would need to do atomic inc/dec on a global counter.
> Not sure how hard it is to make all corner cases work there, I dropped
> it half way b/c the perf record post-processing looked like a better
> approach.

Nice. Just to focus on this point and go off on something of a
tangent. I worry a little about perf_event_sample_format where we've
used 25 out of the 64 bits of sample_type. Perhaps there will be a
sample_type2 in the future. For the code and data page size it seems
the same information could come from mmap events. You have a similar
issue. I was thinking of another similar issue, adding information
about the number of dirty pages in a VMA. I wonder if there is a
better way to organize these things, rather than just keep using up
bits in the perf_event_sample_format. For example, we could have a
code page size software event that when in a leader sampling group
with a hardware event with a sample IP provides the code page size
information of the leader event's sample IP. We have loads of space in
the types and config values to have an endless number of such events
and maybe the value could be generated by a BPF program for yet more
flexibility. What these events would mean without a leader sample
event I'm not sure.

Wrt wall clock time, Howard Chu has done some work advancing off-CPU
sampling. Wall clock time being off CPU plus on CPU. We need to do
something to move forward the default flags/options for perf record,
for example, we don't enable build ID mmap events by default causing
the whole perf.data file to be scanned looking to add build ID events
for the dsos with samples in them. One option that could be a default
could be off-CPU profiling, and when permissions deny the BPF approach
we can fallback on using events. If these events are there by default
then it makes sense to hook them up in perf report.

Wrt perf report, I keep trying to push the python support in perf
forward. These unmerged changes show an event being parsed, and ring
buffer based sampling in a reasonably small number of lines of code in
a way not dissimilar to a perf command line:
https://lore.kernel.org/lkml/20250109075108.7651-12-irogers@google.com/
Building a better UI on top of this in python means there are some
reasonable frameworks that can be leveraged, I particularly like the
look of textual:
https://github.com/textualize/textual-demo
which imo would move things a lot further forward than UI stuff in C
and slang/stdio.

Sorry for all this tangential stuff, I like the work and will try to
delve into specifics later.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-15  5:59           ` Ian Rogers
@ 2025-01-15  7:11             ` Dmitry Vyukov
  2025-01-16 18:55               ` Namhyung Kim
  0 siblings, 1 reply; 22+ messages in thread
From: Dmitry Vyukov @ 2025-01-15  7:11 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Namhyung Kim, linux-perf-users, linux-kernel, eranian,
	Ingo Molnar, Peter Zijlstra, chu howard

On Wed, 15 Jan 2025 at 06:59, Ian Rogers <irogers@google.com> wrote:
>
> On Tue, Jan 14, 2025 at 12:27 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> [snip]
> > FWIW I've also considered and started implementing a different
> > approach where the kernel would count parallelism level for each
> > context and write it out with samples:
> > https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
> > Then sched_in/out would need to do atomic inc/dec on a global counter.
> > Not sure how hard it is to make all corner cases work there, I dropped
> > it half way b/c the perf record post-processing looked like a better
> > approach.
>
> Nice. Just to focus on this point and go off on something of a
> tangent. I worry a little about perf_event_sample_format where we've
> used 25 out of the 64 bits of sample_type. Perhaps there will be a
> sample_type2 in the future. For the code and data page size it seems
> the same information could come from mmap events. You have a similar
> issue. I was thinking of another similar issue, adding information
> about the number of dirty pages in a VMA. I wonder if there is a
> better way to organize these things, rather than just keep using up
> bits in the perf_event_sample_format. For example, we could have a
> code page size software event that when in a leader sampling group
> with a hardware event with a sample IP provides the code page size
> information of the leader event's sample IP. We have loads of space in
> the types and config values to have an endless number of such events
> and maybe the value could be generated by a BPF program for yet more
> flexibility. What these events would mean without a leader sample
> event I'm not sure.

In the end I did not go with adding parallelism to each sample (this
is purely perf report change), so at least for this patch this is very
tangential :)

> Wrt wall clock time, Howard Chu has done some work advancing off-CPU
> sampling. Wall clock time being off CPU plus on CPU. We need to do
> something to move forward the default flags/options for perf record,
> for example, we don't enable build ID mmap events by default causing
> the whole perf.data file to be scanned looking to add build ID events
> for the dsos with samples in them. One option that could be a default
> could be off-CPU profiling, and when permissions deny the BPF approach
> we can fallback on using events. If these events are there by default
> then it makes sense to hook them up in perf report.

Interesting. Do you mean "IO" by "off-CPU".
Yes, if a program was blocked for IO for 10 seconds (no CPU work),
then that obviously contributes to latency, but won't be in this
profile. Though, it still works well for a large number of important
cases (e.g. builds, ML inference, server request handling are
frequently not IO bound).

I was thinking how IO can be accounted for in the wall-clock profile.
Since we have SWITCH OUT events (and they already include preemption
bit), we do have info to account for blocked threads. But it gets
somewhat complex and has to make some hypotheses b/c not all blocked
threads contribute to latency (e.g. blocked watchdog thread). So I
left it out for now.

The idea was as follows.
We know a set of threads blocked at any point in time (switched out,
but not preempted).
We hypothesise that each of these could equally improve CPU load to
max if/when unblocked.
We inject synthetic samples in a leaf "IO wait" symbol with the weight
according to the hypothesis. I think these events can be  injected
before each switch in event only (which changes the set of blocked
threads).

Namely:
If CPU load is already at max (parallelism == num cpus), we don't
inject any IO wait events.
If the number of blocked threads is 0, we don't inject any IO wait events.
If there are C idle CPUs and B blocked threads, we inject IO wait
events with weight C/B for each of them.
For example, if there is a single blocked thread, then we hypothesise
that this blocked thread is the root cause of all currently idle CPUs
being idle.

This still has a problem of saying that unrelated threads contribute
to latency, but at least it's a simple/explainable model and it should
show guilty threads as well. Maybe unrelated threads can be filtered
by the user by specifying a set of symbols in stacks of unrelated
threads.

Does it make any sense? Do you have anything better?

> Wrt perf report, I keep trying to push the python support in perf
> forward. These unmerged changes show an event being parsed, and ring
> buffer based sampling in a reasonably small number of lines of code in
> a way not dissimilar to a perf command line:
> https://lore.kernel.org/lkml/20250109075108.7651-12-irogers@google.com/
> Building a better UI on top of this in python means there are some
> reasonable frameworks that can be leveraged, I particularly like the
> look of textual:
> https://github.com/textualize/textual-demo
> which imo would move things a lot further forward than UI stuff in C
> and slang/stdio.
>
> Sorry for all this tangential stuff, I like the work and will try to
> delve into specifics later.
>
> Thanks,
> Ian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-15  7:11             ` Dmitry Vyukov
@ 2025-01-16 18:55               ` Namhyung Kim
  2025-01-19 11:08                 ` Off-CPU sampling (was perf report: Add wall-clock and parallelism profiling) Dmitry Vyukov
  0 siblings, 1 reply; 22+ messages in thread
From: Namhyung Kim @ 2025-01-16 18:55 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Ian Rogers, linux-perf-users, linux-kernel, eranian, Ingo Molnar,
	Peter Zijlstra, chu howard

On Wed, Jan 15, 2025 at 08:11:51AM +0100, Dmitry Vyukov wrote:
> On Wed, 15 Jan 2025 at 06:59, Ian Rogers <irogers@google.com> wrote:
> >
> > On Tue, Jan 14, 2025 at 12:27 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> > [snip]
> > > FWIW I've also considered and started implementing a different
> > > approach where the kernel would count parallelism level for each
> > > context and write it out with samples:
> > > https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
> > > Then sched_in/out would need to do atomic inc/dec on a global counter.
> > > Not sure how hard it is to make all corner cases work there, I dropped
> > > it half way b/c the perf record post-processing looked like a better
> > > approach.
> >
> > Nice. Just to focus on this point and go off on something of a
> > tangent. I worry a little about perf_event_sample_format where we've
> > used 25 out of the 64 bits of sample_type. Perhaps there will be a
> > sample_type2 in the future. For the code and data page size it seems
> > the same information could come from mmap events. You have a similar
> > issue. I was thinking of another similar issue, adding information
> > about the number of dirty pages in a VMA. I wonder if there is a
> > better way to organize these things, rather than just keep using up
> > bits in the perf_event_sample_format. For example, we could have a
> > code page size software event that when in a leader sampling group
> > with a hardware event with a sample IP provides the code page size
> > information of the leader event's sample IP. We have loads of space in
> > the types and config values to have an endless number of such events
> > and maybe the value could be generated by a BPF program for yet more
> > flexibility. What these events would mean without a leader sample
> > event I'm not sure.
> 
> In the end I did not go with adding parallelism to each sample (this
> is purely perf report change), so at least for this patch this is very
> tangential :)
> 
> > Wrt wall clock time, Howard Chu has done some work advancing off-CPU
> > sampling. Wall clock time being off CPU plus on CPU. We need to do
> > something to move forward the default flags/options for perf record,
> > for example, we don't enable build ID mmap events by default causing
> > the whole perf.data file to be scanned looking to add build ID events
> > for the dsos with samples in them. One option that could be a default
> > could be off-CPU profiling, and when permissions deny the BPF approach
> > we can fallback on using events. If these events are there by default
> > then it makes sense to hook them up in perf report.
> 
> Interesting. Do you mean "IO" by "off-CPU".
> Yes, if a program was blocked for IO for 10 seconds (no CPU work),
> then that obviously contributes to latency, but won't be in this
> profile. Though, it still works well for a large number of important
> cases (e.g. builds, ML inference, server request handling are
> frequently not IO bound).
> 
> I was thinking how IO can be accounted for in the wall-clock profile.
> Since we have SWITCH OUT events (and they already include preemption
> bit), we do have info to account for blocked threads. But it gets
> somewhat complex and has to make some hypotheses b/c not all blocked
> threads contribute to latency (e.g. blocked watchdog thread). So I
> left it out for now.
> 
> The idea was as follows.
> We know a set of threads blocked at any point in time (switched out,
> but not preempted).
> We hypothesise that each of these could equally improve CPU load to
> max if/when unblocked.
> We inject synthetic samples in a leaf "IO wait" symbol with the weight
> according to the hypothesis. I think these events can be  injected
> before each switch in event only (which changes the set of blocked
> threads).
> 
> Namely:
> If CPU load is already at max (parallelism == num cpus), we don't
> inject any IO wait events.
> If the number of blocked threads is 0, we don't inject any IO wait events.
> If there are C idle CPUs and B blocked threads, we inject IO wait
> events with weight C/B for each of them.

To track idle CPUs, you need sched-switch of all CPUs regardless of your
workload, right?  Also I'm not sure when do you want to inject the IO
wait events - when a thread is sched-out without preemption?  And what
would be the weight?  I guess you want something like:

  blocked time * C / B

Then C and B can change before the thread is woken up.

> For example, if there is a single blocked thread, then we hypothesise
> that this blocked thread is the root cause of all currently idle CPUs
> being idle.

I think this may make sense when you target a single process group but
it also needs system-wide idle information.

> 
> This still has a problem of saying that unrelated threads contribute
> to latency, but at least it's a simple/explainable model and it should
> show guilty threads as well. Maybe unrelated threads can be filtered
> by the user by specifying a set of symbols in stacks of unrelated
> threads.

Or by task name.

> 
> Does it make any sense? Do you have anything better?

I'm not sure if it's right to use idle state which will be affected by
unrelated processes.  Maybe it's good for system-wide profiling.

For a process (group) profiliing, I think you need to consider number of
total threads, active threads, and CPUs.  And if the #active-threads is
less than min(#total-threads, #CPUs), then it could be considered as
idle from the workload's perspective.

What do you think?

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-14 17:52               ` Arnaldo Carvalho de Melo
  2025-01-14 18:16                 ` Arnaldo Carvalho de Melo
@ 2025-01-19 10:22                 ` Dmitry Vyukov
  1 sibling, 0 replies; 22+ messages in thread
From: Dmitry Vyukov @ 2025-01-19 10:22 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Namhyung Kim, irogers, linux-perf-users, linux-kernel, eranian,
	Ingo Molnar, Peter Zijlstra

On Tue, 14 Jan 2025 at 18:53, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>
> On Tue, Jan 14, 2025 at 05:07:14PM +0100, Dmitry Vyukov wrote:
> > On Tue, 14 Jan 2025 at 16:57, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > > On Tue, Jan 14, 2025 at 09:26:57AM +0100, Dmitry Vyukov wrote:
> > > > On Tue, 14 Jan 2025 at 02:51, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > I understand your point but I think it has some limitation so maybe it's
> > > > > better to put in a separate mode with special flags.
>
> > > > I hate all options.
>
> > > > The 3rd option is to add a _mandary_ flag to perf record to ask for
> > > > profiling mode. Then the old "perf record" will say "nope, you need to
> > > > choose, here are your options and explanation of the options".
>
> > > Well, we have a line for tips about perf usage at the botton, for
> > > instance, when I ran 'perf report' to check it it showed:
>
> > > Tip: Add -I to perf record to sample register values, which will be visible in perf report sample context.
>
> > > We could add a line with:
>
> > > Tip: Use '-s wallclock' to get wallclock/latency profiling
>
> > Thanks for the review!
>
> > This is an option too.
> > I am not very strong about the exact option we choose, I understand
> > what I proposed has downsides too.
> > I am ready to go with a solution we will agree on here.
>
> > > And when using it we could have another tip:
> > >
> > > Tip: Use 'perf config report.mode=wallclock' to change the default
> > >
> > > ⬢ [acme@toolbox pahole]$ perf config report.mode=wallclock
> > > ⬢ [acme@toolbox pahole]$ perf config report.mode
> > > report.mode=wallclock
> > > ⬢ [acme@toolbox pahole]$
> > >
> > > Or something along these lines.
>
> > > Or we could have a popup that would appear with a "Important notice",
> > > explaining the wallclock mode and asking the user to opt-in to having
> > > both, just one or keep the old mode, that would be a quick, temporary
> > > "annoyance" for people used with the current mode while bringing this
> > > cool new mode to the attention of users.
>
> > Is there an example of doing something similar in the code?
>
> Sure, for the config part look at "report.children", that is processed
> in tools/perf/builtin-report.c report__config() function:
>
>         if (!strcmp(var, "report.children")) {
>                 symbol_conf.cumulate_callchain = perf_config_bool(var, value);
>                 return 0;
>         }
>
> To test, capture with callchains:
>
> root@number:~# perf record -g -a -e cpu_core/cycles/P sleep 5
> [ perf record: Woken up 97 times to write data ]
> [ perf record: Captured and wrote 28.084 MB perf.data (194988 samples) ]
> root@number:~#
> root@number:~# perf evlist
> cpu_core/cycles/P
> dummy:u
> root@number:~#
>
> The default ends up being:
>
> root@number:~# perf report
> Samples: 194K of event 'cpu_core/cycles/P', Event count (approx.): 242759929168
>   Children      Self  Command   Shared Object       Symbol
> +    9.25%     0.00%  cc1       [unknown]           [.] 0000000000000000
> +    7.23%     0.05%  cc1       [kernel.kallsyms]   [k] entry_SYSCALL_64
> +    7.09%     0.02%  cc1       [kernel.kallsyms]   [k] do_syscall_64
> +    4.43%     0.02%  cc1       [kernel.kallsyms]   [k] asm_exc_page_fault
> +    3.89%     0.06%  cc1       libc.so.6           [.] __GI___getcwd
>
> The other mode:
>
> root@number:~# perf report --no-children
> Samples: 194K of event 'cpu_core/cycles/P', Event count (approx.): 242759929168
>   Overhead  Command   Shared Object   Symbol
> +    2.43%  cc1       cc1             [.] ht_lookup_with_hash(ht*, unsigned char const*, unsigned long, unsigned int, ht_lookup_option)
> +    1.62%  cc1       cc1             [.] _cpp_lex_direct
> +    1.34%  cc1       cc1             [.] iterative_hash
> +    1.09%  cc1       cc1             [.] bitmap_set_bit(bitmap_head*, int)
> +    0.98%  cc1       libc.so.6       [.] sysmalloc
> +    0.97%  cc1       libc.so.6       [.] _int_malloc
>
> So people not liking the default do:
>
> root@number:~# perf config report.children=no
> root@number:~# perf config report.children
> report.children=no
> root@number:~#
>
> To avoid having to use --no-children.
>
> See tools/perf/Documentation/perf-config.txt for more of these config
> options.

Oh, interesting. Didn't know about this.
Yes, we probably can use this for wallclock profiling.

> We don't have the popup that sets the ~/.perfconfig variable, but I
> think it is easily possible using ui_browser__dialog_yesno()... ok, look
> at the patch below, with it:
>
> root@x1:~# rm -f ~/.perfconfig
> root@x1:~# perf config annotate.disassemblers=objdump,llvm
> root@x1:~# cat ~/.perfconfig
> # this file is auto-generated.
> [annotate]
>         disassemblers = objdump,llvm
> root@x1:~# perf report
> root@x1:~# cat ~/.perfconfig
> # this file is auto-generated.
> [annotate]
>         disassemblers = objdump,llvm
> [report]
>         wallclock = yes
> root@x1:~#
>
> So with it when the user doesn't have that "report.wallclock=yes" (or
> "no") set, i.e. everybody, the question will be made, once answered it
> will not be made again.
>
> > > Another idea, more general, is to be able to press some hotkey that
> > > would toggle available sort keys, like we have, for instance, 'k' to
> > > toggle the 'Shared object' column in the default 'perf report/top'
> > > TUI while filtering out non-kernel samples.
> > >
> > > But my initial feeling is that you're right and we want this wallclock
> > > column on by default, its just a matter of finding a convenient/easy way
> > > for users to opt-in.
> > >
> > > Being able to quickly toggle ordering by the Overhead/Wallclock columns
> > > also looks useful.
> >
> > I've tried to do something similar, e.g. filtering by parallelism
> > level on-the-fly, thus this patch:
> > https://lore.kernel.org/linux-perf-users/CACT4Y+b8X7PnUwQtuU2QXSqumNDbN8pWDm8EX+wnvgNAKbW0xw@mail.gmail.com/T/#t
> > But it turned out the reload functionality is more broken than working.
> >
> > I agree it would be nice to have these usability improvements.
>
> Right, and I'm not talking about those as a requirement for your patch
> to get accepted.
>
> > But this is also a large change, so I would prefer to merge the basic
> > functionality, and then we can do some improvements on top. This is
> > also the first time I am touching perf code, there are too many new
> > things to learn for me.
>
> I'm happy you made the effort and will try to help you as much as I can.
>
> I agree that, with the patch below, that I'll try to split and merge
> before travelling, we'll have a good starting point.
>
> Here is the patch, I'll split the perf_config__set_variable into a
> separate patch and merge, so that you can continue from there. The
> "<LONGER EXPLANATION>" part should tell the user that 'perf config
> report.workload=yes|no' can be done later, to switch the mode, if one
> doesn't want to continue with it as default.
>
> Thanks,
>
> - Arnaldo
>
> diff --git a/tools/perf/builtin-config.c b/tools/perf/builtin-config.c
> index 2e8363778935e8d5..45b5312fbe8370e8 100644
> --- a/tools/perf/builtin-config.c
> +++ b/tools/perf/builtin-config.c
> @@ -154,6 +154,44 @@ static int parse_config_arg(char *arg, char **var, char **value)
>         return 0;
>  }
>
> +int perf_config__set_variable(const char *var, const char *value)
> +{
> +       char path[PATH_MAX];
> +       char *user_config = mkpath(path, sizeof(path), "%s/.perfconfig", getenv("HOME"));
> +       const char *config_filename;
> +       struct perf_config_set *set;
> +       int ret = -1;
> +
> +       if (use_system_config)
> +               config_exclusive_filename = perf_etc_perfconfig();
> +       else if (use_user_config)
> +               config_exclusive_filename = user_config;
> +
> +       if (!config_exclusive_filename)
> +               config_filename = user_config;
> +       else
> +               config_filename = config_exclusive_filename;
> +
> +       set = perf_config_set__new();
> +       if (!set)
> +               goto out_err;
> +
> +       if (perf_config_set__collect(set, config_filename, var, value) < 0) {
> +               pr_err("Failed to add '%s=%s'\n", var, value);
> +               goto out_err;
> +       }
> +
> +       if (set_config(set, config_filename) < 0) {
> +               pr_err("Failed to set the configs on %s\n", config_filename);
> +               goto out_err;
> +       }
> +
> +       ret = 0;
> +out_err:
> +       perf_config_set__delete(set);
> +       return ret;
> +}
> +
>  int cmd_config(int argc, const char **argv)
>  {
>         int i, ret = -1;
> diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> index f5fbd670d619a32a..ccd6eef8ece6e238 100644
> --- a/tools/perf/builtin-report.c
> +++ b/tools/perf/builtin-report.c
> @@ -50,6 +50,7 @@
>  #include "util/units.h"
>  #include "util/util.h" // perf_tip()
>  #include "ui/ui.h"
> +#include "ui/util.h"
>  #include "ui/progress.h"
>  #include "util/block-info.h"
>
> @@ -99,6 +100,8 @@ struct report {
>         bool                    disable_order;
>         bool                    skip_empty;
>         bool                    data_type;
> +       bool                    wallclock_config_set;
> +       bool                    use_wallclock;
>         int                     max_stack;
>         struct perf_read_values show_threads_values;
>         const char              *pretty_printing_style;
> @@ -151,6 +154,11 @@ static int report__config(const char *var, const char *value, void *cb)
>                 }
>                 return 0;
>         }
> +       if (!strcmp(var, "report.wallclock")) {
> +               rep->use_wallclock = perf_config_bool(var, value);
> +               rep->wallclock_config_set = true;
> +               return 0;
> +       }
>
>         if (!strcmp(var, "report.skip-empty")) {
>                 rep->skip_empty = perf_config_bool(var, value);
> @@ -1089,6 +1097,15 @@ static int __cmd_report(struct report *rep)
>
>         report__warn_kptr_restrict(rep);
>
> +       if (!rep->wallclock_config_set) {
> +               if (ui__dialog_yesno("Do you want to use wallclock/latency mode?\n"
> +                                    "<LONGER EXPLANATION>\n")) {
> +                       rep->use_wallclock = true;
> +                       // do other prep to add the "wallclock" key to the sort order, etc
> +                       perf_config__set_variable("report.wallclock", "yes");
> +               }
> +       }
> +
>         evlist__for_each_entry(session->evlist, pos)
>                 rep->nr_entries += evsel__hists(pos)->nr_entries;
>
> diff --git a/tools/perf/util/config.h b/tools/perf/util/config.h
> index 9971313d61c1e5c6..a727c95cb1195e06 100644
> --- a/tools/perf/util/config.h
> +++ b/tools/perf/util/config.h
> @@ -50,6 +50,7 @@ int perf_config_set__collect(struct perf_config_set *set, const char *file_name,
>                              const char *var, const char *value);
>  void perf_config__exit(void);
>  void perf_config__refresh(void);
> +int perf_config__set_variable(const char *var, const char *value);
>
>  /**
>   * perf_config_sections__for_each - iterate thru all the sections

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-15  0:30           ` Namhyung Kim
@ 2025-01-19 10:50             ` Dmitry Vyukov
  2025-01-24 10:46               ` Dmitry Vyukov
  0 siblings, 1 reply; 22+ messages in thread
From: Dmitry Vyukov @ 2025-01-19 10:50 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: irogers, linux-perf-users, linux-kernel, eranian

On Wed, 15 Jan 2025 at 01:30, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, Jan 14, 2025 at 09:26:57AM +0100, Dmitry Vyukov wrote:
> > On Tue, 14 Jan 2025 at 02:51, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > On Mon, Jan 13, 2025 at 02:40:06PM +0100, Dmitry Vyukov wrote:
> > > > There are two notions of time: wall-clock time and CPU time.
> > > > For a single-threaded program, or a program running on a single-core
> > > > machine, these notions are the same. However, for a multi-threaded/
> > > > multi-process program running on a multi-core machine, these notions are
> > > > significantly different. Each second of wall-clock time we have
> > > > number-of-cores seconds of CPU time.
> > > >
> > > > Currently perf only allows to profile CPU time. Perf (and all other
> > > > existing profilers to the be best of my knowledge) does not allow to
> > > > profile wall-clock time.
> > > >
> > > > Optimizing CPU overhead is useful to improve 'throughput', while
> > > > optimizing wall-clock overhead is useful to improve 'latency'.
> > > > These profiles are complementary and are not interchangeable.
> > > > Examples of where wall-clock profile is needed:
> > > >  - optimzing build latency
> > > >  - optimizing server request latency
> > > >  - optimizing ML training/inference latency
> > > >  - optimizing running time of any command line program
> > > >
> > > > CPU profile is useless for these use cases at best (if a user understands
> > > > the difference), or misleading at worst (if a user tries to use a wrong
> > > > profile for a job).
> > > >
> > > > This patch adds wall-clock and parallelization profiling.
> > > > See the added documentation and flags descriptions for details.
> > > >
> > > > Brief outline of the implementation:
> > > >  - add context switch collection during record
> > > >  - calculate number of threads running on CPUs (parallelism level)
> > > >    during report
> > > >  - divide each sample weight by the parallelism level
> > > > This effectively models that we were taking 1 sample per unit of
> > > > wall-clock time.
> > >
> > > Thanks for working on this, very interesting!
> > >
> > > But I guess this implementation depends on cpu-cycles event and single
> > > target process.
> >
> > Not necessarily, it works the same for PMU events, e.g. again from perf make:
>
> Of course it would work with other events.  But the thing is how to
> interprete the result IMHO.
>
> >
> > Samples: 481K of event 'dTLB-load-misses', Event count (approx.): 169413440
> > Overhead  Wallclock  Command          Symbol
> >    5.34%     16.33%  ld               [.] bfd_elf_link_add_symbols
> >    4.14%     14.99%  ld               [.] bfd_link_hash_traverse
> >    3.53%     10.87%  ld               [.] __memmove_avx_unaligned_erms
> >    1.61%      0.05%  cc1              [k] native_queued_spin_lock_slowpath
> >    1.47%      0.09%  cc1              [k] __d_lookup_rcu
> >    1.38%      0.55%  sh               [k] __percpu_counter_sum
> >    1.22%      0.11%  cc1              [k] __audit_filter_op
> >    0.95%      0.08%  cc1              [k] audit_filter_rules.isra.0
> >    0.72%      0.47%  sh               [k] pcpu_alloc_noprof
> >
> > And also it works well for multi-process targets, again most examples
> > in the patch are from kernel make.
>
> It's the single-origin case.  I'm curious about the multiple-origin
> cases like profiling two or more existing processes together.  How do
> you determine the initial parallelism?

What is the "multiple-origin case"? How is it different from multiple processes?

> Also I think the parallelism is defined in a process (group).  Does it
> still make sense in system-wide profiling?

I've tried to answer this here:

> It can work as is, if the system activity includes mostly a single
> task, e.g. you run something on command line, it's taking too long, so
> you decide to do perf record -a in parallel to see what it's doing.
>
> For system-wide profiling we do Google (GWP), it may need some
> additional logic to make it useful. Namely, the system-wide
> parallelism is not useful in multi-task contexts. But we can extend
> report to allow us to calculate parallelism/wall-clock overhead for a
> single multi-threaded process only. Or, if we auto-matically
> post-process system-wide profile, then we can split the profile
> per-process and calculate parallelism for each of them.

Additionally we could add a process filter that will calculate
parallelism only with the given set of processes (specified by PIDs or
comm names). I.e. one profiled the whole system, but then says "I am
really interested only in processes X and Y".



> > > Do you think if it'd work for system-wide profiling?
> >
> > Currently it doesn't b/c system-wide emits PERF_RECORD_SWITCH_CPU_WIDE
> > instead of PERF_RECORD_SWITCH which this patch handles.
> >
> > It can be made to work, I just didn't want to complicate this patch more.
>
> Ok.
>
> > It can work as is, if the system activity includes mostly a single
> > task, e.g. you run something on command line, it's taking too long, so
> > you decide to do perf record -a in parallel to see what it's doing.
> >
> > For system-wide profiling we do Google (GWP), it may need some
> > additional logic to make it useful. Namely, the system-wide
> > parallelism is not useful in multi-task contexts. But we can extend
> > report to allow us to calculate parallelism/wall-clock overhead for a
> > single multi-threaded process only. Or, if we auto-matically
> > post-process system-wide profile, then we can split the profile
> > per-process and calculate parallelism for each of them.
> >
> > I am also not sure if we use perf report for GWP, or do manual trace
> > parsing. When/if this part is merged, we can think of the system-wide
> > story.
> >
> > TODO: don't enable context-switch recording for -a mode.
> >
> >
> > > How do you define wall-clock overhead if the event counts something
> > > different (like the number of L3 cache-misses)?
> >
> > It's the same. We calculate CPU overhead for these. Wallclock overhead
> > is the same, it's just the weights are adjusted proportionally.
>
> I got this part but ...
>
> >
> > Physical meaning is similar:
> > CPU overhead: how much L3 misses affect throughput/cpu consumption
> > Wallclock overhead: how much L3 misses affect latency
>
> I'm not sure if it's correct.  The event would tell where you get the L3
> miss mostly.  It doesn't necessarily mean latency.  IOW more L3 miss
> doesn't mean more CPU time.  I guess that's why perf doesn't say it as
> "CPU overhead".

Yes, I agree, it does not directly translate to increased CPU
consumption or latency.
But the point is: whatever was the effect on CPU consumption, it's
rescaled to latency.


> Hmm.. maybe you could rename it differently?

You mean the "perf report" help page, right?

Yes, strictly speaking it's not "CPU overhead" and "Wall-clock
overhead", but rather "Overhead in CPU time" and "Overhead in
wall-clock time". Does it look better to you?


> > > Also I'm not sure about the impact of context switch events which could
> > > generate a lot of records that may end up with losing some of them.  And
> > > in that case the parallelism tracking would break.
> >
> > I've counted SAMPLE/SWITCH events on perf make:
> >
> > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
> > 813829
> > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
> > 46249
> >
> > And this is for:
> > perf record -e L1-dcache-load-misses,dTLB-load-misses make
> >
> > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
> > 1138615
> > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
> > 115433
> >
> > Doesn't pretend to be scientifically accurate, but gives some idea
> > about proportion.
> > So for the number of samples that's 5% and 10% respectively.
>
> I think it depends on the workload.  In a very loaded system with short
> timeslice or something, you could get a lot more sched switches.  In
> your workload, `make` may have a long single-threaded portion.

Unfortunately this is likely to be true.

> > SWITCH events are smaller (24 bytes vs 40 bytes for SAMPLE), so for
> > ring size it's 3-6%.
> >
> > FWIW I've also considered and started implementing a different
> > approach where the kernel would count parallelism level for each
> > context and write it out with samples:
> > https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
> > Then sched_in/out would need to do atomic inc/dec on a global counter.
>
> It seems you counted parallelism in an event (including inherited ones).
> But usually perf tool opens a separate FD on each CPU for an event (like
> cycles), then you need to adjust the value across all CPUS.  I'm not
> sure if it's doable in this approach.

Right, this is probably when I gave up on that approach -- when I
observed the counter has a value of 1 at most.

> > Not sure how hard it is to make all corner cases work there, I dropped
> > it half way b/c the perf record post-processing looked like a better
> > approach.
>
> Probably.
>
> >
> > Lost SWITCH events may be a problem. Perf report already detects and
> > prints a warning about lost events. Should we stop reporting
> > parallelism/wall-clock if any events are lost? I just don't see it in
> > practice.
>
> It happens quite often when there are many activities.  Probably we can
> consider stopping when it detects some lost records.
>
>
> > Kernel could write-out a bitmask of types of lost events in PERF_RECORD_LOST.
>
> It'd be nice, but given the rate of sched switch I guess it'd most
> likely contain some.

What's an easy way to get lost events? Normally on my machine I never see them.
Is there some parameter to make the buffer very small, or something?


> > > > The feature is added on an equal footing with the existing CPU profiling
> > > > rather than a separate mode enabled with special flags. The reasoning is
> > > > that users may not understand the problem and the meaning of numbers they
> > > > are seeing in the first place, so won't even realize that they may need
> > > > to be looking for some different profiling mode. When they are presented
> > > > with 2 sets of different numbers, they should start asking questions.
> > >
> > > I understand your point but I think it has some limitation so maybe it's
> > > better to put in a separate mode with special flags.
> >
> > I hate all options.
> >
> > The 3rd option is to add a _mandary_ flag to perf record to ask for
> > profiling mode. Then the old "perf record" will say "nope, you need to
> > choose, here are your options and explanation of the options".
> >
> > If we add it as a new separate mode, I fear it's doomed to be
> > unknown/unused and people will continue to not realize what they are
> > profiling, meaning of numbers, and continue to use wrong profiles for
> > the job.
>
> I think people will find it out if it really works well! :)

Yes, but it may take 10 years for the word to spread :)
I am afraid the main problem is that people don't know they need to be
looking for something. We had 20 years of widespread multi-CPU/core
systems and no profiler supporting latency profiling.

But as I said, I am open to options (at least I will know it's there
for my purposes :)).


> > I consider latency/wall-clock profiling as useful and as fundamental
> > as the current bandwidth profiling for parallel workloads (more or
> > less all workloads today). Yet we are in 2025 and no profilers out
> > there support latency profiling. Which suggests to me there is a
> > fundamental lack of understanding of the matter, so new optional mode
> > may not work well.
>
> I'm open to make it default if we can make sure it works well and have a
> clear interpretation of the result.
>
> >
> > If you are interested for context, I used a similar mode (for pprof at
> > the time) to optimize TensorFlow latency (people mostly care about
> > latency). All existing profilers said it's 99% highly optimized and
> > parallelized matrix multiplication, we are done here. Yet, the
> > wallclock profile said, nope, it's 45% Python interpreter, which
> > resulted in this fix:
> > https://github.com/soumith/convnet-benchmarks/commit/605988f847fcf261c288abc2351597d71a2ee149
> > Full story and examples of these profiles are here:
> > https://github.com/dvyukov/perf-load
>
> Great examples!  I'll take a look later.
>
> Thanks,
> Namhyung
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Off-CPU sampling (was perf report: Add wall-clock and parallelism profiling)
  2025-01-16 18:55               ` Namhyung Kim
@ 2025-01-19 11:08                 ` Dmitry Vyukov
  2025-01-23 23:34                   ` Namhyung Kim
  0 siblings, 1 reply; 22+ messages in thread
From: Dmitry Vyukov @ 2025-01-19 11:08 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ian Rogers, linux-perf-users, LKML, Stephane Eranian, Ingo Molnar,
	Peter Zijlstra, chu howard

On Thu, 16 Jan 2025 at 19:55, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Wed, Jan 15, 2025 at 08:11:51AM +0100, Dmitry Vyukov wrote:
> > On Wed, 15 Jan 2025 at 06:59, Ian Rogers <irogers@google.com> wrote:
> > >
> > > On Tue, Jan 14, 2025 at 12:27 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> > > [snip]
> > > > FWIW I've also considered and started implementing a different
> > > > approach where the kernel would count parallelism level for each
> > > > context and write it out with samples:
> > > > https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
> > > > Then sched_in/out would need to do atomic inc/dec on a global counter.
> > > > Not sure how hard it is to make all corner cases work there, I dropped
> > > > it half way b/c the perf record post-processing looked like a better
> > > > approach.
> > >
> > > Nice. Just to focus on this point and go off on something of a
> > > tangent. I worry a little about perf_event_sample_format where we've
> > > used 25 out of the 64 bits of sample_type. Perhaps there will be a
> > > sample_type2 in the future. For the code and data page size it seems
> > > the same information could come from mmap events. You have a similar
> > > issue. I was thinking of another similar issue, adding information
> > > about the number of dirty pages in a VMA. I wonder if there is a
> > > better way to organize these things, rather than just keep using up
> > > bits in the perf_event_sample_format. For example, we could have a
> > > code page size software event that when in a leader sampling group
> > > with a hardware event with a sample IP provides the code page size
> > > information of the leader event's sample IP. We have loads of space in
> > > the types and config values to have an endless number of such events
> > > and maybe the value could be generated by a BPF program for yet more
> > > flexibility. What these events would mean without a leader sample
> > > event I'm not sure.
> >
> > In the end I did not go with adding parallelism to each sample (this
> > is purely perf report change), so at least for this patch this is very
> > tangential :)
> >
> > > Wrt wall clock time, Howard Chu has done some work advancing off-CPU
> > > sampling. Wall clock time being off CPU plus on CPU. We need to do
> > > something to move forward the default flags/options for perf record,
> > > for example, we don't enable build ID mmap events by default causing
> > > the whole perf.data file to be scanned looking to add build ID events
> > > for the dsos with samples in them. One option that could be a default
> > > could be off-CPU profiling, and when permissions deny the BPF approach
> > > we can fallback on using events. If these events are there by default
> > > then it makes sense to hook them up in perf report.
> >
> > Interesting. Do you mean "IO" by "off-CPU".
> > Yes, if a program was blocked for IO for 10 seconds (no CPU work),
> > then that obviously contributes to latency, but won't be in this
> > profile. Though, it still works well for a large number of important
> > cases (e.g. builds, ML inference, server request handling are
> > frequently not IO bound).
> >
> > I was thinking how IO can be accounted for in the wall-clock profile.
> > Since we have SWITCH OUT events (and they already include preemption
> > bit), we do have info to account for blocked threads. But it gets
> > somewhat complex and has to make some hypotheses b/c not all blocked
> > threads contribute to latency (e.g. blocked watchdog thread). So I
> > left it out for now.
> >
> > The idea was as follows.
> > We know a set of threads blocked at any point in time (switched out,
> > but not preempted).
> > We hypothesise that each of these could equally improve CPU load to
> > max if/when unblocked.
> > We inject synthetic samples in a leaf "IO wait" symbol with the weight
> > according to the hypothesis. I think these events can be  injected
> > before each switch in event only (which changes the set of blocked
> > threads).
> >
> > Namely:
> > If CPU load is already at max (parallelism == num cpus), we don't
> > inject any IO wait events.
> > If the number of blocked threads is 0, we don't inject any IO wait events.
> > If there are C idle CPUs and B blocked threads, we inject IO wait
> > events with weight C/B for each of them.
>
> To track idle CPUs, you need sched-switch of all CPUs regardless of your
> workload, right?  Also I'm not sure when do you want to inject the IO
> wait events - when a thread is sched-out without preemption?  And what
> would be the weight?  I guess you want something like:
>
>   blocked time * C / B
>
> Then C and B can change before the thread is woken up.

Yes, these events need to be emitted on every switch-in/out in the
trace so that a long blocked thread gets multiple events with
different weights.

> > For example, if there is a single blocked thread, then we hypothesise
> > that this blocked thread is the root cause of all currently idle CPUs
> > being idle.
>
> I think this may make sense when you target a single process group but
> it also needs system-wide idle information.

I assumed this profiling is done on a mostly idle system (generally
it's a good idea for any profiling).

Theoretically, we could look at runnable threads rather than running.
If there are NumCPU runnable threads, then creating more runnable
threads won't help.

> > This still has a problem of saying that unrelated threads contribute
> > to latency, but at least it's a simple/explainable model and it should
> > show guilty threads as well. Maybe unrelated threads can be filtered
> > by the user by specifying a set of symbols in stacks of unrelated
> > threads.
>
> Or by task name.
>
> >
> > Does it make any sense? Do you have anything better?
>
> I'm not sure if it's right to use idle state which will be affected by
> unrelated processes.  Maybe it's good for system-wide profiling.
>
> For a process (group) profiliing, I think you need to consider number of
> total threads, active threads, and CPUs.  And if the #active-threads is
> less than min(#total-threads, #CPUs), then it could be considered as
> idle from the workload's perspective.
>
> What do you think?

I don't know, hard to say. I see what you mean, but this makes the
problem even harder, and potentially breaking hypotheses we are
making.
For example, if we have 2 unrelated workloads A and B running on the
machine. Their high- and low-parallelism phases will overlap randomly,
and we will make conclusions from that, but these overlapping are
really random and may not hold next time. Or next time A may be
collocated with C.

I would solve the simpler problem of profiling a single workload on a
mostly idle system first, and only then move to the harder case. Are
you considering this for GWP-type profiling?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Off-CPU sampling (was perf report: Add wall-clock and parallelism profiling)
  2025-01-19 11:08                 ` Off-CPU sampling (was perf report: Add wall-clock and parallelism profiling) Dmitry Vyukov
@ 2025-01-23 23:34                   ` Namhyung Kim
  0 siblings, 0 replies; 22+ messages in thread
From: Namhyung Kim @ 2025-01-23 23:34 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Ian Rogers, linux-perf-users, LKML, Stephane Eranian, Ingo Molnar,
	Peter Zijlstra, chu howard

On Sun, Jan 19, 2025 at 12:08:36PM +0100, Dmitry Vyukov wrote:
> On Thu, 16 Jan 2025 at 19:55, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Wed, Jan 15, 2025 at 08:11:51AM +0100, Dmitry Vyukov wrote:
> > > On Wed, 15 Jan 2025 at 06:59, Ian Rogers <irogers@google.com> wrote:
> > > >
> > > > On Tue, Jan 14, 2025 at 12:27 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> > > > [snip]
> > > > > FWIW I've also considered and started implementing a different
> > > > > approach where the kernel would count parallelism level for each
> > > > > context and write it out with samples:
> > > > > https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
> > > > > Then sched_in/out would need to do atomic inc/dec on a global counter.
> > > > > Not sure how hard it is to make all corner cases work there, I dropped
> > > > > it half way b/c the perf record post-processing looked like a better
> > > > > approach.
> > > >
> > > > Nice. Just to focus on this point and go off on something of a
> > > > tangent. I worry a little about perf_event_sample_format where we've
> > > > used 25 out of the 64 bits of sample_type. Perhaps there will be a
> > > > sample_type2 in the future. For the code and data page size it seems
> > > > the same information could come from mmap events. You have a similar
> > > > issue. I was thinking of another similar issue, adding information
> > > > about the number of dirty pages in a VMA. I wonder if there is a
> > > > better way to organize these things, rather than just keep using up
> > > > bits in the perf_event_sample_format. For example, we could have a
> > > > code page size software event that when in a leader sampling group
> > > > with a hardware event with a sample IP provides the code page size
> > > > information of the leader event's sample IP. We have loads of space in
> > > > the types and config values to have an endless number of such events
> > > > and maybe the value could be generated by a BPF program for yet more
> > > > flexibility. What these events would mean without a leader sample
> > > > event I'm not sure.
> > >
> > > In the end I did not go with adding parallelism to each sample (this
> > > is purely perf report change), so at least for this patch this is very
> > > tangential :)
> > >
> > > > Wrt wall clock time, Howard Chu has done some work advancing off-CPU
> > > > sampling. Wall clock time being off CPU plus on CPU. We need to do
> > > > something to move forward the default flags/options for perf record,
> > > > for example, we don't enable build ID mmap events by default causing
> > > > the whole perf.data file to be scanned looking to add build ID events
> > > > for the dsos with samples in them. One option that could be a default
> > > > could be off-CPU profiling, and when permissions deny the BPF approach
> > > > we can fallback on using events. If these events are there by default
> > > > then it makes sense to hook them up in perf report.
> > >
> > > Interesting. Do you mean "IO" by "off-CPU".
> > > Yes, if a program was blocked for IO for 10 seconds (no CPU work),
> > > then that obviously contributes to latency, but won't be in this
> > > profile. Though, it still works well for a large number of important
> > > cases (e.g. builds, ML inference, server request handling are
> > > frequently not IO bound).
> > >
> > > I was thinking how IO can be accounted for in the wall-clock profile.
> > > Since we have SWITCH OUT events (and they already include preemption
> > > bit), we do have info to account for blocked threads. But it gets
> > > somewhat complex and has to make some hypotheses b/c not all blocked
> > > threads contribute to latency (e.g. blocked watchdog thread). So I
> > > left it out for now.
> > >
> > > The idea was as follows.
> > > We know a set of threads blocked at any point in time (switched out,
> > > but not preempted).
> > > We hypothesise that each of these could equally improve CPU load to
> > > max if/when unblocked.
> > > We inject synthetic samples in a leaf "IO wait" symbol with the weight
> > > according to the hypothesis. I think these events can be  injected
> > > before each switch in event only (which changes the set of blocked
> > > threads).
> > >
> > > Namely:
> > > If CPU load is already at max (parallelism == num cpus), we don't
> > > inject any IO wait events.
> > > If the number of blocked threads is 0, we don't inject any IO wait events.
> > > If there are C idle CPUs and B blocked threads, we inject IO wait
> > > events with weight C/B for each of them.
> >
> > To track idle CPUs, you need sched-switch of all CPUs regardless of your
> > workload, right?  Also I'm not sure when do you want to inject the IO
> > wait events - when a thread is sched-out without preemption?  And what
> > would be the weight?  I guess you want something like:
> >
> >   blocked time * C / B
> >
> > Then C and B can change before the thread is woken up.
> 
> Yes, these events need to be emitted on every switch-in/out in the
> trace so that a long blocked thread gets multiple events with
> different weights.

Ok.

> 
> > > For example, if there is a single blocked thread, then we hypothesise
> > > that this blocked thread is the root cause of all currently idle CPUs
> > > being idle.
> >
> > I think this may make sense when you target a single process group but
> > it also needs system-wide idle information.
> 
> I assumed this profiling is done on a mostly idle system (generally
> it's a good idea for any profiling).
> 
> Theoretically, we could look at runnable threads rather than running.
> If there are NumCPU runnable threads, then creating more runnable
> threads won't help.

But it'd need to look at the state of the previous (sched-out) task in
sched_switch event which is a lot bigger.

> 
> > > This still has a problem of saying that unrelated threads contribute
> > > to latency, but at least it's a simple/explainable model and it should
> > > show guilty threads as well. Maybe unrelated threads can be filtered
> > > by the user by specifying a set of symbols in stacks of unrelated
> > > threads.
> >
> > Or by task name.
> >
> > >
> > > Does it make any sense? Do you have anything better?
> >
> > I'm not sure if it's right to use idle state which will be affected by
> > unrelated processes.  Maybe it's good for system-wide profiling.
> >
> > For a process (group) profiliing, I think you need to consider number of
> > total threads, active threads, and CPUs.  And if the #active-threads is
> > less than min(#total-threads, #CPUs), then it could be considered as
> > idle from the workload's perspective.
> >
> > What do you think?
> 
> I don't know, hard to say. I see what you mean, but this makes the
> problem even harder, and potentially breaking hypotheses we are
> making.
> For example, if we have 2 unrelated workloads A and B running on the
> machine. Their high- and low-parallelism phases will overlap randomly,
> and we will make conclusions from that, but these overlapping are
> really random and may not hold next time. Or next time A may be
> collocated with C.

Hmm.. but isn't it the same when you use idle state?  CPUs can go idle
randomly because of other workload IMHO.

> 
> I would solve the simpler problem of profiling a single workload on a
> mostly idle system first, and only then move to the harder case.

I agree with you to start with the simpler one.  I need to check the
code how you checked the idle state.


> Are you considering this for GWP-type profiling?

No, I'm not (for now).

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-19 10:50             ` Dmitry Vyukov
@ 2025-01-24 10:46               ` Dmitry Vyukov
  2025-01-24 17:56                 ` Namhyung Kim
  0 siblings, 1 reply; 22+ messages in thread
From: Dmitry Vyukov @ 2025-01-24 10:46 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: irogers, linux-perf-users, linux-kernel, eranian

What are the next steps for this patch?

I don't have any action items on my side. Am I missing any? Ready to be merged?



On Sun, 19 Jan 2025 at 11:50, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Wed, 15 Jan 2025 at 01:30, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Tue, Jan 14, 2025 at 09:26:57AM +0100, Dmitry Vyukov wrote:
> > > On Tue, 14 Jan 2025 at 02:51, Namhyung Kim <namhyung@kernel.org> wrote:
> > > >
> > > > Hello,
> > > >
> > > > On Mon, Jan 13, 2025 at 02:40:06PM +0100, Dmitry Vyukov wrote:
> > > > > There are two notions of time: wall-clock time and CPU time.
> > > > > For a single-threaded program, or a program running on a single-core
> > > > > machine, these notions are the same. However, for a multi-threaded/
> > > > > multi-process program running on a multi-core machine, these notions are
> > > > > significantly different. Each second of wall-clock time we have
> > > > > number-of-cores seconds of CPU time.
> > > > >
> > > > > Currently perf only allows to profile CPU time. Perf (and all other
> > > > > existing profilers to the be best of my knowledge) does not allow to
> > > > > profile wall-clock time.
> > > > >
> > > > > Optimizing CPU overhead is useful to improve 'throughput', while
> > > > > optimizing wall-clock overhead is useful to improve 'latency'.
> > > > > These profiles are complementary and are not interchangeable.
> > > > > Examples of where wall-clock profile is needed:
> > > > >  - optimzing build latency
> > > > >  - optimizing server request latency
> > > > >  - optimizing ML training/inference latency
> > > > >  - optimizing running time of any command line program
> > > > >
> > > > > CPU profile is useless for these use cases at best (if a user understands
> > > > > the difference), or misleading at worst (if a user tries to use a wrong
> > > > > profile for a job).
> > > > >
> > > > > This patch adds wall-clock and parallelization profiling.
> > > > > See the added documentation and flags descriptions for details.
> > > > >
> > > > > Brief outline of the implementation:
> > > > >  - add context switch collection during record
> > > > >  - calculate number of threads running on CPUs (parallelism level)
> > > > >    during report
> > > > >  - divide each sample weight by the parallelism level
> > > > > This effectively models that we were taking 1 sample per unit of
> > > > > wall-clock time.
> > > >
> > > > Thanks for working on this, very interesting!
> > > >
> > > > But I guess this implementation depends on cpu-cycles event and single
> > > > target process.
> > >
> > > Not necessarily, it works the same for PMU events, e.g. again from perf make:
> >
> > Of course it would work with other events.  But the thing is how to
> > interprete the result IMHO.
> >
> > >
> > > Samples: 481K of event 'dTLB-load-misses', Event count (approx.): 169413440
> > > Overhead  Wallclock  Command          Symbol
> > >    5.34%     16.33%  ld               [.] bfd_elf_link_add_symbols
> > >    4.14%     14.99%  ld               [.] bfd_link_hash_traverse
> > >    3.53%     10.87%  ld               [.] __memmove_avx_unaligned_erms
> > >    1.61%      0.05%  cc1              [k] native_queued_spin_lock_slowpath
> > >    1.47%      0.09%  cc1              [k] __d_lookup_rcu
> > >    1.38%      0.55%  sh               [k] __percpu_counter_sum
> > >    1.22%      0.11%  cc1              [k] __audit_filter_op
> > >    0.95%      0.08%  cc1              [k] audit_filter_rules.isra.0
> > >    0.72%      0.47%  sh               [k] pcpu_alloc_noprof
> > >
> > > And also it works well for multi-process targets, again most examples
> > > in the patch are from kernel make.
> >
> > It's the single-origin case.  I'm curious about the multiple-origin
> > cases like profiling two or more existing processes together.  How do
> > you determine the initial parallelism?
>
> What is the "multiple-origin case"? How is it different from multiple processes?
>
> > Also I think the parallelism is defined in a process (group).  Does it
> > still make sense in system-wide profiling?
>
> I've tried to answer this here:
>
> > It can work as is, if the system activity includes mostly a single
> > task, e.g. you run something on command line, it's taking too long, so
> > you decide to do perf record -a in parallel to see what it's doing.
> >
> > For system-wide profiling we do Google (GWP), it may need some
> > additional logic to make it useful. Namely, the system-wide
> > parallelism is not useful in multi-task contexts. But we can extend
> > report to allow us to calculate parallelism/wall-clock overhead for a
> > single multi-threaded process only. Or, if we auto-matically
> > post-process system-wide profile, then we can split the profile
> > per-process and calculate parallelism for each of them.
>
> Additionally we could add a process filter that will calculate
> parallelism only with the given set of processes (specified by PIDs or
> comm names). I.e. one profiled the whole system, but then says "I am
> really interested only in processes X and Y".
>
>
>
> > > > Do you think if it'd work for system-wide profiling?
> > >
> > > Currently it doesn't b/c system-wide emits PERF_RECORD_SWITCH_CPU_WIDE
> > > instead of PERF_RECORD_SWITCH which this patch handles.
> > >
> > > It can be made to work, I just didn't want to complicate this patch more.
> >
> > Ok.
> >
> > > It can work as is, if the system activity includes mostly a single
> > > task, e.g. you run something on command line, it's taking too long, so
> > > you decide to do perf record -a in parallel to see what it's doing.
> > >
> > > For system-wide profiling we do Google (GWP), it may need some
> > > additional logic to make it useful. Namely, the system-wide
> > > parallelism is not useful in multi-task contexts. But we can extend
> > > report to allow us to calculate parallelism/wall-clock overhead for a
> > > single multi-threaded process only. Or, if we auto-matically
> > > post-process system-wide profile, then we can split the profile
> > > per-process and calculate parallelism for each of them.
> > >
> > > I am also not sure if we use perf report for GWP, or do manual trace
> > > parsing. When/if this part is merged, we can think of the system-wide
> > > story.
> > >
> > > TODO: don't enable context-switch recording for -a mode.
> > >
> > >
> > > > How do you define wall-clock overhead if the event counts something
> > > > different (like the number of L3 cache-misses)?
> > >
> > > It's the same. We calculate CPU overhead for these. Wallclock overhead
> > > is the same, it's just the weights are adjusted proportionally.
> >
> > I got this part but ...
> >
> > >
> > > Physical meaning is similar:
> > > CPU overhead: how much L3 misses affect throughput/cpu consumption
> > > Wallclock overhead: how much L3 misses affect latency
> >
> > I'm not sure if it's correct.  The event would tell where you get the L3
> > miss mostly.  It doesn't necessarily mean latency.  IOW more L3 miss
> > doesn't mean more CPU time.  I guess that's why perf doesn't say it as
> > "CPU overhead".
>
> Yes, I agree, it does not directly translate to increased CPU
> consumption or latency.
> But the point is: whatever was the effect on CPU consumption, it's
> rescaled to latency.
>
>
> > Hmm.. maybe you could rename it differently?
>
> You mean the "perf report" help page, right?
>
> Yes, strictly speaking it's not "CPU overhead" and "Wall-clock
> overhead", but rather "Overhead in CPU time" and "Overhead in
> wall-clock time". Does it look better to you?
>
>
> > > > Also I'm not sure about the impact of context switch events which could
> > > > generate a lot of records that may end up with losing some of them.  And
> > > > in that case the parallelism tracking would break.
> > >
> > > I've counted SAMPLE/SWITCH events on perf make:
> > >
> > > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
> > > 813829
> > > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
> > > 46249
> > >
> > > And this is for:
> > > perf record -e L1-dcache-load-misses,dTLB-load-misses make
> > >
> > > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
> > > 1138615
> > > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
> > > 115433
> > >
> > > Doesn't pretend to be scientifically accurate, but gives some idea
> > > about proportion.
> > > So for the number of samples that's 5% and 10% respectively.
> >
> > I think it depends on the workload.  In a very loaded system with short
> > timeslice or something, you could get a lot more sched switches.  In
> > your workload, `make` may have a long single-threaded portion.
>
> Unfortunately this is likely to be true.
>
> > > SWITCH events are smaller (24 bytes vs 40 bytes for SAMPLE), so for
> > > ring size it's 3-6%.
> > >
> > > FWIW I've also considered and started implementing a different
> > > approach where the kernel would count parallelism level for each
> > > context and write it out with samples:
> > > https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
> > > Then sched_in/out would need to do atomic inc/dec on a global counter.
> >
> > It seems you counted parallelism in an event (including inherited ones).
> > But usually perf tool opens a separate FD on each CPU for an event (like
> > cycles), then you need to adjust the value across all CPUS.  I'm not
> > sure if it's doable in this approach.
>
> Right, this is probably when I gave up on that approach -- when I
> observed the counter has a value of 1 at most.
>
> > > Not sure how hard it is to make all corner cases work there, I dropped
> > > it half way b/c the perf record post-processing looked like a better
> > > approach.
> >
> > Probably.
> >
> > >
> > > Lost SWITCH events may be a problem. Perf report already detects and
> > > prints a warning about lost events. Should we stop reporting
> > > parallelism/wall-clock if any events are lost? I just don't see it in
> > > practice.
> >
> > It happens quite often when there are many activities.  Probably we can
> > consider stopping when it detects some lost records.
> >
> >
> > > Kernel could write-out a bitmask of types of lost events in PERF_RECORD_LOST.
> >
> > It'd be nice, but given the rate of sched switch I guess it'd most
> > likely contain some.
>
> What's an easy way to get lost events? Normally on my machine I never see them.
> Is there some parameter to make the buffer very small, or something?
>
>
> > > > > The feature is added on an equal footing with the existing CPU profiling
> > > > > rather than a separate mode enabled with special flags. The reasoning is
> > > > > that users may not understand the problem and the meaning of numbers they
> > > > > are seeing in the first place, so won't even realize that they may need
> > > > > to be looking for some different profiling mode. When they are presented
> > > > > with 2 sets of different numbers, they should start asking questions.
> > > >
> > > > I understand your point but I think it has some limitation so maybe it's
> > > > better to put in a separate mode with special flags.
> > >
> > > I hate all options.
> > >
> > > The 3rd option is to add a _mandary_ flag to perf record to ask for
> > > profiling mode. Then the old "perf record" will say "nope, you need to
> > > choose, here are your options and explanation of the options".
> > >
> > > If we add it as a new separate mode, I fear it's doomed to be
> > > unknown/unused and people will continue to not realize what they are
> > > profiling, meaning of numbers, and continue to use wrong profiles for
> > > the job.
> >
> > I think people will find it out if it really works well! :)
>
> Yes, but it may take 10 years for the word to spread :)
> I am afraid the main problem is that people don't know they need to be
> looking for something. We had 20 years of widespread multi-CPU/core
> systems and no profiler supporting latency profiling.
>
> But as I said, I am open to options (at least I will know it's there
> for my purposes :)).
>
>
> > > I consider latency/wall-clock profiling as useful and as fundamental
> > > as the current bandwidth profiling for parallel workloads (more or
> > > less all workloads today). Yet we are in 2025 and no profilers out
> > > there support latency profiling. Which suggests to me there is a
> > > fundamental lack of understanding of the matter, so new optional mode
> > > may not work well.
> >
> > I'm open to make it default if we can make sure it works well and have a
> > clear interpretation of the result.
> >
> > >
> > > If you are interested for context, I used a similar mode (for pprof at
> > > the time) to optimize TensorFlow latency (people mostly care about
> > > latency). All existing profilers said it's 99% highly optimized and
> > > parallelized matrix multiplication, we are done here. Yet, the
> > > wallclock profile said, nope, it's 45% Python interpreter, which
> > > resulted in this fix:
> > > https://github.com/soumith/convnet-benchmarks/commit/605988f847fcf261c288abc2351597d71a2ee149
> > > Full story and examples of these profiles are here:
> > > https://github.com/dvyukov/perf-load
> >
> > Great examples!  I'll take a look later.
> >
> > Thanks,
> > Namhyung
> >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-24 10:46               ` Dmitry Vyukov
@ 2025-01-24 17:56                 ` Namhyung Kim
  0 siblings, 0 replies; 22+ messages in thread
From: Namhyung Kim @ 2025-01-24 17:56 UTC (permalink / raw)
  To: Dmitry Vyukov; +Cc: irogers, linux-perf-users, linux-kernel, eranian

On Fri, Jan 24, 2025 at 11:46:01AM +0100, Dmitry Vyukov wrote:
> What are the next steps for this patch?
> 
> I don't have any action items on my side. Am I missing any? Ready to be merged?

I'll review the change today.

Thanks,
Namhyung

> 
> 
> 
> On Sun, 19 Jan 2025 at 11:50, Dmitry Vyukov <dvyukov@google.com> wrote:
> >
> > On Wed, 15 Jan 2025 at 01:30, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Tue, Jan 14, 2025 at 09:26:57AM +0100, Dmitry Vyukov wrote:
> > > > On Tue, 14 Jan 2025 at 02:51, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > On Mon, Jan 13, 2025 at 02:40:06PM +0100, Dmitry Vyukov wrote:
> > > > > > There are two notions of time: wall-clock time and CPU time.
> > > > > > For a single-threaded program, or a program running on a single-core
> > > > > > machine, these notions are the same. However, for a multi-threaded/
> > > > > > multi-process program running on a multi-core machine, these notions are
> > > > > > significantly different. Each second of wall-clock time we have
> > > > > > number-of-cores seconds of CPU time.
> > > > > >
> > > > > > Currently perf only allows to profile CPU time. Perf (and all other
> > > > > > existing profilers to the be best of my knowledge) does not allow to
> > > > > > profile wall-clock time.
> > > > > >
> > > > > > Optimizing CPU overhead is useful to improve 'throughput', while
> > > > > > optimizing wall-clock overhead is useful to improve 'latency'.
> > > > > > These profiles are complementary and are not interchangeable.
> > > > > > Examples of where wall-clock profile is needed:
> > > > > >  - optimzing build latency
> > > > > >  - optimizing server request latency
> > > > > >  - optimizing ML training/inference latency
> > > > > >  - optimizing running time of any command line program
> > > > > >
> > > > > > CPU profile is useless for these use cases at best (if a user understands
> > > > > > the difference), or misleading at worst (if a user tries to use a wrong
> > > > > > profile for a job).
> > > > > >
> > > > > > This patch adds wall-clock and parallelization profiling.
> > > > > > See the added documentation and flags descriptions for details.
> > > > > >
> > > > > > Brief outline of the implementation:
> > > > > >  - add context switch collection during record
> > > > > >  - calculate number of threads running on CPUs (parallelism level)
> > > > > >    during report
> > > > > >  - divide each sample weight by the parallelism level
> > > > > > This effectively models that we were taking 1 sample per unit of
> > > > > > wall-clock time.
> > > > >
> > > > > Thanks for working on this, very interesting!
> > > > >
> > > > > But I guess this implementation depends on cpu-cycles event and single
> > > > > target process.
> > > >
> > > > Not necessarily, it works the same for PMU events, e.g. again from perf make:
> > >
> > > Of course it would work with other events.  But the thing is how to
> > > interprete the result IMHO.
> > >
> > > >
> > > > Samples: 481K of event 'dTLB-load-misses', Event count (approx.): 169413440
> > > > Overhead  Wallclock  Command          Symbol
> > > >    5.34%     16.33%  ld               [.] bfd_elf_link_add_symbols
> > > >    4.14%     14.99%  ld               [.] bfd_link_hash_traverse
> > > >    3.53%     10.87%  ld               [.] __memmove_avx_unaligned_erms
> > > >    1.61%      0.05%  cc1              [k] native_queued_spin_lock_slowpath
> > > >    1.47%      0.09%  cc1              [k] __d_lookup_rcu
> > > >    1.38%      0.55%  sh               [k] __percpu_counter_sum
> > > >    1.22%      0.11%  cc1              [k] __audit_filter_op
> > > >    0.95%      0.08%  cc1              [k] audit_filter_rules.isra.0
> > > >    0.72%      0.47%  sh               [k] pcpu_alloc_noprof
> > > >
> > > > And also it works well for multi-process targets, again most examples
> > > > in the patch are from kernel make.
> > >
> > > It's the single-origin case.  I'm curious about the multiple-origin
> > > cases like profiling two or more existing processes together.  How do
> > > you determine the initial parallelism?
> >
> > What is the "multiple-origin case"? How is it different from multiple processes?
> >
> > > Also I think the parallelism is defined in a process (group).  Does it
> > > still make sense in system-wide profiling?
> >
> > I've tried to answer this here:
> >
> > > It can work as is, if the system activity includes mostly a single
> > > task, e.g. you run something on command line, it's taking too long, so
> > > you decide to do perf record -a in parallel to see what it's doing.
> > >
> > > For system-wide profiling we do Google (GWP), it may need some
> > > additional logic to make it useful. Namely, the system-wide
> > > parallelism is not useful in multi-task contexts. But we can extend
> > > report to allow us to calculate parallelism/wall-clock overhead for a
> > > single multi-threaded process only. Or, if we auto-matically
> > > post-process system-wide profile, then we can split the profile
> > > per-process and calculate parallelism for each of them.
> >
> > Additionally we could add a process filter that will calculate
> > parallelism only with the given set of processes (specified by PIDs or
> > comm names). I.e. one profiled the whole system, but then says "I am
> > really interested only in processes X and Y".
> >
> >
> >
> > > > > Do you think if it'd work for system-wide profiling?
> > > >
> > > > Currently it doesn't b/c system-wide emits PERF_RECORD_SWITCH_CPU_WIDE
> > > > instead of PERF_RECORD_SWITCH which this patch handles.
> > > >
> > > > It can be made to work, I just didn't want to complicate this patch more.
> > >
> > > Ok.
> > >
> > > > It can work as is, if the system activity includes mostly a single
> > > > task, e.g. you run something on command line, it's taking too long, so
> > > > you decide to do perf record -a in parallel to see what it's doing.
> > > >
> > > > For system-wide profiling we do Google (GWP), it may need some
> > > > additional logic to make it useful. Namely, the system-wide
> > > > parallelism is not useful in multi-task contexts. But we can extend
> > > > report to allow us to calculate parallelism/wall-clock overhead for a
> > > > single multi-threaded process only. Or, if we auto-matically
> > > > post-process system-wide profile, then we can split the profile
> > > > per-process and calculate parallelism for each of them.
> > > >
> > > > I am also not sure if we use perf report for GWP, or do manual trace
> > > > parsing. When/if this part is merged, we can think of the system-wide
> > > > story.
> > > >
> > > > TODO: don't enable context-switch recording for -a mode.
> > > >
> > > >
> > > > > How do you define wall-clock overhead if the event counts something
> > > > > different (like the number of L3 cache-misses)?
> > > >
> > > > It's the same. We calculate CPU overhead for these. Wallclock overhead
> > > > is the same, it's just the weights are adjusted proportionally.
> > >
> > > I got this part but ...
> > >
> > > >
> > > > Physical meaning is similar:
> > > > CPU overhead: how much L3 misses affect throughput/cpu consumption
> > > > Wallclock overhead: how much L3 misses affect latency
> > >
> > > I'm not sure if it's correct.  The event would tell where you get the L3
> > > miss mostly.  It doesn't necessarily mean latency.  IOW more L3 miss
> > > doesn't mean more CPU time.  I guess that's why perf doesn't say it as
> > > "CPU overhead".
> >
> > Yes, I agree, it does not directly translate to increased CPU
> > consumption or latency.
> > But the point is: whatever was the effect on CPU consumption, it's
> > rescaled to latency.
> >
> >
> > > Hmm.. maybe you could rename it differently?
> >
> > You mean the "perf report" help page, right?
> >
> > Yes, strictly speaking it's not "CPU overhead" and "Wall-clock
> > overhead", but rather "Overhead in CPU time" and "Overhead in
> > wall-clock time". Does it look better to you?
> >
> >
> > > > > Also I'm not sure about the impact of context switch events which could
> > > > > generate a lot of records that may end up with losing some of them.  And
> > > > > in that case the parallelism tracking would break.
> > > >
> > > > I've counted SAMPLE/SWITCH events on perf make:
> > > >
> > > > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
> > > > 813829
> > > > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
> > > > 46249
> > > >
> > > > And this is for:
> > > > perf record -e L1-dcache-load-misses,dTLB-load-misses make
> > > >
> > > > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SAMPLE | wc -l
> > > > 1138615
> > > > perf script --dump-unsorted-raw-trace | grep PERF_RECORD_SWITCH | wc -l
> > > > 115433
> > > >
> > > > Doesn't pretend to be scientifically accurate, but gives some idea
> > > > about proportion.
> > > > So for the number of samples that's 5% and 10% respectively.
> > >
> > > I think it depends on the workload.  In a very loaded system with short
> > > timeslice or something, you could get a lot more sched switches.  In
> > > your workload, `make` may have a long single-threaded portion.
> >
> > Unfortunately this is likely to be true.
> >
> > > > SWITCH events are smaller (24 bytes vs 40 bytes for SAMPLE), so for
> > > > ring size it's 3-6%.
> > > >
> > > > FWIW I've also considered and started implementing a different
> > > > approach where the kernel would count parallelism level for each
> > > > context and write it out with samples:
> > > > https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
> > > > Then sched_in/out would need to do atomic inc/dec on a global counter.
> > >
> > > It seems you counted parallelism in an event (including inherited ones).
> > > But usually perf tool opens a separate FD on each CPU for an event (like
> > > cycles), then you need to adjust the value across all CPUS.  I'm not
> > > sure if it's doable in this approach.
> >
> > Right, this is probably when I gave up on that approach -- when I
> > observed the counter has a value of 1 at most.
> >
> > > > Not sure how hard it is to make all corner cases work there, I dropped
> > > > it half way b/c the perf record post-processing looked like a better
> > > > approach.
> > >
> > > Probably.
> > >
> > > >
> > > > Lost SWITCH events may be a problem. Perf report already detects and
> > > > prints a warning about lost events. Should we stop reporting
> > > > parallelism/wall-clock if any events are lost? I just don't see it in
> > > > practice.
> > >
> > > It happens quite often when there are many activities.  Probably we can
> > > consider stopping when it detects some lost records.
> > >
> > >
> > > > Kernel could write-out a bitmask of types of lost events in PERF_RECORD_LOST.
> > >
> > > It'd be nice, but given the rate of sched switch I guess it'd most
> > > likely contain some.
> >
> > What's an easy way to get lost events? Normally on my machine I never see them.
> > Is there some parameter to make the buffer very small, or something?
> >
> >
> > > > > > The feature is added on an equal footing with the existing CPU profiling
> > > > > > rather than a separate mode enabled with special flags. The reasoning is
> > > > > > that users may not understand the problem and the meaning of numbers they
> > > > > > are seeing in the first place, so won't even realize that they may need
> > > > > > to be looking for some different profiling mode. When they are presented
> > > > > > with 2 sets of different numbers, they should start asking questions.
> > > > >
> > > > > I understand your point but I think it has some limitation so maybe it's
> > > > > better to put in a separate mode with special flags.
> > > >
> > > > I hate all options.
> > > >
> > > > The 3rd option is to add a _mandary_ flag to perf record to ask for
> > > > profiling mode. Then the old "perf record" will say "nope, you need to
> > > > choose, here are your options and explanation of the options".
> > > >
> > > > If we add it as a new separate mode, I fear it's doomed to be
> > > > unknown/unused and people will continue to not realize what they are
> > > > profiling, meaning of numbers, and continue to use wrong profiles for
> > > > the job.
> > >
> > > I think people will find it out if it really works well! :)
> >
> > Yes, but it may take 10 years for the word to spread :)
> > I am afraid the main problem is that people don't know they need to be
> > looking for something. We had 20 years of widespread multi-CPU/core
> > systems and no profiler supporting latency profiling.
> >
> > But as I said, I am open to options (at least I will know it's there
> > for my purposes :)).
> >
> >
> > > > I consider latency/wall-clock profiling as useful and as fundamental
> > > > as the current bandwidth profiling for parallel workloads (more or
> > > > less all workloads today). Yet we are in 2025 and no profilers out
> > > > there support latency profiling. Which suggests to me there is a
> > > > fundamental lack of understanding of the matter, so new optional mode
> > > > may not work well.
> > >
> > > I'm open to make it default if we can make sure it works well and have a
> > > clear interpretation of the result.
> > >
> > > >
> > > > If you are interested for context, I used a similar mode (for pprof at
> > > > the time) to optimize TensorFlow latency (people mostly care about
> > > > latency). All existing profilers said it's 99% highly optimized and
> > > > parallelized matrix multiplication, we are done here. Yet, the
> > > > wallclock profile said, nope, it's 45% Python interpreter, which
> > > > resulted in this fix:
> > > > https://github.com/soumith/convnet-benchmarks/commit/605988f847fcf261c288abc2351597d71a2ee149
> > > > Full story and examples of these profiles are here:
> > > > https://github.com/dvyukov/perf-load
> > >
> > > Great examples!  I'll take a look later.
> > >
> > > Thanks,
> > > Namhyung
> > >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-13 13:40     ` [PATCH v2] perf report: " Dmitry Vyukov
  2025-01-14  1:51       ` Namhyung Kim
@ 2025-01-24 19:02       ` Namhyung Kim
  2025-01-27 10:01         ` Dmitry Vyukov
  1 sibling, 1 reply; 22+ messages in thread
From: Namhyung Kim @ 2025-01-24 19:02 UTC (permalink / raw)
  To: Dmitry Vyukov; +Cc: irogers, linux-perf-users, linux-kernel

On Mon, Jan 13, 2025 at 02:40:06PM +0100, Dmitry Vyukov wrote:
> There are two notions of time: wall-clock time and CPU time.
> For a single-threaded program, or a program running on a single-core
> machine, these notions are the same. However, for a multi-threaded/
> multi-process program running on a multi-core machine, these notions are
> significantly different. Each second of wall-clock time we have
> number-of-cores seconds of CPU time.
> 
> Currently perf only allows to profile CPU time. Perf (and all other
> existing profilers to the be best of my knowledge) does not allow to
> profile wall-clock time.
> 
> Optimizing CPU overhead is useful to improve 'throughput', while
> optimizing wall-clock overhead is useful to improve 'latency'.
> These profiles are complementary and are not interchangeable.
> Examples of where wall-clock profile is needed:
>  - optimzing build latency
>  - optimizing server request latency
>  - optimizing ML training/inference latency
>  - optimizing running time of any command line program
> 
> CPU profile is useless for these use cases at best (if a user understands
> the difference), or misleading at worst (if a user tries to use a wrong
> profile for a job).
> 
> This patch adds wall-clock and parallelization profiling.
> See the added documentation and flags descriptions for details.
> 
> Brief outline of the implementation:
>  - add context switch collection during record
>  - calculate number of threads running on CPUs (parallelism level)
>    during report
>  - divide each sample weight by the parallelism level
> This effectively models that we were taking 1 sample per unit of
> wall-clock time.
> 
> The feature is added on an equal footing with the existing CPU profiling
> rather than a separate mode enabled with special flags. The reasoning is
> that users may not understand the problem and the meaning of numbers they
> are seeing in the first place, so won't even realize that they may need
> to be looking for some different profiling mode. When they are presented
> with 2 sets of different numbers, they should start asking questions.

Can you please split the change into pieces?  It'd be helpful to
reviewrs.  I think you can do:

* add parallelism sort key
* add parallelism filter
* add wallclock output field
* organize the feature using a proper option or config

> 
> Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Ian Rogers <irogers@google.com>
> Cc: linux-perf-users@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> 
> ---
> Changes in v2:
>  - just rebased and changed subject
> ---
>  .../callchain-overhead-calculation.txt        |  5 +-
>  .../cpu-and-wallclock-overheads.txt           | 85 +++++++++++++++++
>  tools/perf/Documentation/perf-report.txt      | 50 ++++++----
>  tools/perf/Documentation/tips.txt             |  3 +
>  tools/perf/builtin-record.c                   |  6 ++
>  tools/perf/builtin-report.c                   | 30 ++++++
>  tools/perf/ui/browsers/hists.c                | 27 ++++--
>  tools/perf/ui/hist.c                          | 61 +++++++++++--
>  tools/perf/util/addr_location.c               |  1 +
>  tools/perf/util/addr_location.h               |  6 +-
>  tools/perf/util/event.c                       |  6 ++
>  tools/perf/util/events_stats.h                |  2 +
>  tools/perf/util/hist.c                        | 91 ++++++++++++++++---
>  tools/perf/util/hist.h                        | 25 ++++-
>  tools/perf/util/machine.c                     |  7 ++
>  tools/perf/util/machine.h                     |  6 ++
>  tools/perf/util/session.c                     | 12 +++
>  tools/perf/util/session.h                     |  1 +
>  tools/perf/util/sort.c                        | 59 +++++++++++-
>  tools/perf/util/sort.h                        |  1 +
>  tools/perf/util/symbol.c                      | 34 +++++++
>  tools/perf/util/symbol_conf.h                 |  7 +-
>  22 files changed, 467 insertions(+), 58 deletions(-)
>  create mode 100644 tools/perf/Documentation/cpu-and-wallclock-overheads.txt
> 
> diff --git a/tools/perf/Documentation/callchain-overhead-calculation.txt b/tools/perf/Documentation/callchain-overhead-calculation.txt
> index 1a757927195ed..e0202bf5bd1a0 100644
> --- a/tools/perf/Documentation/callchain-overhead-calculation.txt
> +++ b/tools/perf/Documentation/callchain-overhead-calculation.txt
> @@ -1,7 +1,8 @@
>  Overhead calculation
>  --------------------
> -The overhead can be shown in two columns as 'Children' and 'Self' when
> -perf collects callchains.  The 'self' overhead is simply calculated by
> +The CPU overhead can be shown in two columns as 'Children' and 'Self'
> +when perf collects callchains (and corresponding 'Wall' columns for
> +wall-clock overhead).  The 'self' overhead is simply calculated by
>  adding all period values of the entry - usually a function (symbol).
>  This is the value that perf shows traditionally and sum of all the
>  'self' overhead values should be 100%.
> diff --git a/tools/perf/Documentation/cpu-and-wallclock-overheads.txt b/tools/perf/Documentation/cpu-and-wallclock-overheads.txt
> new file mode 100644
> index 0000000000000..4f739ff4de437
> --- /dev/null
> +++ b/tools/perf/Documentation/cpu-and-wallclock-overheads.txt
> @@ -0,0 +1,85 @@
> +CPU and wall-clock overheads
> +----------------------------
> +There are two notions of time: wall-clock time and CPU time.
> +For a single-threaded program, or a program running on a single-core machine,
> +these notions are the same. However, for a multi-threaded/multi-process program
> +running on a multi-core machine, these notions are significantly different.
> +Each second of wall-clock time we have number-of-cores seconds of CPU time.
> +Perf can measure overhead for both of these times (shown in 'overhead' and
> +'wallclock' columns for CPU and wall-clock correspondingly).
> +
> +Optimizing CPU overhead is useful to improve 'throughput', while optimizing
> +wall-clock overhead is useful to improve 'latency'. It's important to understand
> +which one is useful in a concrete situation at hand. For example, the former
> +may be useful to improve max throughput of a CI build server that runs on 100%
> +CPU utilization, while the latter may be useful to improve user-perceived
> +latency of a single interactive program build.
> +These overheads may be significantly different in some cases. For example,
> +consider a program that executes function 'foo' for 9 seconds with 1 thread,
> +and then executes function 'bar' for 1 second with 128 threads (consumes
> +128 seconds of CPU time). The CPU overhead is: 'foo' - 6.6%, 'bar' - 93.4%.
> +While wall-clock overhead is: 'foo' - 90%, 'bar' - 10%. If we try to optimize
> +running time of the program looking at the (wrong in this case) CPU overhead,
> +we would concentrate on the function 'bar', but it can yield only 10% running
> +time improvement at best.
> +
> +By default, perf shows both overheads (for parallel programs), but sorts by
> +CPU overhead:
> +
> +-----------------------------------
> +Overhead  Wallclock  Command
> +  93.88%     25.79%  cc1
> +   1.90%     39.87%  gzip
> +   0.99%     10.16%  dpkg-deb
> +   0.57%      1.00%  as
> +   0.40%      0.46%  sh
> +-----------------------------------
> +
> +To sort by wall-clock overhead, use '--sort=wallclock' flag:

Strictly speaking, 'wallclock' is not a sort key.  It's an output field
like 'overhead'.  It can be used as a sort key but it'd be meaningless
unless it's used with other sort keys or output field (using -F option).

For example, I got this with 'overhead' only.

  $ perf report -s overhead
  ...
  #
  # Overhead
  # ........
  #
     100.00%

This is because perf uses --sort option in a different meaning.  It's
more like 'group-by'.  But, probably more confusingly, it has the
original meaning when it used with -F/--fields option.

So the correct way to do it would be:

  $ perf report -F wallclock,overhead,comm -s wallclock

But it's cumbersome.. :(

Also I'm not still happy with the name "Wallclock".  If it shows any
events contributing to latency, maybe we can call it "Latency"?

> +
> +-----------------------------------
> +Wallclock  Overhead  Command
> +   39.87%     1.90%  gzip
> +   25.79%    93.88%  cc1
> +   10.16%     0.99%  dpkg-deb
> +    4.17%     0.29%  git
> +    2.81%     0.11%  objtool
> +-----------------------------------
> +
> +To get insight into the difference between the overheads, you may check
> +parallelization histogram with '--sort=wallclock,parallelism,comm,symbol --hierarchy'
> +flags. It shows fraction of (wall-clock) time the workload utilizes different
> +numbers of cores ('parallelism' column). For example, in the following case
> +the workload utilizes only 1 core most of the time, but also has some
> +highly-parallel phases, which explains significant difference between
> +CPU and wall-clock overheads:
> +
> +-----------------------------------
> +  Wallclock  Overhead     Parallelism / Command / Symbol
> ++    56.98%     2.29%     1
> ++    16.94%     1.36%     2
> ++     4.00%    20.13%     125
> ++     3.66%    18.25%     124
> ++     3.48%    17.66%     126
> ++     3.26%     0.39%     3
> ++     2.61%    12.93%     123
> +-----------------------------------
> +
> +By expanding corresponding lines, you may see what programs/functions run
> +at the given parallelism level:
> +
> +-----------------------------------
> +  Wallclock  Overhead     Parallelism / Command / Symbol
> +-    56.98%     2.29%     1
> +        32.80%     1.32%     gzip
> +         4.46%     0.18%     cc1
> +         2.81%     0.11%     objtool
> +         2.43%     0.10%     dpkg-source
> +         2.22%     0.09%     ld
> +         2.10%     0.08%     dpkg-genchanges
> +-----------------------------------
> +
> +To see the normal function-level profile for particular parallelism levels
> +(number of threads actively running on CPUs), you may use '--parallelism'
> +filter. For example, to see the profile only for low parallelism phases
> +of a workload use '--sort=wallclock,comm,symbol --parallelism=1-2' flags.
> diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> index 87f8645194062..e79ff412af0a5 100644
> --- a/tools/perf/Documentation/perf-report.txt
> +++ b/tools/perf/Documentation/perf-report.txt
> @@ -44,7 +44,7 @@ OPTIONS
>  --comms=::
>  	Only consider symbols in these comms. CSV that understands
>  	file://filename entries.  This option will affect the percentage of
> -	the overhead column.  See --percentage for more info.
> +	the overhead and wallclock columns.  See --percentage for more info.
>  --pid=::
>          Only show events for given process ID (comma separated list).
>  
> @@ -54,12 +54,12 @@ OPTIONS
>  --dsos=::
>  	Only consider symbols in these dsos. CSV that understands
>  	file://filename entries.  This option will affect the percentage of
> -	the overhead column.  See --percentage for more info.
> +	the overhead and wallclock columns.  See --percentage for more info.
>  -S::
>  --symbols=::
>  	Only consider these symbols. CSV that understands
>  	file://filename entries.  This option will affect the percentage of
> -	the overhead column.  See --percentage for more info.
> +	the overhead and wallclock columns.  See --percentage for more info.
>  
>  --symbol-filter=::
>  	Only show symbols that match (partially) with this filter.
> @@ -68,6 +68,16 @@ OPTIONS
>  --hide-unresolved::
>          Only display entries resolved to a symbol.
>  
> +--parallelism::
> +        Only consider these parallelism levels. Parallelism level is the number
> +        of threads that actively run on CPUs at the time of sample. The flag
> +        accepts single number, comma-separated list, and ranges (for example:
> +        "1", "7,8", "1,64-128"). This is useful in understanding what a program
> +        is doing during sequential/low-parallelism phases as compared to
> +        high-parallelism phases. This option will affect the percentage of
> +        the overhead and wallclock columns. See --percentage for more info.
> +        Also see the `CPU and wall-clock overheads' section for more details.
> +
>  -s::
>  --sort=::
>  	Sort histogram entries by given key(s) - multiple keys can be specified
> @@ -87,6 +97,7 @@ OPTIONS
>  	entries are displayed as "[other]".
>  	- cpu: cpu number the task ran at the time of sample
>  	- socket: processor socket number the task ran at the time of sample
> +	- parallelism: number of running threads at the time of sample
>  	- srcline: filename and line number executed at the time of sample.  The
>  	DWARF debugging info must be provided.
>  	- srcfile: file name of the source file of the samples. Requires dwarf
> @@ -97,12 +108,14 @@ OPTIONS
>  	- cgroup_id: ID derived from cgroup namespace device and inode numbers.
>  	- cgroup: cgroup pathname in the cgroupfs.
>  	- transaction: Transaction abort flags.
> -	- overhead: Overhead percentage of sample
> -	- overhead_sys: Overhead percentage of sample running in system mode
> -	- overhead_us: Overhead percentage of sample running in user mode
> -	- overhead_guest_sys: Overhead percentage of sample running in system mode
> +	- overhead: CPU overhead percentage of sample
> +	- wallclock: Wall-clock (latency) overhead percentage of sample.
> +	  See the `CPU and wall-clock overheads' section for more details.
> +	- overhead_sys: CPU overhead percentage of sample running in system mode
> +	- overhead_us: CPU overhead percentage of sample running in user mode
> +	- overhead_guest_sys: CPU overhead percentage of sample running in system mode
>  	on guest machine
> -	- overhead_guest_us: Overhead percentage of sample running in user mode on
> +	- overhead_guest_us: CPU overhead percentage of sample running in user mode on
>  	guest machine
>  	- sample: Number of sample
>  	- period: Raw number of event count of sample
> @@ -125,8 +138,9 @@ OPTIONS
>  	- weight2: Average value of event specific weight (2nd field of weight_struct).
>  	- weight3: Average value of event specific weight (3rd field of weight_struct).
>  
> -	By default, comm, dso and symbol keys are used.
> -	(i.e. --sort comm,dso,symbol)
> +	By default, overhead, comm, dso and symbol keys are used.
> +	(i.e. --sort overhead,comm,dso,symbol). If only wallclock is specified,
> +	(i.e. --sort wallclock), it expands to wallclock,comm,dso,symbol.
>  
>  	If --branch-stack option is used, following sort keys are also
>  	available:
> @@ -201,9 +215,9 @@ OPTIONS
>  --fields=::
>  	Specify output field - multiple keys can be specified in CSV format.
>  	Following fields are available:
> -	overhead, overhead_sys, overhead_us, overhead_children, sample, period,
> -	weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat.  The
> -	last 3 names are alias for the corresponding weights.  When the weight
> +	overhead, wallclock, overhead_sys, overhead_us, overhead_children, sample,
> +	period, weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat.
> +	The last 3 names are alias for the corresponding weights.  When the weight
>  	fields are used, they will show the average value of the weight.
>  
>  	Also it can contain any sort key(s).
> @@ -289,7 +303,7 @@ OPTIONS
>  	Accumulate callchain of children to parent entry so that then can
>  	show up in the output.  The output will have a new "Children" column
>  	and will be sorted on the data.  It requires callchains are recorded.
> -	See the `overhead calculation' section for more details. Enabled by
> +	See the `Overhead calculation' section for more details. Enabled by
>  	default, disable with --no-children.
>  
>  --max-stack::
> @@ -442,9 +456,9 @@ OPTIONS
>  	--call-graph option for details.
>  
>  --percentage::
> -	Determine how to display the overhead percentage of filtered entries.
> -	Filters can be applied by --comms, --dsos and/or --symbols options and
> -	Zoom operations on the TUI (thread, dso, etc).
> +	Determine how to display the CPU and wall-clock overhead percentage
> +	of filtered entries. Filters can be applied by --comms, --dsos, --symbols
> +	and/or --parallelism options and Zoom operations on the TUI (thread, dso, etc).
>  
>  	"relative" means it's relative to filtered entries only so that the
>  	sum of shown entries will be always 100%.  "absolute" means it retains
> @@ -627,6 +641,8 @@ include::itrace.txt[]
>  --skip-empty::
>  	Do not print 0 results in the --stat output.
>  
> +include::cpu-and-wallclock-overheads.txt[]
> +
>  include::callchain-overhead-calculation.txt[]
>  
>  SEE ALSO
> diff --git a/tools/perf/Documentation/tips.txt b/tools/perf/Documentation/tips.txt
> index 67b326ba00407..50916c58b81c5 100644
> --- a/tools/perf/Documentation/tips.txt
> +++ b/tools/perf/Documentation/tips.txt
> @@ -62,3 +62,6 @@ To show context switches in perf report sample context add --switch-events to pe
>  To show time in nanoseconds in record/report add --ns
>  To compare hot regions in two workloads use perf record -b -o file ... ; perf diff --stream file1 file2
>  To compare scalability of two workload samples use perf diff -c ratio file1 file2
> +For latency analysis, try: perf report --sort wallclock
> +For parallelism histogram, try: perf report --hierarchy --sort wallclock,parallelism,comm,symbol
> +To analyze particular parallelism levels, try: perf report --sort=wallclock,symbol --parallelism=32-64
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index f832524729211..225e3997f03cb 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -3342,6 +3342,12 @@ static struct record record = {
>  		.ctl_fd              = -1,
>  		.ctl_fd_ack          = -1,
>  		.synth               = PERF_SYNTH_ALL,
> +		/*
> +		 * This is enabled by default because it's required to show
> +		 * wallclock overhead and parallelism profile, and does not
> +		 * add lots of overhead on top of samples data.

I'm not sure about this and don't want to enable it by default.
There's --switch-events option to enable this.  Or you can add --latency
option to make it explicit and give more hints to users.

For example, if we want to target single process only it can reject
system-wide monitoring when --latency is given.

Thanks,
Namhyung

> +		 */
> +		.record_switch_events	= true,
>  	},
>  };
>  

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
  2025-01-24 19:02       ` [PATCH v2] perf report: Add wall-clock and parallelism profiling Namhyung Kim
@ 2025-01-27 10:01         ` Dmitry Vyukov
  0 siblings, 0 replies; 22+ messages in thread
From: Dmitry Vyukov @ 2025-01-27 10:01 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: irogers, linux-perf-users, linux-kernel

On Fri, 24 Jan 2025 at 20:02, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Mon, Jan 13, 2025 at 02:40:06PM +0100, Dmitry Vyukov wrote:
> > There are two notions of time: wall-clock time and CPU time.
> > For a single-threaded program, or a program running on a single-core
> > machine, these notions are the same. However, for a multi-threaded/
> > multi-process program running on a multi-core machine, these notions are
> > significantly different. Each second of wall-clock time we have
> > number-of-cores seconds of CPU time.
> >
> > Currently perf only allows to profile CPU time. Perf (and all other
> > existing profilers to the be best of my knowledge) does not allow to
> > profile wall-clock time.
> >
> > Optimizing CPU overhead is useful to improve 'throughput', while
> > optimizing wall-clock overhead is useful to improve 'latency'.
> > These profiles are complementary and are not interchangeable.
> > Examples of where wall-clock profile is needed:
> >  - optimzing build latency
> >  - optimizing server request latency
> >  - optimizing ML training/inference latency
> >  - optimizing running time of any command line program
> >
> > CPU profile is useless for these use cases at best (if a user understands
> > the difference), or misleading at worst (if a user tries to use a wrong
> > profile for a job).
> >
> > This patch adds wall-clock and parallelization profiling.
> > See the added documentation and flags descriptions for details.
> >
> > Brief outline of the implementation:
> >  - add context switch collection during record
> >  - calculate number of threads running on CPUs (parallelism level)
> >    during report
> >  - divide each sample weight by the parallelism level
> > This effectively models that we were taking 1 sample per unit of
> > wall-clock time.
> >
> > The feature is added on an equal footing with the existing CPU profiling
> > rather than a separate mode enabled with special flags. The reasoning is
> > that users may not understand the problem and the meaning of numbers they
> > are seeing in the first place, so won't even realize that they may need
> > to be looking for some different profiling mode. When they are presented
> > with 2 sets of different numbers, they should start asking questions.
>
> Can you please split the change into pieces?  It'd be helpful to
> reviewrs.  I think you can do:
>
> * add parallelism sort key
> * add parallelism filter
> * add wallclock output field
> * organize the feature using a proper option or config


Thanks for the review!

I've sent v3 with all comments addressed:
https://lore.kernel.org/linux-perf-users/cover.1737971364.git.dvyukov@google.com/T/#t

Please take another look.


> > Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> > Cc: Namhyung Kim <namhyung@kernel.org>
> > Cc: Ian Rogers <irogers@google.com>
> > Cc: linux-perf-users@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> >
> > ---
> > Changes in v2:
> >  - just rebased and changed subject
> > ---
> >  .../callchain-overhead-calculation.txt        |  5 +-
> >  .../cpu-and-wallclock-overheads.txt           | 85 +++++++++++++++++
> >  tools/perf/Documentation/perf-report.txt      | 50 ++++++----
> >  tools/perf/Documentation/tips.txt             |  3 +
> >  tools/perf/builtin-record.c                   |  6 ++
> >  tools/perf/builtin-report.c                   | 30 ++++++
> >  tools/perf/ui/browsers/hists.c                | 27 ++++--
> >  tools/perf/ui/hist.c                          | 61 +++++++++++--
> >  tools/perf/util/addr_location.c               |  1 +
> >  tools/perf/util/addr_location.h               |  6 +-
> >  tools/perf/util/event.c                       |  6 ++
> >  tools/perf/util/events_stats.h                |  2 +
> >  tools/perf/util/hist.c                        | 91 ++++++++++++++++---
> >  tools/perf/util/hist.h                        | 25 ++++-
> >  tools/perf/util/machine.c                     |  7 ++
> >  tools/perf/util/machine.h                     |  6 ++
> >  tools/perf/util/session.c                     | 12 +++
> >  tools/perf/util/session.h                     |  1 +
> >  tools/perf/util/sort.c                        | 59 +++++++++++-
> >  tools/perf/util/sort.h                        |  1 +
> >  tools/perf/util/symbol.c                      | 34 +++++++
> >  tools/perf/util/symbol_conf.h                 |  7 +-
> >  22 files changed, 467 insertions(+), 58 deletions(-)
> >  create mode 100644 tools/perf/Documentation/cpu-and-wallclock-overheads.txt
> >
> > diff --git a/tools/perf/Documentation/callchain-overhead-calculation.txt b/tools/perf/Documentation/callchain-overhead-calculation.txt
> > index 1a757927195ed..e0202bf5bd1a0 100644
> > --- a/tools/perf/Documentation/callchain-overhead-calculation.txt
> > +++ b/tools/perf/Documentation/callchain-overhead-calculation.txt
> > @@ -1,7 +1,8 @@
> >  Overhead calculation
> >  --------------------
> > -The overhead can be shown in two columns as 'Children' and 'Self' when
> > -perf collects callchains.  The 'self' overhead is simply calculated by
> > +The CPU overhead can be shown in two columns as 'Children' and 'Self'
> > +when perf collects callchains (and corresponding 'Wall' columns for
> > +wall-clock overhead).  The 'self' overhead is simply calculated by
> >  adding all period values of the entry - usually a function (symbol).
> >  This is the value that perf shows traditionally and sum of all the
> >  'self' overhead values should be 100%.
> > diff --git a/tools/perf/Documentation/cpu-and-wallclock-overheads.txt b/tools/perf/Documentation/cpu-and-wallclock-overheads.txt
> > new file mode 100644
> > index 0000000000000..4f739ff4de437
> > --- /dev/null
> > +++ b/tools/perf/Documentation/cpu-and-wallclock-overheads.txt
> > @@ -0,0 +1,85 @@
> > +CPU and wall-clock overheads
> > +----------------------------
> > +There are two notions of time: wall-clock time and CPU time.
> > +For a single-threaded program, or a program running on a single-core machine,
> > +these notions are the same. However, for a multi-threaded/multi-process program
> > +running on a multi-core machine, these notions are significantly different.
> > +Each second of wall-clock time we have number-of-cores seconds of CPU time.
> > +Perf can measure overhead for both of these times (shown in 'overhead' and
> > +'wallclock' columns for CPU and wall-clock correspondingly).
> > +
> > +Optimizing CPU overhead is useful to improve 'throughput', while optimizing
> > +wall-clock overhead is useful to improve 'latency'. It's important to understand
> > +which one is useful in a concrete situation at hand. For example, the former
> > +may be useful to improve max throughput of a CI build server that runs on 100%
> > +CPU utilization, while the latter may be useful to improve user-perceived
> > +latency of a single interactive program build.
> > +These overheads may be significantly different in some cases. For example,
> > +consider a program that executes function 'foo' for 9 seconds with 1 thread,
> > +and then executes function 'bar' for 1 second with 128 threads (consumes
> > +128 seconds of CPU time). The CPU overhead is: 'foo' - 6.6%, 'bar' - 93.4%.
> > +While wall-clock overhead is: 'foo' - 90%, 'bar' - 10%. If we try to optimize
> > +running time of the program looking at the (wrong in this case) CPU overhead,
> > +we would concentrate on the function 'bar', but it can yield only 10% running
> > +time improvement at best.
> > +
> > +By default, perf shows both overheads (for parallel programs), but sorts by
> > +CPU overhead:
> > +
> > +-----------------------------------
> > +Overhead  Wallclock  Command
> > +  93.88%     25.79%  cc1
> > +   1.90%     39.87%  gzip
> > +   0.99%     10.16%  dpkg-deb
> > +   0.57%      1.00%  as
> > +   0.40%      0.46%  sh
> > +-----------------------------------
> > +
> > +To sort by wall-clock overhead, use '--sort=wallclock' flag:
>
> Strictly speaking, 'wallclock' is not a sort key.  It's an output field
> like 'overhead'.  It can be used as a sort key but it'd be meaningless
> unless it's used with other sort keys or output field (using -F option).
>
> For example, I got this with 'overhead' only.
>
>   $ perf report -s overhead
>   ...
>   #
>   # Overhead
>   # ........
>   #
>      100.00%
>
> This is because perf uses --sort option in a different meaning.  It's
> more like 'group-by'.  But, probably more confusingly, it has the
> original meaning when it used with -F/--fields option.
>
> So the correct way to do it would be:
>
>   $ perf report -F wallclock,overhead,comm -s wallclock
>
> But it's cumbersome.. :(
>
> Also I'm not still happy with the name "Wallclock".  If it shows any
> events contributing to latency, maybe we can call it "Latency"?
>
> > +
> > +-----------------------------------
> > +Wallclock  Overhead  Command
> > +   39.87%     1.90%  gzip
> > +   25.79%    93.88%  cc1
> > +   10.16%     0.99%  dpkg-deb
> > +    4.17%     0.29%  git
> > +    2.81%     0.11%  objtool
> > +-----------------------------------
> > +
> > +To get insight into the difference between the overheads, you may check
> > +parallelization histogram with '--sort=wallclock,parallelism,comm,symbol --hierarchy'
> > +flags. It shows fraction of (wall-clock) time the workload utilizes different
> > +numbers of cores ('parallelism' column). For example, in the following case
> > +the workload utilizes only 1 core most of the time, but also has some
> > +highly-parallel phases, which explains significant difference between
> > +CPU and wall-clock overheads:
> > +
> > +-----------------------------------
> > +  Wallclock  Overhead     Parallelism / Command / Symbol
> > ++    56.98%     2.29%     1
> > ++    16.94%     1.36%     2
> > ++     4.00%    20.13%     125
> > ++     3.66%    18.25%     124
> > ++     3.48%    17.66%     126
> > ++     3.26%     0.39%     3
> > ++     2.61%    12.93%     123
> > +-----------------------------------
> > +
> > +By expanding corresponding lines, you may see what programs/functions run
> > +at the given parallelism level:
> > +
> > +-----------------------------------
> > +  Wallclock  Overhead     Parallelism / Command / Symbol
> > +-    56.98%     2.29%     1
> > +        32.80%     1.32%     gzip
> > +         4.46%     0.18%     cc1
> > +         2.81%     0.11%     objtool
> > +         2.43%     0.10%     dpkg-source
> > +         2.22%     0.09%     ld
> > +         2.10%     0.08%     dpkg-genchanges
> > +-----------------------------------
> > +
> > +To see the normal function-level profile for particular parallelism levels
> > +(number of threads actively running on CPUs), you may use '--parallelism'
> > +filter. For example, to see the profile only for low parallelism phases
> > +of a workload use '--sort=wallclock,comm,symbol --parallelism=1-2' flags.
> > diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> > index 87f8645194062..e79ff412af0a5 100644
> > --- a/tools/perf/Documentation/perf-report.txt
> > +++ b/tools/perf/Documentation/perf-report.txt
> > @@ -44,7 +44,7 @@ OPTIONS
> >  --comms=::
> >       Only consider symbols in these comms. CSV that understands
> >       file://filename entries.  This option will affect the percentage of
> > -     the overhead column.  See --percentage for more info.
> > +     the overhead and wallclock columns.  See --percentage for more info.
> >  --pid=::
> >          Only show events for given process ID (comma separated list).
> >
> > @@ -54,12 +54,12 @@ OPTIONS
> >  --dsos=::
> >       Only consider symbols in these dsos. CSV that understands
> >       file://filename entries.  This option will affect the percentage of
> > -     the overhead column.  See --percentage for more info.
> > +     the overhead and wallclock columns.  See --percentage for more info.
> >  -S::
> >  --symbols=::
> >       Only consider these symbols. CSV that understands
> >       file://filename entries.  This option will affect the percentage of
> > -     the overhead column.  See --percentage for more info.
> > +     the overhead and wallclock columns.  See --percentage for more info.
> >
> >  --symbol-filter=::
> >       Only show symbols that match (partially) with this filter.
> > @@ -68,6 +68,16 @@ OPTIONS
> >  --hide-unresolved::
> >          Only display entries resolved to a symbol.
> >
> > +--parallelism::
> > +        Only consider these parallelism levels. Parallelism level is the number
> > +        of threads that actively run on CPUs at the time of sample. The flag
> > +        accepts single number, comma-separated list, and ranges (for example:
> > +        "1", "7,8", "1,64-128"). This is useful in understanding what a program
> > +        is doing during sequential/low-parallelism phases as compared to
> > +        high-parallelism phases. This option will affect the percentage of
> > +        the overhead and wallclock columns. See --percentage for more info.
> > +        Also see the `CPU and wall-clock overheads' section for more details.
> > +
> >  -s::
> >  --sort=::
> >       Sort histogram entries by given key(s) - multiple keys can be specified
> > @@ -87,6 +97,7 @@ OPTIONS
> >       entries are displayed as "[other]".
> >       - cpu: cpu number the task ran at the time of sample
> >       - socket: processor socket number the task ran at the time of sample
> > +     - parallelism: number of running threads at the time of sample
> >       - srcline: filename and line number executed at the time of sample.  The
> >       DWARF debugging info must be provided.
> >       - srcfile: file name of the source file of the samples. Requires dwarf
> > @@ -97,12 +108,14 @@ OPTIONS
> >       - cgroup_id: ID derived from cgroup namespace device and inode numbers.
> >       - cgroup: cgroup pathname in the cgroupfs.
> >       - transaction: Transaction abort flags.
> > -     - overhead: Overhead percentage of sample
> > -     - overhead_sys: Overhead percentage of sample running in system mode
> > -     - overhead_us: Overhead percentage of sample running in user mode
> > -     - overhead_guest_sys: Overhead percentage of sample running in system mode
> > +     - overhead: CPU overhead percentage of sample
> > +     - wallclock: Wall-clock (latency) overhead percentage of sample.
> > +       See the `CPU and wall-clock overheads' section for more details.
> > +     - overhead_sys: CPU overhead percentage of sample running in system mode
> > +     - overhead_us: CPU overhead percentage of sample running in user mode
> > +     - overhead_guest_sys: CPU overhead percentage of sample running in system mode
> >       on guest machine
> > -     - overhead_guest_us: Overhead percentage of sample running in user mode on
> > +     - overhead_guest_us: CPU overhead percentage of sample running in user mode on
> >       guest machine
> >       - sample: Number of sample
> >       - period: Raw number of event count of sample
> > @@ -125,8 +138,9 @@ OPTIONS
> >       - weight2: Average value of event specific weight (2nd field of weight_struct).
> >       - weight3: Average value of event specific weight (3rd field of weight_struct).
> >
> > -     By default, comm, dso and symbol keys are used.
> > -     (i.e. --sort comm,dso,symbol)
> > +     By default, overhead, comm, dso and symbol keys are used.
> > +     (i.e. --sort overhead,comm,dso,symbol). If only wallclock is specified,
> > +     (i.e. --sort wallclock), it expands to wallclock,comm,dso,symbol.
> >
> >       If --branch-stack option is used, following sort keys are also
> >       available:
> > @@ -201,9 +215,9 @@ OPTIONS
> >  --fields=::
> >       Specify output field - multiple keys can be specified in CSV format.
> >       Following fields are available:
> > -     overhead, overhead_sys, overhead_us, overhead_children, sample, period,
> > -     weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat.  The
> > -     last 3 names are alias for the corresponding weights.  When the weight
> > +     overhead, wallclock, overhead_sys, overhead_us, overhead_children, sample,
> > +     period, weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat.
> > +     The last 3 names are alias for the corresponding weights.  When the weight
> >       fields are used, they will show the average value of the weight.
> >
> >       Also it can contain any sort key(s).
> > @@ -289,7 +303,7 @@ OPTIONS
> >       Accumulate callchain of children to parent entry so that then can
> >       show up in the output.  The output will have a new "Children" column
> >       and will be sorted on the data.  It requires callchains are recorded.
> > -     See the `overhead calculation' section for more details. Enabled by
> > +     See the `Overhead calculation' section for more details. Enabled by
> >       default, disable with --no-children.
> >
> >  --max-stack::
> > @@ -442,9 +456,9 @@ OPTIONS
> >       --call-graph option for details.
> >
> >  --percentage::
> > -     Determine how to display the overhead percentage of filtered entries.
> > -     Filters can be applied by --comms, --dsos and/or --symbols options and
> > -     Zoom operations on the TUI (thread, dso, etc).
> > +     Determine how to display the CPU and wall-clock overhead percentage
> > +     of filtered entries. Filters can be applied by --comms, --dsos, --symbols
> > +     and/or --parallelism options and Zoom operations on the TUI (thread, dso, etc).
> >
> >       "relative" means it's relative to filtered entries only so that the
> >       sum of shown entries will be always 100%.  "absolute" means it retains
> > @@ -627,6 +641,8 @@ include::itrace.txt[]
> >  --skip-empty::
> >       Do not print 0 results in the --stat output.
> >
> > +include::cpu-and-wallclock-overheads.txt[]
> > +
> >  include::callchain-overhead-calculation.txt[]
> >
> >  SEE ALSO
> > diff --git a/tools/perf/Documentation/tips.txt b/tools/perf/Documentation/tips.txt
> > index 67b326ba00407..50916c58b81c5 100644
> > --- a/tools/perf/Documentation/tips.txt
> > +++ b/tools/perf/Documentation/tips.txt
> > @@ -62,3 +62,6 @@ To show context switches in perf report sample context add --switch-events to pe
> >  To show time in nanoseconds in record/report add --ns
> >  To compare hot regions in two workloads use perf record -b -o file ... ; perf diff --stream file1 file2
> >  To compare scalability of two workload samples use perf diff -c ratio file1 file2
> > +For latency analysis, try: perf report --sort wallclock
> > +For parallelism histogram, try: perf report --hierarchy --sort wallclock,parallelism,comm,symbol
> > +To analyze particular parallelism levels, try: perf report --sort=wallclock,symbol --parallelism=32-64
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index f832524729211..225e3997f03cb 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -3342,6 +3342,12 @@ static struct record record = {
> >               .ctl_fd              = -1,
> >               .ctl_fd_ack          = -1,
> >               .synth               = PERF_SYNTH_ALL,
> > +             /*
> > +              * This is enabled by default because it's required to show
> > +              * wallclock overhead and parallelism profile, and does not
> > +              * add lots of overhead on top of samples data.
>
> I'm not sure about this and don't want to enable it by default.
> There's --switch-events option to enable this.  Or you can add --latency
> option to make it explicit and give more hints to users.
>
> For example, if we want to target single process only it can reject
> system-wide monitoring when --latency is given.
>
> Thanks,
> Namhyung
>
> > +              */
> > +             .record_switch_events   = true,
> >       },
> >  };
> >

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-01-27 10:02 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-08  8:24 [PATCH] tools/perf: Add wall-clock and parallelism profiling Dmitry Vyukov
2025-01-08  8:34 ` Dmitry Vyukov
2025-01-13 12:25   ` Dmitry Vyukov
2025-01-13 13:40     ` [PATCH v2] perf report: " Dmitry Vyukov
2025-01-14  1:51       ` Namhyung Kim
2025-01-14  8:26         ` Dmitry Vyukov
2025-01-14 15:56           ` Arnaldo Carvalho de Melo
2025-01-14 16:07             ` Dmitry Vyukov
2025-01-14 17:52               ` Arnaldo Carvalho de Melo
2025-01-14 18:16                 ` Arnaldo Carvalho de Melo
2025-01-19 10:22                 ` Dmitry Vyukov
2025-01-15  0:30           ` Namhyung Kim
2025-01-19 10:50             ` Dmitry Vyukov
2025-01-24 10:46               ` Dmitry Vyukov
2025-01-24 17:56                 ` Namhyung Kim
2025-01-15  5:59           ` Ian Rogers
2025-01-15  7:11             ` Dmitry Vyukov
2025-01-16 18:55               ` Namhyung Kim
2025-01-19 11:08                 ` Off-CPU sampling (was perf report: Add wall-clock and parallelism profiling) Dmitry Vyukov
2025-01-23 23:34                   ` Namhyung Kim
2025-01-24 19:02       ` [PATCH v2] perf report: Add wall-clock and parallelism profiling Namhyung Kim
2025-01-27 10:01         ` Dmitry Vyukov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).