[PATCH v5 0/8] perf report: Add latency and parallelism profiling

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/8] perf report: Add latency and parallelism profiling
@ 2025-02-05 16:27 Dmitry Vyukov
  2025-02-05 16:27 ` [PATCH v5 1/8] perf report: Add machine parallelism Dmitry Vyukov
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-05 16:27 UTC (permalink / raw)
  To: namhyung, irogers
  Cc: linux-perf-users, linux-kernel, Dmitry Vyukov,
	Arnaldo Carvalho de Melo

There are two notions of time: wall-clock time and CPU time.
For a single-threaded program, or a program running on a single-core
machine, these notions are the same. However, for a multi-threaded/
multi-process program running on a multi-core machine, these notions are
significantly different. Each second of wall-clock time we have
number-of-cores seconds of CPU time.

Currently perf only allows to profile CPU time. Perf (and all other
existing profilers to the be best of my knowledge) does not allow to
profile wall-clock time.

Optimizing CPU overhead is useful to improve 'throughput', while
optimizing wall-clock overhead is useful to improve 'latency'.
These profiles are complementary and are not interchangeable.
Examples of where latency profile is needed:
 - optimzing build latency
 - optimizing server request latency
 - optimizing ML training/inference latency
 - optimizing running time of any command line program

CPU profile is useless for these use cases at best (if a user understands
the difference), or misleading at worst (if a user tries to use a wrong
profile for a job).

This series add latency and parallelization profiling.
See the added documentation and flags descriptions for details.

Brief outline of the implementation:
 - add context switch collection during record
 - calculate number of threads running on CPUs (parallelism level)
   during report
 - divide each sample weight by the parallelism level
This effectively models that we were taking 1 sample per unit of
wall-clock time.

We still default to the CPU profile, so it's up to users to learn
about the second profiling mode and use it when appropriate.

Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

Changes in v5:
 - fixed formatting of latency field in --stdout mode
 - added description of --latency flag in Documentation flags

Changes in v4:
 - added "Shrink struct hist_entry size" commit
 - rebased to perf-tools-next HEAD

Changes in v3:
 - rebase and split into patches
 - rename 'wallclock' to 'latency' everywhere
 - don't enable latency profiling by default,
   instead add record/report --latency flag

Dmitry Vyukov (8):
  perf report: Add machine parallelism
  perf report: Add parallelism sort key
  perf report: Switch filtered from u8 to u16
  perf report: Add parallelism filter
  perf report: Add latency output field
  perf report: Add --latency flag
  perf report: Add latency and parallelism profiling documentation
  perf hist: Shrink struct hist_entry size

 .../callchain-overhead-calculation.txt        |   5 +-
 .../cpu-and-latency-overheads.txt             |  85 ++++++++++++++
 tools/perf/Documentation/perf-record.txt      |   4 +
 tools/perf/Documentation/perf-report.txt      |  54 ++++++---
 tools/perf/Documentation/tips.txt             |   3 +
 tools/perf/builtin-record.c                   |  20 ++++
 tools/perf/builtin-report.c                   |  39 +++++++
 tools/perf/ui/browsers/hists.c                |  27 +++--
 tools/perf/ui/hist.c                          | 104 ++++++++++++------
 tools/perf/util/addr_location.c               |   1 +
 tools/perf/util/addr_location.h               |   7 +-
 tools/perf/util/event.c                       |  11 ++
 tools/perf/util/events_stats.h                |   2 +
 tools/perf/util/hist.c                        |  90 ++++++++++++---
 tools/perf/util/hist.h                        |  32 +++++-
 tools/perf/util/machine.c                     |   7 ++
 tools/perf/util/machine.h                     |   6 +
 tools/perf/util/sample.h                      |   2 +-
 tools/perf/util/session.c                     |  12 ++
 tools/perf/util/session.h                     |   1 +
 tools/perf/util/sort.c                        |  69 +++++++++++-
 tools/perf/util/sort.h                        |   3 +-
 tools/perf/util/symbol.c                      |  34 ++++++
 tools/perf/util/symbol_conf.h                 |   8 +-
 24 files changed, 531 insertions(+), 95 deletions(-)
 create mode 100644 tools/perf/Documentation/cpu-and-latency-overheads.txt


base-commit: 8ce0d2da14d3fb62844dd0e95982c194326b1a5f
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v5 1/8] perf report: Add machine parallelism
  2025-02-05 16:27 [PATCH v5 0/8] perf report: Add latency and parallelism profiling Dmitry Vyukov
@ 2025-02-05 16:27 ` Dmitry Vyukov
  2025-02-05 16:27 ` [PATCH v5 2/8] perf report: Add parallelism sort key Dmitry Vyukov
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-05 16:27 UTC (permalink / raw)
  To: namhyung, irogers
  Cc: linux-perf-users, linux-kernel, Dmitry Vyukov,
	Arnaldo Carvalho de Melo

Add calculation of the current parallelism level (number of threads actively
running on CPUs). The parallelism level can be shown in reports on its own,
and to calculate latency overheads.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 tools/perf/builtin-report.c     | 1 +
 tools/perf/util/addr_location.c | 1 +
 tools/perf/util/addr_location.h | 2 ++
 tools/perf/util/event.c         | 3 +++
 tools/perf/util/machine.c       | 7 +++++++
 tools/perf/util/machine.h       | 6 ++++++
 6 files changed, 20 insertions(+)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index f5fbd670d619a..0d9bd090eda71 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -1568,6 +1568,7 @@ int cmd_report(int argc, const char **argv)
 	report.tool.cgroup		 = perf_event__process_cgroup;
 	report.tool.exit		 = perf_event__process_exit;
 	report.tool.fork		 = perf_event__process_fork;
+	report.tool.context_switch	 = perf_event__process_switch;
 	report.tool.lost		 = perf_event__process_lost;
 	report.tool.read		 = process_read_event;
 	report.tool.attr		 = process_attr;
diff --git a/tools/perf/util/addr_location.c b/tools/perf/util/addr_location.c
index 51825ef8c0ab7..007a2f5df9a6a 100644
--- a/tools/perf/util/addr_location.c
+++ b/tools/perf/util/addr_location.c
@@ -17,6 +17,7 @@ void addr_location__init(struct addr_location *al)
 	al->cpumode = 0;
 	al->cpu = 0;
 	al->socket = 0;
+	al->parallelism = 1;
 }
 
 /*
diff --git a/tools/perf/util/addr_location.h b/tools/perf/util/addr_location.h
index d8ac0428dff23..36aaa45445f24 100644
--- a/tools/perf/util/addr_location.h
+++ b/tools/perf/util/addr_location.h
@@ -21,6 +21,8 @@ struct addr_location {
 	u8	      cpumode;
 	s32	      cpu;
 	s32	      socket;
+	/* Same as machine.parallelism but within [1, nr_cpus]. */
+	int	      parallelism;
 };
 
 void addr_location__init(struct addr_location *al);
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index aac96d5d19170..3d3ac58e1c660 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -767,6 +767,9 @@ int machine__resolve(struct machine *machine, struct addr_location *al,
 			al->socket = env->cpu[al->cpu].socket_id;
 	}
 
+	/* Account for possible out-of-order switch events. */
+	al->parallelism = min(max(machine->parallelism, 1), machine__nr_cpus_avail(machine));
+
 	if (al->map) {
 		if (symbol_conf.dso_list &&
 		    (!dso || !(strlist__has_entry(symbol_conf.dso_list,
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 2d51badfbf2e2..0e65c5377f983 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -94,6 +94,8 @@ int machine__init(struct machine *machine, const char *root_dir, pid_t pid)
 	machine->comm_exec = false;
 	machine->kernel_start = 0;
 	machine->vmlinux_map = NULL;
+	/* There is no initial context switch in, so we start at 1. */
+	machine->parallelism = 1;
 
 	machine->root_dir = strdup(root_dir);
 	if (machine->root_dir == NULL)
@@ -677,8 +679,11 @@ int machine__process_aux_output_hw_id_event(struct machine *machine __maybe_unus
 int machine__process_switch_event(struct machine *machine __maybe_unused,
 				  union perf_event *event)
 {
+	bool out = event->header.misc & PERF_RECORD_MISC_SWITCH_OUT;
+
 	if (dump_trace)
 		perf_event__fprintf_switch(event, stdout);
+	machine->parallelism += out ? -1 : 1;
 	return 0;
 }
 
@@ -1900,6 +1905,8 @@ int machine__process_exit_event(struct machine *machine, union perf_event *event
 	if (dump_trace)
 		perf_event__fprintf_task(event, stdout);
 
+	/* There is no context switch out before exit, so we decrement here. */
+	machine->parallelism--;
 	if (thread != NULL) {
 		if (symbol_conf.keep_exited_threads)
 			thread__set_exited(thread, /*exited=*/true);
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index 2e5a4cb342d82..a5f28652873ce 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -50,6 +50,12 @@ struct machine {
 		u64	  text_start;
 		u64	  text_end;
 	} sched, lock, traceiter, trace;
+	/*
+	 * The current parallelism level (number of threads that run on CPUs).
+	 * This value can be less than 1, or larger than the total number
+	 * of CPUs, if events are poorly ordered.
+	 */
+	int		  parallelism;
 	pid_t		  *current_tid;
 	size_t		  current_tid_sz;
 	union { /* Tool specific area */
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 2/8] perf report: Add parallelism sort key
  2025-02-05 16:27 [PATCH v5 0/8] perf report: Add latency and parallelism profiling Dmitry Vyukov
  2025-02-05 16:27 ` [PATCH v5 1/8] perf report: Add machine parallelism Dmitry Vyukov
@ 2025-02-05 16:27 ` Dmitry Vyukov
  2025-02-05 16:27 ` [PATCH v5 3/8] perf report: Switch filtered from u8 to u16 Dmitry Vyukov
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-05 16:27 UTC (permalink / raw)
  To: namhyung, irogers
  Cc: linux-perf-users, linux-kernel, Dmitry Vyukov,
	Arnaldo Carvalho de Melo

Show parallelism level in profiles if requested by user.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 tools/perf/builtin-report.c | 11 +++++++++++
 tools/perf/util/hist.c      |  2 ++
 tools/perf/util/hist.h      |  3 +++
 tools/perf/util/session.c   | 12 ++++++++++++
 tools/perf/util/session.h   |  1 +
 tools/perf/util/sort.c      | 23 +++++++++++++++++++++++
 tools/perf/util/sort.h      |  1 +
 7 files changed, 53 insertions(+)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 0d9bd090eda71..14d49f0625881 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -1720,6 +1720,17 @@ int cmd_report(int argc, const char **argv)
 		symbol_conf.annotate_data_sample = true;
 	}
 
+	if (report.disable_order || !perf_session__has_switch_events(session)) {
+		if ((sort_order && strstr(sort_order, "parallelism")) ||
+				(field_order && strstr(field_order, "parallelism"))) {
+			if (report.disable_order)
+				ui__error("Use of parallelism is incompatible with --disable-order.\n");
+			else
+				ui__error("Use of parallelism requires --switch-events during record.\n");
+			return -1;
+		}
+	}
+
 	if (sort_order && strstr(sort_order, "ipc")) {
 		parse_options_usage(report_usage, options, "s", 1);
 		goto error;
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 0f30f843c566d..cafd693568189 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -207,6 +207,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 
 	hists__new_col_len(hists, HISTC_CGROUP, 6);
 	hists__new_col_len(hists, HISTC_CGROUP_ID, 20);
+	hists__new_col_len(hists, HISTC_PARALLELISM, 11);
 	hists__new_col_len(hists, HISTC_CPU, 3);
 	hists__new_col_len(hists, HISTC_SOCKET, 6);
 	hists__new_col_len(hists, HISTC_MEM_LOCKED, 6);
@@ -741,6 +742,7 @@ __hists__add_entry(struct hists *hists,
 		.ip	 = al->addr,
 		.level	 = al->level,
 		.code_page_size = sample->code_page_size,
+		.parallelism	= al->parallelism,
 		.stat = {
 			.nr_events = 1,
 			.period	= sample->period,
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 46c8373e31465..a6e662d77dc24 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -42,6 +42,7 @@ enum hist_column {
 	HISTC_CGROUP_ID,
 	HISTC_CGROUP,
 	HISTC_PARENT,
+	HISTC_PARALLELISM,
 	HISTC_CPU,
 	HISTC_SOCKET,
 	HISTC_SRCLINE,
@@ -228,6 +229,7 @@ struct hist_entry {
 	u64			transaction;
 	s32			socket;
 	s32			cpu;
+	int			parallelism;
 	u64			code_page_size;
 	u64			weight;
 	u64			ins_lat;
@@ -580,6 +582,7 @@ bool perf_hpp__is_thread_entry(struct perf_hpp_fmt *fmt);
 bool perf_hpp__is_comm_entry(struct perf_hpp_fmt *fmt);
 bool perf_hpp__is_dso_entry(struct perf_hpp_fmt *fmt);
 bool perf_hpp__is_sym_entry(struct perf_hpp_fmt *fmt);
+bool perf_hpp__is_parallelism_entry(struct perf_hpp_fmt *fmt);
 
 struct perf_hpp_fmt *perf_hpp_fmt__dup(struct perf_hpp_fmt *fmt);
 
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index c06e3020a9769..00fcf8d8ac255 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2403,6 +2403,18 @@ bool perf_session__has_traces(struct perf_session *session, const char *msg)
 	return false;
 }
 
+bool perf_session__has_switch_events(struct perf_session *session)
+{
+	struct evsel *evsel;
+
+	evlist__for_each_entry(session->evlist, evsel) {
+		if (evsel->core.attr.context_switch)
+			return true;
+	}
+
+	return false;
+}
+
 int map__set_kallsyms_ref_reloc_sym(struct map *map, const char *symbol_name, u64 addr)
 {
 	char *bracket;
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index bcf1bcf06959b..db1c120a9e67f 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -141,6 +141,7 @@ int perf_session__resolve_callchain(struct perf_session *session,
 				    struct symbol **parent);
 
 bool perf_session__has_traces(struct perf_session *session, const char *msg);
+bool perf_session__has_switch_events(struct perf_session *session);
 
 void perf_event__attr_swap(struct perf_event_attr *attr);
 
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 3dd33721823f3..7eef43f5be360 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -892,6 +892,27 @@ struct sort_entry sort_cpu = {
 	.se_width_idx	= HISTC_CPU,
 };
 
+/* --sort parallelism */
+
+static int64_t
+sort__parallelism_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	return right->parallelism - left->parallelism;
+}
+
+static int hist_entry__parallelism_snprintf(struct hist_entry *he, char *bf,
+				    size_t size, unsigned int width)
+{
+	return repsep_snprintf(bf, size, "%*d", width, he->parallelism);
+}
+
+struct sort_entry sort_parallelism = {
+	.se_header      = "Parallelism",
+	.se_cmp	        = sort__parallelism_cmp,
+	.se_snprintf    = hist_entry__parallelism_snprintf,
+	.se_width_idx	= HISTC_PARALLELISM,
+};
+
 /* --sort cgroup_id */
 
 static int64_t _sort__cgroup_dev_cmp(u64 left_dev, u64 right_dev)
@@ -2534,6 +2555,7 @@ static struct sort_dimension common_sort_dimensions[] = {
 	DIM(SORT_ANNOTATE_DATA_TYPE_OFFSET, "typeoff", sort_type_offset),
 	DIM(SORT_SYM_OFFSET, "symoff", sort_sym_offset),
 	DIM(SORT_ANNOTATE_DATA_TYPE_CACHELINE, "typecln", sort_type_cacheline),
+	DIM(SORT_PARALLELISM, "parallelism", sort_parallelism),
 };
 
 #undef DIM
@@ -2735,6 +2757,7 @@ MK_SORT_ENTRY_CHK(thread)
 MK_SORT_ENTRY_CHK(comm)
 MK_SORT_ENTRY_CHK(dso)
 MK_SORT_ENTRY_CHK(sym)
+MK_SORT_ENTRY_CHK(parallelism)
 
 
 static bool __sort__hpp_equal(struct perf_hpp_fmt *a, struct perf_hpp_fmt *b)
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index a8572574e1686..11fb15f914093 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -72,6 +72,7 @@ enum sort_type {
 	SORT_ANNOTATE_DATA_TYPE_OFFSET,
 	SORT_SYM_OFFSET,
 	SORT_ANNOTATE_DATA_TYPE_CACHELINE,
+	SORT_PARALLELISM,
 
 	/* branch stack specific sort keys */
 	__SORT_BRANCH_STACK,
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 3/8] perf report: Switch filtered from u8 to u16
  2025-02-05 16:27 [PATCH v5 0/8] perf report: Add latency and parallelism profiling Dmitry Vyukov
  2025-02-05 16:27 ` [PATCH v5 1/8] perf report: Add machine parallelism Dmitry Vyukov
  2025-02-05 16:27 ` [PATCH v5 2/8] perf report: Add parallelism sort key Dmitry Vyukov
@ 2025-02-05 16:27 ` Dmitry Vyukov
  2025-02-05 16:27 ` [PATCH v5 4/8] perf report: Add parallelism filter Dmitry Vyukov
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-05 16:27 UTC (permalink / raw)
  To: namhyung, irogers
  Cc: linux-perf-users, linux-kernel, Dmitry Vyukov,
	Arnaldo Carvalho de Melo

We already have all u8 bits taken, adding one more filter leads to unpleasant
failure mode, where code compiles w/o warnings, but the last filters silently
don't work. Add a typedef and switch to u16.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 tools/perf/util/addr_location.h | 3 ++-
 tools/perf/util/hist.c          | 2 +-
 tools/perf/util/hist.h          | 4 +++-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/addr_location.h b/tools/perf/util/addr_location.h
index 36aaa45445f24..f83d74e370b2f 100644
--- a/tools/perf/util/addr_location.h
+++ b/tools/perf/util/addr_location.h
@@ -3,6 +3,7 @@
 #define __PERF_ADDR_LOCATION 1
 
 #include <linux/types.h>
+#include "hist.h"
 
 struct thread;
 struct maps;
@@ -17,7 +18,7 @@ struct addr_location {
 	const char    *srcline;
 	u64	      addr;
 	char	      level;
-	u8	      filtered;
+	filter_mask_t filtered;
 	u8	      cpumode;
 	s32	      cpu;
 	s32	      socket;
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index cafd693568189..6b8f8da8d3b66 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -585,7 +585,7 @@ static struct hist_entry *hist_entry__new(struct hist_entry *template,
 	return he;
 }
 
-static u8 symbol__parent_filter(const struct symbol *parent)
+static filter_mask_t symbol__parent_filter(const struct symbol *parent)
 {
 	if (symbol_conf.exclude_other && parent == NULL)
 		return 1 << HIST_FILTER__PARENT;
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index a6e662d77dc24..4035106a74087 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -33,6 +33,8 @@ enum hist_filter {
 	HIST_FILTER__C2C,
 };
 
+typedef u16 filter_mask_t;
+
 enum hist_column {
 	HISTC_SYMBOL,
 	HISTC_TIME,
@@ -244,7 +246,7 @@ struct hist_entry {
 	bool			leaf;
 
 	char			level;
-	u8			filtered;
+	filter_mask_t		filtered;
 
 	u16			callchain_size;
 	union {
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 4/8] perf report: Add parallelism filter
  2025-02-05 16:27 [PATCH v5 0/8] perf report: Add latency and parallelism profiling Dmitry Vyukov
                   ` (2 preceding siblings ...)
  2025-02-05 16:27 ` [PATCH v5 3/8] perf report: Switch filtered from u8 to u16 Dmitry Vyukov
@ 2025-02-05 16:27 ` Dmitry Vyukov
  2025-02-05 16:27 ` [PATCH v5 5/8] perf report: Add latency output field Dmitry Vyukov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-05 16:27 UTC (permalink / raw)
  To: namhyung, irogers
  Cc: linux-perf-users, linux-kernel, Dmitry Vyukov,
	Arnaldo Carvalho de Melo

Add parallelism filter that can be used to look at specific parallelism
levels only. The format is the same as cpu lists. For example:

Only single-threaded samples: --parallelism=1
Low parallelism only: --parallelism=1-4
High parallelism only: --parallelism=64-128

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 tools/perf/builtin-report.c   |  5 ++++-
 tools/perf/util/event.c       |  2 ++
 tools/perf/util/hist.c        | 31 +++++++++++++++++++++++++++++++
 tools/perf/util/hist.h        |  6 +++++-
 tools/perf/util/sort.c        | 11 +++++++++++
 tools/perf/util/symbol.c      | 34 ++++++++++++++++++++++++++++++++++
 tools/perf/util/symbol_conf.h |  4 ++++
 7 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 14d49f0625881..2a19abdc869a1 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -1390,6 +1390,8 @@ int cmd_report(int argc, const char **argv)
 		     symbol__config_symfs),
 	OPT_STRING('C', "cpu", &report.cpu_list, "cpu",
 		   "list of cpus to profile"),
+	OPT_STRING(0, "parallelism", &symbol_conf.parallelism_list_str, "parallelism",
+		   "only consider these parallelism levels (cpu set format)"),
 	OPT_BOOLEAN('I', "show-info", &report.show_full_info,
 		    "Display extended information about perf.data file"),
 	OPT_BOOLEAN(0, "source", &annotate_opts.annotate_src,
@@ -1721,7 +1723,8 @@ int cmd_report(int argc, const char **argv)
 	}
 
 	if (report.disable_order || !perf_session__has_switch_events(session)) {
-		if ((sort_order && strstr(sort_order, "parallelism")) ||
+		if (symbol_conf.parallelism_list_str ||
+				(sort_order && strstr(sort_order, "parallelism")) ||
 				(field_order && strstr(field_order, "parallelism"))) {
 			if (report.disable_order)
 				ui__error("Use of parallelism is incompatible with --disable-order.\n");
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 3d3ac58e1c660..bea9698e0c6bf 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -769,6 +769,8 @@ int machine__resolve(struct machine *machine, struct addr_location *al,
 
 	/* Account for possible out-of-order switch events. */
 	al->parallelism = min(max(machine->parallelism, 1), machine__nr_cpus_avail(machine));
+	if (test_bit(al->parallelism, symbol_conf.parallelism_filter))
+		al->filtered |= (1 << HIST_FILTER__PARALLELISM);
 
 	if (al->map) {
 		if (symbol_conf.dso_list &&
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 6b8f8da8d3b66..446342246f5ee 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -43,6 +43,8 @@ static bool hists__filter_entry_by_symbol(struct hists *hists,
 					  struct hist_entry *he);
 static bool hists__filter_entry_by_socket(struct hists *hists,
 					  struct hist_entry *he);
+static bool hists__filter_entry_by_parallelism(struct hists *hists,
+					       struct hist_entry *he);
 
 u16 hists__col_len(struct hists *hists, enum hist_column col)
 {
@@ -1457,6 +1459,10 @@ static void hist_entry__check_and_remove_filter(struct hist_entry *he,
 		if (symbol_conf.sym_list == NULL)
 			return;
 		break;
+	case HIST_FILTER__PARALLELISM:
+		if (__bitmap_weight(symbol_conf.parallelism_filter, MAX_NR_CPUS + 1) == 0)
+			return;
+		break;
 	case HIST_FILTER__PARENT:
 	case HIST_FILTER__GUEST:
 	case HIST_FILTER__HOST:
@@ -1515,6 +1521,9 @@ static void hist_entry__apply_hierarchy_filters(struct hist_entry *he)
 	hist_entry__check_and_remove_filter(he, HIST_FILTER__SYMBOL,
 					    perf_hpp__is_sym_entry);
 
+	hist_entry__check_and_remove_filter(he, HIST_FILTER__PARALLELISM,
+					    perf_hpp__is_parallelism_entry);
+
 	hists__apply_filters(he->hists, he);
 }
 
@@ -1711,6 +1720,7 @@ static void hists__apply_filters(struct hists *hists, struct hist_entry *he)
 	hists__filter_entry_by_thread(hists, he);
 	hists__filter_entry_by_symbol(hists, he);
 	hists__filter_entry_by_socket(hists, he);
+	hists__filter_entry_by_parallelism(hists, he);
 }
 
 int hists__collapse_resort(struct hists *hists, struct ui_progress *prog)
@@ -2197,6 +2207,16 @@ static bool hists__filter_entry_by_socket(struct hists *hists,
 	return false;
 }
 
+static bool hists__filter_entry_by_parallelism(struct hists *hists,
+					       struct hist_entry *he)
+{
+	if (test_bit(he->parallelism, hists->parallelism_filter)) {
+		he->filtered |= (1 << HIST_FILTER__PARALLELISM);
+		return true;
+	}
+	return false;
+}
+
 typedef bool (*filter_fn_t)(struct hists *hists, struct hist_entry *he);
 
 static void hists__filter_by_type(struct hists *hists, int type, filter_fn_t filter)
@@ -2366,6 +2386,16 @@ void hists__filter_by_socket(struct hists *hists)
 				      hists__filter_entry_by_socket);
 }
 
+void hists__filter_by_parallelism(struct hists *hists)
+{
+	if (symbol_conf.report_hierarchy)
+		hists__filter_hierarchy(hists, HIST_FILTER__PARALLELISM,
+					hists->parallelism_filter);
+	else
+		hists__filter_by_type(hists, HIST_FILTER__PARALLELISM,
+				      hists__filter_entry_by_parallelism);
+}
+
 void events_stats__inc(struct events_stats *stats, u32 type)
 {
 	++stats->nr_events[0];
@@ -2872,6 +2902,7 @@ int __hists__init(struct hists *hists, struct perf_hpp_list *hpp_list)
 	hists->entries = RB_ROOT_CACHED;
 	mutex_init(&hists->lock);
 	hists->socket_filter = -1;
+	hists->parallelism_filter = symbol_conf.parallelism_filter;
 	hists->hpp_list = hpp_list;
 	INIT_LIST_HEAD(&hists->hpp_formats);
 	return 0;
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 4035106a74087..c2236e0d89f2a 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -31,6 +31,7 @@ enum hist_filter {
 	HIST_FILTER__HOST,
 	HIST_FILTER__SOCKET,
 	HIST_FILTER__C2C,
+	HIST_FILTER__PARALLELISM,
 };
 
 typedef u16 filter_mask_t;
@@ -112,6 +113,7 @@ struct hists {
 	const struct dso	*dso_filter;
 	const char		*uid_filter_str;
 	const char		*symbol_filter_str;
+	unsigned long		*parallelism_filter;
 	struct mutex		lock;
 	struct hists_stats	stats;
 	u64			event_stream;
@@ -388,11 +390,13 @@ void hists__filter_by_dso(struct hists *hists);
 void hists__filter_by_thread(struct hists *hists);
 void hists__filter_by_symbol(struct hists *hists);
 void hists__filter_by_socket(struct hists *hists);
+void hists__filter_by_parallelism(struct hists *hists);
 
 static inline bool hists__has_filter(struct hists *hists)
 {
 	return hists->thread_filter || hists->dso_filter ||
-		hists->symbol_filter_str || (hists->socket_filter > -1);
+		hists->symbol_filter_str || (hists->socket_filter > -1) ||
+		hists->parallelism_filter;
 }
 
 u16 hists__col_len(struct hists *hists, enum hist_column col);
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 7eef43f5be360..3055496358ebb 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -900,6 +900,16 @@ sort__parallelism_cmp(struct hist_entry *left, struct hist_entry *right)
 	return right->parallelism - left->parallelism;
 }
 
+static int hist_entry__parallelism_filter(struct hist_entry *he, int type, const void *arg)
+{
+	const unsigned long *parallelism_filter = arg;
+
+	if (type != HIST_FILTER__PARALLELISM)
+		return -1;
+
+	return test_bit(he->parallelism, parallelism_filter);
+}
+
 static int hist_entry__parallelism_snprintf(struct hist_entry *he, char *bf,
 				    size_t size, unsigned int width)
 {
@@ -909,6 +919,7 @@ static int hist_entry__parallelism_snprintf(struct hist_entry *he, char *bf,
 struct sort_entry sort_parallelism = {
 	.se_header      = "Parallelism",
 	.se_cmp	        = sort__parallelism_cmp,
+	.se_filter	= hist_entry__parallelism_filter,
 	.se_snprintf    = hist_entry__parallelism_snprintf,
 	.se_width_idx	= HISTC_PARALLELISM,
 };
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 49b08adc6ee34..315f74b5bac06 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -18,6 +18,7 @@
 #include "annotate.h"
 #include "build-id.h"
 #include "cap.h"
+#include "cpumap.h"
 #include "dso.h"
 #include "util.h" // lsdir()
 #include "debug.h"
@@ -2471,6 +2472,36 @@ int symbol__annotation_init(void)
 	return 0;
 }
 
+static int setup_parallelism_bitmap(void)
+{
+	struct perf_cpu_map *map;
+	struct perf_cpu cpu;
+	int i, err = -1;
+
+	if (symbol_conf.parallelism_list_str == NULL)
+		return 0;
+
+	map = perf_cpu_map__new(symbol_conf.parallelism_list_str);
+	if (map == NULL) {
+		pr_err("failed to parse parallelism filter list\n");
+		return -1;
+	}
+
+	bitmap_fill(symbol_conf.parallelism_filter, MAX_NR_CPUS + 1);
+	perf_cpu_map__for_each_cpu(cpu, i, map) {
+		if (cpu.cpu <= 0 || cpu.cpu > MAX_NR_CPUS) {
+			pr_err("Requested parallelism level %d is invalid.\n", cpu.cpu);
+			goto out_delete_map;
+		}
+		__clear_bit(cpu.cpu, symbol_conf.parallelism_filter);
+	}
+
+	err = 0;
+out_delete_map:
+	perf_cpu_map__put(map);
+	return err;
+}
+
 int symbol__init(struct perf_env *env)
 {
 	const char *symfs;
@@ -2490,6 +2521,9 @@ int symbol__init(struct perf_env *env)
 		return -1;
 	}
 
+	if (setup_parallelism_bitmap())
+		return -1;
+
 	if (setup_list(&symbol_conf.dso_list,
 		       symbol_conf.dso_list_str, "dso") < 0)
 		return -1;
diff --git a/tools/perf/util/symbol_conf.h b/tools/perf/util/symbol_conf.h
index a9c51acc722fe..c5b2e56127e22 100644
--- a/tools/perf/util/symbol_conf.h
+++ b/tools/perf/util/symbol_conf.h
@@ -3,6 +3,8 @@
 #define __PERF_SYMBOL_CONF 1
 
 #include <stdbool.h>
+#include <linux/bitmap.h>
+#include "perf.h"
 
 struct strlist;
 struct intlist;
@@ -62,6 +64,7 @@ struct symbol_conf {
 			*pid_list_str,
 			*tid_list_str,
 			*sym_list_str,
+			*parallelism_list_str,
 			*col_width_list_str,
 			*bt_stop_list_str;
 	const char		*addr2line_path;
@@ -82,6 +85,7 @@ struct symbol_conf {
 	int		pad_output_len_dso;
 	int		group_sort_idx;
 	int		addr_range;
+	DECLARE_BITMAP(parallelism_filter, MAX_NR_CPUS + 1);
 };
 
 extern struct symbol_conf symbol_conf;
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 5/8] perf report: Add latency output field
  2025-02-05 16:27 [PATCH v5 0/8] perf report: Add latency and parallelism profiling Dmitry Vyukov
                   ` (3 preceding siblings ...)
  2025-02-05 16:27 ` [PATCH v5 4/8] perf report: Add parallelism filter Dmitry Vyukov
@ 2025-02-05 16:27 ` Dmitry Vyukov
  2025-02-05 16:27 ` [PATCH v5 6/8] perf report: Add --latency flag Dmitry Vyukov
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-05 16:27 UTC (permalink / raw)
  To: namhyung, irogers
  Cc: linux-perf-users, linux-kernel, Dmitry Vyukov,
	Arnaldo Carvalho de Melo

Latency output field is similar to overhead, but represents overhead for
latency rather than CPU consumption. It's re-scaled from overhead by dividing
weight by the current parallelism level at the time of the sample.
It effectively models profiling with 1 sample taken per unit of wall-clock
time rather than unit of CPU time.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

---
Changes in v5:
 - fixed formatting of latency field in --stdout mode
---
 tools/perf/ui/browsers/hists.c  | 27 ++++++++-----
 tools/perf/ui/hist.c            | 69 ++++++++++++++++++---------------
 tools/perf/util/addr_location.h |  2 +
 tools/perf/util/event.c         |  6 +++
 tools/perf/util/events_stats.h  |  2 +
 tools/perf/util/hist.c          | 55 +++++++++++++++++++-------
 tools/perf/util/hist.h          | 12 ++++++
 tools/perf/util/sort.c          |  2 +
 8 files changed, 120 insertions(+), 55 deletions(-)

diff --git a/tools/perf/ui/browsers/hists.c b/tools/perf/ui/browsers/hists.c
index 49ba82bf33918..35c10509b797f 100644
--- a/tools/perf/ui/browsers/hists.c
+++ b/tools/perf/ui/browsers/hists.c
@@ -1226,7 +1226,7 @@ int __hpp__slsmg_color_printf(struct perf_hpp *hpp, const char *fmt, ...)
 	return ret;
 }
 
-#define __HPP_COLOR_PERCENT_FN(_type, _field)				\
+#define __HPP_COLOR_PERCENT_FN(_type, _field, _fmttype)			\
 static u64 __hpp_get_##_field(struct hist_entry *he)			\
 {									\
 	return he->stat._field;						\
@@ -1238,10 +1238,10 @@ hist_browser__hpp_color_##_type(struct perf_hpp_fmt *fmt,		\
 				struct hist_entry *he)			\
 {									\
 	return hpp__fmt(fmt, hpp, he, __hpp_get_##_field, " %*.2f%%",	\
-			__hpp__slsmg_color_printf, true);		\
+			__hpp__slsmg_color_printf, _fmttype);		\
 }
 
-#define __HPP_COLOR_ACC_PERCENT_FN(_type, _field)			\
+#define __HPP_COLOR_ACC_PERCENT_FN(_type, _field, _fmttype)		\
 static u64 __hpp_get_acc_##_field(struct hist_entry *he)		\
 {									\
 	return he->stat_acc->_field;					\
@@ -1262,15 +1262,18 @@ hist_browser__hpp_color_##_type(struct perf_hpp_fmt *fmt,		\
 		return ret;						\
 	}								\
 	return hpp__fmt(fmt, hpp, he, __hpp_get_acc_##_field,		\
-			" %*.2f%%", __hpp__slsmg_color_printf, true);	\
+			" %*.2f%%", __hpp__slsmg_color_printf,		\
+			_fmttype);					\
 }
 
-__HPP_COLOR_PERCENT_FN(overhead, period)
-__HPP_COLOR_PERCENT_FN(overhead_sys, period_sys)
-__HPP_COLOR_PERCENT_FN(overhead_us, period_us)
-__HPP_COLOR_PERCENT_FN(overhead_guest_sys, period_guest_sys)
-__HPP_COLOR_PERCENT_FN(overhead_guest_us, period_guest_us)
-__HPP_COLOR_ACC_PERCENT_FN(overhead_acc, period)
+__HPP_COLOR_PERCENT_FN(overhead, period, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_PERCENT_FN(latency, latency, PERF_HPP_FMT_TYPE__LATENCY)
+__HPP_COLOR_PERCENT_FN(overhead_sys, period_sys, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_PERCENT_FN(overhead_us, period_us, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_PERCENT_FN(overhead_guest_sys, period_guest_sys, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_PERCENT_FN(overhead_guest_us, period_guest_us, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_ACC_PERCENT_FN(overhead_acc, period, PERF_HPP_FMT_TYPE__PERCENT)
+__HPP_COLOR_ACC_PERCENT_FN(latency_acc, latency, PERF_HPP_FMT_TYPE__LATENCY)
 
 #undef __HPP_COLOR_PERCENT_FN
 #undef __HPP_COLOR_ACC_PERCENT_FN
@@ -1279,6 +1282,8 @@ void hist_browser__init_hpp(void)
 {
 	perf_hpp__format[PERF_HPP__OVERHEAD].color =
 				hist_browser__hpp_color_overhead;
+	perf_hpp__format[PERF_HPP__LATENCY].color =
+				hist_browser__hpp_color_latency;
 	perf_hpp__format[PERF_HPP__OVERHEAD_SYS].color =
 				hist_browser__hpp_color_overhead_sys;
 	perf_hpp__format[PERF_HPP__OVERHEAD_US].color =
@@ -1289,6 +1294,8 @@ void hist_browser__init_hpp(void)
 				hist_browser__hpp_color_overhead_guest_us;
 	perf_hpp__format[PERF_HPP__OVERHEAD_ACC].color =
 				hist_browser__hpp_color_overhead_acc;
+	perf_hpp__format[PERF_HPP__LATENCY_ACC].color =
+				hist_browser__hpp_color_latency_acc;
 
 	res_sample_init();
 }
diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
index 34fda1d5eccb4..6de6309595f9e 100644
--- a/tools/perf/ui/hist.c
+++ b/tools/perf/ui/hist.c
@@ -27,9 +27,10 @@ static int __hpp__fmt_print(struct perf_hpp *hpp, struct hists *hists, u64 val,
 			    int nr_samples, const char *fmt, int len,
 			    hpp_snprint_fn print_fn, enum perf_hpp_fmt_type fmtype)
 {
-	if (fmtype == PERF_HPP_FMT_TYPE__PERCENT) {
+	if (fmtype == PERF_HPP_FMT_TYPE__PERCENT || fmtype == PERF_HPP_FMT_TYPE__LATENCY) {
 		double percent = 0.0;
-		u64 total = hists__total_period(hists);
+		u64 total = fmtype == PERF_HPP_FMT_TYPE__PERCENT ? hists__total_period(hists) :
+			hists__total_latency(hists);
 
 		if (total)
 			percent = 100.0 * val / total;
@@ -128,7 +129,7 @@ int hpp__fmt(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
 				  print_fn, fmtype);
 	}
 
-	if (fmtype == PERF_HPP_FMT_TYPE__PERCENT)
+	if (fmtype == PERF_HPP_FMT_TYPE__PERCENT || fmtype == PERF_HPP_FMT_TYPE__LATENCY)
 		len -= 2; /* 2 for a space and a % sign */
 	else
 		len -= 1;
@@ -356,7 +357,7 @@ static int hpp_entry_scnprintf(struct perf_hpp *hpp, const char *fmt, ...)
 	return (ret >= ssize) ? (ssize - 1) : ret;
 }
 
-#define __HPP_COLOR_PERCENT_FN(_type, _field)					\
+#define __HPP_COLOR_PERCENT_FN(_type, _field, _fmttype)				\
 static u64 he_get_##_field(struct hist_entry *he)				\
 {										\
 	return he->stat._field;							\
@@ -366,15 +367,15 @@ static int hpp__color_##_type(struct perf_hpp_fmt *fmt,				\
 			      struct perf_hpp *hpp, struct hist_entry *he) 	\
 {										\
 	return hpp__fmt(fmt, hpp, he, he_get_##_field, " %*.2f%%",		\
-			hpp_color_scnprintf, PERF_HPP_FMT_TYPE__PERCENT);	\
+			hpp_color_scnprintf, _fmttype);				\
 }
 
-#define __HPP_ENTRY_PERCENT_FN(_type, _field)					\
+#define __HPP_ENTRY_PERCENT_FN(_type, _field, _fmttype)				\
 static int hpp__entry_##_type(struct perf_hpp_fmt *fmt,				\
 			      struct perf_hpp *hpp, struct hist_entry *he) 	\
 {										\
 	return hpp__fmt(fmt, hpp, he, he_get_##_field, " %*.2f%%",		\
-			hpp_entry_scnprintf, PERF_HPP_FMT_TYPE__PERCENT);	\
+			hpp_entry_scnprintf, _fmttype);	\
 }
 
 #define __HPP_SORT_FN(_type, _field)						\
@@ -384,7 +385,7 @@ static int64_t hpp__sort_##_type(struct perf_hpp_fmt *fmt __maybe_unused, 	\
 	return __hpp__sort(a, b, he_get_##_field);				\
 }
 
-#define __HPP_COLOR_ACC_PERCENT_FN(_type, _field)				\
+#define __HPP_COLOR_ACC_PERCENT_FN(_type, _field, _fmttype)			\
 static u64 he_get_acc_##_field(struct hist_entry *he)				\
 {										\
 	return he->stat_acc->_field;						\
@@ -394,15 +395,15 @@ static int hpp__color_##_type(struct perf_hpp_fmt *fmt,				\
 			      struct perf_hpp *hpp, struct hist_entry *he) 	\
 {										\
 	return hpp__fmt_acc(fmt, hpp, he, he_get_acc_##_field, " %*.2f%%", 	\
-			    hpp_color_scnprintf, PERF_HPP_FMT_TYPE__PERCENT);	\
+			    hpp_color_scnprintf, _fmttype);			\
 }
 
-#define __HPP_ENTRY_ACC_PERCENT_FN(_type, _field)				\
+#define __HPP_ENTRY_ACC_PERCENT_FN(_type, _field, _fmttype)			\
 static int hpp__entry_##_type(struct perf_hpp_fmt *fmt,				\
 			      struct perf_hpp *hpp, struct hist_entry *he) 	\
 {										\
 	return hpp__fmt_acc(fmt, hpp, he, he_get_acc_##_field, " %*.2f%%",	\
-			    hpp_entry_scnprintf, PERF_HPP_FMT_TYPE__PERCENT);	\
+			    hpp_entry_scnprintf, _fmttype);			\
 }
 
 #define __HPP_SORT_ACC_FN(_type, _field)					\
@@ -453,14 +454,14 @@ static int64_t hpp__sort_##_type(struct perf_hpp_fmt *fmt __maybe_unused, 	\
 }
 
 
-#define HPP_PERCENT_FNS(_type, _field)					\
-__HPP_COLOR_PERCENT_FN(_type, _field)					\
-__HPP_ENTRY_PERCENT_FN(_type, _field)					\
+#define HPP_PERCENT_FNS(_type, _field, _fmttype)			\
+__HPP_COLOR_PERCENT_FN(_type, _field, _fmttype)				\
+__HPP_ENTRY_PERCENT_FN(_type, _field, _fmttype)				\
 __HPP_SORT_FN(_type, _field)
 
-#define HPP_PERCENT_ACC_FNS(_type, _field)				\
-__HPP_COLOR_ACC_PERCENT_FN(_type, _field)				\
-__HPP_ENTRY_ACC_PERCENT_FN(_type, _field)				\
+#define HPP_PERCENT_ACC_FNS(_type, _field, _fmttype)			\
+__HPP_COLOR_ACC_PERCENT_FN(_type, _field, _fmttype)			\
+__HPP_ENTRY_ACC_PERCENT_FN(_type, _field, _fmttype)			\
 __HPP_SORT_ACC_FN(_type, _field)
 
 #define HPP_RAW_FNS(_type, _field)					\
@@ -471,12 +472,14 @@ __HPP_SORT_RAW_FN(_type, _field)
 __HPP_ENTRY_AVERAGE_FN(_type, _field)					\
 __HPP_SORT_AVERAGE_FN(_type, _field)
 
-HPP_PERCENT_FNS(overhead, period)
-HPP_PERCENT_FNS(overhead_sys, period_sys)
-HPP_PERCENT_FNS(overhead_us, period_us)
-HPP_PERCENT_FNS(overhead_guest_sys, period_guest_sys)
-HPP_PERCENT_FNS(overhead_guest_us, period_guest_us)
-HPP_PERCENT_ACC_FNS(overhead_acc, period)
+HPP_PERCENT_FNS(overhead, period, PERF_HPP_FMT_TYPE__PERCENT)
+HPP_PERCENT_FNS(latency, latency, PERF_HPP_FMT_TYPE__LATENCY)
+HPP_PERCENT_FNS(overhead_sys, period_sys, PERF_HPP_FMT_TYPE__PERCENT)
+HPP_PERCENT_FNS(overhead_us, period_us, PERF_HPP_FMT_TYPE__PERCENT)
+HPP_PERCENT_FNS(overhead_guest_sys, period_guest_sys, PERF_HPP_FMT_TYPE__PERCENT)
+HPP_PERCENT_FNS(overhead_guest_us, period_guest_us, PERF_HPP_FMT_TYPE__PERCENT)
+HPP_PERCENT_ACC_FNS(overhead_acc, period, PERF_HPP_FMT_TYPE__PERCENT)
+HPP_PERCENT_ACC_FNS(latency_acc, latency, PERF_HPP_FMT_TYPE__LATENCY)
 
 HPP_RAW_FNS(samples, nr_events)
 HPP_RAW_FNS(period, period)
@@ -548,11 +551,13 @@ static bool hpp__equal(struct perf_hpp_fmt *a, struct perf_hpp_fmt *b)
 
 struct perf_hpp_fmt perf_hpp__format[] = {
 	HPP__COLOR_PRINT_FNS("Overhead", overhead, OVERHEAD),
+	HPP__COLOR_PRINT_FNS("Latency", latency, LATENCY),
 	HPP__COLOR_PRINT_FNS("sys", overhead_sys, OVERHEAD_SYS),
 	HPP__COLOR_PRINT_FNS("usr", overhead_us, OVERHEAD_US),
 	HPP__COLOR_PRINT_FNS("guest sys", overhead_guest_sys, OVERHEAD_GUEST_SYS),
 	HPP__COLOR_PRINT_FNS("guest usr", overhead_guest_us, OVERHEAD_GUEST_US),
 	HPP__COLOR_ACC_PRINT_FNS("Children", overhead_acc, OVERHEAD_ACC),
+	HPP__COLOR_ACC_PRINT_FNS("Latency", latency_acc, LATENCY_ACC),
 	HPP__PRINT_FNS("Samples", samples, SAMPLES),
 	HPP__PRINT_FNS("Period", period, PERIOD),
 	HPP__PRINT_FNS("Weight1", weight1, WEIGHT1),
@@ -601,6 +606,11 @@ static void fmt_free(struct perf_hpp_fmt *fmt)
 		fmt->free(fmt);
 }
 
+static bool fmt_equal(struct perf_hpp_fmt *a, struct perf_hpp_fmt *b)
+{
+	return a->equal && a->equal(a, b);
+}
+
 void perf_hpp__init(void)
 {
 	int i;
@@ -671,30 +681,26 @@ static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
 
 void perf_hpp__cancel_cumulate(void)
 {
-	struct perf_hpp_fmt *fmt, *acc, *ovh, *tmp;
+	struct perf_hpp_fmt *fmt, *acc, *ovh, *acc_lat, *tmp;
 
 	if (is_strict_order(field_order))
 		return;
 
 	ovh = &perf_hpp__format[PERF_HPP__OVERHEAD];
 	acc = &perf_hpp__format[PERF_HPP__OVERHEAD_ACC];
+	acc_lat = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
 
 	perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
-		if (acc->equal(acc, fmt)) {
+		if (fmt_equal(acc, fmt) || fmt_equal(acc_lat, fmt)) {
 			perf_hpp__column_unregister(fmt);
 			continue;
 		}
 
-		if (ovh->equal(ovh, fmt))
+		if (fmt_equal(ovh, fmt))
 			fmt->name = "Overhead";
 	}
 }
 
-static bool fmt_equal(struct perf_hpp_fmt *a, struct perf_hpp_fmt *b)
-{
-	return a->equal && a->equal(a, b);
-}
-
 void perf_hpp__setup_output_field(struct perf_hpp_list *list)
 {
 	struct perf_hpp_fmt *fmt;
@@ -819,6 +825,7 @@ void perf_hpp__reset_width(struct perf_hpp_fmt *fmt, struct hists *hists)
 
 	switch (fmt->idx) {
 	case PERF_HPP__OVERHEAD:
+	case PERF_HPP__LATENCY:
 	case PERF_HPP__OVERHEAD_SYS:
 	case PERF_HPP__OVERHEAD_US:
 	case PERF_HPP__OVERHEAD_ACC:
diff --git a/tools/perf/util/addr_location.h b/tools/perf/util/addr_location.h
index f83d74e370b2f..663e9a55d8ed3 100644
--- a/tools/perf/util/addr_location.h
+++ b/tools/perf/util/addr_location.h
@@ -24,6 +24,8 @@ struct addr_location {
 	s32	      socket;
 	/* Same as machine.parallelism but within [1, nr_cpus]. */
 	int	      parallelism;
+	/* See he_stat.latency. */
+	u64	      latency;
 };
 
 void addr_location__init(struct addr_location *al);
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index bea9698e0c6bf..36e576f141473 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -771,6 +771,12 @@ int machine__resolve(struct machine *machine, struct addr_location *al,
 	al->parallelism = min(max(machine->parallelism, 1), machine__nr_cpus_avail(machine));
 	if (test_bit(al->parallelism, symbol_conf.parallelism_filter))
 		al->filtered |= (1 << HIST_FILTER__PARALLELISM);
+	/*
+	 * Multiply it by some const to avoid precision loss or dealing
+	 * with floats. The multiplier does not matter otherwise since
+	 * we only print it as percents.
+	 */
+	al->latency = sample->period * 1000 / al->parallelism;
 
 	if (al->map) {
 		if (symbol_conf.dso_list &&
diff --git a/tools/perf/util/events_stats.h b/tools/perf/util/events_stats.h
index eabd7913c3092..dcff697ed2529 100644
--- a/tools/perf/util/events_stats.h
+++ b/tools/perf/util/events_stats.h
@@ -57,6 +57,8 @@ struct events_stats {
 struct hists_stats {
 	u64 total_period;
 	u64 total_non_filtered_period;
+	u64 total_latency;
+	u64 total_non_filtered_latency;
 	u32 nr_samples;
 	u32 nr_non_filtered_samples;
 	u32 nr_lost_samples;
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 446342246f5ee..a29324e33ed04 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -305,9 +305,10 @@ static long hist_time(unsigned long htime)
 	return htime;
 }
 
-static void he_stat__add_period(struct he_stat *he_stat, u64 period)
+static void he_stat__add_period(struct he_stat *he_stat, u64 period, u64 latency)
 {
 	he_stat->period		+= period;
+	he_stat->latency	+= latency;
 	he_stat->nr_events	+= 1;
 }
 
@@ -322,6 +323,7 @@ static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
 	dest->weight2		+= src->weight2;
 	dest->weight3		+= src->weight3;
 	dest->nr_events		+= src->nr_events;
+	dest->latency		+= src->latency;
 }
 
 static void he_stat__decay(struct he_stat *he_stat)
@@ -331,6 +333,7 @@ static void he_stat__decay(struct he_stat *he_stat)
 	he_stat->weight1 = (he_stat->weight1 * 7) / 8;
 	he_stat->weight2 = (he_stat->weight2 * 7) / 8;
 	he_stat->weight3 = (he_stat->weight3 * 7) / 8;
+	he_stat->latency = (he_stat->latency * 7) / 8;
 }
 
 static void hists__delete_entry(struct hists *hists, struct hist_entry *he);
@@ -338,7 +341,7 @@ static void hists__delete_entry(struct hists *hists, struct hist_entry *he);
 static bool hists__decay_entry(struct hists *hists, struct hist_entry *he)
 {
 	u64 prev_period = he->stat.period;
-	u64 diff;
+	u64 prev_latency = he->stat.latency;
 
 	if (prev_period == 0)
 		return true;
@@ -348,12 +351,16 @@ static bool hists__decay_entry(struct hists *hists, struct hist_entry *he)
 		he_stat__decay(he->stat_acc);
 	decay_callchain(he->callchain);
 
-	diff = prev_period - he->stat.period;
-
 	if (!he->depth) {
-		hists->stats.total_period -= diff;
-		if (!he->filtered)
-			hists->stats.total_non_filtered_period -= diff;
+		u64 period_diff = prev_period - he->stat.period;
+		u64 latency_diff = prev_latency - he->stat.latency;
+
+		hists->stats.total_period -= period_diff;
+		hists->stats.total_latency -= latency_diff;
+		if (!he->filtered) {
+			hists->stats.total_non_filtered_period -= period_diff;
+			hists->stats.total_non_filtered_latency -= latency_diff;
+		}
 	}
 
 	if (!he->leaf) {
@@ -368,7 +375,7 @@ static bool hists__decay_entry(struct hists *hists, struct hist_entry *he)
 		}
 	}
 
-	return he->stat.period == 0;
+	return he->stat.period == 0 && he->stat.latency == 0;
 }
 
 static void hists__delete_entry(struct hists *hists, struct hist_entry *he)
@@ -594,14 +601,17 @@ static filter_mask_t symbol__parent_filter(const struct symbol *parent)
 	return 0;
 }
 
-static void hist_entry__add_callchain_period(struct hist_entry *he, u64 period)
+static void hist_entry__add_callchain_period(struct hist_entry *he, u64 period, u64 latency)
 {
 	if (!hist_entry__has_callchains(he) || !symbol_conf.use_callchain)
 		return;
 
 	he->hists->callchain_period += period;
-	if (!he->filtered)
+	he->hists->callchain_latency += latency;
+	if (!he->filtered) {
 		he->hists->callchain_non_filtered_period += period;
+		he->hists->callchain_non_filtered_latency += latency;
+	}
 }
 
 static struct hist_entry *hists__findnew_entry(struct hists *hists,
@@ -614,6 +624,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 	struct hist_entry *he;
 	int64_t cmp;
 	u64 period = entry->stat.period;
+	u64 latency = entry->stat.latency;
 	bool leftmost = true;
 
 	p = &hists->entries_in->rb_root.rb_node;
@@ -632,10 +643,10 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 		if (!cmp) {
 			if (sample_self) {
 				he_stat__add_stat(&he->stat, &entry->stat);
-				hist_entry__add_callchain_period(he, period);
+				hist_entry__add_callchain_period(he, period, latency);
 			}
 			if (symbol_conf.cumulate_callchain)
-				he_stat__add_period(he->stat_acc, period);
+				he_stat__add_period(he->stat_acc, period, latency);
 
 			block_info__delete(entry->block_info);
 
@@ -672,7 +683,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 		return NULL;
 
 	if (sample_self)
-		hist_entry__add_callchain_period(he, period);
+		hist_entry__add_callchain_period(he, period, latency);
 	hists->nr_entries++;
 
 	rb_link_node(&he->rb_node_in, parent, p);
@@ -751,6 +762,7 @@ __hists__add_entry(struct hists *hists,
 			.weight1 = sample->weight,
 			.weight2 = sample->ins_lat,
 			.weight3 = sample->p_stage_cyc,
+			.latency = al->latency,
 		},
 		.parent = sym_parent,
 		.filtered = symbol__parent_filter(sym_parent) | al->filtered,
@@ -1768,12 +1780,14 @@ static void hists__reset_filter_stats(struct hists *hists)
 {
 	hists->nr_non_filtered_entries = 0;
 	hists->stats.total_non_filtered_period = 0;
+	hists->stats.total_non_filtered_latency = 0;
 }
 
 void hists__reset_stats(struct hists *hists)
 {
 	hists->nr_entries = 0;
 	hists->stats.total_period = 0;
+	hists->stats.total_latency = 0;
 
 	hists__reset_filter_stats(hists);
 }
@@ -1782,6 +1796,7 @@ static void hists__inc_filter_stats(struct hists *hists, struct hist_entry *h)
 {
 	hists->nr_non_filtered_entries++;
 	hists->stats.total_non_filtered_period += h->stat.period;
+	hists->stats.total_non_filtered_latency += h->stat.latency;
 }
 
 void hists__inc_stats(struct hists *hists, struct hist_entry *h)
@@ -1791,6 +1806,7 @@ void hists__inc_stats(struct hists *hists, struct hist_entry *h)
 
 	hists->nr_entries++;
 	hists->stats.total_period += h->stat.period;
+	hists->stats.total_latency += h->stat.latency;
 }
 
 static void hierarchy_recalc_total_periods(struct hists *hists)
@@ -1802,6 +1818,8 @@ static void hierarchy_recalc_total_periods(struct hists *hists)
 
 	hists->stats.total_period = 0;
 	hists->stats.total_non_filtered_period = 0;
+	hists->stats.total_latency = 0;
+	hists->stats.total_non_filtered_latency = 0;
 
 	/*
 	 * recalculate total period using top-level entries only
@@ -1813,8 +1831,11 @@ static void hierarchy_recalc_total_periods(struct hists *hists)
 		node = rb_next(node);
 
 		hists->stats.total_period += he->stat.period;
-		if (!he->filtered)
+		hists->stats.total_latency += he->stat.latency;
+		if (!he->filtered) {
 			hists->stats.total_non_filtered_period += he->stat.period;
+			hists->stats.total_non_filtered_latency += he->stat.latency;
+		}
 	}
 }
 
@@ -2791,6 +2812,12 @@ u64 hists__total_period(struct hists *hists)
 		hists->stats.total_period;
 }
 
+u64 hists__total_latency(struct hists *hists)
+{
+	return symbol_conf.filter_relative ? hists->stats.total_non_filtered_latency :
+		hists->stats.total_latency;
+}
+
 int __hists__scnprintf_title(struct hists *hists, char *bf, size_t size, bool show_freq)
 {
 	char unit;
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index c2236e0d89f2a..91159f16c60b2 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -109,6 +109,8 @@ struct hists {
 	u64			nr_non_filtered_entries;
 	u64			callchain_period;
 	u64			callchain_non_filtered_period;
+	u64			callchain_latency;
+	u64			callchain_non_filtered_latency;
 	struct thread		*thread_filter;
 	const struct dso	*dso_filter;
 	const char		*uid_filter_str;
@@ -170,6 +172,12 @@ struct res_sample {
 
 struct he_stat {
 	u64			period;
+	/*
+	 * Period re-scaled from CPU time to wall-clock time (divided by the
+	 * parallelism at the time of the sample). This represents effect of
+	 * the event on latency rather than CPU consumption.
+	 */
+	u64			latency;
 	u64			period_sys;
 	u64			period_us;
 	u64			period_guest_sys;
@@ -374,6 +382,7 @@ void hists__output_recalc_col_len(struct hists *hists, int max_rows);
 struct hist_entry *hists__get_entry(struct hists *hists, int idx);
 
 u64 hists__total_period(struct hists *hists);
+u64 hists__total_latency(struct hists *hists);
 void hists__reset_stats(struct hists *hists);
 void hists__inc_stats(struct hists *hists, struct hist_entry *h);
 void hists__inc_nr_events(struct hists *hists);
@@ -555,11 +564,13 @@ extern struct perf_hpp_fmt perf_hpp__format[];
 enum {
 	/* Matches perf_hpp__format array. */
 	PERF_HPP__OVERHEAD,
+	PERF_HPP__LATENCY,
 	PERF_HPP__OVERHEAD_SYS,
 	PERF_HPP__OVERHEAD_US,
 	PERF_HPP__OVERHEAD_GUEST_SYS,
 	PERF_HPP__OVERHEAD_GUEST_US,
 	PERF_HPP__OVERHEAD_ACC,
+	PERF_HPP__LATENCY_ACC,
 	PERF_HPP__SAMPLES,
 	PERF_HPP__PERIOD,
 	PERF_HPP__WEIGHT1,
@@ -615,6 +626,7 @@ void hists__reset_column_width(struct hists *hists);
 enum perf_hpp_fmt_type {
 	PERF_HPP_FMT_TYPE__RAW,
 	PERF_HPP_FMT_TYPE__PERCENT,
+	PERF_HPP_FMT_TYPE__LATENCY,
 	PERF_HPP_FMT_TYPE__AVERAGE,
 };
 
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 3055496358ebb..bc4c3acfe7552 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -2628,11 +2628,13 @@ struct hpp_dimension {
 
 static struct hpp_dimension hpp_sort_dimensions[] = {
 	DIM(PERF_HPP__OVERHEAD, "overhead"),
+	DIM(PERF_HPP__LATENCY, "latency"),
 	DIM(PERF_HPP__OVERHEAD_SYS, "overhead_sys"),
 	DIM(PERF_HPP__OVERHEAD_US, "overhead_us"),
 	DIM(PERF_HPP__OVERHEAD_GUEST_SYS, "overhead_guest_sys"),
 	DIM(PERF_HPP__OVERHEAD_GUEST_US, "overhead_guest_us"),
 	DIM(PERF_HPP__OVERHEAD_ACC, "overhead_children"),
+	DIM(PERF_HPP__LATENCY_ACC, "latency_children"),
 	DIM(PERF_HPP__SAMPLES, "sample"),
 	DIM(PERF_HPP__PERIOD, "period"),
 	DIM(PERF_HPP__WEIGHT1, "weight1"),
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 6/8] perf report: Add --latency flag
  2025-02-05 16:27 [PATCH v5 0/8] perf report: Add latency and parallelism profiling Dmitry Vyukov
                   ` (4 preceding siblings ...)
  2025-02-05 16:27 ` [PATCH v5 5/8] perf report: Add latency output field Dmitry Vyukov
@ 2025-02-05 16:27 ` Dmitry Vyukov
  2025-02-07  3:44   ` Namhyung Kim
  2025-02-07  3:53   ` Namhyung Kim
  2025-02-05 16:27 ` [PATCH v5 7/8] perf report: Add latency and parallelism profiling documentation Dmitry Vyukov
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-05 16:27 UTC (permalink / raw)
  To: namhyung, irogers
  Cc: linux-perf-users, linux-kernel, Dmitry Vyukov,
	Arnaldo Carvalho de Melo

Add record/report --latency flag that allows to capture and show
latency-centric profiles rather than the default CPU-consumption-centric
profiles. For latency profiles record captures context switch events,
and report shows Latency as the first column.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

---
Changes in v5:
 - added description of --latency flag in Documentation flags
---
 tools/perf/Documentation/perf-record.txt |  4 +++
 tools/perf/Documentation/perf-report.txt |  5 +++
 tools/perf/builtin-record.c              | 20 ++++++++++++
 tools/perf/builtin-report.c              | 32 +++++++++++++++---
 tools/perf/ui/hist.c                     | 41 +++++++++++++++++++-----
 tools/perf/util/hist.h                   |  1 +
 tools/perf/util/sort.c                   | 33 +++++++++++++++----
 tools/perf/util/sort.h                   |  2 +-
 tools/perf/util/symbol_conf.h            |  4 ++-
 9 files changed, 122 insertions(+), 20 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 80686d590de24..c7fc1ba265e27 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -227,6 +227,10 @@ OPTIONS
 	'--filter' exists, the new filter expression will be combined with
 	them by '&&'.
 
+--latency::
+	Enable data collection for latency profiling.
+	Use perf report --latency for latency-centric profile.
+
 -a::
 --all-cpus::
         System-wide collection from all CPUs (default if no target is specified).
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 87f8645194062..66794131aec48 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -68,6 +68,11 @@ OPTIONS
 --hide-unresolved::
         Only display entries resolved to a symbol.
 
+--latency::
+        Show latency-centric profile rather than the default
+        CPU-consumption-centric profile
+        (requires perf record --latency flag).
+
 -s::
 --sort=::
 	Sort histogram entries by given key(s) - multiple keys can be specified
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 5db1aedf48df9..e219639ac401b 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -161,6 +161,7 @@ struct record {
 	struct evlist		*sb_evlist;
 	pthread_t		thread_id;
 	int			realtime_prio;
+	bool			latency;
 	bool			switch_output_event_set;
 	bool			no_buildid;
 	bool			no_buildid_set;
@@ -3371,6 +3372,9 @@ static struct option __record_options[] = {
 		     parse_events_option),
 	OPT_CALLBACK(0, "filter", &record.evlist, "filter",
 		     "event filter", parse_filter),
+	OPT_BOOLEAN(0, "latency", &record.latency,
+		    "Enable data collection for latency profiling.\n"
+		    "\t\t\t  Use perf report --latency for latency-centric profile."),
 	OPT_CALLBACK_NOOPT(0, "exclude-perf", &record.evlist,
 			   NULL, "don't record events from perf itself",
 			   exclude_perf),
@@ -4017,6 +4021,22 @@ int cmd_record(int argc, const char **argv)
 
 	}
 
+	if (record.latency) {
+		/*
+		 * There is no fundamental reason why latency profiling
+		 * can't work for system-wide mode, but exact semantics
+		 * and details are to be defined.
+		 * See the following thread for details:
+		 * https://lore.kernel.org/all/Z4XDJyvjiie3howF@google.com/
+		 */
+		if (record.opts.target.system_wide) {
+			pr_err("Failed: latency profiling is not supported with system-wide collection.\n");
+			err = -EINVAL;
+			goto out_opts;
+		}
+		record.opts.record_switch_events = true;
+	}
+
 	if (rec->buildid_mmap) {
 		if (!perf_can_record_build_id()) {
 			pr_err("Failed: no support to record build id in mmap events, update your kernel.\n");
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 2a19abdc869a1..69de6dbefecfa 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -112,6 +112,8 @@ struct report {
 	u64			nr_entries;
 	u64			queue_size;
 	u64			total_cycles;
+	u64			total_samples;
+	u64			singlethreaded_samples;
 	int			socket_filter;
 	DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
 	struct branch_type_stat	brtype_stat;
@@ -331,6 +333,10 @@ static int process_sample_event(const struct perf_tool *tool,
 				     &rep->total_cycles, evsel);
 	}
 
+	rep->total_samples++;
+	if (al.parallelism == 1)
+		rep->singlethreaded_samples++;
+
 	ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
 	if (ret < 0)
 		pr_debug("problem adding hist entry, skipping event\n");
@@ -1079,6 +1085,11 @@ static int __cmd_report(struct report *rep)
 		return ret;
 	}
 
+	/* Don't show Latency column for non-parallel profiles by default. */
+	if (rep->singlethreaded_samples * 100 / rep->total_samples >= 99 &&
+			!symbol_conf.prefer_latency)
+		perf_hpp__cancel_latency();
+
 	evlist__check_mem_load_aux(session->evlist);
 
 	if (rep->stats_mode)
@@ -1468,6 +1479,10 @@ int cmd_report(int argc, const char **argv)
 		    "Disable raw trace ordering"),
 	OPT_BOOLEAN(0, "skip-empty", &report.skip_empty,
 		    "Do not display empty (or dummy) events in the output"),
+	OPT_BOOLEAN(0, "latency", &symbol_conf.prefer_latency,
+		    "Show latency-centric profile rather than the default\n"
+		    "\t\t\t  CPU-consumption-centric profile\n"
+		    "\t\t\t  (requires perf record --latency flag)."),
 	OPT_END()
 	};
 	struct perf_data data = {
@@ -1722,16 +1737,25 @@ int cmd_report(int argc, const char **argv)
 		symbol_conf.annotate_data_sample = true;
 	}
 
+	symbol_conf.enable_latency = true;
 	if (report.disable_order || !perf_session__has_switch_events(session)) {
 		if (symbol_conf.parallelism_list_str ||
-				(sort_order && strstr(sort_order, "parallelism")) ||
-				(field_order && strstr(field_order, "parallelism"))) {
+			symbol_conf.prefer_latency ||
+			(sort_order && (strstr(sort_order, "latency") ||
+				strstr(sort_order, "parallelism"))) ||
+			(field_order && (strstr(field_order, "latency") ||
+				strstr(field_order, "parallelism")))) {
 			if (report.disable_order)
-				ui__error("Use of parallelism is incompatible with --disable-order.\n");
+				ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
 			else
-				ui__error("Use of parallelism requires --switch-events during record.\n");
+				ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
 			return -1;
 		}
+		/*
+		 * If user did not ask for anything related to
+		 * latency/parallelism explicitly, just don't show it.
+		 */
+		symbol_conf.enable_latency = false;
 	}
 
 	if (sort_order && strstr(sort_order, "ipc")) {
diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
index 6de6309595f9e..b453f8eb579cc 100644
--- a/tools/perf/ui/hist.c
+++ b/tools/perf/ui/hist.c
@@ -632,27 +632,36 @@ void perf_hpp__init(void)
 		return;
 
 	if (symbol_conf.cumulate_callchain) {
-		hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC);
+		/* Use idempotent addition to avoid more complex logic. */
+		if (symbol_conf.prefer_latency)
+			hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
+		hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC, true);
+		if (symbol_conf.enable_latency)
+			hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
 		perf_hpp__format[PERF_HPP__OVERHEAD].name = "Self";
 	}
 
-	hpp_dimension__add_output(PERF_HPP__OVERHEAD);
+	if (symbol_conf.prefer_latency)
+		hpp_dimension__add_output(PERF_HPP__LATENCY, true);
+	hpp_dimension__add_output(PERF_HPP__OVERHEAD, true);
+	if (symbol_conf.enable_latency)
+		hpp_dimension__add_output(PERF_HPP__LATENCY, true);
 
 	if (symbol_conf.show_cpu_utilization) {
-		hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS);
-		hpp_dimension__add_output(PERF_HPP__OVERHEAD_US);
+		hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS, false);
+		hpp_dimension__add_output(PERF_HPP__OVERHEAD_US, false);
 
 		if (perf_guest) {
-			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS);
-			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US);
+			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS, false);
+			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US, false);
 		}
 	}
 
 	if (symbol_conf.show_nr_samples)
-		hpp_dimension__add_output(PERF_HPP__SAMPLES);
+		hpp_dimension__add_output(PERF_HPP__SAMPLES, false);
 
 	if (symbol_conf.show_total_period)
-		hpp_dimension__add_output(PERF_HPP__PERIOD);
+		hpp_dimension__add_output(PERF_HPP__PERIOD, false);
 }
 
 void perf_hpp_list__column_register(struct perf_hpp_list *list,
@@ -701,6 +710,22 @@ void perf_hpp__cancel_cumulate(void)
 	}
 }
 
+void perf_hpp__cancel_latency(void)
+{
+	struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
+
+	if (is_strict_order(field_order) || is_strict_order(sort_order))
+		return;
+
+	lat = &perf_hpp__format[PERF_HPP__LATENCY];
+	acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
+
+	perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
+		if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
+			perf_hpp__column_unregister(fmt);
+	}
+}
+
 void perf_hpp__setup_output_field(struct perf_hpp_list *list)
 {
 	struct perf_hpp_fmt *fmt;
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 91159f16c60b2..29d4c7a3d1747 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -582,6 +582,7 @@ enum {
 
 void perf_hpp__init(void);
 void perf_hpp__cancel_cumulate(void);
+void perf_hpp__cancel_latency(void);
 void perf_hpp__setup_output_field(struct perf_hpp_list *list);
 void perf_hpp__reset_output_field(struct perf_hpp_list *list);
 void perf_hpp__append_sort_keys(struct perf_hpp_list *list);
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index bc4c3acfe7552..2b6023de7a53a 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -2622,6 +2622,7 @@ struct hpp_dimension {
 	const char		*name;
 	struct perf_hpp_fmt	*fmt;
 	int			taken;
+	int			was_taken;
 };
 
 #define DIM(d, n) { .name = n, .fmt = &perf_hpp__format[d], }
@@ -3513,6 +3514,7 @@ static int __hpp_dimension__add(struct hpp_dimension *hd,
 		return -1;
 
 	hd->taken = 1;
+	hd->was_taken = 1;
 	perf_hpp_list__register_sort_field(list, fmt);
 	return 0;
 }
@@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
 	return 0;
 }
 
-int hpp_dimension__add_output(unsigned col)
+int hpp_dimension__add_output(unsigned col, bool implicit)
 {
+	struct hpp_dimension *hd;
+
 	BUG_ON(col >= PERF_HPP__MAX_INDEX);
-	return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
+	hd = &hpp_sort_dimensions[col];
+	if (implicit && !hd->was_taken)
+		return 0;
+	return __hpp_dimension__add_output(&perf_hpp_list, hd);
 }
 
 int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
@@ -3809,10 +3816,24 @@ static char *setup_overhead(char *keys)
 	if (sort__mode == SORT_MODE__DIFF)
 		return keys;
 
-	keys = prefix_if_not_in("overhead", keys);
-
-	if (symbol_conf.cumulate_callchain)
-		keys = prefix_if_not_in("overhead_children", keys);
+	if (symbol_conf.prefer_latency) {
+		keys = prefix_if_not_in("overhead", keys);
+		keys = prefix_if_not_in("latency", keys);
+		if (symbol_conf.cumulate_callchain) {
+			keys = prefix_if_not_in("overhead_children", keys);
+			keys = prefix_if_not_in("latency_children", keys);
+		}
+	} else if (!keys || (!strstr(keys, "overhead") &&
+			!strstr(keys, "latency"))) {
+		if (symbol_conf.enable_latency)
+			keys = prefix_if_not_in("latency", keys);
+		keys = prefix_if_not_in("overhead", keys);
+		if (symbol_conf.cumulate_callchain) {
+			if (symbol_conf.enable_latency)
+				keys = prefix_if_not_in("latency_children", keys);
+			keys = prefix_if_not_in("overhead_children", keys);
+		}
+	}
 
 	return keys;
 }
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 11fb15f914093..180d36a2bea35 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -141,7 +141,7 @@ int report_parse_ignore_callees_opt(const struct option *opt, const char *arg, i
 
 bool is_strict_order(const char *order);
 
-int hpp_dimension__add_output(unsigned col);
+int hpp_dimension__add_output(unsigned col, bool implicit);
 void reset_dimensions(void);
 int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
 			struct evlist *evlist,
diff --git a/tools/perf/util/symbol_conf.h b/tools/perf/util/symbol_conf.h
index c5b2e56127e22..cd9aa82c7d5ad 100644
--- a/tools/perf/util/symbol_conf.h
+++ b/tools/perf/util/symbol_conf.h
@@ -49,7 +49,9 @@ struct symbol_conf {
 			keep_exited_threads,
 			annotate_data_member,
 			annotate_data_sample,
-			skip_empty;
+			skip_empty,
+			enable_latency,
+			prefer_latency;
 	const char	*vmlinux_name,
 			*kallsyms_name,
 			*source_prefix,
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 7/8] perf report: Add latency and parallelism profiling documentation
  2025-02-05 16:27 [PATCH v5 0/8] perf report: Add latency and parallelism profiling Dmitry Vyukov
                   ` (5 preceding siblings ...)
  2025-02-05 16:27 ` [PATCH v5 6/8] perf report: Add --latency flag Dmitry Vyukov
@ 2025-02-05 16:27 ` Dmitry Vyukov
  2025-02-05 16:27 ` [PATCH v5 8/8] perf hist: Shrink struct hist_entry size Dmitry Vyukov
  2025-02-06 18:30 ` [PATCH v5 0/8] perf report: Add latency and parallelism profiling Andi Kleen
  8 siblings, 0 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-05 16:27 UTC (permalink / raw)
  To: namhyung, irogers
  Cc: linux-perf-users, linux-kernel, Dmitry Vyukov,
	Arnaldo Carvalho de Melo

Describe latency and parallelism profiling, related flags, and differences
with the currently only supported CPU-consumption-centric profiling.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 .../callchain-overhead-calculation.txt        |  5 +-
 .../cpu-and-latency-overheads.txt             | 85 +++++++++++++++++++
 tools/perf/Documentation/perf-report.txt      | 49 +++++++----
 tools/perf/Documentation/tips.txt             |  3 +
 4 files changed, 123 insertions(+), 19 deletions(-)

diff --git a/tools/perf/Documentation/callchain-overhead-calculation.txt b/tools/perf/Documentation/callchain-overhead-calculation.txt
index 1a757927195ed..e0202bf5bd1a0 100644
--- a/tools/perf/Documentation/callchain-overhead-calculation.txt
+++ b/tools/perf/Documentation/callchain-overhead-calculation.txt
@@ -1,7 +1,8 @@
 Overhead calculation
 --------------------
-The overhead can be shown in two columns as 'Children' and 'Self' when
-perf collects callchains.  The 'self' overhead is simply calculated by
+The CPU overhead can be shown in two columns as 'Children' and 'Self'
+when perf collects callchains (and corresponding 'Wall' columns for
+wall-clock overhead).  The 'self' overhead is simply calculated by
 adding all period values of the entry - usually a function (symbol).
 This is the value that perf shows traditionally and sum of all the
 'self' overhead values should be 100%.
diff --git a/tools/perf/Documentation/cpu-and-latency-overheads.txt b/tools/perf/Documentation/cpu-and-latency-overheads.txt
new file mode 100644
index 0000000000000..3b6d637054651
--- /dev/null
+++ b/tools/perf/Documentation/cpu-and-latency-overheads.txt
@@ -0,0 +1,85 @@
+CPU and latency overheads
+-------------------------
+There are two notions of time: wall-clock time and CPU time.
+For a single-threaded program, or a program running on a single-core machine,
+these notions are the same. However, for a multi-threaded/multi-process program
+running on a multi-core machine, these notions are significantly different.
+Each second of wall-clock time we have number-of-cores seconds of CPU time.
+Perf can measure overhead for both of these times (shown in 'overhead' and
+'latency' columns for CPU and wall-clock time correspondingly).
+
+Optimizing CPU overhead is useful to improve 'throughput', while optimizing
+latency overhead is useful to improve 'latency'. It's important to understand
+which one is useful in a concrete situation at hand. For example, the former
+may be useful to improve max throughput of a CI build server that runs on 100%
+CPU utilization, while the latter may be useful to improve user-perceived
+latency of a single interactive program build.
+These overheads may be significantly different in some cases. For example,
+consider a program that executes function 'foo' for 9 seconds with 1 thread,
+and then executes function 'bar' for 1 second with 128 threads (consumes
+128 seconds of CPU time). The CPU overhead is: 'foo' - 6.6%, 'bar' - 93.4%.
+While the latency overhead is: 'foo' - 90%, 'bar' - 10%. If we try to optimize
+running time of the program looking at the (wrong in this case) CPU overhead,
+we would concentrate on the function 'bar', but it can yield only 10% running
+time improvement at best.
+
+By default, perf shows only CPU overhead. To show latency overhead, use
+'perf record --latency' and 'perf report':
+
+-----------------------------------
+Overhead  Latency  Command
+  93.88%   25.79%  cc1
+   1.90%   39.87%  gzip
+   0.99%   10.16%  dpkg-deb
+   0.57%    1.00%  as
+   0.40%    0.46%  sh
+-----------------------------------
+
+To sort by latency overhead, use 'perf report --latency':
+
+-----------------------------------
+Latency  Overhead  Command
+ 39.87%     1.90%  gzip
+ 25.79%    93.88%  cc1
+ 10.16%     0.99%  dpkg-deb
+  4.17%     0.29%  git
+  2.81%     0.11%  objtool
+-----------------------------------
+
+To get insight into the difference between the overheads, you may check
+parallelization histogram with '--sort=latency,parallelism,comm,symbol --hierarchy'
+flags. It shows fraction of (wall-clock) time the workload utilizes different
+numbers of cores ('Parallelism' column). For example, in the following case
+the workload utilizes only 1 core most of the time, but also has some
+highly-parallel phases, which explains significant difference between
+CPU and wall-clock overheads:
+
+-----------------------------------
+  Latency  Overhead     Parallelism / Command / Symbol
++  56.98%     2.29%     1
++  16.94%     1.36%     2
++   4.00%    20.13%     125
++   3.66%    18.25%     124
++   3.48%    17.66%     126
++   3.26%     0.39%     3
++   2.61%    12.93%     123
+-----------------------------------
+
+By expanding corresponding lines, you may see what commands/functions run
+at the given parallelism level:
+
+-----------------------------------
+  Latency  Overhead     Parallelism / Command / Symbol
+-  56.98%     2.29%     1
+      32.80%     1.32%     gzip
+       4.46%     0.18%     cc1
+       2.81%     0.11%     objtool
+       2.43%     0.10%     dpkg-source
+       2.22%     0.09%     ld
+       2.10%     0.08%     dpkg-genchanges
+-----------------------------------
+
+To see the normal function-level profile for particular parallelism levels
+(number of threads actively running on CPUs), you may use '--parallelism'
+filter. For example, to see the profile only for low parallelism phases
+of a workload use '--latency --parallelism=1-2' flags.
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 66794131aec48..3376c47105750 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -44,7 +44,7 @@ OPTIONS
 --comms=::
 	Only consider symbols in these comms. CSV that understands
 	file://filename entries.  This option will affect the percentage of
-	the overhead column.  See --percentage for more info.
+	the overhead and latency columns.  See --percentage for more info.
 --pid=::
         Only show events for given process ID (comma separated list).
 
@@ -54,12 +54,12 @@ OPTIONS
 --dsos=::
 	Only consider symbols in these dsos. CSV that understands
 	file://filename entries.  This option will affect the percentage of
-	the overhead column.  See --percentage for more info.
+	the overhead and latency columns.  See --percentage for more info.
 -S::
 --symbols=::
 	Only consider these symbols. CSV that understands
 	file://filename entries.  This option will affect the percentage of
-	the overhead column.  See --percentage for more info.
+	the overhead and latency columns.  See --percentage for more info.
 
 --symbol-filter=::
 	Only show symbols that match (partially) with this filter.
@@ -68,6 +68,16 @@ OPTIONS
 --hide-unresolved::
         Only display entries resolved to a symbol.
 
+--parallelism::
+        Only consider these parallelism levels. Parallelism level is the number
+        of threads that actively run on CPUs at the time of sample. The flag
+        accepts single number, comma-separated list, and ranges (for example:
+        "1", "7,8", "1,64-128"). This is useful in understanding what a program
+        is doing during sequential/low-parallelism phases as compared to
+        high-parallelism phases. This option will affect the percentage of
+        the overhead and latency columns. See --percentage for more info.
+        Also see the `CPU and latency overheads' section for more details.
+
 --latency::
         Show latency-centric profile rather than the default
         CPU-consumption-centric profile
@@ -92,6 +102,7 @@ OPTIONS
 	entries are displayed as "[other]".
 	- cpu: cpu number the task ran at the time of sample
 	- socket: processor socket number the task ran at the time of sample
+	- parallelism: number of running threads at the time of sample
 	- srcline: filename and line number executed at the time of sample.  The
 	DWARF debugging info must be provided.
 	- srcfile: file name of the source file of the samples. Requires dwarf
@@ -102,12 +113,14 @@ OPTIONS
 	- cgroup_id: ID derived from cgroup namespace device and inode numbers.
 	- cgroup: cgroup pathname in the cgroupfs.
 	- transaction: Transaction abort flags.
-	- overhead: Overhead percentage of sample
-	- overhead_sys: Overhead percentage of sample running in system mode
-	- overhead_us: Overhead percentage of sample running in user mode
-	- overhead_guest_sys: Overhead percentage of sample running in system mode
+	- overhead: CPU overhead percentage of sample.
+	- latency: latency (wall-clock) overhead percentage of sample.
+	  See the `CPU and latency overheads' section for more details.
+	- overhead_sys: CPU overhead percentage of sample running in system mode
+	- overhead_us: CPU overhead percentage of sample running in user mode
+	- overhead_guest_sys: CPU overhead percentage of sample running in system mode
 	on guest machine
-	- overhead_guest_us: Overhead percentage of sample running in user mode on
+	- overhead_guest_us: CPU overhead percentage of sample running in user mode on
 	guest machine
 	- sample: Number of sample
 	- period: Raw number of event count of sample
@@ -130,8 +143,8 @@ OPTIONS
 	- weight2: Average value of event specific weight (2nd field of weight_struct).
 	- weight3: Average value of event specific weight (3rd field of weight_struct).
 
-	By default, comm, dso and symbol keys are used.
-	(i.e. --sort comm,dso,symbol)
+	By default, overhead, comm, dso and symbol keys are used.
+	(i.e. --sort overhead,comm,dso,symbol).
 
 	If --branch-stack option is used, following sort keys are also
 	available:
@@ -206,9 +219,9 @@ OPTIONS
 --fields=::
 	Specify output field - multiple keys can be specified in CSV format.
 	Following fields are available:
-	overhead, overhead_sys, overhead_us, overhead_children, sample, period,
-	weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat.  The
-	last 3 names are alias for the corresponding weights.  When the weight
+	overhead, latency, overhead_sys, overhead_us, overhead_children, sample,
+	period, weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat.
+	The last 3 names are alias for the corresponding weights.  When the weight
 	fields are used, they will show the average value of the weight.
 
 	Also it can contain any sort key(s).
@@ -294,7 +307,7 @@ OPTIONS
 	Accumulate callchain of children to parent entry so that then can
 	show up in the output.  The output will have a new "Children" column
 	and will be sorted on the data.  It requires callchains are recorded.
-	See the `overhead calculation' section for more details. Enabled by
+	See the `Overhead calculation' section for more details. Enabled by
 	default, disable with --no-children.
 
 --max-stack::
@@ -447,9 +460,9 @@ OPTIONS
 	--call-graph option for details.
 
 --percentage::
-	Determine how to display the overhead percentage of filtered entries.
-	Filters can be applied by --comms, --dsos and/or --symbols options and
-	Zoom operations on the TUI (thread, dso, etc).
+	Determine how to display the CPU and latency overhead percentage
+	of filtered entries. Filters can be applied by --comms, --dsos, --symbols
+	and/or --parallelism options and Zoom operations on the TUI (thread, dso, etc).
 
 	"relative" means it's relative to filtered entries only so that the
 	sum of shown entries will be always 100%.  "absolute" means it retains
@@ -632,6 +645,8 @@ include::itrace.txt[]
 --skip-empty::
 	Do not print 0 results in the --stat output.
 
+include::cpu-and-latency-overheads.txt[]
+
 include::callchain-overhead-calculation.txt[]
 
 SEE ALSO
diff --git a/tools/perf/Documentation/tips.txt b/tools/perf/Documentation/tips.txt
index 67b326ba00407..f6f71e70ff2cb 100644
--- a/tools/perf/Documentation/tips.txt
+++ b/tools/perf/Documentation/tips.txt
@@ -62,3 +62,6 @@ To show context switches in perf report sample context add --switch-events to pe
 To show time in nanoseconds in record/report add --ns
 To compare hot regions in two workloads use perf record -b -o file ... ; perf diff --stream file1 file2
 To compare scalability of two workload samples use perf diff -c ratio file1 file2
+For latency profiling, try: perf record/report --latency
+For parallelism histogram, try: perf report --hierarchy --sort latency,parallelism,comm,symbol
+To analyze particular parallelism levels, try: perf report --latency --parallelism=32-64
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 8/8] perf hist: Shrink struct hist_entry size
  2025-02-05 16:27 [PATCH v5 0/8] perf report: Add latency and parallelism profiling Dmitry Vyukov
                   ` (6 preceding siblings ...)
  2025-02-05 16:27 ` [PATCH v5 7/8] perf report: Add latency and parallelism profiling documentation Dmitry Vyukov
@ 2025-02-05 16:27 ` Dmitry Vyukov
  2025-02-06 18:30 ` [PATCH v5 0/8] perf report: Add latency and parallelism profiling Andi Kleen
  8 siblings, 0 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-05 16:27 UTC (permalink / raw)
  To: namhyung, irogers
  Cc: linux-perf-users, linux-kernel, Dmitry Vyukov,
	Arnaldo Carvalho de Melo

Reorder the struct fields by size to reduce paddings and reduce
struct simd_flags size from 8 to 1 byte.

This reduces struct hist_entry size by 8 bytes (592->584),
and leaves a single more usable 6 byte padding hole.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

---
Pahole output before:

struct hist_entry {
	struct rb_node             rb_node_in __attribute__((__aligned__(8))); /*     0    24 */
	struct rb_node             rb_node __attribute__((__aligned__(8))); /*    24    24 */
	union {
		struct list_head   node;                 /*    48    16 */
		struct list_head   head;                 /*    48    16 */
	} pairs;                                         /*    48    16 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct he_stat             stat;                 /*    64    80 */

	/* XXX last struct has 4 bytes of padding */

	/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
	struct he_stat *           stat_acc;             /*   144     8 */
	struct map_symbol          ms;                   /*   152    24 */
	struct thread *            thread;               /*   176     8 */
	struct comm *              comm;                 /*   184     8 */
	/* --- cacheline 3 boundary (192 bytes) --- */
	struct namespace_id        cgroup_id;            /*   192    16 */
	u64                        cgroup;               /*   208     8 */
	u64                        ip;                   /*   216     8 */
	u64                        transaction;          /*   224     8 */
	s32                        socket;               /*   232     4 */
	s32                        cpu;                  /*   236     4 */
	int                        parallelism;          /*   240     4 */

	/* XXX 4 bytes hole, try to pack */

	u64                        code_page_size;       /*   248     8 */
	/* --- cacheline 4 boundary (256 bytes) --- */
	u64                        weight;               /*   256     8 */
	u64                        ins_lat;              /*   264     8 */
	u64                        p_stage_cyc;          /*   272     8 */
	u8                         cpumode;              /*   280     1 */
	u8                         depth;                /*   281     1 */

	/* XXX 2 bytes hole, try to pack */

	int                        mem_type_off;         /*   284     4 */
	struct simd_flags          simd_flags;           /*   288     8 */
	_Bool                      dummy;                /*   296     1 */
	_Bool                      leaf;                 /*   297     1 */
	char                       level;                /*   298     1 */

	/* XXX 1 byte hole, try to pack */

	filter_mask_t              filtered;             /*   300     2 */
	u16                        callchain_size;       /*   302     2 */
	union {
		struct hist_entry_diff diff;             /*   304   120 */
		struct {
			u16        row_offset;           /*   304     2 */
			u16        nr_rows;              /*   306     2 */
			_Bool      init_have_children;   /*   308     1 */
			_Bool      unfolded;             /*   309     1 */
			_Bool      has_children;         /*   310     1 */
			_Bool      has_no_entry;         /*   311     1 */
		};                                       /*   304     8 */
	};                                               /*   304   120 */
	/* --- cacheline 6 boundary (384 bytes) was 40 bytes ago --- */
	char *                     srcline;              /*   424     8 */
	char *                     srcfile;              /*   432     8 */
	struct symbol *            parent;               /*   440     8 */
	/* --- cacheline 7 boundary (448 bytes) --- */
	struct branch_info *       branch_info;          /*   448     8 */
	long int                   time;                 /*   456     8 */
	struct hists *             hists;                /*   464     8 */
	struct mem_info *          mem_info;             /*   472     8 */
	struct block_info *        block_info;           /*   480     8 */
	struct kvm_info *          kvm_info;             /*   488     8 */
	void *                     raw_data;             /*   496     8 */
	u32                        raw_size;             /*   504     4 */
	int                        num_res;              /*   508     4 */
	/* --- cacheline 8 boundary (512 bytes) --- */
	struct res_sample *        res_samples;          /*   512     8 */
	void *                     trace_output;         /*   520     8 */
	struct perf_hpp_list *     hpp_list;             /*   528     8 */
	struct hist_entry *        parent_he;            /*   536     8 */
	struct hist_entry_ops *    ops;                  /*   544     8 */
	struct annotated_data_type * mem_type;           /*   552     8 */
	union {
		struct {
			struct rb_root_cached hroot_in;  /*   560    16 */
			/* --- cacheline 9 boundary (576 bytes) --- */
			struct rb_root_cached hroot_out; /*   576    16 */
		};                                       /*   560    32 */
		struct rb_root     sorted_chain;         /*   560     8 */
	};                                               /*   560    32 */
	/* --- cacheline 9 boundary (576 bytes) was 16 bytes ago --- */
	struct callchain_root      callchain[] __attribute__((__aligned__(8))); /*   592     0 */

	/* size: 592, cachelines: 10, members: 49 */
	/* sum members: 585, holes: 3, sum holes: 7 */
	/* paddings: 1, sum paddings: 4 */
	/* forced alignments: 3 */
	/* last cacheline: 16 bytes */
} __attribute__((__aligned__(8)));

After:

struct hist_entry {
	struct rb_node             rb_node_in __attribute__((__aligned__(8))); /*     0    24 */
	struct rb_node             rb_node __attribute__((__aligned__(8))); /*    24    24 */
	union {
		struct list_head   node;                 /*    48    16 */
		struct list_head   head;                 /*    48    16 */
	} pairs;                                         /*    48    16 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct he_stat             stat;                 /*    64    80 */

	/* XXX last struct has 4 bytes of padding */

	/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
	struct he_stat *           stat_acc;             /*   144     8 */
	struct map_symbol          ms;                   /*   152    24 */
	struct thread *            thread;               /*   176     8 */
	struct comm *              comm;                 /*   184     8 */
	/* --- cacheline 3 boundary (192 bytes) --- */
	struct namespace_id        cgroup_id;            /*   192    16 */
	u64                        cgroup;               /*   208     8 */
	u64                        ip;                   /*   216     8 */
	u64                        transaction;          /*   224     8 */
	u64                        code_page_size;       /*   232     8 */
	u64                        weight;               /*   240     8 */
	u64                        ins_lat;              /*   248     8 */
	/* --- cacheline 4 boundary (256 bytes) --- */
	u64                        p_stage_cyc;          /*   256     8 */
	s32                        socket;               /*   264     4 */
	s32                        cpu;                  /*   268     4 */
	int                        parallelism;          /*   272     4 */
	int                        mem_type_off;         /*   276     4 */
	u8                         cpumode;              /*   280     1 */
	u8                         depth;                /*   281     1 */
	struct simd_flags          simd_flags;           /*   282     1 */
	_Bool                      dummy;                /*   283     1 */
	_Bool                      leaf;                 /*   284     1 */
	char                       level;                /*   285     1 */
	filter_mask_t              filtered;             /*   286     2 */
	u16                        callchain_size;       /*   288     2 */

	/* XXX 6 bytes hole, try to pack */

	union {
		struct hist_entry_diff diff;             /*   296   120 */
		struct {
			u16        row_offset;           /*   296     2 */
			u16        nr_rows;              /*   298     2 */
			_Bool      init_have_children;   /*   300     1 */
			_Bool      unfolded;             /*   301     1 */
			_Bool      has_children;         /*   302     1 */
			_Bool      has_no_entry;         /*   303     1 */
		};                                       /*   296     8 */
	};                                               /*   296   120 */
	/* --- cacheline 6 boundary (384 bytes) was 32 bytes ago --- */
	char *                     srcline;              /*   416     8 */
	char *                     srcfile;              /*   424     8 */
	struct symbol *            parent;               /*   432     8 */
	struct branch_info *       branch_info;          /*   440     8 */
	/* --- cacheline 7 boundary (448 bytes) --- */
	long int                   time;                 /*   448     8 */
	struct hists *             hists;                /*   456     8 */
	struct mem_info *          mem_info;             /*   464     8 */
	struct block_info *        block_info;           /*   472     8 */
	struct kvm_info *          kvm_info;             /*   480     8 */
	void *                     raw_data;             /*   488     8 */
	u32                        raw_size;             /*   496     4 */
	int                        num_res;              /*   500     4 */
	struct res_sample *        res_samples;          /*   504     8 */
	/* --- cacheline 8 boundary (512 bytes) --- */
	void *                     trace_output;         /*   512     8 */
	struct perf_hpp_list *     hpp_list;             /*   520     8 */
	struct hist_entry *        parent_he;            /*   528     8 */
	struct hist_entry_ops *    ops;                  /*   536     8 */
	struct annotated_data_type * mem_type;           /*   544     8 */
	union {
		struct {
			struct rb_root_cached hroot_in;  /*   552    16 */
			struct rb_root_cached hroot_out; /*   568    16 */
		};                                       /*   552    32 */
		struct rb_root     sorted_chain;         /*   552     8 */
	};                                               /*   552    32 */
	/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
	struct callchain_root      callchain[] __attribute__((__aligned__(8))); /*   584     0 */

	/* size: 584, cachelines: 10, members: 49 */
	/* sum members: 578, holes: 1, sum holes: 6 */
	/* paddings: 1, sum paddings: 4 */
	/* forced alignments: 3 */
	/* last cacheline: 8 bytes */
} __attribute__((__aligned__(8)));
---
 tools/perf/util/hist.h   | 8 ++++----
 tools/perf/util/sample.h | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 29d4c7a3d1747..317d06cca8b88 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -239,16 +239,16 @@ struct hist_entry {
 	u64			cgroup;
 	u64			ip;
 	u64			transaction;
-	s32			socket;
-	s32			cpu;
-	int			parallelism;
 	u64			code_page_size;
 	u64			weight;
 	u64			ins_lat;
 	u64			p_stage_cyc;
+	s32			socket;
+	s32			cpu;
+	int			parallelism;
+	int			mem_type_off;
 	u8			cpumode;
 	u8			depth;
-	int			mem_type_off;
 	struct simd_flags	simd_flags;
 
 	/* We are added by hists__add_dummy_entry. */
diff --git a/tools/perf/util/sample.h b/tools/perf/util/sample.h
index 70b2c3135555e..ab756d61cbcd6 100644
--- a/tools/perf/util/sample.h
+++ b/tools/perf/util/sample.h
@@ -67,7 +67,7 @@ struct aux_sample {
 };
 
 struct simd_flags {
-	u64	arch:1,	/* architecture (isa) */
+	u8	arch:1,	/* architecture (isa) */
 		pred:2;	/* predication */
 };
 
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
  2025-02-05 16:27 [PATCH v5 0/8] perf report: Add latency and parallelism profiling Dmitry Vyukov
                   ` (7 preceding siblings ...)
  2025-02-05 16:27 ` [PATCH v5 8/8] perf hist: Shrink struct hist_entry size Dmitry Vyukov
@ 2025-02-06 18:30 ` Andi Kleen
  2025-02-06 18:41   ` Dmitry Vyukov
  2025-02-07  8:16   ` Dmitry Vyukov
  8 siblings, 2 replies; 32+ messages in thread
From: Andi Kleen @ 2025-02-06 18:30 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: namhyung, irogers, linux-perf-users, linux-kernel,
	Arnaldo Carvalho de Melo

Dmitry Vyukov <dvyukov@google.com> writes:

> There are two notions of time: wall-clock time and CPU time.
> For a single-threaded program, or a program running on a single-core
> machine, these notions are the same. However, for a multi-threaded/
> multi-process program running on a multi-core machine, these notions are
> significantly different. Each second of wall-clock time we have
> number-of-cores seconds of CPU time.

I'm curious how does this interact with the time / --time-quantum sort key?

I assume it just works, but might be worth checking.

It was intended to address some of these issues too.

> Optimizing CPU overhead is useful to improve 'throughput', while
> optimizing wall-clock overhead is useful to improve 'latency'.
> These profiles are complementary and are not interchangeable.
> Examples of where latency profile is needed:
>  - optimzing build latency
>  - optimizing server request latency
>  - optimizing ML training/inference latency
>  - optimizing running time of any command line program
>
> CPU profile is useless for these use cases at best (if a user understands
> the difference), or misleading at worst (if a user tries to use a wrong
> profile for a job).

I would agree in the general case, but not if the time sort key
is chosen with a suitable quantum. You can see how the parallelism
changes over time then which is often a good enough proxy. 


> We still default to the CPU profile, so it's up to users to learn
> about the second profiling mode and use it when appropriate.

You should add it to tips.txt then

>  .../callchain-overhead-calculation.txt        |   5 +-
>  .../cpu-and-latency-overheads.txt             |  85 ++++++++++++++
>  tools/perf/Documentation/perf-record.txt      |   4 +
>  tools/perf/Documentation/perf-report.txt      |  54 ++++++---
>  tools/perf/Documentation/tips.txt             |   3 +
>  tools/perf/builtin-record.c                   |  20 ++++
>  tools/perf/builtin-report.c                   |  39 +++++++
>  tools/perf/ui/browsers/hists.c                |  27 +++--
>  tools/perf/ui/hist.c                          | 104 ++++++++++++------
>  tools/perf/util/addr_location.c               |   1 +
>  tools/perf/util/addr_location.h               |   7 +-
>  tools/perf/util/event.c                       |  11 ++
>  tools/perf/util/events_stats.h                |   2 +
>  tools/perf/util/hist.c                        |  90 ++++++++++++---
>  tools/perf/util/hist.h                        |  32 +++++-
>  tools/perf/util/machine.c                     |   7 ++
>  tools/perf/util/machine.h                     |   6 +
>  tools/perf/util/sample.h                      |   2 +-
>  tools/perf/util/session.c                     |  12 ++
>  tools/perf/util/session.h                     |   1 +
>  tools/perf/util/sort.c                        |  69 +++++++++++-
>  tools/perf/util/sort.h                        |   3 +-
>  tools/perf/util/symbol.c                      |  34 ++++++
>  tools/perf/util/symbol_conf.h                 |   8 +-

We traditionally didn't do it, but in general test coverage
of perf report is too low, so I would recommend to add some simple
test case in the perf test scripts.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
  2025-02-06 18:30 ` [PATCH v5 0/8] perf report: Add latency and parallelism profiling Andi Kleen
@ 2025-02-06 18:41   ` Dmitry Vyukov
  2025-02-06 18:51     ` Ian Rogers
  2025-02-06 18:57     ` Andi Kleen
  2025-02-07  8:16   ` Dmitry Vyukov
  1 sibling, 2 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-06 18:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: namhyung, irogers, linux-perf-users, linux-kernel,
	Arnaldo Carvalho de Melo

On Thu, 6 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
>
> Dmitry Vyukov <dvyukov@google.com> writes:
>
> > There are two notions of time: wall-clock time and CPU time.
> > For a single-threaded program, or a program running on a single-core
> > machine, these notions are the same. However, for a multi-threaded/
> > multi-process program running on a multi-core machine, these notions are
> > significantly different. Each second of wall-clock time we have
> > number-of-cores seconds of CPU time.
>
> I'm curious how does this interact with the time / --time-quantum sort key?
>
> I assume it just works, but might be worth checking.

I will check later. But if you have some concrete commands to try, it
will help. I never used --time-quantum before.


> It was intended to address some of these issues too.
>
> > Optimizing CPU overhead is useful to improve 'throughput', while
> > optimizing wall-clock overhead is useful to improve 'latency'.
> > These profiles are complementary and are not interchangeable.
> > Examples of where latency profile is needed:
> >  - optimzing build latency
> >  - optimizing server request latency
> >  - optimizing ML training/inference latency
> >  - optimizing running time of any command line program
> >
> > CPU profile is useless for these use cases at best (if a user understands
> > the difference), or misleading at worst (if a user tries to use a wrong
> > profile for a job).
>
> I would agree in the general case, but not if the time sort key
> is chosen with a suitable quantum. You can see how the parallelism
> changes over time then which is often a good enough proxy.

Never used it. I will look at what capabilities it provides.

> > We still default to the CPU profile, so it's up to users to learn
> > about the second profiling mode and use it when appropriate.
>
> You should add it to tips.txt then

It is done in the docs patch.

> >  .../callchain-overhead-calculation.txt        |   5 +-
> >  .../cpu-and-latency-overheads.txt             |  85 ++++++++++++++
> >  tools/perf/Documentation/perf-record.txt      |   4 +
> >  tools/perf/Documentation/perf-report.txt      |  54 ++++++---
> >  tools/perf/Documentation/tips.txt             |   3 +
> >  tools/perf/builtin-record.c                   |  20 ++++
> >  tools/perf/builtin-report.c                   |  39 +++++++
> >  tools/perf/ui/browsers/hists.c                |  27 +++--
> >  tools/perf/ui/hist.c                          | 104 ++++++++++++------
> >  tools/perf/util/addr_location.c               |   1 +
> >  tools/perf/util/addr_location.h               |   7 +-
> >  tools/perf/util/event.c                       |  11 ++
> >  tools/perf/util/events_stats.h                |   2 +
> >  tools/perf/util/hist.c                        |  90 ++++++++++++---
> >  tools/perf/util/hist.h                        |  32 +++++-
> >  tools/perf/util/machine.c                     |   7 ++
> >  tools/perf/util/machine.h                     |   6 +
> >  tools/perf/util/sample.h                      |   2 +-
> >  tools/perf/util/session.c                     |  12 ++
> >  tools/perf/util/session.h                     |   1 +
> >  tools/perf/util/sort.c                        |  69 +++++++++++-
> >  tools/perf/util/sort.h                        |   3 +-
> >  tools/perf/util/symbol.c                      |  34 ++++++
> >  tools/perf/util/symbol_conf.h                 |   8 +-
>
> We traditionally didn't do it, but in general test coverage
> of perf report is too low, so I would recommend to add some simple
> test case in the perf test scripts.

What of this is testable within the current testing framework?
Also how do I run tests? I failed to figure it out.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
  2025-02-06 18:41   ` Dmitry Vyukov
@ 2025-02-06 18:51     ` Ian Rogers
  2025-02-07  3:57       ` Namhyung Kim
  2025-02-06 18:57     ` Andi Kleen
  1 sibling, 1 reply; 32+ messages in thread
From: Ian Rogers @ 2025-02-06 18:51 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Andi Kleen, namhyung, linux-perf-users, linux-kernel,
	Arnaldo Carvalho de Melo

On Thu, Feb 6, 2025 at 10:41 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> On Thu, 6 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
[snip]
> > We traditionally didn't do it, but in general test coverage
> > of perf report is too low, so I would recommend to add some simple
> > test case in the perf test scripts.
>
> What of this is testable within the current testing framework?
> Also how do I run tests? I failed to figure it out.

Often just having a test that ensure a command doesn't segfault is
progress :-) The shell tests Andi mentions are here:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell?h=perf-tools-next
There's no explicit `perf report` test there but maybe the annotate,
diff or record tests could give you some ideas.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
  2025-02-06 18:41   ` Dmitry Vyukov
  2025-02-06 18:51     ` Ian Rogers
@ 2025-02-06 18:57     ` Andi Kleen
  2025-02-06 19:07       ` Andi Kleen
  1 sibling, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2025-02-06 18:57 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: namhyung, irogers, linux-perf-users, linux-kernel,
	Arnaldo Carvalho de Melo

On Thu, Feb 06, 2025 at 07:41:00PM +0100, Dmitry Vyukov wrote:
> On Thu, 6 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
> >
> > Dmitry Vyukov <dvyukov@google.com> writes:
> >
> > > There are two notions of time: wall-clock time and CPU time.
> > > For a single-threaded program, or a program running on a single-core
> > > machine, these notions are the same. However, for a multi-threaded/
> > > multi-process program running on a multi-core machine, these notions are
> > > significantly different. Each second of wall-clock time we have
> > > number-of-cores seconds of CPU time.
> >
> > I'm curious how does this interact with the time / --time-quantum sort key?
> >
> > I assume it just works, but might be worth checking.
> 
> I will check later. But if you have some concrete commands to try, it
> will help. I never used --time-quantum before.

perf report --sort time,overhead,sym 

It just slices perf.data into time slices so you get a time series
instead of full aggregation.

--time-quantum is optional, but sets the slice length,

> > >  tools/perf/util/symbol_conf.h                 |   8 +-
> >
> > We traditionally didn't do it, but in general test coverage
> > of perf report is too low, so I would recommend to add some simple
> > test case in the perf test scripts.
> 
> What of this is testable within the current testing framework?

You can write a shell script that runs it and does 
some basic sanity checking. tests/shell has a lot of examples.

If you don't do that someone will break it like it happened 
to some of my features :/

> Also how do I run tests? I failed to figure it out.

Just "perf test"


-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
  2025-02-06 18:57     ` Andi Kleen
@ 2025-02-06 19:07       ` Andi Kleen
  0 siblings, 0 replies; 32+ messages in thread
From: Andi Kleen @ 2025-02-06 19:07 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: namhyung, irogers, linux-perf-users, linux-kernel,
	Arnaldo Carvalho de Melo

> > Also how do I run tests? I failed to figure it out.
> 
> Just "perf test"

Or actually did you mean you can't run it in the build directory?

For that you may need my patch  
https://lore.kernel.org/linux-perf-users/20240813213651.1057362-2-ak@linux.intel.com/

(sadly it's still not applied)

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/8] perf report: Add --latency flag
  2025-02-05 16:27 ` [PATCH v5 6/8] perf report: Add --latency flag Dmitry Vyukov
@ 2025-02-07  3:44   ` Namhyung Kim
  2025-02-07  7:23     ` Dmitry Vyukov
  2025-02-07  3:53   ` Namhyung Kim
  1 sibling, 1 reply; 32+ messages in thread
From: Namhyung Kim @ 2025-02-07  3:44 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: irogers, linux-perf-users, linux-kernel, Arnaldo Carvalho de Melo

On Wed, Feb 05, 2025 at 05:27:45PM +0100, Dmitry Vyukov wrote:
> Add record/report --latency flag that allows to capture and show
> latency-centric profiles rather than the default CPU-consumption-centric
> profiles. For latency profiles record captures context switch events,
> and report shows Latency as the first column.
> 
> Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Ian Rogers <irogers@google.com>
> Cc: linux-perf-users@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> 
> ---
> Changes in v5:
>  - added description of --latency flag in Documentation flags
> ---
>  tools/perf/Documentation/perf-record.txt |  4 +++
>  tools/perf/Documentation/perf-report.txt |  5 +++
>  tools/perf/builtin-record.c              | 20 ++++++++++++
>  tools/perf/builtin-report.c              | 32 +++++++++++++++---
>  tools/perf/ui/hist.c                     | 41 +++++++++++++++++++-----
>  tools/perf/util/hist.h                   |  1 +
>  tools/perf/util/sort.c                   | 33 +++++++++++++++----
>  tools/perf/util/sort.h                   |  2 +-
>  tools/perf/util/symbol_conf.h            |  4 ++-
>  9 files changed, 122 insertions(+), 20 deletions(-)
> 
> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> index 80686d590de24..c7fc1ba265e27 100644
> --- a/tools/perf/Documentation/perf-record.txt
> +++ b/tools/perf/Documentation/perf-record.txt
> @@ -227,6 +227,10 @@ OPTIONS
>  	'--filter' exists, the new filter expression will be combined with
>  	them by '&&'.
>  
> +--latency::
> +	Enable data collection for latency profiling.
> +	Use perf report --latency for latency-centric profile.
> +
>  -a::
>  --all-cpus::
>          System-wide collection from all CPUs (default if no target is specified).
> diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> index 87f8645194062..66794131aec48 100644
> --- a/tools/perf/Documentation/perf-report.txt
> +++ b/tools/perf/Documentation/perf-report.txt
> @@ -68,6 +68,11 @@ OPTIONS
>  --hide-unresolved::
>          Only display entries resolved to a symbol.
>  
> +--latency::
> +        Show latency-centric profile rather than the default
> +        CPU-consumption-centric profile
> +        (requires perf record --latency flag).
> +
>  -s::
>  --sort=::
>  	Sort histogram entries by given key(s) - multiple keys can be specified
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 5db1aedf48df9..e219639ac401b 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -161,6 +161,7 @@ struct record {
>  	struct evlist		*sb_evlist;
>  	pthread_t		thread_id;
>  	int			realtime_prio;
> +	bool			latency;
>  	bool			switch_output_event_set;
>  	bool			no_buildid;
>  	bool			no_buildid_set;
> @@ -3371,6 +3372,9 @@ static struct option __record_options[] = {
>  		     parse_events_option),
>  	OPT_CALLBACK(0, "filter", &record.evlist, "filter",
>  		     "event filter", parse_filter),
> +	OPT_BOOLEAN(0, "latency", &record.latency,
> +		    "Enable data collection for latency profiling.\n"
> +		    "\t\t\t  Use perf report --latency for latency-centric profile."),
>  	OPT_CALLBACK_NOOPT(0, "exclude-perf", &record.evlist,
>  			   NULL, "don't record events from perf itself",
>  			   exclude_perf),
> @@ -4017,6 +4021,22 @@ int cmd_record(int argc, const char **argv)
>  
>  	}
>  
> +	if (record.latency) {
> +		/*
> +		 * There is no fundamental reason why latency profiling
> +		 * can't work for system-wide mode, but exact semantics
> +		 * and details are to be defined.
> +		 * See the following thread for details:
> +		 * https://lore.kernel.org/all/Z4XDJyvjiie3howF@google.com/
> +		 */
> +		if (record.opts.target.system_wide) {
> +			pr_err("Failed: latency profiling is not supported with system-wide collection.\n");
> +			err = -EINVAL;
> +			goto out_opts;
> +		}
> +		record.opts.record_switch_events = true;
> +	}
> +
>  	if (rec->buildid_mmap) {
>  		if (!perf_can_record_build_id()) {
>  			pr_err("Failed: no support to record build id in mmap events, update your kernel.\n");
> diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> index 2a19abdc869a1..69de6dbefecfa 100644
> --- a/tools/perf/builtin-report.c
> +++ b/tools/perf/builtin-report.c
> @@ -112,6 +112,8 @@ struct report {
>  	u64			nr_entries;
>  	u64			queue_size;
>  	u64			total_cycles;
> +	u64			total_samples;
> +	u64			singlethreaded_samples;
>  	int			socket_filter;
>  	DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
>  	struct branch_type_stat	brtype_stat;
> @@ -331,6 +333,10 @@ static int process_sample_event(const struct perf_tool *tool,
>  				     &rep->total_cycles, evsel);
>  	}
>  
> +	rep->total_samples++;
> +	if (al.parallelism == 1)
> +		rep->singlethreaded_samples++;
> +
>  	ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
>  	if (ret < 0)
>  		pr_debug("problem adding hist entry, skipping event\n");
> @@ -1079,6 +1085,11 @@ static int __cmd_report(struct report *rep)
>  		return ret;
>  	}
>  
> +	/* Don't show Latency column for non-parallel profiles by default. */
> +	if (rep->singlethreaded_samples * 100 / rep->total_samples >= 99 &&
> +			!symbol_conf.prefer_latency)
> +		perf_hpp__cancel_latency();
> +
>  	evlist__check_mem_load_aux(session->evlist);
>  
>  	if (rep->stats_mode)
> @@ -1468,6 +1479,10 @@ int cmd_report(int argc, const char **argv)
>  		    "Disable raw trace ordering"),
>  	OPT_BOOLEAN(0, "skip-empty", &report.skip_empty,
>  		    "Do not display empty (or dummy) events in the output"),
> +	OPT_BOOLEAN(0, "latency", &symbol_conf.prefer_latency,
> +		    "Show latency-centric profile rather than the default\n"
> +		    "\t\t\t  CPU-consumption-centric profile\n"
> +		    "\t\t\t  (requires perf record --latency flag)."),
>  	OPT_END()
>  	};
>  	struct perf_data data = {
> @@ -1722,16 +1737,25 @@ int cmd_report(int argc, const char **argv)
>  		symbol_conf.annotate_data_sample = true;
>  	}
>  
> +	symbol_conf.enable_latency = true;
>  	if (report.disable_order || !perf_session__has_switch_events(session)) {
>  		if (symbol_conf.parallelism_list_str ||
> -				(sort_order && strstr(sort_order, "parallelism")) ||
> -				(field_order && strstr(field_order, "parallelism"))) {
> +			symbol_conf.prefer_latency ||
> +			(sort_order && (strstr(sort_order, "latency") ||
> +				strstr(sort_order, "parallelism"))) ||
> +			(field_order && (strstr(field_order, "latency") ||
> +				strstr(field_order, "parallelism")))) {
>  			if (report.disable_order)
> -				ui__error("Use of parallelism is incompatible with --disable-order.\n");
> +				ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
>  			else
> -				ui__error("Use of parallelism requires --switch-events during record.\n");
> +				ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
>  			return -1;
>  		}
> +		/*
> +		 * If user did not ask for anything related to
> +		 * latency/parallelism explicitly, just don't show it.
> +		 */
> +		symbol_conf.enable_latency = false;
>  	}
>  
>  	if (sort_order && strstr(sort_order, "ipc")) {
> diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
> index 6de6309595f9e..b453f8eb579cc 100644
> --- a/tools/perf/ui/hist.c
> +++ b/tools/perf/ui/hist.c
> @@ -632,27 +632,36 @@ void perf_hpp__init(void)
>  		return;
>  
>  	if (symbol_conf.cumulate_callchain) {
> -		hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC);
> +		/* Use idempotent addition to avoid more complex logic. */
> +		if (symbol_conf.prefer_latency)
> +			hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> +		hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC, true);
> +		if (symbol_conf.enable_latency)
> +			hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
>  		perf_hpp__format[PERF_HPP__OVERHEAD].name = "Self";
>  	}
>  
> -	hpp_dimension__add_output(PERF_HPP__OVERHEAD);
> +	if (symbol_conf.prefer_latency)
> +		hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> +	hpp_dimension__add_output(PERF_HPP__OVERHEAD, true);
> +	if (symbol_conf.enable_latency)
> +		hpp_dimension__add_output(PERF_HPP__LATENCY, true);
>  
>  	if (symbol_conf.show_cpu_utilization) {
> -		hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS);
> -		hpp_dimension__add_output(PERF_HPP__OVERHEAD_US);
> +		hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS, false);
> +		hpp_dimension__add_output(PERF_HPP__OVERHEAD_US, false);
>  
>  		if (perf_guest) {
> -			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS);
> -			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US);
> +			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS, false);
> +			hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US, false);
>  		}
>  	}
>  
>  	if (symbol_conf.show_nr_samples)
> -		hpp_dimension__add_output(PERF_HPP__SAMPLES);
> +		hpp_dimension__add_output(PERF_HPP__SAMPLES, false);
>  
>  	if (symbol_conf.show_total_period)
> -		hpp_dimension__add_output(PERF_HPP__PERIOD);
> +		hpp_dimension__add_output(PERF_HPP__PERIOD, false);
>  }
>  
>  void perf_hpp_list__column_register(struct perf_hpp_list *list,
> @@ -701,6 +710,22 @@ void perf_hpp__cancel_cumulate(void)
>  	}
>  }
>  
> +void perf_hpp__cancel_latency(void)
> +{
> +	struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
> +
> +	if (is_strict_order(field_order) || is_strict_order(sort_order))
> +		return;
> +
> +	lat = &perf_hpp__format[PERF_HPP__LATENCY];
> +	acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
> +
> +	perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
> +		if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
> +			perf_hpp__column_unregister(fmt);
> +	}
> +}
> +
>  void perf_hpp__setup_output_field(struct perf_hpp_list *list)
>  {
>  	struct perf_hpp_fmt *fmt;
> diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> index 91159f16c60b2..29d4c7a3d1747 100644
> --- a/tools/perf/util/hist.h
> +++ b/tools/perf/util/hist.h
> @@ -582,6 +582,7 @@ enum {
>  
>  void perf_hpp__init(void);
>  void perf_hpp__cancel_cumulate(void);
> +void perf_hpp__cancel_latency(void);
>  void perf_hpp__setup_output_field(struct perf_hpp_list *list);
>  void perf_hpp__reset_output_field(struct perf_hpp_list *list);
>  void perf_hpp__append_sort_keys(struct perf_hpp_list *list);
> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> index bc4c3acfe7552..2b6023de7a53a 100644
> --- a/tools/perf/util/sort.c
> +++ b/tools/perf/util/sort.c
> @@ -2622,6 +2622,7 @@ struct hpp_dimension {
>  	const char		*name;
>  	struct perf_hpp_fmt	*fmt;
>  	int			taken;
> +	int			was_taken;
>  };
>  
>  #define DIM(d, n) { .name = n, .fmt = &perf_hpp__format[d], }
> @@ -3513,6 +3514,7 @@ static int __hpp_dimension__add(struct hpp_dimension *hd,
>  		return -1;
>  
>  	hd->taken = 1;
> +	hd->was_taken = 1;
>  	perf_hpp_list__register_sort_field(list, fmt);
>  	return 0;
>  }
> @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
>  	return 0;
>  }
>  
> -int hpp_dimension__add_output(unsigned col)
> +int hpp_dimension__add_output(unsigned col, bool implicit)
>  {
> +	struct hpp_dimension *hd;
> +
>  	BUG_ON(col >= PERF_HPP__MAX_INDEX);
> -	return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> +	hd = &hpp_sort_dimensions[col];
> +	if (implicit && !hd->was_taken)
> +		return 0;

I don't think you need these implicit and was_taken things.
Just removing from the sort list when it's unregistered seems to work.

---8<---
@@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
 static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
 {
        list_del_init(&format->list);
+       list_del_init(&format->sort_list);
        fmt_free(format);
 }
 
---8<---

Thanks,
Namhyung


> +	return __hpp_dimension__add_output(&perf_hpp_list, hd);
>  }
>  
>  int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
> @@ -3809,10 +3816,24 @@ static char *setup_overhead(char *keys)
>  	if (sort__mode == SORT_MODE__DIFF)
>  		return keys;
>  
> -	keys = prefix_if_not_in("overhead", keys);
> -
> -	if (symbol_conf.cumulate_callchain)
> -		keys = prefix_if_not_in("overhead_children", keys);
> +	if (symbol_conf.prefer_latency) {
> +		keys = prefix_if_not_in("overhead", keys);
> +		keys = prefix_if_not_in("latency", keys);
> +		if (symbol_conf.cumulate_callchain) {
> +			keys = prefix_if_not_in("overhead_children", keys);
> +			keys = prefix_if_not_in("latency_children", keys);
> +		}
> +	} else if (!keys || (!strstr(keys, "overhead") &&
> +			!strstr(keys, "latency"))) {
> +		if (symbol_conf.enable_latency)
> +			keys = prefix_if_not_in("latency", keys);
> +		keys = prefix_if_not_in("overhead", keys);
> +		if (symbol_conf.cumulate_callchain) {
> +			if (symbol_conf.enable_latency)
> +				keys = prefix_if_not_in("latency_children", keys);
> +			keys = prefix_if_not_in("overhead_children", keys);
> +		}
> +	}
>  
>  	return keys;
>  }
> diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> index 11fb15f914093..180d36a2bea35 100644
> --- a/tools/perf/util/sort.h
> +++ b/tools/perf/util/sort.h
> @@ -141,7 +141,7 @@ int report_parse_ignore_callees_opt(const struct option *opt, const char *arg, i
>  
>  bool is_strict_order(const char *order);
>  
> -int hpp_dimension__add_output(unsigned col);
> +int hpp_dimension__add_output(unsigned col, bool implicit);
>  void reset_dimensions(void);
>  int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
>  			struct evlist *evlist,
> diff --git a/tools/perf/util/symbol_conf.h b/tools/perf/util/symbol_conf.h
> index c5b2e56127e22..cd9aa82c7d5ad 100644
> --- a/tools/perf/util/symbol_conf.h
> +++ b/tools/perf/util/symbol_conf.h
> @@ -49,7 +49,9 @@ struct symbol_conf {
>  			keep_exited_threads,
>  			annotate_data_member,
>  			annotate_data_sample,
> -			skip_empty;
> +			skip_empty,
> +			enable_latency,
> +			prefer_latency;
>  	const char	*vmlinux_name,
>  			*kallsyms_name,
>  			*source_prefix,
> -- 
> 2.48.1.362.g079036d154-goog
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/8] perf report: Add --latency flag
  2025-02-05 16:27 ` [PATCH v5 6/8] perf report: Add --latency flag Dmitry Vyukov
  2025-02-07  3:44   ` Namhyung Kim
@ 2025-02-07  3:53   ` Namhyung Kim
  2025-02-07 11:42     ` Dmitry Vyukov
  1 sibling, 1 reply; 32+ messages in thread
From: Namhyung Kim @ 2025-02-07  3:53 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: irogers, linux-perf-users, linux-kernel, Arnaldo Carvalho de Melo

On Wed, Feb 05, 2025 at 05:27:45PM +0100, Dmitry Vyukov wrote:
> Add record/report --latency flag that allows to capture and show
> latency-centric profiles rather than the default CPU-consumption-centric
> profiles. For latency profiles record captures context switch events,
> and report shows Latency as the first column.
> 
> Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Ian Rogers <irogers@google.com>
> Cc: linux-perf-users@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> 
> ---
[SNIP]  
> +void perf_hpp__cancel_latency(void)
> +{
> +	struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
> +
> +	if (is_strict_order(field_order) || is_strict_order(sort_order))
> +		return;

This also needs to be changed since you call setup_overhead even if you
have a strict sort_order.

For example, it's strange these two are different.

With default sort order:

  $ perf record --latency -- ls /

  $ perf report --stdio
  ...
  # Overhead  Command  Shared Object         Symbol                  
  # ........  .......  ....................  ........................
  #
      64.50%  ls       ld-linux-x86-64.so.2  [.] do_lookup_x
      33.41%  ls       [kernel.kallsyms]     [k] chacha_block_generic
       2.00%  perf-ex  [kernel.kallsyms]     [k] put_ctx
       0.09%  perf-ex  [kernel.kallsyms]     [k] native_write_msr
  
Same sort order, but explicitly:

  $ perf report --stdio -s comm,dso,sym
  ...
  # Overhead   Latency  Command  Shared Object         Symbol                  
  # ........  ........  .......  ....................  ........................
  #
      64.50%    64.50%  ls       ld-linux-x86-64.so.2  [.] do_lookup_x
      33.41%    33.41%  ls       [kernel.kallsyms]     [k] chacha_block_generic
       2.00%     2.00%  perf-ex  [kernel.kallsyms]     [k] put_ctx
       0.09%     0.09%  perf-ex  [kernel.kallsyms]     [k] native_write_msr

Maybe you want to cancel the latency field even if sort key is given
(unless it has 'latency').

Something like this?

---8<---
@@ -714,7 +715,9 @@ void perf_hpp__cancel_latency(void)
 {
        struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
 
-       if (is_strict_order(field_order) || is_strict_order(sort_order))
+       if (is_strict_order(field_order))
+               return;
+       if (is_strict_order(sort_order) && strstr(sort_order, "latency"))
                return;
 
        lat = &perf_hpp__format[PERF_HPP__LATENCY];
---8<---

Thanks,
Namhyung

> +
> +	lat = &perf_hpp__format[PERF_HPP__LATENCY];
> +	acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
> +
> +	perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
> +		if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
> +			perf_hpp__column_unregister(fmt);
> +	}
> +}
> +
>  void perf_hpp__setup_output_field(struct perf_hpp_list *list)
>  {
>  	struct perf_hpp_fmt *fmt;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
  2025-02-06 18:51     ` Ian Rogers
@ 2025-02-07  3:57       ` Namhyung Kim
  2025-02-07 11:44         ` Dmitry Vyukov
  0 siblings, 1 reply; 32+ messages in thread
From: Namhyung Kim @ 2025-02-07  3:57 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Dmitry Vyukov, Andi Kleen, linux-perf-users, linux-kernel,
	Arnaldo Carvalho de Melo

On Thu, Feb 06, 2025 at 10:51:16AM -0800, Ian Rogers wrote:
> On Thu, Feb 6, 2025 at 10:41 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> > On Thu, 6 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
> [snip]
> > > We traditionally didn't do it, but in general test coverage
> > > of perf report is too low, so I would recommend to add some simple
> > > test case in the perf test scripts.
> >
> > What of this is testable within the current testing framework?
> > Also how do I run tests? I failed to figure it out.
> 
> Often just having a test that ensure a command doesn't segfault is
> progress :-) The shell tests Andi mentions are here:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell?h=perf-tools-next
> There's no explicit `perf report` test there but maybe the annotate,
> diff or record tests could give you some ideas.

Well, we have. :)

  $ ./perf test -vv report
  116: perftool-testsuite_report:
  --- start ---
  test child forked, pid 109653
  -- [ PASS ] -- perf_report :: setup :: prepare the perf.data file
  ## [ PASS ] ## perf_report :: setup SUMMARY
  -- [ SKIP ] -- perf_report :: test_basic :: help message :: testcase skipped
  -- [ PASS ] -- perf_report :: test_basic :: basic execution
  -- [ PASS ] -- perf_report :: test_basic :: number of samples
  -- [ PASS ] -- perf_report :: test_basic :: header
  -- [ PASS ] -- perf_report :: test_basic :: header timestamp
  -- [ PASS ] -- perf_report :: test_basic :: show CPU utilization
  -- [ PASS ] -- perf_report :: test_basic :: pid
  -- [ PASS ] -- perf_report :: test_basic :: non-existing symbol
  -- [ PASS ] -- perf_report :: test_basic :: symbol filter
  ## [ PASS ] ## perf_report :: test_basic SUMMARY
  ---- end(0) ----
  116: perftool-testsuite_report                                       : Ok

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/8] perf report: Add --latency flag
  2025-02-07  3:44   ` Namhyung Kim
@ 2025-02-07  7:23     ` Dmitry Vyukov
  2025-02-11  1:02       ` Namhyung Kim
  0 siblings, 1 reply; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-07  7:23 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: irogers, linux-perf-users, linux-kernel, Arnaldo Carvalho de Melo

On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Wed, Feb 05, 2025 at 05:27:45PM +0100, Dmitry Vyukov wrote:
> > Add record/report --latency flag that allows to capture and show
> > latency-centric profiles rather than the default CPU-consumption-centric
> > profiles. For latency profiles record captures context switch events,
> > and report shows Latency as the first column.
> >
> > Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> > Cc: Namhyung Kim <namhyung@kernel.org>
> > Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> > Cc: Ian Rogers <irogers@google.com>
> > Cc: linux-perf-users@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> >
> > ---
> > Changes in v5:
> >  - added description of --latency flag in Documentation flags
> > ---
> >  tools/perf/Documentation/perf-record.txt |  4 +++
> >  tools/perf/Documentation/perf-report.txt |  5 +++
> >  tools/perf/builtin-record.c              | 20 ++++++++++++
> >  tools/perf/builtin-report.c              | 32 +++++++++++++++---
> >  tools/perf/ui/hist.c                     | 41 +++++++++++++++++++-----
> >  tools/perf/util/hist.h                   |  1 +
> >  tools/perf/util/sort.c                   | 33 +++++++++++++++----
> >  tools/perf/util/sort.h                   |  2 +-
> >  tools/perf/util/symbol_conf.h            |  4 ++-
> >  9 files changed, 122 insertions(+), 20 deletions(-)
> >
> > diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> > index 80686d590de24..c7fc1ba265e27 100644
> > --- a/tools/perf/Documentation/perf-record.txt
> > +++ b/tools/perf/Documentation/perf-record.txt
> > @@ -227,6 +227,10 @@ OPTIONS
> >       '--filter' exists, the new filter expression will be combined with
> >       them by '&&'.
> >
> > +--latency::
> > +     Enable data collection for latency profiling.
> > +     Use perf report --latency for latency-centric profile.
> > +
> >  -a::
> >  --all-cpus::
> >          System-wide collection from all CPUs (default if no target is specified).
> > diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> > index 87f8645194062..66794131aec48 100644
> > --- a/tools/perf/Documentation/perf-report.txt
> > +++ b/tools/perf/Documentation/perf-report.txt
> > @@ -68,6 +68,11 @@ OPTIONS
> >  --hide-unresolved::
> >          Only display entries resolved to a symbol.
> >
> > +--latency::
> > +        Show latency-centric profile rather than the default
> > +        CPU-consumption-centric profile
> > +        (requires perf record --latency flag).
> > +
> >  -s::
> >  --sort=::
> >       Sort histogram entries by given key(s) - multiple keys can be specified
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index 5db1aedf48df9..e219639ac401b 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -161,6 +161,7 @@ struct record {
> >       struct evlist           *sb_evlist;
> >       pthread_t               thread_id;
> >       int                     realtime_prio;
> > +     bool                    latency;
> >       bool                    switch_output_event_set;
> >       bool                    no_buildid;
> >       bool                    no_buildid_set;
> > @@ -3371,6 +3372,9 @@ static struct option __record_options[] = {
> >                    parse_events_option),
> >       OPT_CALLBACK(0, "filter", &record.evlist, "filter",
> >                    "event filter", parse_filter),
> > +     OPT_BOOLEAN(0, "latency", &record.latency,
> > +                 "Enable data collection for latency profiling.\n"
> > +                 "\t\t\t  Use perf report --latency for latency-centric profile."),
> >       OPT_CALLBACK_NOOPT(0, "exclude-perf", &record.evlist,
> >                          NULL, "don't record events from perf itself",
> >                          exclude_perf),
> > @@ -4017,6 +4021,22 @@ int cmd_record(int argc, const char **argv)
> >
> >       }
> >
> > +     if (record.latency) {
> > +             /*
> > +              * There is no fundamental reason why latency profiling
> > +              * can't work for system-wide mode, but exact semantics
> > +              * and details are to be defined.
> > +              * See the following thread for details:
> > +              * https://lore.kernel.org/all/Z4XDJyvjiie3howF@google.com/
> > +              */
> > +             if (record.opts.target.system_wide) {
> > +                     pr_err("Failed: latency profiling is not supported with system-wide collection.\n");
> > +                     err = -EINVAL;
> > +                     goto out_opts;
> > +             }
> > +             record.opts.record_switch_events = true;
> > +     }
> > +
> >       if (rec->buildid_mmap) {
> >               if (!perf_can_record_build_id()) {
> >                       pr_err("Failed: no support to record build id in mmap events, update your kernel.\n");
> > diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> > index 2a19abdc869a1..69de6dbefecfa 100644
> > --- a/tools/perf/builtin-report.c
> > +++ b/tools/perf/builtin-report.c
> > @@ -112,6 +112,8 @@ struct report {
> >       u64                     nr_entries;
> >       u64                     queue_size;
> >       u64                     total_cycles;
> > +     u64                     total_samples;
> > +     u64                     singlethreaded_samples;
> >       int                     socket_filter;
> >       DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
> >       struct branch_type_stat brtype_stat;
> > @@ -331,6 +333,10 @@ static int process_sample_event(const struct perf_tool *tool,
> >                                    &rep->total_cycles, evsel);
> >       }
> >
> > +     rep->total_samples++;
> > +     if (al.parallelism == 1)
> > +             rep->singlethreaded_samples++;
> > +
> >       ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
> >       if (ret < 0)
> >               pr_debug("problem adding hist entry, skipping event\n");
> > @@ -1079,6 +1085,11 @@ static int __cmd_report(struct report *rep)
> >               return ret;
> >       }
> >
> > +     /* Don't show Latency column for non-parallel profiles by default. */
> > +     if (rep->singlethreaded_samples * 100 / rep->total_samples >= 99 &&
> > +                     !symbol_conf.prefer_latency)
> > +             perf_hpp__cancel_latency();
> > +
> >       evlist__check_mem_load_aux(session->evlist);
> >
> >       if (rep->stats_mode)
> > @@ -1468,6 +1479,10 @@ int cmd_report(int argc, const char **argv)
> >                   "Disable raw trace ordering"),
> >       OPT_BOOLEAN(0, "skip-empty", &report.skip_empty,
> >                   "Do not display empty (or dummy) events in the output"),
> > +     OPT_BOOLEAN(0, "latency", &symbol_conf.prefer_latency,
> > +                 "Show latency-centric profile rather than the default\n"
> > +                 "\t\t\t  CPU-consumption-centric profile\n"
> > +                 "\t\t\t  (requires perf record --latency flag)."),
> >       OPT_END()
> >       };
> >       struct perf_data data = {
> > @@ -1722,16 +1737,25 @@ int cmd_report(int argc, const char **argv)
> >               symbol_conf.annotate_data_sample = true;
> >       }
> >
> > +     symbol_conf.enable_latency = true;
> >       if (report.disable_order || !perf_session__has_switch_events(session)) {
> >               if (symbol_conf.parallelism_list_str ||
> > -                             (sort_order && strstr(sort_order, "parallelism")) ||
> > -                             (field_order && strstr(field_order, "parallelism"))) {
> > +                     symbol_conf.prefer_latency ||
> > +                     (sort_order && (strstr(sort_order, "latency") ||
> > +                             strstr(sort_order, "parallelism"))) ||
> > +                     (field_order && (strstr(field_order, "latency") ||
> > +                             strstr(field_order, "parallelism")))) {
> >                       if (report.disable_order)
> > -                             ui__error("Use of parallelism is incompatible with --disable-order.\n");
> > +                             ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
> >                       else
> > -                             ui__error("Use of parallelism requires --switch-events during record.\n");
> > +                             ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
> >                       return -1;
> >               }
> > +             /*
> > +              * If user did not ask for anything related to
> > +              * latency/parallelism explicitly, just don't show it.
> > +              */
> > +             symbol_conf.enable_latency = false;
> >       }
> >
> >       if (sort_order && strstr(sort_order, "ipc")) {
> > diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
> > index 6de6309595f9e..b453f8eb579cc 100644
> > --- a/tools/perf/ui/hist.c
> > +++ b/tools/perf/ui/hist.c
> > @@ -632,27 +632,36 @@ void perf_hpp__init(void)
> >               return;
> >
> >       if (symbol_conf.cumulate_callchain) {
> > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC);
> > +             /* Use idempotent addition to avoid more complex logic. */
> > +             if (symbol_conf.prefer_latency)
> > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC, true);
> > +             if (symbol_conf.enable_latency)
> > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> >               perf_hpp__format[PERF_HPP__OVERHEAD].name = "Self";
> >       }
> >
> > -     hpp_dimension__add_output(PERF_HPP__OVERHEAD);
> > +     if (symbol_conf.prefer_latency)
> > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> > +     hpp_dimension__add_output(PERF_HPP__OVERHEAD, true);
> > +     if (symbol_conf.enable_latency)
> > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> >
> >       if (symbol_conf.show_cpu_utilization) {
> > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS);
> > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US);
> > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS, false);
> > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US, false);
> >
> >               if (perf_guest) {
> > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS);
> > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US);
> > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS, false);
> > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US, false);
> >               }
> >       }
> >
> >       if (symbol_conf.show_nr_samples)
> > -             hpp_dimension__add_output(PERF_HPP__SAMPLES);
> > +             hpp_dimension__add_output(PERF_HPP__SAMPLES, false);
> >
> >       if (symbol_conf.show_total_period)
> > -             hpp_dimension__add_output(PERF_HPP__PERIOD);
> > +             hpp_dimension__add_output(PERF_HPP__PERIOD, false);
> >  }
> >
> >  void perf_hpp_list__column_register(struct perf_hpp_list *list,
> > @@ -701,6 +710,22 @@ void perf_hpp__cancel_cumulate(void)
> >       }
> >  }
> >
> > +void perf_hpp__cancel_latency(void)
> > +{
> > +     struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
> > +
> > +     if (is_strict_order(field_order) || is_strict_order(sort_order))
> > +             return;
> > +
> > +     lat = &perf_hpp__format[PERF_HPP__LATENCY];
> > +     acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
> > +
> > +     perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
> > +             if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
> > +                     perf_hpp__column_unregister(fmt);
> > +     }
> > +}
> > +
> >  void perf_hpp__setup_output_field(struct perf_hpp_list *list)
> >  {
> >       struct perf_hpp_fmt *fmt;
> > diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> > index 91159f16c60b2..29d4c7a3d1747 100644
> > --- a/tools/perf/util/hist.h
> > +++ b/tools/perf/util/hist.h
> > @@ -582,6 +582,7 @@ enum {
> >
> >  void perf_hpp__init(void);
> >  void perf_hpp__cancel_cumulate(void);
> > +void perf_hpp__cancel_latency(void);
> >  void perf_hpp__setup_output_field(struct perf_hpp_list *list);
> >  void perf_hpp__reset_output_field(struct perf_hpp_list *list);
> >  void perf_hpp__append_sort_keys(struct perf_hpp_list *list);
> > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > index bc4c3acfe7552..2b6023de7a53a 100644
> > --- a/tools/perf/util/sort.c
> > +++ b/tools/perf/util/sort.c
> > @@ -2622,6 +2622,7 @@ struct hpp_dimension {
> >       const char              *name;
> >       struct perf_hpp_fmt     *fmt;
> >       int                     taken;
> > +     int                     was_taken;
> >  };
> >
> >  #define DIM(d, n) { .name = n, .fmt = &perf_hpp__format[d], }
> > @@ -3513,6 +3514,7 @@ static int __hpp_dimension__add(struct hpp_dimension *hd,
> >               return -1;
> >
> >       hd->taken = 1;
> > +     hd->was_taken = 1;
> >       perf_hpp_list__register_sort_field(list, fmt);
> >       return 0;
> >  }
> > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> >       return 0;
> >  }
> >
> > -int hpp_dimension__add_output(unsigned col)
> > +int hpp_dimension__add_output(unsigned col, bool implicit)
> >  {
> > +     struct hpp_dimension *hd;
> > +
> >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > +     hd = &hpp_sort_dimensions[col];
> > +     if (implicit && !hd->was_taken)
> > +             return 0;
>
> I don't think you need these implicit and was_taken things.
> Just removing from the sort list when it's unregistered seems to work.
>
> ---8<---
> @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
>  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
>  {
>         list_del_init(&format->list);
> +       list_del_init(&format->sort_list);
>         fmt_free(format);
>  }
>
> ---8<---

It merely suppresses the warning, but does not work the same way. See
this for details:
https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/

> Thanks,
> Namhyung
>
>
> > +     return __hpp_dimension__add_output(&perf_hpp_list, hd);
> >  }
> >
> >  int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
> > @@ -3809,10 +3816,24 @@ static char *setup_overhead(char *keys)
> >       if (sort__mode == SORT_MODE__DIFF)
> >               return keys;
> >
> > -     keys = prefix_if_not_in("overhead", keys);
> > -
> > -     if (symbol_conf.cumulate_callchain)
> > -             keys = prefix_if_not_in("overhead_children", keys);
> > +     if (symbol_conf.prefer_latency) {
> > +             keys = prefix_if_not_in("overhead", keys);
> > +             keys = prefix_if_not_in("latency", keys);
> > +             if (symbol_conf.cumulate_callchain) {
> > +                     keys = prefix_if_not_in("overhead_children", keys);
> > +                     keys = prefix_if_not_in("latency_children", keys);
> > +             }
> > +     } else if (!keys || (!strstr(keys, "overhead") &&
> > +                     !strstr(keys, "latency"))) {
> > +             if (symbol_conf.enable_latency)
> > +                     keys = prefix_if_not_in("latency", keys);
> > +             keys = prefix_if_not_in("overhead", keys);
> > +             if (symbol_conf.cumulate_callchain) {
> > +                     if (symbol_conf.enable_latency)
> > +                             keys = prefix_if_not_in("latency_children", keys);
> > +                     keys = prefix_if_not_in("overhead_children", keys);
> > +             }
> > +     }
> >
> >       return keys;
> >  }
> > diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> > index 11fb15f914093..180d36a2bea35 100644
> > --- a/tools/perf/util/sort.h
> > +++ b/tools/perf/util/sort.h
> > @@ -141,7 +141,7 @@ int report_parse_ignore_callees_opt(const struct option *opt, const char *arg, i
> >
> >  bool is_strict_order(const char *order);
> >
> > -int hpp_dimension__add_output(unsigned col);
> > +int hpp_dimension__add_output(unsigned col, bool implicit);
> >  void reset_dimensions(void);
> >  int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
> >                       struct evlist *evlist,
> > diff --git a/tools/perf/util/symbol_conf.h b/tools/perf/util/symbol_conf.h
> > index c5b2e56127e22..cd9aa82c7d5ad 100644
> > --- a/tools/perf/util/symbol_conf.h
> > +++ b/tools/perf/util/symbol_conf.h
> > @@ -49,7 +49,9 @@ struct symbol_conf {
> >                       keep_exited_threads,
> >                       annotate_data_member,
> >                       annotate_data_sample,
> > -                     skip_empty;
> > +                     skip_empty,
> > +                     enable_latency,
> > +                     prefer_latency;
> >       const char      *vmlinux_name,
> >                       *kallsyms_name,
> >                       *source_prefix,
> > --
> > 2.48.1.362.g079036d154-goog
> >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
  2025-02-06 18:30 ` [PATCH v5 0/8] perf report: Add latency and parallelism profiling Andi Kleen
  2025-02-06 18:41   ` Dmitry Vyukov
@ 2025-02-07  8:16   ` Dmitry Vyukov
  2025-02-07 18:30     ` Andi Kleen
  1 sibling, 1 reply; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-07  8:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: namhyung, irogers, linux-perf-users, linux-kernel,
	Arnaldo Carvalho de Melo

On Thu, 6 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
>
> Dmitry Vyukov <dvyukov@google.com> writes:
>
> > There are two notions of time: wall-clock time and CPU time.
> > For a single-threaded program, or a program running on a single-core
> > machine, these notions are the same. However, for a multi-threaded/
> > multi-process program running on a multi-core machine, these notions are
> > significantly different. Each second of wall-clock time we have
> > number-of-cores seconds of CPU time.
>
> I'm curious how does this interact with the time / --time-quantum sort key?
>
> I assume it just works, but might be worth checking.

Yes, it seems to just work as one would assume. Things just combine as intended.

> It was intended to address some of these issues too.

What issue? Latency profiling? I wonder what approach you had in mind?

> > Optimizing CPU overhead is useful to improve 'throughput', while
> > optimizing wall-clock overhead is useful to improve 'latency'.
> > These profiles are complementary and are not interchangeable.
> > Examples of where latency profile is needed:
> >  - optimzing build latency
> >  - optimizing server request latency
> >  - optimizing ML training/inference latency
> >  - optimizing running time of any command line program
> >
> > CPU profile is useless for these use cases at best (if a user understands
> > the difference), or misleading at worst (if a user tries to use a wrong
> > profile for a job).
>
> I would agree in the general case, but not if the time sort key
> is chosen with a suitable quantum. You can see how the parallelism
> changes over time then which is often a good enough proxy.

That's an interesting feature, but I don't see how it helps with latency.

How do you infer parallelism for slices? It looks like it just gives
the same wrong CPU profile, but multiple times (for each slice).

Also (1) user still needs to understand the default profile is wrong,
(2) be proficient with perf features, (3) manually aggregate lots of
data (time slicing increases amount of data in the profile X times),
(4) deal with inaccuracy caused by edge effects (e.g. slice is 1s, but
program phase changed mid-second).

But it does open some interesting capabilities in combination with a
latency profile, e.g. the following shows how parallelism was changing
over time.

for perf make profile:

perf report -F time,latency,parallelism --time-quantum=1s

# Time           Latency  Parallelism
# ............  ........  ...........
#
  1795957.0000     1.42%            1
  1795957.0000     0.07%            2
  1795957.0000     0.01%            3
  1795957.0000     0.00%            4

  1795958.0000     4.82%            1
  1795958.0000     0.11%            2
  1795958.0000     0.00%            3
...
  1795964.0000     1.76%            2
  1795964.0000     0.58%            4
  1795964.0000     0.45%            1
  1795964.0000     0.23%           10
  1795964.0000     0.21%            6

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

Here it finally started running on more than 1 CPU.






> > We still default to the CPU profile, so it's up to users to learn
> > about the second profiling mode and use it when appropriate.
>
> You should add it to tips.txt then
>
> >  .../callchain-overhead-calculation.txt        |   5 +-
> >  .../cpu-and-latency-overheads.txt             |  85 ++++++++++++++
> >  tools/perf/Documentation/perf-record.txt      |   4 +
> >  tools/perf/Documentation/perf-report.txt      |  54 ++++++---
> >  tools/perf/Documentation/tips.txt             |   3 +
> >  tools/perf/builtin-record.c                   |  20 ++++
> >  tools/perf/builtin-report.c                   |  39 +++++++
> >  tools/perf/ui/browsers/hists.c                |  27 +++--
> >  tools/perf/ui/hist.c                          | 104 ++++++++++++------
> >  tools/perf/util/addr_location.c               |   1 +
> >  tools/perf/util/addr_location.h               |   7 +-
> >  tools/perf/util/event.c                       |  11 ++
> >  tools/perf/util/events_stats.h                |   2 +
> >  tools/perf/util/hist.c                        |  90 ++++++++++++---
> >  tools/perf/util/hist.h                        |  32 +++++-
> >  tools/perf/util/machine.c                     |   7 ++
> >  tools/perf/util/machine.h                     |   6 +
> >  tools/perf/util/sample.h                      |   2 +-
> >  tools/perf/util/session.c                     |  12 ++
> >  tools/perf/util/session.h                     |   1 +
> >  tools/perf/util/sort.c                        |  69 +++++++++++-
> >  tools/perf/util/sort.h                        |   3 +-
> >  tools/perf/util/symbol.c                      |  34 ++++++
> >  tools/perf/util/symbol_conf.h                 |   8 +-
>
> We traditionally didn't do it, but in general test coverage
> of perf report is too low, so I would recommend to add some simple
> test case in the perf test scripts.
>
> -Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/8] perf report: Add --latency flag
  2025-02-07  3:53   ` Namhyung Kim
@ 2025-02-07 11:42     ` Dmitry Vyukov
  0 siblings, 0 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-07 11:42 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: irogers, linux-perf-users, linux-kernel, Arnaldo Carvalho de Melo

On Fri, 7 Feb 2025 at 04:53, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Wed, Feb 05, 2025 at 05:27:45PM +0100, Dmitry Vyukov wrote:
> > Add record/report --latency flag that allows to capture and show
> > latency-centric profiles rather than the default CPU-consumption-centric
> > profiles. For latency profiles record captures context switch events,
> > and report shows Latency as the first column.
> >
> > Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> > Cc: Namhyung Kim <namhyung@kernel.org>
> > Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> > Cc: Ian Rogers <irogers@google.com>
> > Cc: linux-perf-users@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> >
> > ---
> [SNIP]
> > +void perf_hpp__cancel_latency(void)
> > +{
> > +     struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
> > +
> > +     if (is_strict_order(field_order) || is_strict_order(sort_order))
> > +             return;
>
> This also needs to be changed since you call setup_overhead even if you
> have a strict sort_order.
>
> For example, it's strange these two are different.
>
> With default sort order:
>
>   $ perf record --latency -- ls /
>
>   $ perf report --stdio
>   ...
>   # Overhead  Command  Shared Object         Symbol
>   # ........  .......  ....................  ........................
>   #
>       64.50%  ls       ld-linux-x86-64.so.2  [.] do_lookup_x
>       33.41%  ls       [kernel.kallsyms]     [k] chacha_block_generic
>        2.00%  perf-ex  [kernel.kallsyms]     [k] put_ctx
>        0.09%  perf-ex  [kernel.kallsyms]     [k] native_write_msr
>
> Same sort order, but explicitly:
>
>   $ perf report --stdio -s comm,dso,sym
>   ...
>   # Overhead   Latency  Command  Shared Object         Symbol
>   # ........  ........  .......  ....................  ........................
>   #
>       64.50%    64.50%  ls       ld-linux-x86-64.so.2  [.] do_lookup_x
>       33.41%    33.41%  ls       [kernel.kallsyms]     [k] chacha_block_generic
>        2.00%     2.00%  perf-ex  [kernel.kallsyms]     [k] put_ctx
>        0.09%     0.09%  perf-ex  [kernel.kallsyms]     [k] native_write_msr
>
> Maybe you want to cancel the latency field even if sort key is given
> (unless it has 'latency').
>
> Something like this?

Makes sense! Done in v6.

> ---8<---
> @@ -714,7 +715,9 @@ void perf_hpp__cancel_latency(void)
>  {
>         struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
>
> -       if (is_strict_order(field_order) || is_strict_order(sort_order))
> +       if (is_strict_order(field_order))
> +               return;
> +       if (is_strict_order(sort_order) && strstr(sort_order, "latency"))
>                 return;
>
>         lat = &perf_hpp__format[PERF_HPP__LATENCY];
> ---8<---
>
> Thanks,
> Namhyung
>
> > +
> > +     lat = &perf_hpp__format[PERF_HPP__LATENCY];
> > +     acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
> > +
> > +     perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
> > +             if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
> > +                     perf_hpp__column_unregister(fmt);
> > +     }
> > +}
> > +
> >  void perf_hpp__setup_output_field(struct perf_hpp_list *list)
> >  {
> >       struct perf_hpp_fmt *fmt;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
  2025-02-07  3:57       ` Namhyung Kim
@ 2025-02-07 11:44         ` Dmitry Vyukov
  0 siblings, 0 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-07 11:44 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ian Rogers, Andi Kleen, linux-perf-users, linux-kernel,
	Arnaldo Carvalho de Melo

On Fri, 7 Feb 2025 at 04:57, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Thu, Feb 06, 2025 at 10:51:16AM -0800, Ian Rogers wrote:
> > On Thu, Feb 6, 2025 at 10:41 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> > > On Thu, 6 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
> > [snip]
> > > > We traditionally didn't do it, but in general test coverage
> > > > of perf report is too low, so I would recommend to add some simple
> > > > test case in the perf test scripts.
> > >
> > > What of this is testable within the current testing framework?
> > > Also how do I run tests? I failed to figure it out.
> >
> > Often just having a test that ensure a command doesn't segfault is
> > progress :-) The shell tests Andi mentions are here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell?h=perf-tools-next
> > There's no explicit `perf report` test there but maybe the annotate,
> > diff or record tests could give you some ideas.
>
> Well, we have. :)
>
>   $ ./perf test -vv report
>   116: perftool-testsuite_report:
>   --- start ---
>   test child forked, pid 109653
>   -- [ PASS ] -- perf_report :: setup :: prepare the perf.data file
>   ## [ PASS ] ## perf_report :: setup SUMMARY
>   -- [ SKIP ] -- perf_report :: test_basic :: help message :: testcase skipped
>   -- [ PASS ] -- perf_report :: test_basic :: basic execution
>   -- [ PASS ] -- perf_report :: test_basic :: number of samples
>   -- [ PASS ] -- perf_report :: test_basic :: header
>   -- [ PASS ] -- perf_report :: test_basic :: header timestamp
>   -- [ PASS ] -- perf_report :: test_basic :: show CPU utilization
>   -- [ PASS ] -- perf_report :: test_basic :: pid
>   -- [ PASS ] -- perf_report :: test_basic :: non-existing symbol
>   -- [ PASS ] -- perf_report :: test_basic :: symbol filter
>   ## [ PASS ] ## perf_report :: test_basic SUMMARY
>   ---- end(0) ----
>   116: perftool-testsuite_report                                       : Ok
>
> Thanks,
> Namhyung


Thanks, Namhyung, Andi, Ian, this is useful.

Added tests in v6:
https://lore.kernel.org/linux-perf-users/eb8506dfa5998da3b891ba3d36f7ed4a7db4ca2b.1738928210.git.dvyukov@google.com/T/#u

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
  2025-02-07  8:16   ` Dmitry Vyukov
@ 2025-02-07 18:30     ` Andi Kleen
  2025-02-10  7:17       ` Dmitry Vyukov
  0 siblings, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2025-02-07 18:30 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: namhyung, irogers, linux-perf-users, linux-kernel,
	Arnaldo Carvalho de Melo

> > I assume it just works, but might be worth checking.
> 
> Yes, it seems to just work as one would assume. Things just combine as intended.

Great.

> 
> > It was intended to address some of these issues too.
> 
> What issue? Latency profiling? I wonder what approach you had in mind?

The problem with gaps in parallelism is usually how things change
over time. If you have e.g. idle periods they tend to look different
in the profile. with the full aggregation you don't see it, but with
a time series it tends to stand out.

But yes that approach usually only works for large gaps. I guess
yours is better for more fine grained issues.

And I guess it might not be the most intutive for less experienced
users.

This is BTW actually a case for using a perf data GUI like hotspot or
vtune which can visualize this better and you can zoom arbitrarily.
Standard perf has only timechart for it, but it's a bit clunky to use
and only shows the reschedules.

> Also (1) user still needs to understand the default profile is wrong,
> (2) be proficient with perf features, (3) manually aggregate lots of
> data (time slicing increases amount of data in the profile X times),
> (4) deal with inaccuracy caused by edge effects (e.g. slice is 1s, but
> program phase changed mid-second).

If you're lucky and the problem is not long tail you can use a high
percentage cut off (--percent-limit) to eliminate most of the data.

Then you just have "topN functions over time" which tends to be quite
readable. One drawback of that approach is that it doesn't show
the "other", but perhaps we'll fix that one day.

But yes that perf has too many options and is not intuitive and most
people miss most of the features is an inherent problem. I don't have
a good solution for that unfortunately, other than perhaps better
documentation.

> 
> But it does open some interesting capabilities in combination with a
> latency profile, e.g. the following shows how parallelism was changing
> over time.
> 
> for perf make profile:

Very nice! Looks useful.

Perhaps add that variant to tips.txt too.

> 
> perf report -F time,latency,parallelism --time-quantum=1s
> 
> # Time           Latency  Parallelism
> # ............  ........  ...........
> #
>   1795957.0000     1.42%            1
>   1795957.0000     0.07%            2
>   1795957.0000     0.01%            3
>   1795957.0000     0.00%            4
> 
>   1795958.0000     4.82%            1
>   1795958.0000     0.11%            2
>   1795958.0000     0.00%            3
> ...
>   1795964.0000     1.76%            2
>   1795964.0000     0.58%            4
>   1795964.0000     0.45%            1
>   1795964.0000     0.23%           10
>   1795964.0000     0.21%            6
> 
> /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
> 
> Here it finally started running on more than 1 CPU.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
  2025-02-07 18:30     ` Andi Kleen
@ 2025-02-10  7:17       ` Dmitry Vyukov
  2025-02-10 17:11         ` Andi Kleen
  2025-02-13  9:09         ` Dmitry Vyukov
  0 siblings, 2 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-10  7:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: namhyung, irogers, linux-perf-users, linux-kernel,
	Arnaldo Carvalho de Melo

On Fri, 7 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
>
> > > I assume it just works, but might be worth checking.
> >
> > Yes, it seems to just work as one would assume. Things just combine as intended.
>
> Great.
>
> >
> > > It was intended to address some of these issues too.
> >
> > What issue? Latency profiling? I wonder what approach you had in mind?
>
> The problem with gaps in parallelism is usually how things change
> over time. If you have e.g. idle periods they tend to look different
> in the profile. with the full aggregation you don't see it, but with
> a time series it tends to stand out.
>
> But yes that approach usually only works for large gaps. I guess
> yours is better for more fine grained issues.
>
> And I guess it might not be the most intutive for less experienced
> users.
>
> This is BTW actually a case for using a perf data GUI like hotspot or
> vtune which can visualize this better and you can zoom arbitrarily.
> Standard perf has only timechart for it, but it's a bit clunky to use
> and only shows the reschedules.
>
> > Also (1) user still needs to understand the default profile is wrong,
> > (2) be proficient with perf features, (3) manually aggregate lots of
> > data (time slicing increases amount of data in the profile X times),
> > (4) deal with inaccuracy caused by edge effects (e.g. slice is 1s, but
> > program phase changed mid-second).
>
> If you're lucky and the problem is not long tail you can use a high
> percentage cut off (--percent-limit) to eliminate most of the data.
>
> Then you just have "topN functions over time" which tends to be quite
> readable. One drawback of that approach is that it doesn't show
> the "other", but perhaps we'll fix that one day.
>
> But yes that perf has too many options and is not intuitive and most
> people miss most of the features is an inherent problem. I don't have
> a good solution for that unfortunately, other than perhaps better
> documentation.

I don't think this is a solution :(

I provided lots of rationale for making this latency profiling enabled
by default in this patch series for this reason. If we just capture
context switches, then we can show both overhead and latency, even if
we sort by overhead by default, people would still see the latency
column and may start thinking/asking questions.
But this is not happening, so mostly people on this thread will know about it :)


> > But it does open some interesting capabilities in combination with a
> > latency profile, e.g. the following shows how parallelism was changing
> > over time.
> >
> > for perf make profile:
>
> Very nice! Looks useful.
>
> Perhaps add that variant to tips.txt too.

That's a good idea.
I am waiting for other feedback to not resend the series just because of this.


> > perf report -F time,latency,parallelism --time-quantum=1s
> >
> > # Time           Latency  Parallelism
> > # ............  ........  ...........
> > #
> >   1795957.0000     1.42%            1
> >   1795957.0000     0.07%            2
> >   1795957.0000     0.01%            3
> >   1795957.0000     0.00%            4
> >
> >   1795958.0000     4.82%            1
> >   1795958.0000     0.11%            2
> >   1795958.0000     0.00%            3
> > ...
> >   1795964.0000     1.76%            2
> >   1795964.0000     0.58%            4
> >   1795964.0000     0.45%            1
> >   1795964.0000     0.23%           10
> >   1795964.0000     0.21%            6
> >
> > /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
> >
> > Here it finally started running on more than 1 CPU.
>
>
> -Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
  2025-02-10  7:17       ` Dmitry Vyukov
@ 2025-02-10 17:11         ` Andi Kleen
  2025-02-13  9:09         ` Dmitry Vyukov
  1 sibling, 0 replies; 32+ messages in thread
From: Andi Kleen @ 2025-02-10 17:11 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: namhyung, irogers, linux-perf-users, linux-kernel,
	Arnaldo Carvalho de Melo

> > But yes that perf has too many options and is not intuitive and most
> > people miss most of the features is an inherent problem. I don't have
> > a good solution for that unfortunately, other than perhaps better
> > documentation.
> 
> I don't think this is a solution :(
> 
> I provided lots of rationale for making this latency profiling enabled
> by default in this patch series for this reason. If we just capture
> context switches, then we can show both overhead and latency, even if
> we sort by overhead by default, people would still see the latency
> column and may start thinking/asking questions.
> But this is not happening, so mostly people on this thread will know about it :)

Maybe something that could be done is to have some higher level configurations
for perf report that are easier to understand.

This kind of already exists e.g. in perf mem report which is just a
wrapper around perf report with some magic flags.

In this case you could have perf latency report (or maybe some other
syntax like perf report --mode latency) 

There are a few others that would make sense, like basic time series
or disabling children aggregation.

Another possibility would be to find a heuristic where perf report
can detect that a latency problem might be there (e.g. varying
usage of CPUs) and suggest it automatically.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/8] perf report: Add --latency flag
  2025-02-07  7:23     ` Dmitry Vyukov
@ 2025-02-11  1:02       ` Namhyung Kim
  2025-02-11  8:30         ` Dmitry Vyukov
  2025-02-11  8:42         ` Dmitry Vyukov
  0 siblings, 2 replies; 32+ messages in thread
From: Namhyung Kim @ 2025-02-11  1:02 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: irogers, linux-perf-users, linux-kernel, Arnaldo Carvalho de Melo

On Fri, Feb 07, 2025 at 08:23:58AM +0100, Dmitry Vyukov wrote:
> On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Wed, Feb 05, 2025 at 05:27:45PM +0100, Dmitry Vyukov wrote:
> > > Add record/report --latency flag that allows to capture and show
> > > latency-centric profiles rather than the default CPU-consumption-centric
> > > profiles. For latency profiles record captures context switch events,
> > > and report shows Latency as the first column.
> > >
> > > Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> > > Cc: Namhyung Kim <namhyung@kernel.org>
> > > Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> > > Cc: Ian Rogers <irogers@google.com>
> > > Cc: linux-perf-users@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > >
> > > ---
> > > Changes in v5:
> > >  - added description of --latency flag in Documentation flags
> > > ---
> > >  tools/perf/Documentation/perf-record.txt |  4 +++
> > >  tools/perf/Documentation/perf-report.txt |  5 +++
> > >  tools/perf/builtin-record.c              | 20 ++++++++++++
> > >  tools/perf/builtin-report.c              | 32 +++++++++++++++---
> > >  tools/perf/ui/hist.c                     | 41 +++++++++++++++++++-----
> > >  tools/perf/util/hist.h                   |  1 +
> > >  tools/perf/util/sort.c                   | 33 +++++++++++++++----
> > >  tools/perf/util/sort.h                   |  2 +-
> > >  tools/perf/util/symbol_conf.h            |  4 ++-
> > >  9 files changed, 122 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> > > index 80686d590de24..c7fc1ba265e27 100644
> > > --- a/tools/perf/Documentation/perf-record.txt
> > > +++ b/tools/perf/Documentation/perf-record.txt
> > > @@ -227,6 +227,10 @@ OPTIONS
> > >       '--filter' exists, the new filter expression will be combined with
> > >       them by '&&'.
> > >
> > > +--latency::
> > > +     Enable data collection for latency profiling.
> > > +     Use perf report --latency for latency-centric profile.
> > > +
> > >  -a::
> > >  --all-cpus::
> > >          System-wide collection from all CPUs (default if no target is specified).
> > > diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> > > index 87f8645194062..66794131aec48 100644
> > > --- a/tools/perf/Documentation/perf-report.txt
> > > +++ b/tools/perf/Documentation/perf-report.txt
> > > @@ -68,6 +68,11 @@ OPTIONS
> > >  --hide-unresolved::
> > >          Only display entries resolved to a symbol.
> > >
> > > +--latency::
> > > +        Show latency-centric profile rather than the default
> > > +        CPU-consumption-centric profile
> > > +        (requires perf record --latency flag).
> > > +
> > >  -s::
> > >  --sort=::
> > >       Sort histogram entries by given key(s) - multiple keys can be specified
> > > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > > index 5db1aedf48df9..e219639ac401b 100644
> > > --- a/tools/perf/builtin-record.c
> > > +++ b/tools/perf/builtin-record.c
> > > @@ -161,6 +161,7 @@ struct record {
> > >       struct evlist           *sb_evlist;
> > >       pthread_t               thread_id;
> > >       int                     realtime_prio;
> > > +     bool                    latency;
> > >       bool                    switch_output_event_set;
> > >       bool                    no_buildid;
> > >       bool                    no_buildid_set;
> > > @@ -3371,6 +3372,9 @@ static struct option __record_options[] = {
> > >                    parse_events_option),
> > >       OPT_CALLBACK(0, "filter", &record.evlist, "filter",
> > >                    "event filter", parse_filter),
> > > +     OPT_BOOLEAN(0, "latency", &record.latency,
> > > +                 "Enable data collection for latency profiling.\n"
> > > +                 "\t\t\t  Use perf report --latency for latency-centric profile."),
> > >       OPT_CALLBACK_NOOPT(0, "exclude-perf", &record.evlist,
> > >                          NULL, "don't record events from perf itself",
> > >                          exclude_perf),
> > > @@ -4017,6 +4021,22 @@ int cmd_record(int argc, const char **argv)
> > >
> > >       }
> > >
> > > +     if (record.latency) {
> > > +             /*
> > > +              * There is no fundamental reason why latency profiling
> > > +              * can't work for system-wide mode, but exact semantics
> > > +              * and details are to be defined.
> > > +              * See the following thread for details:
> > > +              * https://lore.kernel.org/all/Z4XDJyvjiie3howF@google.com/
> > > +              */
> > > +             if (record.opts.target.system_wide) {
> > > +                     pr_err("Failed: latency profiling is not supported with system-wide collection.\n");
> > > +                     err = -EINVAL;
> > > +                     goto out_opts;
> > > +             }
> > > +             record.opts.record_switch_events = true;
> > > +     }
> > > +
> > >       if (rec->buildid_mmap) {
> > >               if (!perf_can_record_build_id()) {
> > >                       pr_err("Failed: no support to record build id in mmap events, update your kernel.\n");
> > > diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> > > index 2a19abdc869a1..69de6dbefecfa 100644
> > > --- a/tools/perf/builtin-report.c
> > > +++ b/tools/perf/builtin-report.c
> > > @@ -112,6 +112,8 @@ struct report {
> > >       u64                     nr_entries;
> > >       u64                     queue_size;
> > >       u64                     total_cycles;
> > > +     u64                     total_samples;
> > > +     u64                     singlethreaded_samples;
> > >       int                     socket_filter;
> > >       DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
> > >       struct branch_type_stat brtype_stat;
> > > @@ -331,6 +333,10 @@ static int process_sample_event(const struct perf_tool *tool,
> > >                                    &rep->total_cycles, evsel);
> > >       }
> > >
> > > +     rep->total_samples++;
> > > +     if (al.parallelism == 1)
> > > +             rep->singlethreaded_samples++;
> > > +
> > >       ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
> > >       if (ret < 0)
> > >               pr_debug("problem adding hist entry, skipping event\n");
> > > @@ -1079,6 +1085,11 @@ static int __cmd_report(struct report *rep)
> > >               return ret;
> > >       }
> > >
> > > +     /* Don't show Latency column for non-parallel profiles by default. */
> > > +     if (rep->singlethreaded_samples * 100 / rep->total_samples >= 99 &&
> > > +                     !symbol_conf.prefer_latency)
> > > +             perf_hpp__cancel_latency();
> > > +
> > >       evlist__check_mem_load_aux(session->evlist);
> > >
> > >       if (rep->stats_mode)
> > > @@ -1468,6 +1479,10 @@ int cmd_report(int argc, const char **argv)
> > >                   "Disable raw trace ordering"),
> > >       OPT_BOOLEAN(0, "skip-empty", &report.skip_empty,
> > >                   "Do not display empty (or dummy) events in the output"),
> > > +     OPT_BOOLEAN(0, "latency", &symbol_conf.prefer_latency,
> > > +                 "Show latency-centric profile rather than the default\n"
> > > +                 "\t\t\t  CPU-consumption-centric profile\n"
> > > +                 "\t\t\t  (requires perf record --latency flag)."),
> > >       OPT_END()
> > >       };
> > >       struct perf_data data = {
> > > @@ -1722,16 +1737,25 @@ int cmd_report(int argc, const char **argv)
> > >               symbol_conf.annotate_data_sample = true;
> > >       }
> > >
> > > +     symbol_conf.enable_latency = true;
> > >       if (report.disable_order || !perf_session__has_switch_events(session)) {
> > >               if (symbol_conf.parallelism_list_str ||
> > > -                             (sort_order && strstr(sort_order, "parallelism")) ||
> > > -                             (field_order && strstr(field_order, "parallelism"))) {
> > > +                     symbol_conf.prefer_latency ||
> > > +                     (sort_order && (strstr(sort_order, "latency") ||
> > > +                             strstr(sort_order, "parallelism"))) ||
> > > +                     (field_order && (strstr(field_order, "latency") ||
> > > +                             strstr(field_order, "parallelism")))) {
> > >                       if (report.disable_order)
> > > -                             ui__error("Use of parallelism is incompatible with --disable-order.\n");
> > > +                             ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
> > >                       else
> > > -                             ui__error("Use of parallelism requires --switch-events during record.\n");
> > > +                             ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
> > >                       return -1;
> > >               }
> > > +             /*
> > > +              * If user did not ask for anything related to
> > > +              * latency/parallelism explicitly, just don't show it.
> > > +              */
> > > +             symbol_conf.enable_latency = false;
> > >       }
> > >
> > >       if (sort_order && strstr(sort_order, "ipc")) {
> > > diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
> > > index 6de6309595f9e..b453f8eb579cc 100644
> > > --- a/tools/perf/ui/hist.c
> > > +++ b/tools/perf/ui/hist.c
> > > @@ -632,27 +632,36 @@ void perf_hpp__init(void)
> > >               return;
> > >
> > >       if (symbol_conf.cumulate_callchain) {
> > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC);
> > > +             /* Use idempotent addition to avoid more complex logic. */
> > > +             if (symbol_conf.prefer_latency)
> > > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC, true);
> > > +             if (symbol_conf.enable_latency)
> > > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> > >               perf_hpp__format[PERF_HPP__OVERHEAD].name = "Self";
> > >       }
> > >
> > > -     hpp_dimension__add_output(PERF_HPP__OVERHEAD);
> > > +     if (symbol_conf.prefer_latency)
> > > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> > > +     hpp_dimension__add_output(PERF_HPP__OVERHEAD, true);
> > > +     if (symbol_conf.enable_latency)
> > > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> > >
> > >       if (symbol_conf.show_cpu_utilization) {
> > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS);
> > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US);
> > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS, false);
> > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US, false);
> > >
> > >               if (perf_guest) {
> > > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS);
> > > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US);
> > > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS, false);
> > > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US, false);
> > >               }
> > >       }
> > >
> > >       if (symbol_conf.show_nr_samples)
> > > -             hpp_dimension__add_output(PERF_HPP__SAMPLES);
> > > +             hpp_dimension__add_output(PERF_HPP__SAMPLES, false);
> > >
> > >       if (symbol_conf.show_total_period)
> > > -             hpp_dimension__add_output(PERF_HPP__PERIOD);
> > > +             hpp_dimension__add_output(PERF_HPP__PERIOD, false);
> > >  }
> > >
> > >  void perf_hpp_list__column_register(struct perf_hpp_list *list,
> > > @@ -701,6 +710,22 @@ void perf_hpp__cancel_cumulate(void)
> > >       }
> > >  }
> > >
> > > +void perf_hpp__cancel_latency(void)
> > > +{
> > > +     struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
> > > +
> > > +     if (is_strict_order(field_order) || is_strict_order(sort_order))
> > > +             return;
> > > +
> > > +     lat = &perf_hpp__format[PERF_HPP__LATENCY];
> > > +     acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
> > > +
> > > +     perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
> > > +             if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
> > > +                     perf_hpp__column_unregister(fmt);
> > > +     }
> > > +}
> > > +
> > >  void perf_hpp__setup_output_field(struct perf_hpp_list *list)
> > >  {
> > >       struct perf_hpp_fmt *fmt;
> > > diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> > > index 91159f16c60b2..29d4c7a3d1747 100644
> > > --- a/tools/perf/util/hist.h
> > > +++ b/tools/perf/util/hist.h
> > > @@ -582,6 +582,7 @@ enum {
> > >
> > >  void perf_hpp__init(void);
> > >  void perf_hpp__cancel_cumulate(void);
> > > +void perf_hpp__cancel_latency(void);
> > >  void perf_hpp__setup_output_field(struct perf_hpp_list *list);
> > >  void perf_hpp__reset_output_field(struct perf_hpp_list *list);
> > >  void perf_hpp__append_sort_keys(struct perf_hpp_list *list);
> > > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > > index bc4c3acfe7552..2b6023de7a53a 100644
> > > --- a/tools/perf/util/sort.c
> > > +++ b/tools/perf/util/sort.c
> > > @@ -2622,6 +2622,7 @@ struct hpp_dimension {
> > >       const char              *name;
> > >       struct perf_hpp_fmt     *fmt;
> > >       int                     taken;
> > > +     int                     was_taken;
> > >  };
> > >
> > >  #define DIM(d, n) { .name = n, .fmt = &perf_hpp__format[d], }
> > > @@ -3513,6 +3514,7 @@ static int __hpp_dimension__add(struct hpp_dimension *hd,
> > >               return -1;
> > >
> > >       hd->taken = 1;
> > > +     hd->was_taken = 1;
> > >       perf_hpp_list__register_sort_field(list, fmt);
> > >       return 0;
> > >  }
> > > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> > >       return 0;
> > >  }
> > >
> > > -int hpp_dimension__add_output(unsigned col)
> > > +int hpp_dimension__add_output(unsigned col, bool implicit)
> > >  {
> > > +     struct hpp_dimension *hd;
> > > +
> > >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > > +     hd = &hpp_sort_dimensions[col];
> > > +     if (implicit && !hd->was_taken)
> > > +             return 0;
> >
> > I don't think you need these implicit and was_taken things.
> > Just removing from the sort list when it's unregistered seems to work.
> >
> > ---8<---
> > @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
> >  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
> >  {
> >         list_del_init(&format->list);
> > +       list_del_init(&format->sort_list);
> >         fmt_free(format);
> >  }
> >
> > ---8<---
> 
> It merely suppresses the warning, but does not work the same way. See
> this for details:
> https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/

But I think it's better to pass --latency option rather than adding it
to -s option.  If you really want to have specific output fields, then
please use -F latency,sym instead.

Also I've realized that it should add one sort key in setup_overhead()
to support hierarchy mode properly.  Something like this?

Thanks,
Namhyung


---8<---
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 2b6023de7a53ae2e..329c2e9bbc69a725 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -3817,22 +3817,15 @@ static char *setup_overhead(char *keys)
 		return keys;
 
 	if (symbol_conf.prefer_latency) {
-		keys = prefix_if_not_in("overhead", keys);
-		keys = prefix_if_not_in("latency", keys);
-		if (symbol_conf.cumulate_callchain) {
-			keys = prefix_if_not_in("overhead_children", keys);
+		if (symbol_conf.cumulate_callchain)
 			keys = prefix_if_not_in("latency_children", keys);
-		}
-	} else if (!keys || (!strstr(keys, "overhead") &&
-			!strstr(keys, "latency"))) {
-		if (symbol_conf.enable_latency)
+		else
 			keys = prefix_if_not_in("latency", keys);
-		keys = prefix_if_not_in("overhead", keys);
-		if (symbol_conf.cumulate_callchain) {
-			if (symbol_conf.enable_latency)
-				keys = prefix_if_not_in("latency_children", keys);
+	} else {
+		if (symbol_conf.cumulate_callchain)
 			keys = prefix_if_not_in("overhead_children", keys);
-		}
+		else
+			keys = prefix_if_not_in("overhead", keys);
 	}
 
 	return keys;
---8<---


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/8] perf report: Add --latency flag
  2025-02-11  1:02       ` Namhyung Kim
@ 2025-02-11  8:30         ` Dmitry Vyukov
  2025-02-11  8:42         ` Dmitry Vyukov
  1 sibling, 0 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-11  8:30 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: irogers, linux-perf-users, linux-kernel, Arnaldo Carvalho de Melo

On Tue, 11 Feb 2025 at 02:02, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Fri, Feb 07, 2025 at 08:23:58AM +0100, Dmitry Vyukov wrote:
> > On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Wed, Feb 05, 2025 at 05:27:45PM +0100, Dmitry Vyukov wrote:
> > > > Add record/report --latency flag that allows to capture and show
> > > > latency-centric profiles rather than the default CPU-consumption-centric
> > > > profiles. For latency profiles record captures context switch events,
> > > > and report shows Latency as the first column.
> > > >
> > > > Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> > > > Cc: Namhyung Kim <namhyung@kernel.org>
> > > > Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> > > > Cc: Ian Rogers <irogers@google.com>
> > > > Cc: linux-perf-users@vger.kernel.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > >
> > > > ---
> > > > Changes in v5:
> > > >  - added description of --latency flag in Documentation flags
> > > > ---
> > > >  tools/perf/Documentation/perf-record.txt |  4 +++
> > > >  tools/perf/Documentation/perf-report.txt |  5 +++
> > > >  tools/perf/builtin-record.c              | 20 ++++++++++++
> > > >  tools/perf/builtin-report.c              | 32 +++++++++++++++---
> > > >  tools/perf/ui/hist.c                     | 41 +++++++++++++++++++-----
> > > >  tools/perf/util/hist.h                   |  1 +
> > > >  tools/perf/util/sort.c                   | 33 +++++++++++++++----
> > > >  tools/perf/util/sort.h                   |  2 +-
> > > >  tools/perf/util/symbol_conf.h            |  4 ++-
> > > >  9 files changed, 122 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> > > > index 80686d590de24..c7fc1ba265e27 100644
> > > > --- a/tools/perf/Documentation/perf-record.txt
> > > > +++ b/tools/perf/Documentation/perf-record.txt
> > > > @@ -227,6 +227,10 @@ OPTIONS
> > > >       '--filter' exists, the new filter expression will be combined with
> > > >       them by '&&'.
> > > >
> > > > +--latency::
> > > > +     Enable data collection for latency profiling.
> > > > +     Use perf report --latency for latency-centric profile.
> > > > +
> > > >  -a::
> > > >  --all-cpus::
> > > >          System-wide collection from all CPUs (default if no target is specified).
> > > > diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> > > > index 87f8645194062..66794131aec48 100644
> > > > --- a/tools/perf/Documentation/perf-report.txt
> > > > +++ b/tools/perf/Documentation/perf-report.txt
> > > > @@ -68,6 +68,11 @@ OPTIONS
> > > >  --hide-unresolved::
> > > >          Only display entries resolved to a symbol.
> > > >
> > > > +--latency::
> > > > +        Show latency-centric profile rather than the default
> > > > +        CPU-consumption-centric profile
> > > > +        (requires perf record --latency flag).
> > > > +
> > > >  -s::
> > > >  --sort=::
> > > >       Sort histogram entries by given key(s) - multiple keys can be specified
> > > > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > > > index 5db1aedf48df9..e219639ac401b 100644
> > > > --- a/tools/perf/builtin-record.c
> > > > +++ b/tools/perf/builtin-record.c
> > > > @@ -161,6 +161,7 @@ struct record {
> > > >       struct evlist           *sb_evlist;
> > > >       pthread_t               thread_id;
> > > >       int                     realtime_prio;
> > > > +     bool                    latency;
> > > >       bool                    switch_output_event_set;
> > > >       bool                    no_buildid;
> > > >       bool                    no_buildid_set;
> > > > @@ -3371,6 +3372,9 @@ static struct option __record_options[] = {
> > > >                    parse_events_option),
> > > >       OPT_CALLBACK(0, "filter", &record.evlist, "filter",
> > > >                    "event filter", parse_filter),
> > > > +     OPT_BOOLEAN(0, "latency", &record.latency,
> > > > +                 "Enable data collection for latency profiling.\n"
> > > > +                 "\t\t\t  Use perf report --latency for latency-centric profile."),
> > > >       OPT_CALLBACK_NOOPT(0, "exclude-perf", &record.evlist,
> > > >                          NULL, "don't record events from perf itself",
> > > >                          exclude_perf),
> > > > @@ -4017,6 +4021,22 @@ int cmd_record(int argc, const char **argv)
> > > >
> > > >       }
> > > >
> > > > +     if (record.latency) {
> > > > +             /*
> > > > +              * There is no fundamental reason why latency profiling
> > > > +              * can't work for system-wide mode, but exact semantics
> > > > +              * and details are to be defined.
> > > > +              * See the following thread for details:
> > > > +              * https://lore.kernel.org/all/Z4XDJyvjiie3howF@google.com/
> > > > +              */
> > > > +             if (record.opts.target.system_wide) {
> > > > +                     pr_err("Failed: latency profiling is not supported with system-wide collection.\n");
> > > > +                     err = -EINVAL;
> > > > +                     goto out_opts;
> > > > +             }
> > > > +             record.opts.record_switch_events = true;
> > > > +     }
> > > > +
> > > >       if (rec->buildid_mmap) {
> > > >               if (!perf_can_record_build_id()) {
> > > >                       pr_err("Failed: no support to record build id in mmap events, update your kernel.\n");
> > > > diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> > > > index 2a19abdc869a1..69de6dbefecfa 100644
> > > > --- a/tools/perf/builtin-report.c
> > > > +++ b/tools/perf/builtin-report.c
> > > > @@ -112,6 +112,8 @@ struct report {
> > > >       u64                     nr_entries;
> > > >       u64                     queue_size;
> > > >       u64                     total_cycles;
> > > > +     u64                     total_samples;
> > > > +     u64                     singlethreaded_samples;
> > > >       int                     socket_filter;
> > > >       DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
> > > >       struct branch_type_stat brtype_stat;
> > > > @@ -331,6 +333,10 @@ static int process_sample_event(const struct perf_tool *tool,
> > > >                                    &rep->total_cycles, evsel);
> > > >       }
> > > >
> > > > +     rep->total_samples++;
> > > > +     if (al.parallelism == 1)
> > > > +             rep->singlethreaded_samples++;
> > > > +
> > > >       ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
> > > >       if (ret < 0)
> > > >               pr_debug("problem adding hist entry, skipping event\n");
> > > > @@ -1079,6 +1085,11 @@ static int __cmd_report(struct report *rep)
> > > >               return ret;
> > > >       }
> > > >
> > > > +     /* Don't show Latency column for non-parallel profiles by default. */
> > > > +     if (rep->singlethreaded_samples * 100 / rep->total_samples >= 99 &&
> > > > +                     !symbol_conf.prefer_latency)
> > > > +             perf_hpp__cancel_latency();
> > > > +
> > > >       evlist__check_mem_load_aux(session->evlist);
> > > >
> > > >       if (rep->stats_mode)
> > > > @@ -1468,6 +1479,10 @@ int cmd_report(int argc, const char **argv)
> > > >                   "Disable raw trace ordering"),
> > > >       OPT_BOOLEAN(0, "skip-empty", &report.skip_empty,
> > > >                   "Do not display empty (or dummy) events in the output"),
> > > > +     OPT_BOOLEAN(0, "latency", &symbol_conf.prefer_latency,
> > > > +                 "Show latency-centric profile rather than the default\n"
> > > > +                 "\t\t\t  CPU-consumption-centric profile\n"
> > > > +                 "\t\t\t  (requires perf record --latency flag)."),
> > > >       OPT_END()
> > > >       };
> > > >       struct perf_data data = {
> > > > @@ -1722,16 +1737,25 @@ int cmd_report(int argc, const char **argv)
> > > >               symbol_conf.annotate_data_sample = true;
> > > >       }
> > > >
> > > > +     symbol_conf.enable_latency = true;
> > > >       if (report.disable_order || !perf_session__has_switch_events(session)) {
> > > >               if (symbol_conf.parallelism_list_str ||
> > > > -                             (sort_order && strstr(sort_order, "parallelism")) ||
> > > > -                             (field_order && strstr(field_order, "parallelism"))) {
> > > > +                     symbol_conf.prefer_latency ||
> > > > +                     (sort_order && (strstr(sort_order, "latency") ||
> > > > +                             strstr(sort_order, "parallelism"))) ||
> > > > +                     (field_order && (strstr(field_order, "latency") ||
> > > > +                             strstr(field_order, "parallelism")))) {
> > > >                       if (report.disable_order)
> > > > -                             ui__error("Use of parallelism is incompatible with --disable-order.\n");
> > > > +                             ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
> > > >                       else
> > > > -                             ui__error("Use of parallelism requires --switch-events during record.\n");
> > > > +                             ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
> > > >                       return -1;
> > > >               }
> > > > +             /*
> > > > +              * If user did not ask for anything related to
> > > > +              * latency/parallelism explicitly, just don't show it.
> > > > +              */
> > > > +             symbol_conf.enable_latency = false;
> > > >       }
> > > >
> > > >       if (sort_order && strstr(sort_order, "ipc")) {
> > > > diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
> > > > index 6de6309595f9e..b453f8eb579cc 100644
> > > > --- a/tools/perf/ui/hist.c
> > > > +++ b/tools/perf/ui/hist.c
> > > > @@ -632,27 +632,36 @@ void perf_hpp__init(void)
> > > >               return;
> > > >
> > > >       if (symbol_conf.cumulate_callchain) {
> > > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC);
> > > > +             /* Use idempotent addition to avoid more complex logic. */
> > > > +             if (symbol_conf.prefer_latency)
> > > > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> > > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC, true);
> > > > +             if (symbol_conf.enable_latency)
> > > > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> > > >               perf_hpp__format[PERF_HPP__OVERHEAD].name = "Self";
> > > >       }
> > > >
> > > > -     hpp_dimension__add_output(PERF_HPP__OVERHEAD);
> > > > +     if (symbol_conf.prefer_latency)
> > > > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> > > > +     hpp_dimension__add_output(PERF_HPP__OVERHEAD, true);
> > > > +     if (symbol_conf.enable_latency)
> > > > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> > > >
> > > >       if (symbol_conf.show_cpu_utilization) {
> > > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS);
> > > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US);
> > > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS, false);
> > > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US, false);
> > > >
> > > >               if (perf_guest) {
> > > > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS);
> > > > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US);
> > > > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS, false);
> > > > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US, false);
> > > >               }
> > > >       }
> > > >
> > > >       if (symbol_conf.show_nr_samples)
> > > > -             hpp_dimension__add_output(PERF_HPP__SAMPLES);
> > > > +             hpp_dimension__add_output(PERF_HPP__SAMPLES, false);
> > > >
> > > >       if (symbol_conf.show_total_period)
> > > > -             hpp_dimension__add_output(PERF_HPP__PERIOD);
> > > > +             hpp_dimension__add_output(PERF_HPP__PERIOD, false);
> > > >  }
> > > >
> > > >  void perf_hpp_list__column_register(struct perf_hpp_list *list,
> > > > @@ -701,6 +710,22 @@ void perf_hpp__cancel_cumulate(void)
> > > >       }
> > > >  }
> > > >
> > > > +void perf_hpp__cancel_latency(void)
> > > > +{
> > > > +     struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
> > > > +
> > > > +     if (is_strict_order(field_order) || is_strict_order(sort_order))
> > > > +             return;
> > > > +
> > > > +     lat = &perf_hpp__format[PERF_HPP__LATENCY];
> > > > +     acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
> > > > +
> > > > +     perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
> > > > +             if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
> > > > +                     perf_hpp__column_unregister(fmt);
> > > > +     }
> > > > +}
> > > > +
> > > >  void perf_hpp__setup_output_field(struct perf_hpp_list *list)
> > > >  {
> > > >       struct perf_hpp_fmt *fmt;
> > > > diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> > > > index 91159f16c60b2..29d4c7a3d1747 100644
> > > > --- a/tools/perf/util/hist.h
> > > > +++ b/tools/perf/util/hist.h
> > > > @@ -582,6 +582,7 @@ enum {
> > > >
> > > >  void perf_hpp__init(void);
> > > >  void perf_hpp__cancel_cumulate(void);
> > > > +void perf_hpp__cancel_latency(void);
> > > >  void perf_hpp__setup_output_field(struct perf_hpp_list *list);
> > > >  void perf_hpp__reset_output_field(struct perf_hpp_list *list);
> > > >  void perf_hpp__append_sort_keys(struct perf_hpp_list *list);
> > > > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > > > index bc4c3acfe7552..2b6023de7a53a 100644
> > > > --- a/tools/perf/util/sort.c
> > > > +++ b/tools/perf/util/sort.c
> > > > @@ -2622,6 +2622,7 @@ struct hpp_dimension {
> > > >       const char              *name;
> > > >       struct perf_hpp_fmt     *fmt;
> > > >       int                     taken;
> > > > +     int                     was_taken;
> > > >  };
> > > >
> > > >  #define DIM(d, n) { .name = n, .fmt = &perf_hpp__format[d], }
> > > > @@ -3513,6 +3514,7 @@ static int __hpp_dimension__add(struct hpp_dimension *hd,
> > > >               return -1;
> > > >
> > > >       hd->taken = 1;
> > > > +     hd->was_taken = 1;
> > > >       perf_hpp_list__register_sort_field(list, fmt);
> > > >       return 0;
> > > >  }
> > > > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> > > >       return 0;
> > > >  }
> > > >
> > > > -int hpp_dimension__add_output(unsigned col)
> > > > +int hpp_dimension__add_output(unsigned col, bool implicit)
> > > >  {
> > > > +     struct hpp_dimension *hd;
> > > > +
> > > >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > > > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > > > +     hd = &hpp_sort_dimensions[col];
> > > > +     if (implicit && !hd->was_taken)
> > > > +             return 0;
> > >
> > > I don't think you need these implicit and was_taken things.
> > > Just removing from the sort list when it's unregistered seems to work.
> > >
> > > ---8<---
> > > @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
> > >  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
> > >  {
> > >         list_del_init(&format->list);
> > > +       list_del_init(&format->sort_list);
> > >         fmt_free(format);
> > >  }
> > >
> > > ---8<---
> >
> > It merely suppresses the warning, but does not work the same way. See
> > this for details:
> > https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/
>
> But I think it's better to pass --latency option rather than adding it
> to -s option.  If you really want to have specific output fields, then
> please use -F latency,sym instead.

-F does not work in all cases, e.g. the command from tips.txt with -F
instead of --sort:

perf report --hierarchy -F latency,parallelism,comm,symbol
Error: --hierarchy and --fields options cannot be used together

I also don't understand how you see it working. Please elaborate.

--latency is no more than a handy present for the default case so that
a user does not need to learn other flags, field names, etc. It's
just:

perf report --latency

Then there are gazillions of custom permutations of various flags with
sort/field orders. It's impossible to second guess what a user means
in all of these cases and simplify with a magic --latency flag. For
example, possible commands may be different just by field order:
"latency", or "latency,overhead", or "overhead,latency", all of the
are "latency" in some sense, but which one of these 3 a user wants in
a particular case is impossible to infer.

Also, I don't see lots of value in using --latency flag instead of
latency sort field. It's just a bit longer command and nothing more.
Am I missing something? If a user is proficient with perf and a custom
specific configuration, I would let users specify what they want
rather than try to second guess their intentions.



> Also I've realized that it should add one sort key in setup_overhead()
> to support hierarchy mode properly.  Something like this?
>
> Thanks,
> Namhyung
>
>
> ---8<---
> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> index 2b6023de7a53ae2e..329c2e9bbc69a725 100644
> --- a/tools/perf/util/sort.c
> +++ b/tools/perf/util/sort.c
> @@ -3817,22 +3817,15 @@ static char *setup_overhead(char *keys)
>                 return keys;
>
>         if (symbol_conf.prefer_latency) {
> -               keys = prefix_if_not_in("overhead", keys);
> -               keys = prefix_if_not_in("latency", keys);
> -               if (symbol_conf.cumulate_callchain) {
> -                       keys = prefix_if_not_in("overhead_children", keys);
> +               if (symbol_conf.cumulate_callchain)
>                         keys = prefix_if_not_in("latency_children", keys);
> -               }
> -       } else if (!keys || (!strstr(keys, "overhead") &&
> -                       !strstr(keys, "latency"))) {
> -               if (symbol_conf.enable_latency)
> +               else
>                         keys = prefix_if_not_in("latency", keys);
> -               keys = prefix_if_not_in("overhead", keys);
> -               if (symbol_conf.cumulate_callchain) {
> -                       if (symbol_conf.enable_latency)
> -                               keys = prefix_if_not_in("latency_children", keys);
> +       } else {
> +               if (symbol_conf.cumulate_callchain)
>                         keys = prefix_if_not_in("overhead_children", keys);
> -               }
> +               else
> +                       keys = prefix_if_not_in("overhead", keys);
>         }
>
>         return keys;
> ---8<---
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/8] perf report: Add --latency flag
  2025-02-11  1:02       ` Namhyung Kim
  2025-02-11  8:30         ` Dmitry Vyukov
@ 2025-02-11  8:42         ` Dmitry Vyukov
  2025-02-11 17:42           ` Namhyung Kim
  1 sibling, 1 reply; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-11  8:42 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: irogers, linux-perf-users, linux-kernel, Arnaldo Carvalho de Melo

On Tue, 11 Feb 2025 at 02:02, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Fri, Feb 07, 2025 at 08:23:58AM +0100, Dmitry Vyukov wrote:
> > On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Wed, Feb 05, 2025 at 05:27:45PM +0100, Dmitry Vyukov wrote:
> > > > Add record/report --latency flag that allows to capture and show
> > > > latency-centric profiles rather than the default CPU-consumption-centric
> > > > profiles. For latency profiles record captures context switch events,
> > > > and report shows Latency as the first column.
> > > >
> > > > Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> > > > Cc: Namhyung Kim <namhyung@kernel.org>
> > > > Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> > > > Cc: Ian Rogers <irogers@google.com>
> > > > Cc: linux-perf-users@vger.kernel.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > >
> > > > ---
> > > > Changes in v5:
> > > >  - added description of --latency flag in Documentation flags
> > > > ---
> > > >  tools/perf/Documentation/perf-record.txt |  4 +++
> > > >  tools/perf/Documentation/perf-report.txt |  5 +++
> > > >  tools/perf/builtin-record.c              | 20 ++++++++++++
> > > >  tools/perf/builtin-report.c              | 32 +++++++++++++++---
> > > >  tools/perf/ui/hist.c                     | 41 +++++++++++++++++++-----
> > > >  tools/perf/util/hist.h                   |  1 +
> > > >  tools/perf/util/sort.c                   | 33 +++++++++++++++----
> > > >  tools/perf/util/sort.h                   |  2 +-
> > > >  tools/perf/util/symbol_conf.h            |  4 ++-
> > > >  9 files changed, 122 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> > > > index 80686d590de24..c7fc1ba265e27 100644
> > > > --- a/tools/perf/Documentation/perf-record.txt
> > > > +++ b/tools/perf/Documentation/perf-record.txt
> > > > @@ -227,6 +227,10 @@ OPTIONS
> > > >       '--filter' exists, the new filter expression will be combined with
> > > >       them by '&&'.
> > > >
> > > > +--latency::
> > > > +     Enable data collection for latency profiling.
> > > > +     Use perf report --latency for latency-centric profile.
> > > > +
> > > >  -a::
> > > >  --all-cpus::
> > > >          System-wide collection from all CPUs (default if no target is specified).
> > > > diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> > > > index 87f8645194062..66794131aec48 100644
> > > > --- a/tools/perf/Documentation/perf-report.txt
> > > > +++ b/tools/perf/Documentation/perf-report.txt
> > > > @@ -68,6 +68,11 @@ OPTIONS
> > > >  --hide-unresolved::
> > > >          Only display entries resolved to a symbol.
> > > >
> > > > +--latency::
> > > > +        Show latency-centric profile rather than the default
> > > > +        CPU-consumption-centric profile
> > > > +        (requires perf record --latency flag).
> > > > +
> > > >  -s::
> > > >  --sort=::
> > > >       Sort histogram entries by given key(s) - multiple keys can be specified
> > > > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > > > index 5db1aedf48df9..e219639ac401b 100644
> > > > --- a/tools/perf/builtin-record.c
> > > > +++ b/tools/perf/builtin-record.c
> > > > @@ -161,6 +161,7 @@ struct record {
> > > >       struct evlist           *sb_evlist;
> > > >       pthread_t               thread_id;
> > > >       int                     realtime_prio;
> > > > +     bool                    latency;
> > > >       bool                    switch_output_event_set;
> > > >       bool                    no_buildid;
> > > >       bool                    no_buildid_set;
> > > > @@ -3371,6 +3372,9 @@ static struct option __record_options[] = {
> > > >                    parse_events_option),
> > > >       OPT_CALLBACK(0, "filter", &record.evlist, "filter",
> > > >                    "event filter", parse_filter),
> > > > +     OPT_BOOLEAN(0, "latency", &record.latency,
> > > > +                 "Enable data collection for latency profiling.\n"
> > > > +                 "\t\t\t  Use perf report --latency for latency-centric profile."),
> > > >       OPT_CALLBACK_NOOPT(0, "exclude-perf", &record.evlist,
> > > >                          NULL, "don't record events from perf itself",
> > > >                          exclude_perf),
> > > > @@ -4017,6 +4021,22 @@ int cmd_record(int argc, const char **argv)
> > > >
> > > >       }
> > > >
> > > > +     if (record.latency) {
> > > > +             /*
> > > > +              * There is no fundamental reason why latency profiling
> > > > +              * can't work for system-wide mode, but exact semantics
> > > > +              * and details are to be defined.
> > > > +              * See the following thread for details:
> > > > +              * https://lore.kernel.org/all/Z4XDJyvjiie3howF@google.com/
> > > > +              */
> > > > +             if (record.opts.target.system_wide) {
> > > > +                     pr_err("Failed: latency profiling is not supported with system-wide collection.\n");
> > > > +                     err = -EINVAL;
> > > > +                     goto out_opts;
> > > > +             }
> > > > +             record.opts.record_switch_events = true;
> > > > +     }
> > > > +
> > > >       if (rec->buildid_mmap) {
> > > >               if (!perf_can_record_build_id()) {
> > > >                       pr_err("Failed: no support to record build id in mmap events, update your kernel.\n");
> > > > diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> > > > index 2a19abdc869a1..69de6dbefecfa 100644
> > > > --- a/tools/perf/builtin-report.c
> > > > +++ b/tools/perf/builtin-report.c
> > > > @@ -112,6 +112,8 @@ struct report {
> > > >       u64                     nr_entries;
> > > >       u64                     queue_size;
> > > >       u64                     total_cycles;
> > > > +     u64                     total_samples;
> > > > +     u64                     singlethreaded_samples;
> > > >       int                     socket_filter;
> > > >       DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
> > > >       struct branch_type_stat brtype_stat;
> > > > @@ -331,6 +333,10 @@ static int process_sample_event(const struct perf_tool *tool,
> > > >                                    &rep->total_cycles, evsel);
> > > >       }
> > > >
> > > > +     rep->total_samples++;
> > > > +     if (al.parallelism == 1)
> > > > +             rep->singlethreaded_samples++;
> > > > +
> > > >       ret = hist_entry_iter__add(&iter, &al, rep->max_stack, rep);
> > > >       if (ret < 0)
> > > >               pr_debug("problem adding hist entry, skipping event\n");
> > > > @@ -1079,6 +1085,11 @@ static int __cmd_report(struct report *rep)
> > > >               return ret;
> > > >       }
> > > >
> > > > +     /* Don't show Latency column for non-parallel profiles by default. */
> > > > +     if (rep->singlethreaded_samples * 100 / rep->total_samples >= 99 &&
> > > > +                     !symbol_conf.prefer_latency)
> > > > +             perf_hpp__cancel_latency();
> > > > +
> > > >       evlist__check_mem_load_aux(session->evlist);
> > > >
> > > >       if (rep->stats_mode)
> > > > @@ -1468,6 +1479,10 @@ int cmd_report(int argc, const char **argv)
> > > >                   "Disable raw trace ordering"),
> > > >       OPT_BOOLEAN(0, "skip-empty", &report.skip_empty,
> > > >                   "Do not display empty (or dummy) events in the output"),
> > > > +     OPT_BOOLEAN(0, "latency", &symbol_conf.prefer_latency,
> > > > +                 "Show latency-centric profile rather than the default\n"
> > > > +                 "\t\t\t  CPU-consumption-centric profile\n"
> > > > +                 "\t\t\t  (requires perf record --latency flag)."),
> > > >       OPT_END()
> > > >       };
> > > >       struct perf_data data = {
> > > > @@ -1722,16 +1737,25 @@ int cmd_report(int argc, const char **argv)
> > > >               symbol_conf.annotate_data_sample = true;
> > > >       }
> > > >
> > > > +     symbol_conf.enable_latency = true;
> > > >       if (report.disable_order || !perf_session__has_switch_events(session)) {
> > > >               if (symbol_conf.parallelism_list_str ||
> > > > -                             (sort_order && strstr(sort_order, "parallelism")) ||
> > > > -                             (field_order && strstr(field_order, "parallelism"))) {
> > > > +                     symbol_conf.prefer_latency ||
> > > > +                     (sort_order && (strstr(sort_order, "latency") ||
> > > > +                             strstr(sort_order, "parallelism"))) ||
> > > > +                     (field_order && (strstr(field_order, "latency") ||
> > > > +                             strstr(field_order, "parallelism")))) {
> > > >                       if (report.disable_order)
> > > > -                             ui__error("Use of parallelism is incompatible with --disable-order.\n");
> > > > +                             ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
> > > >                       else
> > > > -                             ui__error("Use of parallelism requires --switch-events during record.\n");
> > > > +                             ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
> > > >                       return -1;
> > > >               }
> > > > +             /*
> > > > +              * If user did not ask for anything related to
> > > > +              * latency/parallelism explicitly, just don't show it.
> > > > +              */
> > > > +             symbol_conf.enable_latency = false;
> > > >       }
> > > >
> > > >       if (sort_order && strstr(sort_order, "ipc")) {
> > > > diff --git a/tools/perf/ui/hist.c b/tools/perf/ui/hist.c
> > > > index 6de6309595f9e..b453f8eb579cc 100644
> > > > --- a/tools/perf/ui/hist.c
> > > > +++ b/tools/perf/ui/hist.c
> > > > @@ -632,27 +632,36 @@ void perf_hpp__init(void)
> > > >               return;
> > > >
> > > >       if (symbol_conf.cumulate_callchain) {
> > > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC);
> > > > +             /* Use idempotent addition to avoid more complex logic. */
> > > > +             if (symbol_conf.prefer_latency)
> > > > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> > > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_ACC, true);
> > > > +             if (symbol_conf.enable_latency)
> > > > +                     hpp_dimension__add_output(PERF_HPP__LATENCY_ACC, true);
> > > >               perf_hpp__format[PERF_HPP__OVERHEAD].name = "Self";
> > > >       }
> > > >
> > > > -     hpp_dimension__add_output(PERF_HPP__OVERHEAD);
> > > > +     if (symbol_conf.prefer_latency)
> > > > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> > > > +     hpp_dimension__add_output(PERF_HPP__OVERHEAD, true);
> > > > +     if (symbol_conf.enable_latency)
> > > > +             hpp_dimension__add_output(PERF_HPP__LATENCY, true);
> > > >
> > > >       if (symbol_conf.show_cpu_utilization) {
> > > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS);
> > > > -             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US);
> > > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_SYS, false);
> > > > +             hpp_dimension__add_output(PERF_HPP__OVERHEAD_US, false);
> > > >
> > > >               if (perf_guest) {
> > > > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS);
> > > > -                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US);
> > > > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_SYS, false);
> > > > +                     hpp_dimension__add_output(PERF_HPP__OVERHEAD_GUEST_US, false);
> > > >               }
> > > >       }
> > > >
> > > >       if (symbol_conf.show_nr_samples)
> > > > -             hpp_dimension__add_output(PERF_HPP__SAMPLES);
> > > > +             hpp_dimension__add_output(PERF_HPP__SAMPLES, false);
> > > >
> > > >       if (symbol_conf.show_total_period)
> > > > -             hpp_dimension__add_output(PERF_HPP__PERIOD);
> > > > +             hpp_dimension__add_output(PERF_HPP__PERIOD, false);
> > > >  }
> > > >
> > > >  void perf_hpp_list__column_register(struct perf_hpp_list *list,
> > > > @@ -701,6 +710,22 @@ void perf_hpp__cancel_cumulate(void)
> > > >       }
> > > >  }
> > > >
> > > > +void perf_hpp__cancel_latency(void)
> > > > +{
> > > > +     struct perf_hpp_fmt *fmt, *lat, *acc, *tmp;
> > > > +
> > > > +     if (is_strict_order(field_order) || is_strict_order(sort_order))
> > > > +             return;
> > > > +
> > > > +     lat = &perf_hpp__format[PERF_HPP__LATENCY];
> > > > +     acc = &perf_hpp__format[PERF_HPP__LATENCY_ACC];
> > > > +
> > > > +     perf_hpp_list__for_each_format_safe(&perf_hpp_list, fmt, tmp) {
> > > > +             if (fmt_equal(lat, fmt) || fmt_equal(acc, fmt))
> > > > +                     perf_hpp__column_unregister(fmt);
> > > > +     }
> > > > +}
> > > > +
> > > >  void perf_hpp__setup_output_field(struct perf_hpp_list *list)
> > > >  {
> > > >       struct perf_hpp_fmt *fmt;
> > > > diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> > > > index 91159f16c60b2..29d4c7a3d1747 100644
> > > > --- a/tools/perf/util/hist.h
> > > > +++ b/tools/perf/util/hist.h
> > > > @@ -582,6 +582,7 @@ enum {
> > > >
> > > >  void perf_hpp__init(void);
> > > >  void perf_hpp__cancel_cumulate(void);
> > > > +void perf_hpp__cancel_latency(void);
> > > >  void perf_hpp__setup_output_field(struct perf_hpp_list *list);
> > > >  void perf_hpp__reset_output_field(struct perf_hpp_list *list);
> > > >  void perf_hpp__append_sort_keys(struct perf_hpp_list *list);
> > > > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > > > index bc4c3acfe7552..2b6023de7a53a 100644
> > > > --- a/tools/perf/util/sort.c
> > > > +++ b/tools/perf/util/sort.c
> > > > @@ -2622,6 +2622,7 @@ struct hpp_dimension {
> > > >       const char              *name;
> > > >       struct perf_hpp_fmt     *fmt;
> > > >       int                     taken;
> > > > +     int                     was_taken;
> > > >  };
> > > >
> > > >  #define DIM(d, n) { .name = n, .fmt = &perf_hpp__format[d], }
> > > > @@ -3513,6 +3514,7 @@ static int __hpp_dimension__add(struct hpp_dimension *hd,
> > > >               return -1;
> > > >
> > > >       hd->taken = 1;
> > > > +     hd->was_taken = 1;
> > > >       perf_hpp_list__register_sort_field(list, fmt);
> > > >       return 0;
> > > >  }
> > > > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> > > >       return 0;
> > > >  }
> > > >
> > > > -int hpp_dimension__add_output(unsigned col)
> > > > +int hpp_dimension__add_output(unsigned col, bool implicit)
> > > >  {
> > > > +     struct hpp_dimension *hd;
> > > > +
> > > >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > > > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > > > +     hd = &hpp_sort_dimensions[col];
> > > > +     if (implicit && !hd->was_taken)
> > > > +             return 0;
> > >
> > > I don't think you need these implicit and was_taken things.
> > > Just removing from the sort list when it's unregistered seems to work.
> > >
> > > ---8<---
> > > @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
> > >  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
> > >  {
> > >         list_del_init(&format->list);
> > > +       list_del_init(&format->sort_list);
> > >         fmt_free(format);
> > >  }
> > >
> > > ---8<---
> >
> > It merely suppresses the warning, but does not work the same way. See
> > this for details:
> > https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/
>
> But I think it's better to pass --latency option rather than adding it
> to -s option.  If you really want to have specific output fields, then
> please use -F latency,sym instead.
>
> Also I've realized that it should add one sort key in setup_overhead()
> to support hierarchy mode properly.  Something like this?
>
> Thanks,
> Namhyung
>
>
> ---8<---
> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> index 2b6023de7a53ae2e..329c2e9bbc69a725 100644
> --- a/tools/perf/util/sort.c
> +++ b/tools/perf/util/sort.c
> @@ -3817,22 +3817,15 @@ static char *setup_overhead(char *keys)
>                 return keys;
>
>         if (symbol_conf.prefer_latency) {
> -               keys = prefix_if_not_in("overhead", keys);
> -               keys = prefix_if_not_in("latency", keys);
> -               if (symbol_conf.cumulate_callchain) {
> -                       keys = prefix_if_not_in("overhead_children", keys);
> +               if (symbol_conf.cumulate_callchain)
>                         keys = prefix_if_not_in("latency_children", keys);
> -               }
> -       } else if (!keys || (!strstr(keys, "overhead") &&
> -                       !strstr(keys, "latency"))) {
> -               if (symbol_conf.enable_latency)
> +               else
>                         keys = prefix_if_not_in("latency", keys);
> -               keys = prefix_if_not_in("overhead", keys);
> -               if (symbol_conf.cumulate_callchain) {
> -                       if (symbol_conf.enable_latency)
> -                               keys = prefix_if_not_in("latency_children", keys);
> +       } else {
> +               if (symbol_conf.cumulate_callchain)
>                         keys = prefix_if_not_in("overhead_children", keys);
> -               }
> +               else
> +                       keys = prefix_if_not_in("overhead", keys);
>         }
>
>         return keys;


Have I decoded the patch correctly?

        if (symbol_conf.prefer_latency) {
                if (symbol_conf.cumulate_callchain)
                        keys = prefix_if_not_in("latency_children", keys);
                else
                        keys = prefix_if_not_in("latency", keys);
        } else {
                if (symbol_conf.cumulate_callchain)
                        keys = prefix_if_not_in("overhead_children", keys);
                else
                        keys = prefix_if_not_in("overhead", keys);
        }

If I decoded the patch correctly, it's not what we want.

For the default prefer_latency case we also want to add overhead, that
was intentional for the --latency preset. It does not harm, and allows
to see/compare differences in latency and overhead.
Again, if a user wants something custom, there is no way to second
guess all possible intentions. For non-default cases, we just let the
user say what exactly they want, and we will follow that.

"latency" should be added even if cumulate_callchain.

For the !prefer_latency case, we don't want to mess with
overhead/latency fields if the user specified any of them explicitly.
Otherwise this convenience part gets in the user's way and does not
allow them to do what they want. User says "I want X" and perf says
"screw you, I will give you Y instead, and won't allow you to possibly
do X".

And see above: -F does not work with --hierarchy, so this part is unskippable.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/8] perf report: Add --latency flag
  2025-02-11  8:42         ` Dmitry Vyukov
@ 2025-02-11 17:42           ` Namhyung Kim
  2025-02-11 20:23             ` Dmitry Vyukov
  0 siblings, 1 reply; 32+ messages in thread
From: Namhyung Kim @ 2025-02-11 17:42 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: irogers, linux-perf-users, linux-kernel, Arnaldo Carvalho de Melo

On Tue, Feb 11, 2025 at 09:42:16AM +0100, Dmitry Vyukov wrote:
> On Tue, 11 Feb 2025 at 02:02, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Fri, Feb 07, 2025 at 08:23:58AM +0100, Dmitry Vyukov wrote:
> > > On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
[SNIP]
> > > > > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> > > > >       return 0;
> > > > >  }
> > > > >
> > > > > -int hpp_dimension__add_output(unsigned col)
> > > > > +int hpp_dimension__add_output(unsigned col, bool implicit)
> > > > >  {
> > > > > +     struct hpp_dimension *hd;
> > > > > +
> > > > >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > > > > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > > > > +     hd = &hpp_sort_dimensions[col];
> > > > > +     if (implicit && !hd->was_taken)
> > > > > +             return 0;
> > > >
> > > > I don't think you need these implicit and was_taken things.
> > > > Just removing from the sort list when it's unregistered seems to work.
> > > >
> > > > ---8<---
> > > > @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
> > > >  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
> > > >  {
> > > >         list_del_init(&format->list);
> > > > +       list_del_init(&format->sort_list);
> > > >         fmt_free(format);
> > > >  }
> > > >
> > > > ---8<---
> > >
> > > It merely suppresses the warning, but does not work the same way. See
> > > this for details:
> > > https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/
> >
> > But I think it's better to pass --latency option rather than adding it
> > to -s option.  If you really want to have specific output fields, then
> > please use -F latency,sym instead.
> >
> > Also I've realized that it should add one sort key in setup_overhead()
> > to support hierarchy mode properly.  Something like this?
> >
> > Thanks,
> > Namhyung
> >
> >
> > ---8<---
> > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > index 2b6023de7a53ae2e..329c2e9bbc69a725 100644
> > --- a/tools/perf/util/sort.c
> > +++ b/tools/perf/util/sort.c
> > @@ -3817,22 +3817,15 @@ static char *setup_overhead(char *keys)
> >                 return keys;
> >
> >         if (symbol_conf.prefer_latency) {
> > -               keys = prefix_if_not_in("overhead", keys);
> > -               keys = prefix_if_not_in("latency", keys);
> > -               if (symbol_conf.cumulate_callchain) {
> > -                       keys = prefix_if_not_in("overhead_children", keys);
> > +               if (symbol_conf.cumulate_callchain)
> >                         keys = prefix_if_not_in("latency_children", keys);
> > -               }
> > -       } else if (!keys || (!strstr(keys, "overhead") &&
> > -                       !strstr(keys, "latency"))) {
> > -               if (symbol_conf.enable_latency)
> > +               else
> >                         keys = prefix_if_not_in("latency", keys);
> > -               keys = prefix_if_not_in("overhead", keys);
> > -               if (symbol_conf.cumulate_callchain) {
> > -                       if (symbol_conf.enable_latency)
> > -                               keys = prefix_if_not_in("latency_children", keys);
> > +       } else {
> > +               if (symbol_conf.cumulate_callchain)
> >                         keys = prefix_if_not_in("overhead_children", keys);
> > -               }
> > +               else
> > +                       keys = prefix_if_not_in("overhead", keys);
> >         }
> >
> >         return keys;
> 
> 
> Have I decoded the patch correctly?
> 
>         if (symbol_conf.prefer_latency) {
>                 if (symbol_conf.cumulate_callchain)
>                         keys = prefix_if_not_in("latency_children", keys);
>                 else
>                         keys = prefix_if_not_in("latency", keys);
>         } else {
>                 if (symbol_conf.cumulate_callchain)
>                         keys = prefix_if_not_in("overhead_children", keys);
>                 else
>                         keys = prefix_if_not_in("overhead", keys);
>         }
> 

Yep, that's correct.


> If I decoded the patch correctly, it's not what we want.
> 
> For the default prefer_latency case we also want to add overhead, that
> was intentional for the --latency preset. It does not harm, and allows
> to see/compare differences in latency and overhead.
> Again, if a user wants something custom, there is no way to second
> guess all possible intentions. For non-default cases, we just let the
> user say what exactly they want, and we will follow that.
> 
> "latency" should be added even if cumulate_callchain.

Please note that it just sets the sort key - which column you want to
order the result.  The output fields for overhead and children will be
added in perf_hpp__init() if you remove the 'was_taken' logic.  So I
think this change will have the same output with that.

> 
> For the !prefer_latency case, we don't want to mess with
> overhead/latency fields if the user specified any of them explicitly.
> Otherwise this convenience part gets in the user's way and does not
> allow them to do what they want. User says "I want X" and perf says
> "screw you, I will give you Y instead, and won't allow you to possibly
> do X".

That's what -F option does.  The -s option used to specify how to group
the histogram entries and it will add 'overhead' (and/or 'latency') if
it's not even requested.  So I think it's ok to add more output column
when -s option is used.

But unfortunately, using -F and -s together is confusing and change the
meaning of -s option - it now says how it sort the result.

> 
> And see above: -F does not work with --hierarchy, so this part is unskippable.

Yep, but I mean it fixes --hierarchy and --latency.  I'm thinking of a
way to support -F and --hierarchy in general.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/8] perf report: Add --latency flag
  2025-02-11 17:42           ` Namhyung Kim
@ 2025-02-11 20:23             ` Dmitry Vyukov
  2025-02-12 19:47               ` Namhyung Kim
  0 siblings, 1 reply; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-11 20:23 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: irogers, linux-perf-users, linux-kernel, Arnaldo Carvalho de Melo

On Tue, 11 Feb 2025 at 18:42, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, Feb 11, 2025 at 09:42:16AM +0100, Dmitry Vyukov wrote:
> > On Tue, 11 Feb 2025 at 02:02, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Fri, Feb 07, 2025 at 08:23:58AM +0100, Dmitry Vyukov wrote:
> > > > On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
> [SNIP]
> > > > > > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> > > > > >       return 0;
> > > > > >  }
> > > > > >
> > > > > > -int hpp_dimension__add_output(unsigned col)
> > > > > > +int hpp_dimension__add_output(unsigned col, bool implicit)
> > > > > >  {
> > > > > > +     struct hpp_dimension *hd;
> > > > > > +
> > > > > >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > > > > > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > > > > > +     hd = &hpp_sort_dimensions[col];
> > > > > > +     if (implicit && !hd->was_taken)
> > > > > > +             return 0;
> > > > >
> > > > > I don't think you need these implicit and was_taken things.
> > > > > Just removing from the sort list when it's unregistered seems to work.
> > > > >
> > > > > ---8<---
> > > > > @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
> > > > >  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
> > > > >  {
> > > > >         list_del_init(&format->list);
> > > > > +       list_del_init(&format->sort_list);
> > > > >         fmt_free(format);
> > > > >  }
> > > > >
> > > > > ---8<---
> > > >
> > > > It merely suppresses the warning, but does not work the same way. See
> > > > this for details:
> > > > https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/
> > >
> > > But I think it's better to pass --latency option rather than adding it
> > > to -s option.  If you really want to have specific output fields, then
> > > please use -F latency,sym instead.
> > >
> > > Also I've realized that it should add one sort key in setup_overhead()
> > > to support hierarchy mode properly.  Something like this?
> > >
> > > Thanks,
> > > Namhyung
> > >
> > >
> > > ---8<---
> > > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > > index 2b6023de7a53ae2e..329c2e9bbc69a725 100644
> > > --- a/tools/perf/util/sort.c
> > > +++ b/tools/perf/util/sort.c
> > > @@ -3817,22 +3817,15 @@ static char *setup_overhead(char *keys)
> > >                 return keys;
> > >
> > >         if (symbol_conf.prefer_latency) {
> > > -               keys = prefix_if_not_in("overhead", keys);
> > > -               keys = prefix_if_not_in("latency", keys);
> > > -               if (symbol_conf.cumulate_callchain) {
> > > -                       keys = prefix_if_not_in("overhead_children", keys);
> > > +               if (symbol_conf.cumulate_callchain)
> > >                         keys = prefix_if_not_in("latency_children", keys);
> > > -               }
> > > -       } else if (!keys || (!strstr(keys, "overhead") &&
> > > -                       !strstr(keys, "latency"))) {
> > > -               if (symbol_conf.enable_latency)
> > > +               else
> > >                         keys = prefix_if_not_in("latency", keys);
> > > -               keys = prefix_if_not_in("overhead", keys);
> > > -               if (symbol_conf.cumulate_callchain) {
> > > -                       if (symbol_conf.enable_latency)
> > > -                               keys = prefix_if_not_in("latency_children", keys);
> > > +       } else {
> > > +               if (symbol_conf.cumulate_callchain)
> > >                         keys = prefix_if_not_in("overhead_children", keys);
> > > -               }
> > > +               else
> > > +                       keys = prefix_if_not_in("overhead", keys);
> > >         }
> > >
> > >         return keys;
> >
> >
> > Have I decoded the patch correctly?
> >
> >         if (symbol_conf.prefer_latency) {
> >                 if (symbol_conf.cumulate_callchain)
> >                         keys = prefix_if_not_in("latency_children", keys);
> >                 else
> >                         keys = prefix_if_not_in("latency", keys);
> >         } else {
> >                 if (symbol_conf.cumulate_callchain)
> >                         keys = prefix_if_not_in("overhead_children", keys);
> >                 else
> >                         keys = prefix_if_not_in("overhead", keys);
> >         }
> >
>
> Yep, that's correct.
>
>
> > If I decoded the patch correctly, it's not what we want.
> >
> > For the default prefer_latency case we also want to add overhead, that
> > was intentional for the --latency preset. It does not harm, and allows
> > to see/compare differences in latency and overhead.
> > Again, if a user wants something custom, there is no way to second
> > guess all possible intentions. For non-default cases, we just let the
> > user say what exactly they want, and we will follow that.
> >
> > "latency" should be added even if cumulate_callchain.
>
> Please note that it just sets the sort key - which column you want to
> order the result.  The output fields for overhead and children will be
> added in perf_hpp__init() if you remove the 'was_taken' logic.  So I
> think this change will have the same output with that.

Yes, but perf_hpp__init() does not have the logic that's currently
contained in setup_overhead().

If the user specified a "latency" field, and we don't want to add
"overhead" in that case, then _both_ setup_overhead() and
perf_hpp__init() must not add "overhead".

If we do what you proposed, then perf_hpp__init() will still add
"overhead" and we go back to square 0.

I used was_taken to not duplicate this non-trivial logic in both
functions. As I mentioned in the previous replies, I tried that but it
was messier/more complex. was_taken is a simple way to not duplicate
logic and keep these functions consistent.


> > For the !prefer_latency case, we don't want to mess with
> > overhead/latency fields if the user specified any of them explicitly.
> > Otherwise this convenience part gets in the user's way and does not
> > allow them to do what they want. User says "I want X" and perf says
> > "screw you, I will give you Y instead, and won't allow you to possibly
> > do X".
>
> That's what -F option does.  The -s option used to specify how to group
> the histogram entries and it will add 'overhead' (and/or 'latency') if
> it's not even requested.  So I think it's ok to add more output column
> when -s option is used.
>
> But unfortunately, using -F and -s together is confusing and change the
> meaning of -s option - it now says how it sort the result.
>
> >
> > And see above: -F does not work with --hierarchy, so this part is unskippable.
>
> Yep, but I mean it fixes --hierarchy and --latency.  I'm thinking of a
> way to support -F and --hierarchy in general.

I don't know why it was disabled. There are likely other things to
improve, but please let's not tie that to this change.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/8] perf report: Add --latency flag
  2025-02-11 20:23             ` Dmitry Vyukov
@ 2025-02-12 19:47               ` Namhyung Kim
  2025-02-13  9:09                 ` Dmitry Vyukov
  0 siblings, 1 reply; 32+ messages in thread
From: Namhyung Kim @ 2025-02-12 19:47 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: irogers, linux-perf-users, linux-kernel, Arnaldo Carvalho de Melo

On Tue, Feb 11, 2025 at 09:23:30PM +0100, Dmitry Vyukov wrote:
> On Tue, 11 Feb 2025 at 18:42, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Tue, Feb 11, 2025 at 09:42:16AM +0100, Dmitry Vyukov wrote:
> > > On Tue, 11 Feb 2025 at 02:02, Namhyung Kim <namhyung@kernel.org> wrote:
> > > >
> > > > On Fri, Feb 07, 2025 at 08:23:58AM +0100, Dmitry Vyukov wrote:
> > > > > On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
> > [SNIP]
> > > > > > > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> > > > > > >       return 0;
> > > > > > >  }
> > > > > > >
> > > > > > > -int hpp_dimension__add_output(unsigned col)
> > > > > > > +int hpp_dimension__add_output(unsigned col, bool implicit)
> > > > > > >  {
> > > > > > > +     struct hpp_dimension *hd;
> > > > > > > +
> > > > > > >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > > > > > > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > > > > > > +     hd = &hpp_sort_dimensions[col];
> > > > > > > +     if (implicit && !hd->was_taken)
> > > > > > > +             return 0;
> > > > > >
> > > > > > I don't think you need these implicit and was_taken things.
> > > > > > Just removing from the sort list when it's unregistered seems to work.
> > > > > >
> > > > > > ---8<---
> > > > > > @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
> > > > > >  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
> > > > > >  {
> > > > > >         list_del_init(&format->list);
> > > > > > +       list_del_init(&format->sort_list);
> > > > > >         fmt_free(format);
> > > > > >  }
> > > > > >
> > > > > > ---8<---
> > > > >
> > > > > It merely suppresses the warning, but does not work the same way. See
> > > > > this for details:
> > > > > https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/
> > > >
> > > > But I think it's better to pass --latency option rather than adding it
> > > > to -s option.  If you really want to have specific output fields, then
> > > > please use -F latency,sym instead.
> > > >
> > > > Also I've realized that it should add one sort key in setup_overhead()
> > > > to support hierarchy mode properly.  Something like this?
> > > >
> > > > Thanks,
> > > > Namhyung
> > > >
> > > >
> > > > ---8<---
> > > > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > > > index 2b6023de7a53ae2e..329c2e9bbc69a725 100644
> > > > --- a/tools/perf/util/sort.c
> > > > +++ b/tools/perf/util/sort.c
> > > > @@ -3817,22 +3817,15 @@ static char *setup_overhead(char *keys)
> > > >                 return keys;
> > > >
> > > >         if (symbol_conf.prefer_latency) {
> > > > -               keys = prefix_if_not_in("overhead", keys);
> > > > -               keys = prefix_if_not_in("latency", keys);
> > > > -               if (symbol_conf.cumulate_callchain) {
> > > > -                       keys = prefix_if_not_in("overhead_children", keys);
> > > > +               if (symbol_conf.cumulate_callchain)
> > > >                         keys = prefix_if_not_in("latency_children", keys);
> > > > -               }
> > > > -       } else if (!keys || (!strstr(keys, "overhead") &&
> > > > -                       !strstr(keys, "latency"))) {
> > > > -               if (symbol_conf.enable_latency)
> > > > +               else
> > > >                         keys = prefix_if_not_in("latency", keys);
> > > > -               keys = prefix_if_not_in("overhead", keys);
> > > > -               if (symbol_conf.cumulate_callchain) {
> > > > -                       if (symbol_conf.enable_latency)
> > > > -                               keys = prefix_if_not_in("latency_children", keys);
> > > > +       } else {
> > > > +               if (symbol_conf.cumulate_callchain)
> > > >                         keys = prefix_if_not_in("overhead_children", keys);
> > > > -               }
> > > > +               else
> > > > +                       keys = prefix_if_not_in("overhead", keys);
> > > >         }
> > > >
> > > >         return keys;
> > >
> > >
> > > Have I decoded the patch correctly?
> > >
> > >         if (symbol_conf.prefer_latency) {
> > >                 if (symbol_conf.cumulate_callchain)
> > >                         keys = prefix_if_not_in("latency_children", keys);
> > >                 else
> > >                         keys = prefix_if_not_in("latency", keys);
> > >         } else {
> > >                 if (symbol_conf.cumulate_callchain)
> > >                         keys = prefix_if_not_in("overhead_children", keys);
> > >                 else
> > >                         keys = prefix_if_not_in("overhead", keys);
> > >         }
> > >
> >
> > Yep, that's correct.
> >
> >
> > > If I decoded the patch correctly, it's not what we want.
> > >
> > > For the default prefer_latency case we also want to add overhead, that
> > > was intentional for the --latency preset. It does not harm, and allows
> > > to see/compare differences in latency and overhead.
> > > Again, if a user wants something custom, there is no way to second
> > > guess all possible intentions. For non-default cases, we just let the
> > > user say what exactly they want, and we will follow that.
> > >
> > > "latency" should be added even if cumulate_callchain.
> >
> > Please note that it just sets the sort key - which column you want to
> > order the result.  The output fields for overhead and children will be
> > added in perf_hpp__init() if you remove the 'was_taken' logic.  So I
> > think this change will have the same output with that.
> 
> Yes, but perf_hpp__init() does not have the logic that's currently
> contained in setup_overhead().
> 
> If the user specified a "latency" field, and we don't want to add
> "overhead" in that case, then _both_ setup_overhead() and
> perf_hpp__init() must not add "overhead".

Ok, I see your point.  But I think it'd be much easier if you allow the
'overhead' column in that case too.

> 
> If we do what you proposed, then perf_hpp__init() will still add
> "overhead" and we go back to square 0.

Right, but currently the default perf report and with --latency option,
will show both overhead and latency columns.  That's why I thought you
wanted to display them together.

Actually I don't want to use -s option to describe output fields (like
overhead and latency) but I cannot prevent people from doing that. :(
Maybe you can skip the setup_overhead() if user gives either 'overhead'
or 'latency' explicitly - oh, you have that in the !prefer_latency case.

> 
> I used was_taken to not duplicate this non-trivial logic in both
> functions. As I mentioned in the previous replies, I tried that but it
> was messier/more complex. was_taken is a simple way to not duplicate
> logic and keep these functions consistent.

Hmm.. ok.  Maybe we can update this part later.  Can you please add a
comment in the perf_hpp__init() that says overhead and latency columns
are added to the sort list in setup_overhead() so it's added implicitly
here only if it's already taken?

> 
> 
> > > For the !prefer_latency case, we don't want to mess with
> > > overhead/latency fields if the user specified any of them explicitly.
> > > Otherwise this convenience part gets in the user's way and does not
> > > allow them to do what they want. User says "I want X" and perf says
> > > "screw you, I will give you Y instead, and won't allow you to possibly
> > > do X".
> >
> > That's what -F option does.  The -s option used to specify how to group
> > the histogram entries and it will add 'overhead' (and/or 'latency') if
> > it's not even requested.  So I think it's ok to add more output column
> > when -s option is used.
> >
> > But unfortunately, using -F and -s together is confusing and change the
> > meaning of -s option - it now says how it sort the result.
> >
> > >
> > > And see above: -F does not work with --hierarchy, so this part is unskippable.
> >
> > Yep, but I mean it fixes --hierarchy and --latency.  I'm thinking of a
> > way to support -F and --hierarchy in general.
> 
> I don't know why it was disabled. There are likely other things to
> improve, but please let's not tie that to this change.

Right, it's a separate issue.  I was afraid of mixing output fields and
sort keys in an unexpected order.  But maybe we can support that if
that's what user wants.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/8] perf report: Add --latency flag
  2025-02-12 19:47               ` Namhyung Kim
@ 2025-02-13  9:09                 ` Dmitry Vyukov
  0 siblings, 0 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-13  9:09 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: irogers, linux-perf-users, linux-kernel, Arnaldo Carvalho de Melo

On Wed, 12 Feb 2025 at 20:47, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, Feb 11, 2025 at 09:23:30PM +0100, Dmitry Vyukov wrote:
> > On Tue, 11 Feb 2025 at 18:42, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Tue, Feb 11, 2025 at 09:42:16AM +0100, Dmitry Vyukov wrote:
> > > > On Tue, 11 Feb 2025 at 02:02, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > >
> > > > > On Fri, Feb 07, 2025 at 08:23:58AM +0100, Dmitry Vyukov wrote:
> > > > > > On Fri, 7 Feb 2025 at 04:44, Namhyung Kim <namhyung@kernel.org> wrote:
> > > [SNIP]
> > > > > > > > @@ -3547,10 +3549,15 @@ static int __hpp_dimension__add_output(struct perf_hpp_list *list,
> > > > > > > >       return 0;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > -int hpp_dimension__add_output(unsigned col)
> > > > > > > > +int hpp_dimension__add_output(unsigned col, bool implicit)
> > > > > > > >  {
> > > > > > > > +     struct hpp_dimension *hd;
> > > > > > > > +
> > > > > > > >       BUG_ON(col >= PERF_HPP__MAX_INDEX);
> > > > > > > > -     return __hpp_dimension__add_output(&perf_hpp_list, &hpp_sort_dimensions[col]);
> > > > > > > > +     hd = &hpp_sort_dimensions[col];
> > > > > > > > +     if (implicit && !hd->was_taken)
> > > > > > > > +             return 0;
> > > > > > >
> > > > > > > I don't think you need these implicit and was_taken things.
> > > > > > > Just removing from the sort list when it's unregistered seems to work.
> > > > > > >
> > > > > > > ---8<---
> > > > > > > @@ -685,6 +685,7 @@ void perf_hpp_list__prepend_sort_field(struct perf_hpp_list *list,
> > > > > > >  static void perf_hpp__column_unregister(struct perf_hpp_fmt *format)
> > > > > > >  {
> > > > > > >         list_del_init(&format->list);
> > > > > > > +       list_del_init(&format->sort_list);
> > > > > > >         fmt_free(format);
> > > > > > >  }
> > > > > > >
> > > > > > > ---8<---
> > > > > >
> > > > > > It merely suppresses the warning, but does not work the same way. See
> > > > > > this for details:
> > > > > > https://lore.kernel.org/all/CACT4Y+ZREdDL7a+DMKGFGae1ZjX1C8uNRwCGF0c8iUJtTTq0Lw@mail.gmail.com/
> > > > >
> > > > > But I think it's better to pass --latency option rather than adding it
> > > > > to -s option.  If you really want to have specific output fields, then
> > > > > please use -F latency,sym instead.
> > > > >
> > > > > Also I've realized that it should add one sort key in setup_overhead()
> > > > > to support hierarchy mode properly.  Something like this?
> > > > >
> > > > > Thanks,
> > > > > Namhyung
> > > > >
> > > > >
> > > > > ---8<---
> > > > > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > > > > index 2b6023de7a53ae2e..329c2e9bbc69a725 100644
> > > > > --- a/tools/perf/util/sort.c
> > > > > +++ b/tools/perf/util/sort.c
> > > > > @@ -3817,22 +3817,15 @@ static char *setup_overhead(char *keys)
> > > > >                 return keys;
> > > > >
> > > > >         if (symbol_conf.prefer_latency) {
> > > > > -               keys = prefix_if_not_in("overhead", keys);
> > > > > -               keys = prefix_if_not_in("latency", keys);
> > > > > -               if (symbol_conf.cumulate_callchain) {
> > > > > -                       keys = prefix_if_not_in("overhead_children", keys);
> > > > > +               if (symbol_conf.cumulate_callchain)
> > > > >                         keys = prefix_if_not_in("latency_children", keys);
> > > > > -               }
> > > > > -       } else if (!keys || (!strstr(keys, "overhead") &&
> > > > > -                       !strstr(keys, "latency"))) {
> > > > > -               if (symbol_conf.enable_latency)
> > > > > +               else
> > > > >                         keys = prefix_if_not_in("latency", keys);
> > > > > -               keys = prefix_if_not_in("overhead", keys);
> > > > > -               if (symbol_conf.cumulate_callchain) {
> > > > > -                       if (symbol_conf.enable_latency)
> > > > > -                               keys = prefix_if_not_in("latency_children", keys);
> > > > > +       } else {
> > > > > +               if (symbol_conf.cumulate_callchain)
> > > > >                         keys = prefix_if_not_in("overhead_children", keys);
> > > > > -               }
> > > > > +               else
> > > > > +                       keys = prefix_if_not_in("overhead", keys);
> > > > >         }
> > > > >
> > > > >         return keys;
> > > >
> > > >
> > > > Have I decoded the patch correctly?
> > > >
> > > >         if (symbol_conf.prefer_latency) {
> > > >                 if (symbol_conf.cumulate_callchain)
> > > >                         keys = prefix_if_not_in("latency_children", keys);
> > > >                 else
> > > >                         keys = prefix_if_not_in("latency", keys);
> > > >         } else {
> > > >                 if (symbol_conf.cumulate_callchain)
> > > >                         keys = prefix_if_not_in("overhead_children", keys);
> > > >                 else
> > > >                         keys = prefix_if_not_in("overhead", keys);
> > > >         }
> > > >
> > >
> > > Yep, that's correct.
> > >
> > >
> > > > If I decoded the patch correctly, it's not what we want.
> > > >
> > > > For the default prefer_latency case we also want to add overhead, that
> > > > was intentional for the --latency preset. It does not harm, and allows
> > > > to see/compare differences in latency and overhead.
> > > > Again, if a user wants something custom, there is no way to second
> > > > guess all possible intentions. For non-default cases, we just let the
> > > > user say what exactly they want, and we will follow that.
> > > >
> > > > "latency" should be added even if cumulate_callchain.
> > >
> > > Please note that it just sets the sort key - which column you want to
> > > order the result.  The output fields for overhead and children will be
> > > added in perf_hpp__init() if you remove the 'was_taken' logic.  So I
> > > think this change will have the same output with that.
> >
> > Yes, but perf_hpp__init() does not have the logic that's currently
> > contained in setup_overhead().
> >
> > If the user specified a "latency" field, and we don't want to add
> > "overhead" in that case, then _both_ setup_overhead() and
> > perf_hpp__init() must not add "overhead".
>
> Ok, I see your point.  But I think it'd be much easier if you allow the
> 'overhead' column in that case too.


It may be possible. However, one of my motivations that I did not
fully explain and realized myself is as follows.

You referred to fields and sort order as to different things. However,
for a person not working on perf code internals these are mostly the
same thing.

Fields are merged into sort order, sort order elements appear as fields.
Say, if I do "-F symbol,overhead", output will be sorted by symbols
first, so fields == sort. Similarly, anything specified in --sort
appears as fields in output, so sort == fields.

My current code keeps behavior for them consistent. Making it
non-consistent may be confusing for users.

For example, --hierarchy requires --sort (fails with -F), so I tend to
use it more than F (commands with --sort are already in my bash
history).
So if I want latency and symbol I tend to do:

perf report --sort latency,symbol

and with my current code this behaves the same way as:

perf report -F latency,symbol

Which matches my mental model of F/sort being the same.
However, if we remove was_taken, then the above commands behave
differently, which for end user like me is:
¯\_(ツ)_/¯


After reading the code, I think I understand the difference
(hopefully), e.g. it allows me to do a subtle things like:
-F symbol,overhead --sort overhead,symbol
(not sure how useful this is compared to always sticking to field
order for sorting)




> > If we do what you proposed, then perf_hpp__init() will still add
> > "overhead" and we go back to square 0.
>
> Right, but currently the default perf report and with --latency option,
> will show both overhead and latency columns.  That's why I thought you
> wanted to display them together.
>
> Actually I don't want to use -s option to describe output fields (like
> overhead and latency) but I cannot prevent people from doing that. :(
> Maybe you can skip the setup_overhead() if user gives either 'overhead'
> or 'latency' explicitly - oh, you have that in the !prefer_latency case.
>
> >
> > I used was_taken to not duplicate this non-trivial logic in both
> > functions. As I mentioned in the previous replies, I tried that but it
> > was messier/more complex. was_taken is a simple way to not duplicate
> > logic and keep these functions consistent.
>
> Hmm.. ok.  Maybe we can update this part later.  Can you please add a
> comment in the perf_hpp__init() that says overhead and latency columns
> are added to the sort list in setup_overhead() so it's added implicitly
> here only if it's already taken?

Added a comment capturing some of this discussion in v7


> > > > For the !prefer_latency case, we don't want to mess with
> > > > overhead/latency fields if the user specified any of them explicitly.
> > > > Otherwise this convenience part gets in the user's way and does not
> > > > allow them to do what they want. User says "I want X" and perf says
> > > > "screw you, I will give you Y instead, and won't allow you to possibly
> > > > do X".
> > >
> > > That's what -F option does.  The -s option used to specify how to group
> > > the histogram entries and it will add 'overhead' (and/or 'latency') if
> > > it's not even requested.  So I think it's ok to add more output column
> > > when -s option is used.
> > >
> > > But unfortunately, using -F and -s together is confusing and change the
> > > meaning of -s option - it now says how it sort the result.
> > >
> > > >
> > > > And see above: -F does not work with --hierarchy, so this part is unskippable.
> > >
> > > Yep, but I mean it fixes --hierarchy and --latency.  I'm thinking of a
> > > way to support -F and --hierarchy in general.
> >
> > I don't know why it was disabled. There are likely other things to
> > improve, but please let's not tie that to this change.
>
> Right, it's a separate issue.  I was afraid of mixing output fields and
> sort keys in an unexpected order.  But maybe we can support that if
> that's what user wants.
>
> Thanks,
> Namhyung
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/8] perf report: Add latency and parallelism profiling
  2025-02-10  7:17       ` Dmitry Vyukov
  2025-02-10 17:11         ` Andi Kleen
@ 2025-02-13  9:09         ` Dmitry Vyukov
  1 sibling, 0 replies; 32+ messages in thread
From: Dmitry Vyukov @ 2025-02-13  9:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: namhyung, irogers, linux-perf-users, linux-kernel,
	Arnaldo Carvalho de Melo

On Mon, 10 Feb 2025 at 08:17, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Fri, 7 Feb 2025 at 19:30, Andi Kleen <ak@linux.intel.com> wrote:
> >
> > > > I assume it just works, but might be worth checking.
> > >
> > > Yes, it seems to just work as one would assume. Things just combine as intended.
> >
> > Great.
> >
> > >
> > > > It was intended to address some of these issues too.
> > >
> > > What issue? Latency profiling? I wonder what approach you had in mind?
> >
> > The problem with gaps in parallelism is usually how things change
> > over time. If you have e.g. idle periods they tend to look different
> > in the profile. with the full aggregation you don't see it, but with
> > a time series it tends to stand out.
> >
> > But yes that approach usually only works for large gaps. I guess
> > yours is better for more fine grained issues.
> >
> > And I guess it might not be the most intutive for less experienced
> > users.
> >
> > This is BTW actually a case for using a perf data GUI like hotspot or
> > vtune which can visualize this better and you can zoom arbitrarily.
> > Standard perf has only timechart for it, but it's a bit clunky to use
> > and only shows the reschedules.
> >
> > > Also (1) user still needs to understand the default profile is wrong,
> > > (2) be proficient with perf features, (3) manually aggregate lots of
> > > data (time slicing increases amount of data in the profile X times),
> > > (4) deal with inaccuracy caused by edge effects (e.g. slice is 1s, but
> > > program phase changed mid-second).
> >
> > If you're lucky and the problem is not long tail you can use a high
> > percentage cut off (--percent-limit) to eliminate most of the data.
> >
> > Then you just have "topN functions over time" which tends to be quite
> > readable. One drawback of that approach is that it doesn't show
> > the "other", but perhaps we'll fix that one day.
> >
> > But yes that perf has too many options and is not intuitive and most
> > people miss most of the features is an inherent problem. I don't have
> > a good solution for that unfortunately, other than perhaps better
> > documentation.
>
> I don't think this is a solution :(
>
> I provided lots of rationale for making this latency profiling enabled
> by default in this patch series for this reason. If we just capture
> context switches, then we can show both overhead and latency, even if
> we sort by overhead by default, people would still see the latency
> column and may start thinking/asking questions.
> But this is not happening, so mostly people on this thread will know about it :)
>
>
> > > But it does open some interesting capabilities in combination with a
> > > latency profile, e.g. the following shows how parallelism was changing
> > > over time.
> > >
> > > for perf make profile:
> >
> > Very nice! Looks useful.
> >
> > Perhaps add that variant to tips.txt too.

Now done in v7

> That's a good idea.
> I am waiting for other feedback to not resend the series just because of this.
>
>
> > > perf report -F time,latency,parallelism --time-quantum=1s
> > >
> > > # Time           Latency  Parallelism
> > > # ............  ........  ...........
> > > #
> > >   1795957.0000     1.42%            1
> > >   1795957.0000     0.07%            2
> > >   1795957.0000     0.01%            3
> > >   1795957.0000     0.00%            4
> > >
> > >   1795958.0000     4.82%            1
> > >   1795958.0000     0.11%            2
> > >   1795958.0000     0.00%            3
> > > ...
> > >   1795964.0000     1.76%            2
> > >   1795964.0000     0.58%            4
> > >   1795964.0000     0.45%            1
> > >   1795964.0000     0.23%           10
> > >   1795964.0000     0.21%            6
> > >
> > > /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
> > >
> > > Here it finally started running on more than 1 CPU.
> >
> >
> > -Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-02-13  9:10 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-05 16:27 [PATCH v5 0/8] perf report: Add latency and parallelism profiling Dmitry Vyukov
2025-02-05 16:27 ` [PATCH v5 1/8] perf report: Add machine parallelism Dmitry Vyukov
2025-02-05 16:27 ` [PATCH v5 2/8] perf report: Add parallelism sort key Dmitry Vyukov
2025-02-05 16:27 ` [PATCH v5 3/8] perf report: Switch filtered from u8 to u16 Dmitry Vyukov
2025-02-05 16:27 ` [PATCH v5 4/8] perf report: Add parallelism filter Dmitry Vyukov
2025-02-05 16:27 ` [PATCH v5 5/8] perf report: Add latency output field Dmitry Vyukov
2025-02-05 16:27 ` [PATCH v5 6/8] perf report: Add --latency flag Dmitry Vyukov
2025-02-07  3:44   ` Namhyung Kim
2025-02-07  7:23     ` Dmitry Vyukov
2025-02-11  1:02       ` Namhyung Kim
2025-02-11  8:30         ` Dmitry Vyukov
2025-02-11  8:42         ` Dmitry Vyukov
2025-02-11 17:42           ` Namhyung Kim
2025-02-11 20:23             ` Dmitry Vyukov
2025-02-12 19:47               ` Namhyung Kim
2025-02-13  9:09                 ` Dmitry Vyukov
2025-02-07  3:53   ` Namhyung Kim
2025-02-07 11:42     ` Dmitry Vyukov
2025-02-05 16:27 ` [PATCH v5 7/8] perf report: Add latency and parallelism profiling documentation Dmitry Vyukov
2025-02-05 16:27 ` [PATCH v5 8/8] perf hist: Shrink struct hist_entry size Dmitry Vyukov
2025-02-06 18:30 ` [PATCH v5 0/8] perf report: Add latency and parallelism profiling Andi Kleen
2025-02-06 18:41   ` Dmitry Vyukov
2025-02-06 18:51     ` Ian Rogers
2025-02-07  3:57       ` Namhyung Kim
2025-02-07 11:44         ` Dmitry Vyukov
2025-02-06 18:57     ` Andi Kleen
2025-02-06 19:07       ` Andi Kleen
2025-02-07  8:16   ` Dmitry Vyukov
2025-02-07 18:30     ` Andi Kleen
2025-02-10  7:17       ` Dmitry Vyukov
2025-02-10 17:11         ` Andi Kleen
2025-02-13  9:09         ` Dmitry Vyukov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).