[RFC/PATCH] perf report: Support latency profiling in system-wide mode

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC/PATCH] perf report: Support latency profiling in system-wide mode
@ 2025-05-03  0:36 Namhyung Kim
  2025-05-04  8:22 ` Ingo Molnar
  2025-05-05  8:08 ` Dmitry Vyukov
  0 siblings, 2 replies; 29+ messages in thread
From: Namhyung Kim @ 2025-05-03  0:36 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang
  Cc: Jiri Olsa, Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen, Dmitry Vyukov

When it profile a target process (and its children), it's
straight-forward to track parallelism using sched-switch info.  The
parallelism is kept in machine-level in this case.

But when it profile multiple processes like in the system-wide mode,
it might not be clear how to apply the (machine-level) parallelism to
different tasks.  That's why it disabled the latency profiling for
system-wide mode.

But it should be able to track parallelism in each process and it'd
useful to profile latency issues in multi-threaded programs.  So this
patch tries to enable it.

However using sched-switch info can be a problem since it may emit a lot
more data and more chances for losing data when perf cannot keep up with
it.

Instead, it can maintain the current process for each CPU when it sees
samples.  And the process updates parallelism so that it can calculate
the latency based on the value.  One more point to improve is to remove
the idle task from latency calculation.

Here's an example:

  # perf record -a -- perf bench sched messaging

This basically forks each sender and receiver tasks for the run.

  # perf report --latency -s comm --stdio
  ...
  #
  #  Latency  Overhead  Command
  # ........  ........  ...............
  #
      98.14%    95.97%  sched-messaging
       0.78%     0.93%  gnome-shell
       0.36%     0.34%  ptyxis
       0.23%     0.23%  kworker/u112:0-
       0.23%     0.44%  perf
       0.08%     0.10%  KMS thread
       0.05%     0.05%  rcu_preempt
       0.05%     0.05%  kworker/u113:2-
       ...

So the latency profile is similar to the overhead.  But when you run
with -t option to make it multi-threaded:

  # perf record -a -- perf bench sched messaging -t

Then the latency profile result differs with the overhead.

  # perf report --latency -s comm --stdio
  ...
  #
  #  Latency  Overhead  Command
  # ........  ........  ...............
  #
      58.94%    94.52%  sched-messaging
      19.12%     1.22%  perf
      12.13%     0.77%  ptyxis
       4.07%     0.28%  gnome-shell
       1.31%     0.08%  perf-exec
       0.95%     0.06%  kworker/u113:2-
       0.86%     0.05%  KMS thread
       0.85%     0.05%  kworker/18:1-mm
       ...

A side effect is that it will most likely show latency output column
for system-wide mode by default.

Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-record.c | 16 ++---------
 tools/perf/builtin-report.c | 56 +++++++++++++++++++++++++++++--------
 tools/perf/util/event.c     | 48 ++++++++++++++++++++++++-------
 tools/perf/util/machine.c   | 21 ++++++++++++++
 tools/perf/util/machine.h   |  4 +++
 tools/perf/util/thread.h    | 23 +++++++++++++++
 6 files changed, 133 insertions(+), 35 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 8898357325cf4132..67f746601ac05e4d 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -4062,19 +4062,9 @@ int cmd_record(int argc, const char **argv)
 	}
 
 	if (record.latency) {
-		/*
-		 * There is no fundamental reason why latency profiling
-		 * can't work for system-wide mode, but exact semantics
-		 * and details are to be defined.
-		 * See the following thread for details:
-		 * https://lore.kernel.org/all/Z4XDJyvjiie3howF@google.com/
-		 */
-		if (record.opts.target.system_wide) {
-			pr_err("Failed: latency profiling is not supported with system-wide collection.\n");
-			err = -EINVAL;
-			goto out_opts;
-		}
-		record.opts.record_switch_events = true;
+		/* It can use sample->cpu to get process-level parallelism. */
+		if (!target__has_cpu(&record.opts.target))
+			record.opts.record_switch_events = true;
 	}
 
 	if (rec->buildid_mmap) {
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index f0299c7ee0254a37..4294508b9df52d83 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -1280,6 +1280,22 @@ static int process_attr(const struct perf_tool *tool __maybe_unused,
 	return 0;
 }
 
+static bool is_latency_requested(void)
+{
+	if (symbol_conf.parallelism_list_str || symbol_conf.prefer_latency)
+		return true;
+
+	if (sort_order && (strstr(sort_order, "latency") ||
+			   strstr(sort_order, "parallelism")))
+		return true;
+
+	if (field_order && (strstr(field_order, "latency") ||
+			    strstr(field_order, "parallelism")))
+		return true;
+
+	return false;
+}
+
 #define CALLCHAIN_BRANCH_SORT_ORDER	\
 	"srcline,symbol,dso,callchain_branch_predicted," \
 	"callchain_branch_abort,callchain_branch_cycles"
@@ -1735,18 +1751,10 @@ int cmd_report(int argc, const char **argv)
 	}
 
 	symbol_conf.enable_latency = true;
-	if (report.disable_order || !perf_session__has_switch_events(session)) {
-		if (symbol_conf.parallelism_list_str ||
-			symbol_conf.prefer_latency ||
-			(sort_order && (strstr(sort_order, "latency") ||
-				strstr(sort_order, "parallelism"))) ||
-			(field_order && (strstr(field_order, "latency") ||
-				strstr(field_order, "parallelism")))) {
-			if (report.disable_order)
-				ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
-			else
-				ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
-			return -1;
+	if (report.disable_order) {
+		if (is_latency_requested()) {
+			ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
+			goto error;
 		}
 		/*
 		 * If user did not ask for anything related to
@@ -1755,6 +1763,30 @@ int cmd_report(int argc, const char **argv)
 		symbol_conf.enable_latency = false;
 	}
 
+	if (!perf_session__has_switch_events(session)) {
+		if (evlist__combined_sample_type(session->evlist) & PERF_SAMPLE_CPU) {
+			struct machine *machine = &session->machines.host;
+
+			/*
+			 * Maintain current process for each CPU to track
+			 * parallelism in process level.
+			 */
+			ret = machine__create_current_table(machine);
+			if (ret < 0)
+				goto error;
+		} else if (is_latency_requested()) {
+			ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
+			goto error;
+		}
+		else {
+			/*
+			 * If user did not ask for anything related to
+			 * latency/parallelism explicitly, just don't show it.
+			 */
+			symbol_conf.enable_latency = false;
+		}
+	}
+
 	if (last_key != K_SWITCH_INPUT_DATA) {
 		if (sort_order && strstr(sort_order, "ipc")) {
 			parse_options_usage(report_usage, options, "s", 1);
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index c23b77f8f854ad21..6c2c9cd68e3afb5e 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -767,16 +767,44 @@ int machine__resolve(struct machine *machine, struct addr_location *al,
 			al->socket = env->cpu[al->cpu].socket_id;
 	}
 
-	/* Account for possible out-of-order switch events. */
-	al->parallelism = max(1, min(machine->parallelism, machine__nr_cpus_avail(machine)));
-	if (test_bit(al->parallelism, symbol_conf.parallelism_filter))
-		al->filtered |= (1 << HIST_FILTER__PARALLELISM);
-	/*
-	 * Multiply it by some const to avoid precision loss or dealing
-	 * with floats. The multiplier does not matter otherwise since
-	 * we only print it as percents.
-	 */
-	al->latency = sample->period * 1000 / al->parallelism;
+	if (symbol_conf.enable_latency) {
+		if (machine->current && al->cpu >= 0) {
+			struct thread *curr = machine->current[al->cpu];
+
+			if (curr) {
+				thread__dec_parallelism(curr);
+				thread__put(curr);
+			}
+
+			curr = machine__findnew_thread(machine, sample->pid,
+						       sample->pid);
+			if (curr) {
+				machine->current[al->cpu] = curr;
+
+				thread__inc_parallelism(curr);
+				thread__get(curr);
+			}
+
+			al->parallelism = curr ? thread__parallelism(curr) : 1;
+		} else {
+			/* Account for possible out-of-order switch events. */
+			al->parallelism = max(1, min(machine->parallelism,
+						     machine__nr_cpus_avail(machine)));
+		}
+
+		if (test_bit(al->parallelism, symbol_conf.parallelism_filter))
+			al->filtered |= (1 << HIST_FILTER__PARALLELISM);
+		/*
+		 * Multiply it by some const to avoid precision loss or dealing
+		 * with floats. The multiplier does not matter otherwise since
+		 * we only print it as percents.
+		 */
+		al->latency = sample->period * 1000 / al->parallelism;
+
+		/* It won't be exciting to see idle threads in latency profile. */
+		if (sample->pid == 0)
+			al->latency = 0;
+	}
 
 	if (al->map) {
 		if (symbol_conf.dso_list &&
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 2531b373f2cf7ca0..8c0e7d418a88f528 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -117,6 +117,7 @@ int machine__init(struct machine *machine, const char *root_dir, pid_t pid)
 	}
 
 	machine->current_tid = NULL;
+	machine->current = NULL;
 	err = 0;
 
 out:
@@ -183,6 +184,12 @@ void machine__exit(struct machine *machine)
 	zfree(&machine->current_tid);
 	zfree(&machine->kallsyms_filename);
 
+	if (machine->current) {
+		for (int i = 0; i < machine__nr_cpus_avail(machine); i++)
+			thread__put(machine->current[i]);
+		zfree(&machine->current);
+	}
+
 	threads__exit(&machine->threads);
 }
 
@@ -3271,3 +3278,17 @@ int machine__hit_all_dsos(struct machine *machine)
 {
 	return dsos__hit_all(&machine->dsos);
 }
+
+int machine__create_current_table(struct machine *machine)
+{
+	int nr = machine__nr_cpus_avail(machine);
+
+	if (nr == 0)
+		return -EINVAL;
+
+	machine->current = calloc(nr, sizeof(*machine->current));
+	if (machine->current == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index b56abec84fed1e3f..13d23b14b25e51fc 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -64,6 +64,8 @@ struct machine {
 	};
 	struct machines   *machines;
 	bool		  trampolines_mapped;
+	/* per-CPU current process for parallelism */
+	struct thread	  **current;
 };
 
 /*
@@ -332,4 +334,6 @@ int machine__resolve(struct machine *machine, struct addr_location *al,
 
 int machine__hit_all_dsos(struct machine *machine);
 
+int machine__create_current_table(struct machine *machine);
+
 #endif /* __PERF_MACHINE_H */
diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
index cd574a896418ac94..5233436404408d10 100644
--- a/tools/perf/util/thread.h
+++ b/tools/perf/util/thread.h
@@ -39,6 +39,7 @@ DECLARE_RC_STRUCT(thread) {
 	pid_t			ppid;
 	int			cpu;
 	int			guest_cpu; /* For QEMU thread */
+	int			parallelism; /* Valid for group leaders */
 	refcount_t		refcnt;
 	/**
 	 * @exited: Has the thread had an exit event. Such threads are usually
@@ -359,4 +360,26 @@ void thread__free_stitch_list(struct thread *thread);
 void thread__resolve(struct thread *thread, struct addr_location *al,
 		     struct perf_sample *sample);
 
+static inline int thread__parallelism(const struct thread *thread)
+{
+	return RC_CHK_ACCESS(thread)->parallelism ? : 1;
+}
+
+static inline void thread__set_parallelism(struct thread *thread, int parallelism)
+{
+	RC_CHK_ACCESS(thread)->parallelism = parallelism;
+}
+
+static inline void thread__inc_parallelism(struct thread *thread)
+{
+	if (thread__pid(thread))
+		RC_CHK_ACCESS(thread)->parallelism++;
+}
+
+static inline void thread__dec_parallelism(struct thread *thread)
+{
+	if (thread__pid(thread) && RC_CHK_ACCESS(thread)->parallelism > 0)
+		RC_CHK_ACCESS(thread)->parallelism--;
+}
+
 #endif	/* __PERF_THREAD_H */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-03  0:36 [RFC/PATCH] perf report: Support latency profiling in system-wide mode Namhyung Kim
@ 2025-05-04  8:22 ` Ingo Molnar
  2025-05-04 19:52   ` Namhyung Kim
  2025-05-05  8:08 ` Dmitry Vyukov
  1 sibling, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2025-05-04  8:22 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, LKML, linux-perf-users, Andi Kleen,
	Dmitry Vyukov


* Namhyung Kim <namhyung@kernel.org> wrote:

> When it profile a target process (and its children), it's
> straight-forward to track parallelism using sched-switch info.  The
> parallelism is kept in machine-level in this case.
> 
> But when it profile multiple processes like in the system-wide mode,
> it might not be clear how to apply the (machine-level) parallelism to
> different tasks.  That's why it disabled the latency profiling for
> system-wide mode.
> 
> But it should be able to track parallelism in each process and it'd
> useful to profile latency issues in multi-threaded programs.  So this
> patch tries to enable it.
> 
> However using sched-switch info can be a problem since it may emit a lot
> more data and more chances for losing data when perf cannot keep up with
> it.
> 
> Instead, it can maintain the current process for each CPU when it sees
> samples.  And the process updates parallelism so that it can calculate
> the latency based on the value.  One more point to improve is to remove
> the idle task from latency calculation.
> 
> Here's an example:
> 
>   # perf record -a -- perf bench sched messaging
> 
> This basically forks each sender and receiver tasks for the run.
> 
>   # perf report --latency -s comm --stdio
>   ...
>   #
>   #  Latency  Overhead  Command
>   # ........  ........  ...............
>   #
>       98.14%    95.97%  sched-messaging
>        0.78%     0.93%  gnome-shell
>        0.36%     0.34%  ptyxis
>        0.23%     0.23%  kworker/u112:0-
>        0.23%     0.44%  perf
>        0.08%     0.10%  KMS thread
>        0.05%     0.05%  rcu_preempt
>        0.05%     0.05%  kworker/u113:2-
>        ...

Just a generic user-interface comment: I had to look up what 'latency' 
means in this context, and went about 3 hops deep into various pieces 
of description until I found Documentation/cpu-and-latency-overheads.txt,
where after a bit of head-scratching I realized that 'latency' is a
weird alias for 'wall-clock time'...

This is *highly* confusing terminology IMHO.

'Latency' is a highly overloaded concept that almost never corresponds 
to 'wall clock time'. It usually means a relative delay value, which is 
why I initially thought this somehow means instruction-latency or 
memory-latency profiling ...

Ie. 'latency' in its naive meaning, is on the exact opposite side of 
the terminology spectrum of where it should be: it suggests relative 
time, while in reality it's connected to wall-clock/absolute time ...

*Please* use something else. Wall-clock is fine, as 
cpu-and-latency-overheads.txt uses initially, but so would be other 
combinations:

   #1: 'CPU time' vs. 'real time'

        This is short, although a disadvantage is the possible 
        'real-time kernel' source of confusion here.

   #2: 'CPU time' vs. 'wall-clock time'

        This is longer but OK and unambiguous.

   #3: 'relative time' vs. 'absolute time'

        This is short and straightforward, and might be my favorite 
        personally, because relative/absolute is such an unambiguous 
        and well-known terminology and often paired in a similar 
        fashion.

   #4: 'CPU time' vs. 'absolute time'

        This is a combination of #1 and #3 that keeps the 'CPU time' 
        terminology for relative time. The CPU/absolute pairing is not 
        that intuitive though.

   #5: 'CPU time' vs. 'latency'

        This is really, really bad and unintuitive. Sorry to be so 
        harsh and negative about this choice, but this is such a nice 
        feature, which suffers from confusing naming. :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-04  8:22 ` Ingo Molnar
@ 2025-05-04 19:52   ` Namhyung Kim
  2025-05-05  8:03     ` Dmitry Vyukov
  0 siblings, 1 reply; 29+ messages in thread
From: Namhyung Kim @ 2025-05-04 19:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, LKML, linux-perf-users, Andi Kleen,
	Dmitry Vyukov

Hi Ingo,

Thanks for sharing your opinion.

On Sun, May 04, 2025 at 10:22:26AM +0200, Ingo Molnar wrote:
> 
> * Namhyung Kim <namhyung@kernel.org> wrote:
> 
> > When it profile a target process (and its children), it's
> > straight-forward to track parallelism using sched-switch info.  The
> > parallelism is kept in machine-level in this case.
> > 
> > But when it profile multiple processes like in the system-wide mode,
> > it might not be clear how to apply the (machine-level) parallelism to
> > different tasks.  That's why it disabled the latency profiling for
> > system-wide mode.
> > 
> > But it should be able to track parallelism in each process and it'd
> > useful to profile latency issues in multi-threaded programs.  So this
> > patch tries to enable it.
> > 
> > However using sched-switch info can be a problem since it may emit a lot
> > more data and more chances for losing data when perf cannot keep up with
> > it.
> > 
> > Instead, it can maintain the current process for each CPU when it sees
> > samples.  And the process updates parallelism so that it can calculate
> > the latency based on the value.  One more point to improve is to remove
> > the idle task from latency calculation.
> > 
> > Here's an example:
> > 
> >   # perf record -a -- perf bench sched messaging
> > 
> > This basically forks each sender and receiver tasks for the run.
> > 
> >   # perf report --latency -s comm --stdio
> >   ...
> >   #
> >   #  Latency  Overhead  Command
> >   # ........  ........  ...............
> >   #
> >       98.14%    95.97%  sched-messaging
> >        0.78%     0.93%  gnome-shell
> >        0.36%     0.34%  ptyxis
> >        0.23%     0.23%  kworker/u112:0-
> >        0.23%     0.44%  perf
> >        0.08%     0.10%  KMS thread
> >        0.05%     0.05%  rcu_preempt
> >        0.05%     0.05%  kworker/u113:2-
> >        ...
> 
> Just a generic user-interface comment: I had to look up what 'latency' 
> means in this context, and went about 3 hops deep into various pieces 
> of description until I found Documentation/cpu-and-latency-overheads.txt,
> where after a bit of head-scratching I realized that 'latency' is a
> weird alias for 'wall-clock time'...
> 
> This is *highly* confusing terminology IMHO.

Sorry for the confusion.  I know I'm terrible at naming things. :)

Actually Dmitry used the term 'wall-clock' profiling at first when he
implemented this feature but I thought it was not clear how it meant
for non-cycle events.  As 'overhead' is also a generic term, we ended
up with 'latency'.

> 
> 'Latency' is a highly overloaded concept that almost never corresponds 
> to 'wall clock time'. It usually means a relative delay value, which is 
> why I initially thought this somehow means instruction-latency or 
> memory-latency profiling ...
> 
> Ie. 'latency' in its naive meaning, is on the exact opposite side of 
> the terminology spectrum of where it should be: it suggests relative 
> time, while in reality it's connected to wall-clock/absolute time ...
> 
> *Please* use something else. Wall-clock is fine, as 
> cpu-and-latency-overheads.txt uses initially, but so would be other 
> combinations:
> 
>    #1: 'CPU time' vs. 'real time'
> 
>         This is short, although a disadvantage is the possible 
>         'real-time kernel' source of confusion here.
> 
>    #2: 'CPU time' vs. 'wall-clock time'
> 
>         This is longer but OK and unambiguous.
> 
>    #3: 'relative time' vs. 'absolute time'
> 
>         This is short and straightforward, and might be my favorite 
>         personally, because relative/absolute is such an unambiguous 
>         and well-known terminology and often paired in a similar 
>         fashion.
> 
>    #4: 'CPU time' vs. 'absolute time'
> 
>         This is a combination of #1 and #3 that keeps the 'CPU time' 
>         terminology for relative time. The CPU/absolute pairing is not 
>         that intuitive though.
> 
>    #5: 'CPU time' vs. 'latency'
> 
>         This is really, really bad and unintuitive. Sorry to be so 
>         harsh and negative about this choice, but this is such a nice 
>         feature, which suffers from confusing naming. :-)

Thanks for your seggestions.  My main concern is that it's not just
about cpu-time and wallclock-time.  perf tools can measure any events
that have different meanings.  So I think we need generic terms to cover
them.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-04 19:52   ` Namhyung Kim
@ 2025-05-05  8:03     ` Dmitry Vyukov
  2025-05-06 14:57       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 29+ messages in thread
From: Dmitry Vyukov @ 2025-05-05  8:03 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang,
	Jiri Olsa, Adrian Hunter, Peter Zijlstra, LKML, linux-perf-users,
	Andi Kleen

On Sun, 4 May 2025 at 21:52, Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hi Ingo,
>
> Thanks for sharing your opinion.
>
> On Sun, May 04, 2025 at 10:22:26AM +0200, Ingo Molnar wrote:
> >
> > * Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > > When it profile a target process (and its children), it's
> > > straight-forward to track parallelism using sched-switch info.  The
> > > parallelism is kept in machine-level in this case.
> > >
> > > But when it profile multiple processes like in the system-wide mode,
> > > it might not be clear how to apply the (machine-level) parallelism to
> > > different tasks.  That's why it disabled the latency profiling for
> > > system-wide mode.
> > >
> > > But it should be able to track parallelism in each process and it'd
> > > useful to profile latency issues in multi-threaded programs.  So this
> > > patch tries to enable it.
> > >
> > > However using sched-switch info can be a problem since it may emit a lot
> > > more data and more chances for losing data when perf cannot keep up with
> > > it.
> > >
> > > Instead, it can maintain the current process for each CPU when it sees
> > > samples.  And the process updates parallelism so that it can calculate
> > > the latency based on the value.  One more point to improve is to remove
> > > the idle task from latency calculation.
> > >
> > > Here's an example:
> > >
> > >   # perf record -a -- perf bench sched messaging
> > >
> > > This basically forks each sender and receiver tasks for the run.
> > >
> > >   # perf report --latency -s comm --stdio
> > >   ...
> > >   #
> > >   #  Latency  Overhead  Command
> > >   # ........  ........  ...............
> > >   #
> > >       98.14%    95.97%  sched-messaging
> > >        0.78%     0.93%  gnome-shell
> > >        0.36%     0.34%  ptyxis
> > >        0.23%     0.23%  kworker/u112:0-
> > >        0.23%     0.44%  perf
> > >        0.08%     0.10%  KMS thread
> > >        0.05%     0.05%  rcu_preempt
> > >        0.05%     0.05%  kworker/u113:2-
> > >        ...
> >
> > Just a generic user-interface comment: I had to look up what 'latency'
> > means in this context, and went about 3 hops deep into various pieces
> > of description until I found Documentation/cpu-and-latency-overheads.txt,
> > where after a bit of head-scratching I realized that 'latency' is a
> > weird alias for 'wall-clock time'...
> >
> > This is *highly* confusing terminology IMHO.
>
> Sorry for the confusion.  I know I'm terrible at naming things. :)
>
> Actually Dmitry used the term 'wall-clock' profiling at first when he
> implemented this feature but I thought it was not clear how it meant
> for non-cycle events.  As 'overhead' is also a generic term, we ended
> up with 'latency'.

Exactly :)

I've also talked with a bunch of people about this, and everybody
proposes their own term and is confused by all other proposals.

The problem is that we did not just lack this fundamental profiling
capability in all profilers out there, but we, as a community, also
still don't know how to even talk about these things...

There is no terminology that would be clear for everybody. E.g. when
some people hear wall-clock, they imply that it samples every thread
(runnable and non-runnnable) every time unit (but that's a vastly
different profile from this one).


> > 'Latency' is a highly overloaded concept that almost never corresponds
> > to 'wall clock time'. It usually means a relative delay value, which is
> > why I initially thought this somehow means instruction-latency or
> > memory-latency profiling ...
> >
> > Ie. 'latency' in its naive meaning, is on the exact opposite side of
> > the terminology spectrum of where it should be: it suggests relative
> > time, while in reality it's connected to wall-clock/absolute time ...
> >
> > *Please* use something else. Wall-clock is fine, as
> > cpu-and-latency-overheads.txt uses initially, but so would be other
> > combinations:
> >
> >    #1: 'CPU time' vs. 'real time'
> >
> >         This is short, although a disadvantage is the possible
> >         'real-time kernel' source of confusion here.
> >
> >    #2: 'CPU time' vs. 'wall-clock time'
> >
> >         This is longer but OK and unambiguous.
> >
> >    #3: 'relative time' vs. 'absolute time'
> >
> >         This is short and straightforward, and might be my favorite
> >         personally, because relative/absolute is such an unambiguous
> >         and well-known terminology and often paired in a similar
> >         fashion.
> >
> >    #4: 'CPU time' vs. 'absolute time'
> >
> >         This is a combination of #1 and #3 that keeps the 'CPU time'
> >         terminology for relative time. The CPU/absolute pairing is not
> >         that intuitive though.
> >
> >    #5: 'CPU time' vs. 'latency'
> >
> >         This is really, really bad and unintuitive. Sorry to be so
> >         harsh and negative about this choice, but this is such a nice
> >         feature, which suffers from confusing naming. :-)
>
> Thanks for your seggestions.  My main concern is that it's not just
> about cpu-time and wallclock-time.  perf tools can measure any events
> that have different meanings.  So I think we need generic terms to cover
> them.

I did not see a conflict there. It's possible to sample, say, cache
misses per unit of CPU time or per unit of wall-clock time.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-03  0:36 [RFC/PATCH] perf report: Support latency profiling in system-wide mode Namhyung Kim
  2025-05-04  8:22 ` Ingo Molnar
@ 2025-05-05  8:08 ` Dmitry Vyukov
  2025-05-06  5:30   ` Namhyung Kim
  1 sibling, 1 reply; 29+ messages in thread
From: Dmitry Vyukov @ 2025-05-05  8:08 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Sat, 3 May 2025 at 02:36, Namhyung Kim <namhyung@kernel.org> wrote:
>
> When it profile a target process (and its children), it's
> straight-forward to track parallelism using sched-switch info.  The
> parallelism is kept in machine-level in this case.
>
> But when it profile multiple processes like in the system-wide mode,
> it might not be clear how to apply the (machine-level) parallelism to
> different tasks.  That's why it disabled the latency profiling for
> system-wide mode.
>
> But it should be able to track parallelism in each process and it'd
> useful to profile latency issues in multi-threaded programs.  So this
> patch tries to enable it.
>
> However using sched-switch info can be a problem since it may emit a lot
> more data and more chances for losing data when perf cannot keep up with
> it.
>
> Instead, it can maintain the current process for each CPU when it sees
> samples.

Interesting.

Few questions:
1. Do we always see a CPU sample when a CPU becomes idle? Otherwise we
will think that the last thread runs on that CPU for arbitrary long,
when it's actually not.
2. If yes, can we also lose that "terminating" even when a CPU becomes
idle? If yes, then it looks equivalent to missing a context switch
event.
3. Does this mode kick in even for non system-wide profiles (collected
w/o context switch events)? If yes, do we properly understand when a
thread stops running for such profiles? How do we do that? There won't
be samples for idle/other tasks.



> And the process updates parallelism so that it can calculate
> the latency based on the value.  One more point to improve is to remove
> the idle task from latency calculation.
>
> Here's an example:
>
>   # perf record -a -- perf bench sched messaging
>
> This basically forks each sender and receiver tasks for the run.
>
>   # perf report --latency -s comm --stdio
>   ...
>   #
>   #  Latency  Overhead  Command
>   # ........  ........  ...............
>   #
>       98.14%    95.97%  sched-messaging
>        0.78%     0.93%  gnome-shell
>        0.36%     0.34%  ptyxis
>        0.23%     0.23%  kworker/u112:0-
>        0.23%     0.44%  perf
>        0.08%     0.10%  KMS thread
>        0.05%     0.05%  rcu_preempt
>        0.05%     0.05%  kworker/u113:2-
>        ...
>
> So the latency profile is similar to the overhead.  But when you run
> with -t option to make it multi-threaded:
>
>   # perf record -a -- perf bench sched messaging -t
>
> Then the latency profile result differs with the overhead.
>
>   # perf report --latency -s comm --stdio
>   ...
>   #
>   #  Latency  Overhead  Command
>   # ........  ........  ...............
>   #
>       58.94%    94.52%  sched-messaging
>       19.12%     1.22%  perf
>       12.13%     0.77%  ptyxis
>        4.07%     0.28%  gnome-shell
>        1.31%     0.08%  perf-exec
>        0.95%     0.06%  kworker/u113:2-
>        0.86%     0.05%  KMS thread
>        0.85%     0.05%  kworker/18:1-mm
>        ...
>
> A side effect is that it will most likely show latency output column
> for system-wide mode by default.
>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> ---
>  tools/perf/builtin-record.c | 16 ++---------
>  tools/perf/builtin-report.c | 56 +++++++++++++++++++++++++++++--------
>  tools/perf/util/event.c     | 48 ++++++++++++++++++++++++-------
>  tools/perf/util/machine.c   | 21 ++++++++++++++
>  tools/perf/util/machine.h   |  4 +++
>  tools/perf/util/thread.h    | 23 +++++++++++++++
>  6 files changed, 133 insertions(+), 35 deletions(-)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 8898357325cf4132..67f746601ac05e4d 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -4062,19 +4062,9 @@ int cmd_record(int argc, const char **argv)
>         }
>
>         if (record.latency) {
> -               /*
> -                * There is no fundamental reason why latency profiling
> -                * can't work for system-wide mode, but exact semantics
> -                * and details are to be defined.
> -                * See the following thread for details:
> -                * https://lore.kernel.org/all/Z4XDJyvjiie3howF@google.com/
> -                */
> -               if (record.opts.target.system_wide) {
> -                       pr_err("Failed: latency profiling is not supported with system-wide collection.\n");
> -                       err = -EINVAL;
> -                       goto out_opts;
> -               }
> -               record.opts.record_switch_events = true;
> +               /* It can use sample->cpu to get process-level parallelism. */
> +               if (!target__has_cpu(&record.opts.target))
> +                       record.opts.record_switch_events = true;
>         }
>
>         if (rec->buildid_mmap) {
> diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> index f0299c7ee0254a37..4294508b9df52d83 100644
> --- a/tools/perf/builtin-report.c
> +++ b/tools/perf/builtin-report.c
> @@ -1280,6 +1280,22 @@ static int process_attr(const struct perf_tool *tool __maybe_unused,
>         return 0;
>  }
>
> +static bool is_latency_requested(void)
> +{
> +       if (symbol_conf.parallelism_list_str || symbol_conf.prefer_latency)
> +               return true;
> +
> +       if (sort_order && (strstr(sort_order, "latency") ||
> +                          strstr(sort_order, "parallelism")))
> +               return true;
> +
> +       if (field_order && (strstr(field_order, "latency") ||
> +                           strstr(field_order, "parallelism")))
> +               return true;
> +
> +       return false;
> +}
> +
>  #define CALLCHAIN_BRANCH_SORT_ORDER    \
>         "srcline,symbol,dso,callchain_branch_predicted," \
>         "callchain_branch_abort,callchain_branch_cycles"
> @@ -1735,18 +1751,10 @@ int cmd_report(int argc, const char **argv)
>         }
>
>         symbol_conf.enable_latency = true;
> -       if (report.disable_order || !perf_session__has_switch_events(session)) {
> -               if (symbol_conf.parallelism_list_str ||
> -                       symbol_conf.prefer_latency ||
> -                       (sort_order && (strstr(sort_order, "latency") ||
> -                               strstr(sort_order, "parallelism"))) ||
> -                       (field_order && (strstr(field_order, "latency") ||
> -                               strstr(field_order, "parallelism")))) {
> -                       if (report.disable_order)
> -                               ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
> -                       else
> -                               ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
> -                       return -1;
> +       if (report.disable_order) {
> +               if (is_latency_requested()) {
> +                       ui__error("Use of latency profile or parallelism is incompatible with --disable-order.\n");
> +                       goto error;
>                 }
>                 /*
>                  * If user did not ask for anything related to
> @@ -1755,6 +1763,30 @@ int cmd_report(int argc, const char **argv)
>                 symbol_conf.enable_latency = false;
>         }
>
> +       if (!perf_session__has_switch_events(session)) {
> +               if (evlist__combined_sample_type(session->evlist) & PERF_SAMPLE_CPU) {
> +                       struct machine *machine = &session->machines.host;
> +
> +                       /*
> +                        * Maintain current process for each CPU to track
> +                        * parallelism in process level.
> +                        */
> +                       ret = machine__create_current_table(machine);
> +                       if (ret < 0)
> +                               goto error;
> +               } else if (is_latency_requested()) {
> +                       ui__error("Use of latency profile or parallelism requires --latency flag during record.\n");
> +                       goto error;
> +               }
> +               else {
> +                       /*
> +                        * If user did not ask for anything related to
> +                        * latency/parallelism explicitly, just don't show it.
> +                        */
> +                       symbol_conf.enable_latency = false;
> +               }
> +       }
> +
>         if (last_key != K_SWITCH_INPUT_DATA) {
>                 if (sort_order && strstr(sort_order, "ipc")) {
>                         parse_options_usage(report_usage, options, "s", 1);
> diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
> index c23b77f8f854ad21..6c2c9cd68e3afb5e 100644
> --- a/tools/perf/util/event.c
> +++ b/tools/perf/util/event.c
> @@ -767,16 +767,44 @@ int machine__resolve(struct machine *machine, struct addr_location *al,
>                         al->socket = env->cpu[al->cpu].socket_id;
>         }
>
> -       /* Account for possible out-of-order switch events. */
> -       al->parallelism = max(1, min(machine->parallelism, machine__nr_cpus_avail(machine)));
> -       if (test_bit(al->parallelism, symbol_conf.parallelism_filter))
> -               al->filtered |= (1 << HIST_FILTER__PARALLELISM);
> -       /*
> -        * Multiply it by some const to avoid precision loss or dealing
> -        * with floats. The multiplier does not matter otherwise since
> -        * we only print it as percents.
> -        */
> -       al->latency = sample->period * 1000 / al->parallelism;
> +       if (symbol_conf.enable_latency) {
> +               if (machine->current && al->cpu >= 0) {
> +                       struct thread *curr = machine->current[al->cpu];
> +
> +                       if (curr) {
> +                               thread__dec_parallelism(curr);
> +                               thread__put(curr);
> +                       }
> +
> +                       curr = machine__findnew_thread(machine, sample->pid,
> +                                                      sample->pid);
> +                       if (curr) {
> +                               machine->current[al->cpu] = curr;
> +
> +                               thread__inc_parallelism(curr);
> +                               thread__get(curr);
> +                       }
> +
> +                       al->parallelism = curr ? thread__parallelism(curr) : 1;
> +               } else {
> +                       /* Account for possible out-of-order switch events. */
> +                       al->parallelism = max(1, min(machine->parallelism,
> +                                                    machine__nr_cpus_avail(machine)));
> +               }
> +
> +               if (test_bit(al->parallelism, symbol_conf.parallelism_filter))
> +                       al->filtered |= (1 << HIST_FILTER__PARALLELISM);
> +               /*
> +                * Multiply it by some const to avoid precision loss or dealing
> +                * with floats. The multiplier does not matter otherwise since
> +                * we only print it as percents.
> +                */
> +               al->latency = sample->period * 1000 / al->parallelism;
> +
> +               /* It won't be exciting to see idle threads in latency profile. */
> +               if (sample->pid == 0)
> +                       al->latency = 0;
> +       }
>
>         if (al->map) {
>                 if (symbol_conf.dso_list &&
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index 2531b373f2cf7ca0..8c0e7d418a88f528 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -117,6 +117,7 @@ int machine__init(struct machine *machine, const char *root_dir, pid_t pid)
>         }
>
>         machine->current_tid = NULL;
> +       machine->current = NULL;
>         err = 0;
>
>  out:
> @@ -183,6 +184,12 @@ void machine__exit(struct machine *machine)
>         zfree(&machine->current_tid);
>         zfree(&machine->kallsyms_filename);
>
> +       if (machine->current) {
> +               for (int i = 0; i < machine__nr_cpus_avail(machine); i++)
> +                       thread__put(machine->current[i]);
> +               zfree(&machine->current);
> +       }
> +
>         threads__exit(&machine->threads);
>  }
>
> @@ -3271,3 +3278,17 @@ int machine__hit_all_dsos(struct machine *machine)
>  {
>         return dsos__hit_all(&machine->dsos);
>  }
> +
> +int machine__create_current_table(struct machine *machine)
> +{
> +       int nr = machine__nr_cpus_avail(machine);
> +
> +       if (nr == 0)
> +               return -EINVAL;
> +
> +       machine->current = calloc(nr, sizeof(*machine->current));
> +       if (machine->current == NULL)
> +               return -ENOMEM;
> +
> +       return 0;
> +}
> diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
> index b56abec84fed1e3f..13d23b14b25e51fc 100644
> --- a/tools/perf/util/machine.h
> +++ b/tools/perf/util/machine.h
> @@ -64,6 +64,8 @@ struct machine {
>         };
>         struct machines   *machines;
>         bool              trampolines_mapped;
> +       /* per-CPU current process for parallelism */
> +       struct thread     **current;
>  };
>
>  /*
> @@ -332,4 +334,6 @@ int machine__resolve(struct machine *machine, struct addr_location *al,
>
>  int machine__hit_all_dsos(struct machine *machine);
>
> +int machine__create_current_table(struct machine *machine);
> +
>  #endif /* __PERF_MACHINE_H */
> diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
> index cd574a896418ac94..5233436404408d10 100644
> --- a/tools/perf/util/thread.h
> +++ b/tools/perf/util/thread.h
> @@ -39,6 +39,7 @@ DECLARE_RC_STRUCT(thread) {
>         pid_t                   ppid;
>         int                     cpu;
>         int                     guest_cpu; /* For QEMU thread */
> +       int                     parallelism; /* Valid for group leaders */
>         refcount_t              refcnt;
>         /**
>          * @exited: Has the thread had an exit event. Such threads are usually
> @@ -359,4 +360,26 @@ void thread__free_stitch_list(struct thread *thread);
>  void thread__resolve(struct thread *thread, struct addr_location *al,
>                      struct perf_sample *sample);
>
> +static inline int thread__parallelism(const struct thread *thread)
> +{
> +       return RC_CHK_ACCESS(thread)->parallelism ? : 1;
> +}
> +
> +static inline void thread__set_parallelism(struct thread *thread, int parallelism)
> +{
> +       RC_CHK_ACCESS(thread)->parallelism = parallelism;
> +}
> +
> +static inline void thread__inc_parallelism(struct thread *thread)
> +{
> +       if (thread__pid(thread))
> +               RC_CHK_ACCESS(thread)->parallelism++;
> +}
> +
> +static inline void thread__dec_parallelism(struct thread *thread)
> +{
> +       if (thread__pid(thread) && RC_CHK_ACCESS(thread)->parallelism > 0)
> +               RC_CHK_ACCESS(thread)->parallelism--;
> +}
> +
>  #endif /* __PERF_THREAD_H */
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-05  8:08 ` Dmitry Vyukov
@ 2025-05-06  5:30   ` Namhyung Kim
  2025-05-06  5:55     ` Dmitry Vyukov
  0 siblings, 1 reply; 29+ messages in thread
From: Namhyung Kim @ 2025-05-06  5:30 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

Hello,

On Mon, May 05, 2025 at 10:08:17AM +0200, Dmitry Vyukov wrote:
> On Sat, 3 May 2025 at 02:36, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > When it profile a target process (and its children), it's
> > straight-forward to track parallelism using sched-switch info.  The
> > parallelism is kept in machine-level in this case.
> >
> > But when it profile multiple processes like in the system-wide mode,
> > it might not be clear how to apply the (machine-level) parallelism to
> > different tasks.  That's why it disabled the latency profiling for
> > system-wide mode.
> >
> > But it should be able to track parallelism in each process and it'd
> > useful to profile latency issues in multi-threaded programs.  So this
> > patch tries to enable it.
> >
> > However using sched-switch info can be a problem since it may emit a lot
> > more data and more chances for losing data when perf cannot keep up with
> > it.
> >
> > Instead, it can maintain the current process for each CPU when it sees
> > samples.
> 
> Interesting.
> 
> Few questions:
> 1. Do we always see a CPU sample when a CPU becomes idle? Otherwise we
> will think that the last thread runs on that CPU for arbitrary long,
> when it's actually not.

No, it's not guaranteed to have a sample for idle tasks.  So right, it
can mis-calculate the parallelism for the last task.  If we can emit
sched-switches only when it goes to the idle task, it'd be accurate.


> 2. If yes, can we also lose that "terminating" even when a CPU becomes
> idle? If yes, then it looks equivalent to missing a context switch
> event.

I'm not sure what you are asking.  When it lose some records because the
buffer is full, it'll see the task of the last sample on each CPU.
Maybe we want to reset the current task after PERF_RECORD_LOST.


> 3. Does this mode kick in even for non system-wide profiles (collected
> w/o context switch events)? If yes, do we properly understand when a
> thread stops running for such profiles? How do we do that? There won't
> be samples for idle/other tasks.

For non system-wide profiles, the problem is that it cannot know when
the current task is scheduled out so that it can decrease the count of
parallelism.  So this approach cannot work and sched-switch info is
required.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-06  5:30   ` Namhyung Kim
@ 2025-05-06  5:55     ` Dmitry Vyukov
  2025-05-06  6:43       ` Namhyung Kim
  0 siblings, 1 reply; 29+ messages in thread
From: Dmitry Vyukov @ 2025-05-06  5:55 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Tue, 6 May 2025 at 07:30, Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> On Mon, May 05, 2025 at 10:08:17AM +0200, Dmitry Vyukov wrote:
> > On Sat, 3 May 2025 at 02:36, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > When it profile a target process (and its children), it's
> > > straight-forward to track parallelism using sched-switch info.  The
> > > parallelism is kept in machine-level in this case.
> > >
> > > But when it profile multiple processes like in the system-wide mode,
> > > it might not be clear how to apply the (machine-level) parallelism to
> > > different tasks.  That's why it disabled the latency profiling for
> > > system-wide mode.
> > >
> > > But it should be able to track parallelism in each process and it'd
> > > useful to profile latency issues in multi-threaded programs.  So this
> > > patch tries to enable it.
> > >
> > > However using sched-switch info can be a problem since it may emit a lot
> > > more data and more chances for losing data when perf cannot keep up with
> > > it.
> > >
> > > Instead, it can maintain the current process for each CPU when it sees
> > > samples.
> >
> > Interesting.
> >
> > Few questions:
> > 1. Do we always see a CPU sample when a CPU becomes idle? Otherwise we
> > will think that the last thread runs on that CPU for arbitrary long,
> > when it's actually not.
>
> No, it's not guaranteed to have a sample for idle tasks.  So right, it
> can mis-calculate the parallelism for the last task.  If we can emit
> sched-switches only when it goes to the idle task, it'd be accurate.

Then I think the profile can be significantly off if the system wasn't
~100% loaded, right?

> > 2. If yes, can we also lose that "terminating" even when a CPU becomes
> > idle? If yes, then it looks equivalent to missing a context switch
> > event.
>
> I'm not sure what you are asking.  When it lose some records because the
> buffer is full, it'll see the task of the last sample on each CPU.
> Maybe we want to reset the current task after PERF_RECORD_LOST.

This probably does not matter much if the answer to question 1 is No.

But what I was is the following:

let's say we have samples:
Sample 1 for Pid 42 on Cpu 10
Sample 2 for idle task on Cpu 10
... no samples for some time on Cpu 10 ...

When we process sample 2, we decrement the counter for running tasks
for Pid 42, right.
Now if sample 2 is lost, then we don't do decrement and the accounting
becomes off.
In a sense this is equivalent to the problem of losing context switch event.


> > 3. Does this mode kick in even for non system-wide profiles (collected
> > w/o context switch events)? If yes, do we properly understand when a
> > thread stops running for such profiles? How do we do that? There won't
> > be samples for idle/other tasks.
>
> For non system-wide profiles, the problem is that it cannot know when
> the current task is scheduled out so that it can decrease the count of
> parallelism.  So this approach cannot work and sched-switch info is
> required.

Where does the patch check that this mode is used only for system-wide profiles?
Is it that PERF_SAMPLE_CPU present only for system-wide profiles?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-06  5:55     ` Dmitry Vyukov
@ 2025-05-06  6:43       ` Namhyung Kim
  2025-05-06  6:46         ` Dmitry Vyukov
  0 siblings, 1 reply; 29+ messages in thread
From: Namhyung Kim @ 2025-05-06  6:43 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Tue, May 06, 2025 at 07:55:25AM +0200, Dmitry Vyukov wrote:
> On Tue, 6 May 2025 at 07:30, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Hello,
> >
> > On Mon, May 05, 2025 at 10:08:17AM +0200, Dmitry Vyukov wrote:
> > > On Sat, 3 May 2025 at 02:36, Namhyung Kim <namhyung@kernel.org> wrote:
> > > >
> > > > When it profile a target process (and its children), it's
> > > > straight-forward to track parallelism using sched-switch info.  The
> > > > parallelism is kept in machine-level in this case.
> > > >
> > > > But when it profile multiple processes like in the system-wide mode,
> > > > it might not be clear how to apply the (machine-level) parallelism to
> > > > different tasks.  That's why it disabled the latency profiling for
> > > > system-wide mode.
> > > >
> > > > But it should be able to track parallelism in each process and it'd
> > > > useful to profile latency issues in multi-threaded programs.  So this
> > > > patch tries to enable it.
> > > >
> > > > However using sched-switch info can be a problem since it may emit a lot
> > > > more data and more chances for losing data when perf cannot keep up with
> > > > it.
> > > >
> > > > Instead, it can maintain the current process for each CPU when it sees
> > > > samples.
> > >
> > > Interesting.
> > >
> > > Few questions:
> > > 1. Do we always see a CPU sample when a CPU becomes idle? Otherwise we
> > > will think that the last thread runs on that CPU for arbitrary long,
> > > when it's actually not.
> >
> > No, it's not guaranteed to have a sample for idle tasks.  So right, it
> > can mis-calculate the parallelism for the last task.  If we can emit
> > sched-switches only when it goes to the idle task, it'd be accurate.
> 
> Then I think the profile can be significantly off if the system wasn't
> ~100% loaded, right?

Yep, it can be.

> 
> > > 2. If yes, can we also lose that "terminating" even when a CPU becomes
> > > idle? If yes, then it looks equivalent to missing a context switch
> > > event.
> >
> > I'm not sure what you are asking.  When it lose some records because the
> > buffer is full, it'll see the task of the last sample on each CPU.
> > Maybe we want to reset the current task after PERF_RECORD_LOST.
> 
> This probably does not matter much if the answer to question 1 is No.
> 
> But what I was is the following:
> 
> let's say we have samples:
> Sample 1 for Pid 42 on Cpu 10
> Sample 2 for idle task on Cpu 10
> ... no samples for some time on Cpu 10 ...
> 
> When we process sample 2, we decrement the counter for running tasks
> for Pid 42, right.
> Now if sample 2 is lost, then we don't do decrement and the accounting
> becomes off.
> In a sense this is equivalent to the problem of losing context switch event.

Right.  But I think it's hard to be correct once it loses something.

> 
> 
> > > 3. Does this mode kick in even for non system-wide profiles (collected
> > > w/o context switch events)? If yes, do we properly understand when a
> > > thread stops running for such profiles? How do we do that? There won't
> > > be samples for idle/other tasks.
> >
> > For non system-wide profiles, the problem is that it cannot know when
> > the current task is scheduled out so that it can decrease the count of
> > parallelism.  So this approach cannot work and sched-switch info is
> > required.
> 
> Where does the patch check that this mode is used only for system-wide profiles?
> Is it that PERF_SAMPLE_CPU present only for system-wide profiles?

Basically yes, but you can use --sample-cpu to add it.

In util/evsel.c::evsel__config():

	if (target__has_cpu(&opts->target) || opts->sample_cpu)
		evsel__set_sample_bit(evsel, CPU);

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-06  6:43       ` Namhyung Kim
@ 2025-05-06  6:46         ` Dmitry Vyukov
  2025-05-06  7:09           ` Namhyung Kim
  0 siblings, 1 reply; 29+ messages in thread
From: Dmitry Vyukov @ 2025-05-06  6:46 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Tue, 6 May 2025 at 08:43, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, May 06, 2025 at 07:55:25AM +0200, Dmitry Vyukov wrote:
> > On Tue, 6 May 2025 at 07:30, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > On Mon, May 05, 2025 at 10:08:17AM +0200, Dmitry Vyukov wrote:
> > > > On Sat, 3 May 2025 at 02:36, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > >
> > > > > When it profile a target process (and its children), it's
> > > > > straight-forward to track parallelism using sched-switch info.  The
> > > > > parallelism is kept in machine-level in this case.
> > > > >
> > > > > But when it profile multiple processes like in the system-wide mode,
> > > > > it might not be clear how to apply the (machine-level) parallelism to
> > > > > different tasks.  That's why it disabled the latency profiling for
> > > > > system-wide mode.
> > > > >
> > > > > But it should be able to track parallelism in each process and it'd
> > > > > useful to profile latency issues in multi-threaded programs.  So this
> > > > > patch tries to enable it.
> > > > >
> > > > > However using sched-switch info can be a problem since it may emit a lot
> > > > > more data and more chances for losing data when perf cannot keep up with
> > > > > it.
> > > > >
> > > > > Instead, it can maintain the current process for each CPU when it sees
> > > > > samples.
> > > >
> > > > Interesting.
> > > >
> > > > Few questions:
> > > > 1. Do we always see a CPU sample when a CPU becomes idle? Otherwise we
> > > > will think that the last thread runs on that CPU for arbitrary long,
> > > > when it's actually not.
> > >
> > > No, it's not guaranteed to have a sample for idle tasks.  So right, it
> > > can mis-calculate the parallelism for the last task.  If we can emit
> > > sched-switches only when it goes to the idle task, it'd be accurate.
> >
> > Then I think the profile can be significantly off if the system wasn't
> > ~100% loaded, right?
>
> Yep, it can be.
>
> >
> > > > 2. If yes, can we also lose that "terminating" even when a CPU becomes
> > > > idle? If yes, then it looks equivalent to missing a context switch
> > > > event.
> > >
> > > I'm not sure what you are asking.  When it lose some records because the
> > > buffer is full, it'll see the task of the last sample on each CPU.
> > > Maybe we want to reset the current task after PERF_RECORD_LOST.
> >
> > This probably does not matter much if the answer to question 1 is No.
> >
> > But what I was is the following:
> >
> > let's say we have samples:
> > Sample 1 for Pid 42 on Cpu 10
> > Sample 2 for idle task on Cpu 10
> > ... no samples for some time on Cpu 10 ...
> >
> > When we process sample 2, we decrement the counter for running tasks
> > for Pid 42, right.
> > Now if sample 2 is lost, then we don't do decrement and the accounting
> > becomes off.
> > In a sense this is equivalent to the problem of losing context switch event.
>
> Right.  But I think it's hard to be correct once it loses something.
>
> >
> >
> > > > 3. Does this mode kick in even for non system-wide profiles (collected
> > > > w/o context switch events)? If yes, do we properly understand when a
> > > > thread stops running for such profiles? How do we do that? There won't
> > > > be samples for idle/other tasks.
> > >
> > > For non system-wide profiles, the problem is that it cannot know when
> > > the current task is scheduled out so that it can decrease the count of
> > > parallelism.  So this approach cannot work and sched-switch info is
> > > required.
> >
> > Where does the patch check that this mode is used only for system-wide profiles?
> > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
>
> Basically yes, but you can use --sample-cpu to add it.

Are you sure? --sample-cpu seems to work for non-system-wide profiles too.

> In util/evsel.c::evsel__config():
>
>         if (target__has_cpu(&opts->target) || opts->sample_cpu)
>                 evsel__set_sample_bit(evsel, CPU);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-06  6:46         ` Dmitry Vyukov
@ 2025-05-06  7:09           ` Namhyung Kim
  2025-05-06  7:40             ` Dmitry Vyukov
  0 siblings, 1 reply; 29+ messages in thread
From: Namhyung Kim @ 2025-05-06  7:09 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Tue, May 06, 2025 at 08:46:20AM +0200, Dmitry Vyukov wrote:
> On Tue, 6 May 2025 at 08:43, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Tue, May 06, 2025 at 07:55:25AM +0200, Dmitry Vyukov wrote:
> > > Where does the patch check that this mode is used only for system-wide profiles?
> > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> >
> > Basically yes, but you can use --sample-cpu to add it.
> 
> Are you sure? --sample-cpu seems to work for non-system-wide profiles too.

Yep, that's why I said "Basically".  So it's not 100% guarantee.

We may disable latency column by default in this case and show warning
if it's requested.  Or we may add a new attribute to emit sched-switch
records only for idle tasks and enable the latency report only if the
data has sched-switch records.

What do you think?

Thanks,
Namhyung

> 
> > In util/evsel.c::evsel__config():
> >
> >         if (target__has_cpu(&opts->target) || opts->sample_cpu)
> >                 evsel__set_sample_bit(evsel, CPU);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-06  7:09           ` Namhyung Kim
@ 2025-05-06  7:40             ` Dmitry Vyukov
  2025-05-07 23:43               ` Namhyung Kim
  0 siblings, 1 reply; 29+ messages in thread
From: Dmitry Vyukov @ 2025-05-06  7:40 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > >
> > > Basically yes, but you can use --sample-cpu to add it.
> >
> > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
>
> Yep, that's why I said "Basically".  So it's not 100% guarantee.
>
> We may disable latency column by default in this case and show warning
> if it's requested.  Or we may add a new attribute to emit sched-switch
> records only for idle tasks and enable the latency report only if the
> data has sched-switch records.
>
> What do you think?

Depends on what problem we are trying to solve:

1. Enabling latency profiling for system-wide mode.

2. Switch events bloating trace too much.

3. Lost switch events lead to imprecise accounting.

The patch mentions all 3 :)
But I think 2 and 3 are not really specific to system-wide mode.
An active single process profile can emit more samples than a
system-wide profile on a lightly loaded system.
Similarly, if we rely on switch events for system-wide mode, then it's
equally subject to the lost events problem.

For problem 1: we can just permit --latency for system wide mode and
fully rely on switch events.
It's not any worse than we do now (wrt both profile size and lost events).

For problem 2: yes, we could emit only switches to idle tasks. Or
maybe just a fake CPU sample for an idle task? That's effectively what
we want, then your current accounting code will work w/o any changes.
This should help wrt trace size only for system-wide mode (provided
that user already enables CPU accounting for other reasons, otherwise
it's unclear what's better -- attaching CPU to each sample, or writing
switch events).

For problem 3: switches to idle task won't really help. There can be
lots of them, and missing any will lead to wrong accounting.
A principled approach would be to attach a per-thread scheduler
quantum sequence number to each CPU sample. The sequence number would
be incremented on every context switch. Then any subset of CPU should
be enough to understand when a task was scheduled in and out
(scheduled in on the first CPU sample with sequence number N, and
switched out on the last sample with sequence number N).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-05  8:03     ` Dmitry Vyukov
@ 2025-05-06 14:57       ` Arnaldo Carvalho de Melo
  2025-05-06 16:58         ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 29+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-05-06 14:57 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Namhyung Kim, Ingo Molnar, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, LKML, linux-perf-users, Andi Kleen

On Mon, May 05, 2025 at 10:03:05AM +0200, Dmitry Vyukov wrote:
> On Sun, 4 May 2025 at 21:52, Namhyung Kim <namhyung@kernel.org> wrote:
> > On Sun, May 04, 2025 at 10:22:26AM +0200, Ingo Molnar wrote:
> > > > Here's an example:

> > > >   # perf record -a -- perf bench sched messaging

> > > > This basically forks each sender and receiver tasks for the run.

> > > >   # perf report --latency -s comm --stdio
> > > >   ...
> > > >   #
> > > >   #  Latency  Overhead  Command
> > > >   # ........  ........  ...............
> > > >   #
> > > >       98.14%    95.97%  sched-messaging
> > > >        0.78%     0.93%  gnome-shell
> > > >        0.36%     0.34%  ptyxis
> > > >        0.23%     0.23%  kworker/u112:0-
> > > >        0.23%     0.44%  perf
> > > >        0.08%     0.10%  KMS thread
> > > >        0.05%     0.05%  rcu_preempt
> > > >        0.05%     0.05%  kworker/u113:2-
> > > >        ...

> > > Just a generic user-interface comment: I had to look up what 'latency'
> > > means in this context, and went about 3 hops deep into various pieces
> > > of description until I found Documentation/cpu-and-latency-overheads.txt,
> > > where after a bit of head-scratching I realized that 'latency' is a
> > > weird alias for 'wall-clock time'...

> > > This is *highly* confusing terminology IMHO.

> > Sorry for the confusion.  I know I'm terrible at naming things. :)

> > Actually Dmitry used the term 'wall-clock' profiling at first when he
> > implemented this feature but I thought it was not clear how it meant
> > for non-cycle events.  As 'overhead' is also a generic term, we ended
> > up with 'latency'.
 
> Exactly :)

So, the 'cycles' event is the most commonly used, its the default if one
does 'perf record' and don't ask for a specific event.

When we notice that 'cycles' and 'instructions' are counted, we provide
an IPC, as that is a well known metric for that combo:

  root@number:~# perf stat -e cycles,instructions sleep 1
  
   Performance counter stats for 'sleep 1':
  
           1,149,493      cycles                                                                
           1,197,279      instructions                     #    1.04  insn per cycle            
  
         1.000645408 seconds time elapsed
  
         0.000599000 seconds user
         0.000000000 seconds sys
  
  
  root@number:~#

So maybe when we notice that cycles was used 'perf report/top' should
use the term 'wall-clock' for the column name?

So instead of having a --latency we could have a hint that would tell
the user that if this knob was set:

  $ perf config report.wall-clock=true

Then if 'cycles' is present we would have:

  # perf report -s comm --stdio
  ...
  #
  # Wall-clock  Overhead  Command
  # ..........  ........  ...............
  #
        98.14%    95.97%  sched-messaging
         0.78%     0.93%  gnome-shell
         0.36%     0.34%  ptyxis
         0.23%     0.23%  kworker/u112:0-
         0.23%     0.44%  perf
         0.08%     0.10%  KMS thread
         0.05%     0.05%  rcu_preempt
         0.05%     0.05%  kworker/u113:2-
         ...

And that would become the "default", i.e. show the "Wall-clock" column
when appropriate, which is something I think Dmitry partially (he used
it for all events, IIRC, this would be just if 'cycles' is present)
tried when submitting the first version of his patchset?

What term to use with other events remains as an exercise, but to a lot
of people wanting just the 'cycles' event, like was used in the above
patch log message:

  # perf record -a -- perf bench sched messaging

That would be enough, no?

- Arnaldo
 
> I've also talked with a bunch of people about this, and everybody
> proposes their own term and is confused by all other proposals.
 
> The problem is that we did not just lack this fundamental profiling
> capability in all profilers out there, but we, as a community, also
> still don't know how to even talk about these things...
 
> There is no terminology that would be clear for everybody. E.g. when
> some people hear wall-clock, they imply that it samples every thread
> (runnable and non-runnnable) every time unit (but that's a vastly
> different profile from this one).
 
> > > 'Latency' is a highly overloaded concept that almost never corresponds
> > > to 'wall clock time'. It usually means a relative delay value, which is
> > > why I initially thought this somehow means instruction-latency or
> > > memory-latency profiling ...

> > > Ie. 'latency' in its naive meaning, is on the exact opposite side of
> > > the terminology spectrum of where it should be: it suggests relative
> > > time, while in reality it's connected to wall-clock/absolute time ...

> > > *Please* use something else. Wall-clock is fine, as
> > > cpu-and-latency-overheads.txt uses initially, but so would be other
> > > combinations:

> > >    #1: 'CPU time' vs. 'real time'

> > >         This is short, although a disadvantage is the possible
> > >         'real-time kernel' source of confusion here.

> > >    #2: 'CPU time' vs. 'wall-clock time'

> > >         This is longer but OK and unambiguous.

> > >    #3: 'relative time' vs. 'absolute time'

> > >         This is short and straightforward, and might be my favorite
> > >         personally, because relative/absolute is such an unambiguous
> > >         and well-known terminology and often paired in a similar
> > >         fashion.

> > >    #4: 'CPU time' vs. 'absolute time'

> > >         This is a combination of #1 and #3 that keeps the 'CPU time'
> > >         terminology for relative time. The CPU/absolute pairing is not
> > >         that intuitive though.

> > >    #5: 'CPU time' vs. 'latency'

> > >         This is really, really bad and unintuitive. Sorry to be so
> > >         harsh and negative about this choice, but this is such a nice
> > >         feature, which suffers from confusing naming. :-)

> > Thanks for your seggestions.  My main concern is that it's not just
> > about cpu-time and wallclock-time.  perf tools can measure any events
> > that have different meanings.  So I think we need generic terms to cover
> > them.
 
> I did not see a conflict there. It's possible to sample, say, cache
> misses per unit of CPU time or per unit of wall-clock time.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-06 14:57       ` Arnaldo Carvalho de Melo
@ 2025-05-06 16:58         ` Arnaldo Carvalho de Melo
  2025-05-07  9:58           ` Dmitry Vyukov
  0 siblings, 1 reply; 29+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-05-06 16:58 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Namhyung Kim, Ingo Molnar, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, LKML, linux-perf-users, Andi Kleen

On Tue, May 06, 2025 at 11:57:19AM -0300, Arnaldo Carvalho de Melo wrote:
> On Mon, May 05, 2025 at 10:03:05AM +0200, Dmitry Vyukov wrote:
> > On Sun, 4 May 2025 at 21:52, Namhyung Kim <namhyung@kernel.org> wrote:
> > > On Sun, May 04, 2025 at 10:22:26AM +0200, Ingo Molnar wrote:
> > > > > Here's an example:
> 
> > > > >   # perf record -a -- perf bench sched messaging
> 
> > > > > This basically forks each sender and receiver tasks for the run.
> 
> > > > >   # perf report --latency -s comm --stdio
> > > > >   ...
> > > > >   #
> > > > >   #  Latency  Overhead  Command
> > > > >   # ........  ........  ...............
> > > > >   #
> > > > >       98.14%    95.97%  sched-messaging
> > > > >        0.78%     0.93%  gnome-shell
> > > > >        0.36%     0.34%  ptyxis
> > > > >        0.23%     0.23%  kworker/u112:0-
> > > > >        0.23%     0.44%  perf
> > > > >        0.08%     0.10%  KMS thread
> > > > >        0.05%     0.05%  rcu_preempt
> > > > >        0.05%     0.05%  kworker/u113:2-
> > > > >        ...
> 
> > > > Just a generic user-interface comment: I had to look up what 'latency'
> > > > means in this context, and went about 3 hops deep into various pieces
> > > > of description until I found Documentation/cpu-and-latency-overheads.txt,
> > > > where after a bit of head-scratching I realized that 'latency' is a
> > > > weird alias for 'wall-clock time'...
> 
> > > > This is *highly* confusing terminology IMHO.
> 
> > > Sorry for the confusion.  I know I'm terrible at naming things. :)
> 
> > > Actually Dmitry used the term 'wall-clock' profiling at first when he
> > > implemented this feature but I thought it was not clear how it meant
> > > for non-cycle events.  As 'overhead' is also a generic term, we ended
> > > up with 'latency'.
>  
> > Exactly :)
> 
> So, the 'cycles' event is the most commonly used, its the default if one
> does 'perf record' and don't ask for a specific event.
> 
> When we notice that 'cycles' and 'instructions' are counted, we provide
> an IPC, as that is a well known metric for that combo:
> 
>   root@number:~# perf stat -e cycles,instructions sleep 1
>   
>    Performance counter stats for 'sleep 1':
>   
>            1,149,493      cycles                                                                
>            1,197,279      instructions                     #    1.04  insn per cycle            
>   
>          1.000645408 seconds time elapsed
>   
>          0.000599000 seconds user
>          0.000000000 seconds sys
>   
>   
>   root@number:~#
> 
> So maybe when we notice that cycles was used 'perf report/top' should
> use the term 'wall-clock' for the column name?
> 
> So instead of having a --latency we could have a hint that would tell
> the user that if this knob was set:
> 
>   $ perf config report.wall-clock=true
> 
> Then if 'cycles' is present we would have:

Something along the lines of the patch below, but there are several
details to take into account with what is in the current codebase, only
if what is needed for doing the latency/wall-clock time is present in
the perf.data being used is present that knob would be used or suggested
to the user, so some refactorings are needed, I'll try to folow on it.

But just for reference:

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index f0299c7ee0254a37..20874800d967ffb5 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -51,6 +51,7 @@
 #include "util/util.h" // perf_tip()
 #include "ui/ui.h"
 #include "ui/progress.h"
+#include "ui/util.h"
 #include "util/block-info.h"
 
 #include <dlfcn.h>
@@ -127,6 +128,11 @@ static int report__config(const char *var, const char *value, void *cb)
 {
 	struct report *rep = cb;
 
+	if (!strcmp(var, "report.prefer_latency")) {
+		symbol_conf.prefer_latency = perf_config_bool(var, value);
+		symbol_conf.prefer_latency_set = true;
+		return 0;
+	}
 	if (!strcmp(var, "report.group")) {
 		symbol_conf.event_group = perf_config_bool(var, value);
 		return 0;
@@ -1734,6 +1740,15 @@ int cmd_report(int argc, const char **argv)
 		symbol_conf.annotate_data_sample = true;
 	}
 
+	if (!symbol_conf.prefer_latency) {
+		if (ui__dialog_yesno("Do you want to use latency mode?\n"
+				     "That will add a 'Latency' column that would mean\n"
+				     "'wall-clock' time when used with cycles, for instance\n")) {
+                       symbol_conf.prefer_latency = symbol_conf.prefer_latency_set = true;
+                       perf_config__set_variable("report.prefer_latency", "yes");
+               }
+       }
+
 	symbol_conf.enable_latency = true;
 	if (report.disable_order || !perf_session__has_switch_events(session)) {
 		if (symbol_conf.parallelism_list_str ||
diff --git a/tools/perf/util/symbol_conf.h b/tools/perf/util/symbol_conf.h
index cd9aa82c7d5ad941..f87071f5dedd94ca 100644
--- a/tools/perf/util/symbol_conf.h
+++ b/tools/perf/util/symbol_conf.h
@@ -51,6 +51,7 @@ struct symbol_conf {
 			annotate_data_sample,
 			skip_empty,
 			enable_latency,
+			prefer_latency_set,
 			prefer_latency;
 	const char	*vmlinux_name,
 			*kallsyms_name,

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-06 16:58         ` Arnaldo Carvalho de Melo
@ 2025-05-07  9:58           ` Dmitry Vyukov
  2025-05-07 15:47             ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 29+ messages in thread
From: Dmitry Vyukov @ 2025-05-07  9:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Namhyung Kim, Ingo Molnar, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, LKML, linux-perf-users, Andi Kleen

On Tue, 6 May 2025 at 18:58, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>
> On Tue, May 06, 2025 at 11:57:19AM -0300, Arnaldo Carvalho de Melo wrote:
> > On Mon, May 05, 2025 at 10:03:05AM +0200, Dmitry Vyukov wrote:
> > > On Sun, 4 May 2025 at 21:52, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > On Sun, May 04, 2025 at 10:22:26AM +0200, Ingo Molnar wrote:
> > > > > > Here's an example:
> >
> > > > > >   # perf record -a -- perf bench sched messaging
> >
> > > > > > This basically forks each sender and receiver tasks for the run.
> >
> > > > > >   # perf report --latency -s comm --stdio
> > > > > >   ...
> > > > > >   #
> > > > > >   #  Latency  Overhead  Command
> > > > > >   # ........  ........  ...............
> > > > > >   #
> > > > > >       98.14%    95.97%  sched-messaging
> > > > > >        0.78%     0.93%  gnome-shell
> > > > > >        0.36%     0.34%  ptyxis
> > > > > >        0.23%     0.23%  kworker/u112:0-
> > > > > >        0.23%     0.44%  perf
> > > > > >        0.08%     0.10%  KMS thread
> > > > > >        0.05%     0.05%  rcu_preempt
> > > > > >        0.05%     0.05%  kworker/u113:2-
> > > > > >        ...
> >
> > > > > Just a generic user-interface comment: I had to look up what 'latency'
> > > > > means in this context, and went about 3 hops deep into various pieces
> > > > > of description until I found Documentation/cpu-and-latency-overheads.txt,
> > > > > where after a bit of head-scratching I realized that 'latency' is a
> > > > > weird alias for 'wall-clock time'...
> >
> > > > > This is *highly* confusing terminology IMHO.
> >
> > > > Sorry for the confusion.  I know I'm terrible at naming things. :)
> >
> > > > Actually Dmitry used the term 'wall-clock' profiling at first when he
> > > > implemented this feature but I thought it was not clear how it meant
> > > > for non-cycle events.  As 'overhead' is also a generic term, we ended
> > > > up with 'latency'.
> >
> > > Exactly :)
> >
> > So, the 'cycles' event is the most commonly used, its the default if one
> > does 'perf record' and don't ask for a specific event.
> >
> > When we notice that 'cycles' and 'instructions' are counted, we provide
> > an IPC, as that is a well known metric for that combo:
> >
> >   root@number:~# perf stat -e cycles,instructions sleep 1
> >
> >    Performance counter stats for 'sleep 1':
> >
> >            1,149,493      cycles
> >            1,197,279      instructions                     #    1.04  insn per cycle
> >
> >          1.000645408 seconds time elapsed
> >
> >          0.000599000 seconds user
> >          0.000000000 seconds sys
> >
> >
> >   root@number:~#
> >
> > So maybe when we notice that cycles was used 'perf report/top' should
> > use the term 'wall-clock' for the column name?
> >
> > So instead of having a --latency we could have a hint that would tell
> > the user that if this knob was set:
> >
> >   $ perf config report.wall-clock=true


I am not sure it can be as simple as a single global knob.

First, record needs to collect additional info and that may be
somewhat expensive.

Second, report now has several modes:
 - it can show latency, but order by the current overhead
 - it can also show latency, and order by latency
A user wants one or another depending on what they are optimizing
(possibly looking at both).

I also feel that global config switches are even less discoverable by
median users (read -- ~nobody will know about that). If these things
are normal flags with a help description, then some users may
eventually discover them.



> > Then if 'cycles' is present we would have:
>
> Something along the lines of the patch below, but there are several
> details to take into account with what is in the current codebase, only
> if what is needed for doing the latency/wall-clock time is present in
> the perf.data being used is present that knob would be used or suggested
> to the user, so some refactorings are needed, I'll try to folow on it.
>
> But just for reference:
>
> diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> index f0299c7ee0254a37..20874800d967ffb5 100644
> --- a/tools/perf/builtin-report.c
> +++ b/tools/perf/builtin-report.c
> @@ -51,6 +51,7 @@
>  #include "util/util.h" // perf_tip()
>  #include "ui/ui.h"
>  #include "ui/progress.h"
> +#include "ui/util.h"
>  #include "util/block-info.h"
>
>  #include <dlfcn.h>
> @@ -127,6 +128,11 @@ static int report__config(const char *var, const char *value, void *cb)
>  {
>         struct report *rep = cb;
>
> +       if (!strcmp(var, "report.prefer_latency")) {
> +               symbol_conf.prefer_latency = perf_config_bool(var, value);
> +               symbol_conf.prefer_latency_set = true;
> +               return 0;
> +       }
>         if (!strcmp(var, "report.group")) {
>                 symbol_conf.event_group = perf_config_bool(var, value);
>                 return 0;
> @@ -1734,6 +1740,15 @@ int cmd_report(int argc, const char **argv)
>                 symbol_conf.annotate_data_sample = true;
>         }
>
> +       if (!symbol_conf.prefer_latency) {
> +               if (ui__dialog_yesno("Do you want to use latency mode?\n"
> +                                    "That will add a 'Latency' column that would mean\n"
> +                                    "'wall-clock' time when used with cycles, for instance\n")) {
> +                       symbol_conf.prefer_latency = symbol_conf.prefer_latency_set = true;
> +                       perf_config__set_variable("report.prefer_latency", "yes");
> +               }
> +       }
> +
>         symbol_conf.enable_latency = true;
>         if (report.disable_order || !perf_session__has_switch_events(session)) {
>                 if (symbol_conf.parallelism_list_str ||
> diff --git a/tools/perf/util/symbol_conf.h b/tools/perf/util/symbol_conf.h
> index cd9aa82c7d5ad941..f87071f5dedd94ca 100644
> --- a/tools/perf/util/symbol_conf.h
> +++ b/tools/perf/util/symbol_conf.h
> @@ -51,6 +51,7 @@ struct symbol_conf {
>                         annotate_data_sample,
>                         skip_empty,
>                         enable_latency,
> +                       prefer_latency_set,
>                         prefer_latency;
>         const char      *vmlinux_name,
>                         *kallsyms_name,

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-07  9:58           ` Dmitry Vyukov
@ 2025-05-07 15:47             ` Arnaldo Carvalho de Melo
  2025-05-07 23:51               ` Namhyung Kim
  0 siblings, 1 reply; 29+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-05-07 15:47 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Namhyung Kim, Ingo Molnar, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, LKML, linux-perf-users, Andi Kleen

On Wed, May 07, 2025 at 11:58:10AM +0200, Dmitry Vyukov wrote:
> On Tue, 6 May 2025 at 18:58, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > > So maybe when we notice that cycles was used 'perf report/top' should
> > > use the term 'wall-clock' for the column name?

> > > So instead of having a --latency we could have a hint that would tell
> > > the user that if this knob was set:

> > >   $ perf config report.wall-clock=true
 
> I am not sure it can be as simple as a single global knob.

> First, record needs to collect additional info and that may be
> somewhat expensive.
 
> Second, report now has several modes:
>  - it can show latency, but order by the current overhead
>  - it can also show latency, and order by latency
> A user wants one or another depending on what they are optimizing
> (possibly looking at both).
 
> I also feel that global config switches are even less discoverable by
> median users (read -- ~nobody will know about that). If these things
> are normal flags with a help description, then some users may
> eventually discover them.

So, the addition of:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e9cbc854d8b148e3491291fb615e94261970fb54

  perf config: Add a function to set one variable in .perfconfig

  To allow for setting a variable from some other tool, like with the
  "wallclock" patchset needs to allow the user to opt-in to having
  that key in the sort order for 'perf report'.

  Cc: Dmitriy Vyukov <dvyukov@google.com>


Was to tell the user about the new possibilities of profiling, once,
with something like:

root@number:~# perf report
             ┌─────────────────────────────────────────────────────┐
             │Do you want to use latency mode?                     │
             │That will add a 'Latency' column that would mean     │
             │'wall-clock' time when used with cycles, for instance│
             │                                                     │
             │                                                     │
             │Enter: Yes, ESC: No                                  │
             └─────────────────────────────────────────────────────┘
 
The working should inform if the current perf.data has what is needed
and thus this would take immediate effect, and if not, inform that as
well and what is needed to be able to use that, things like you
described above:

> First, record needs to collect additional info and that may be
> somewhat expensive.

If the user says that the feature is desired by pressing Enter, follow
up questions can be asked as well, like you described:

> Second, report now has several modes:

>  - it can show latency, but order by the current overhead
>  - it can also show latency, and order by latency

> A user wants one or another depending on what they are optimizing
> (possibly looking at both).

Once these are answered, the questions goes away, as they are now
recorded in the user's ~/.perfconfig file and will be used when what is
needed is present in perf.data file.

The user should also be informed that to change that knob, 'perf config'
is your friend.

Additionally some hotkey could be made available to change that
behaviour, on the fly, with the option of turning the new mode the new
default, again writing to ~/.perfconfig.

Sometimes this "on the fly" needs reprocessing of the whole perf.data
file, sometimes it may be a matter of just showing extra available
fields, like the 'V' key now, that goes on bumping the verbosity in the
'perf report' and 'perf top' browsers (they share the hists browser), or
'j' that toogles showing the lines from jump sources to jump targets in
the annotate browser, 'J' to show how many jumps target a particular
jump target, also in the annotate browser, etc.

This could help experimenting with the various modes of doing profiling,
interactively, to see the problem from different angles, without having
to restart the whole process by exiting the 'perf report' to add new
command line options, then restart, etc, reusing computation sometimes
when switching views, etc.

To get default behaviour its just a matter of renaming ~/.perfconfig to
~/.perfconfig.some_suitabe_name_for_later_use.

This way interesting new features like this don't get buried behind
either command line options nor man pages, giving the user at least an
opportunity to be informed about it.

That is what the patch below barely starts sketching, not doing all I
just described.

- Arnaldo

> > > Then if 'cycles' is present we would have:

> > Something along the lines of the patch below, but there are several
> > details to take into account with what is in the current codebase, only
> > if what is needed for doing the latency/wall-clock time is present in
> > the perf.data being used is present that knob would be used or suggested
> > to the user, so some refactorings are needed, I'll try to folow on it.

> > But just for reference:

> > diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> > index f0299c7ee0254a37..20874800d967ffb5 100644
> > --- a/tools/perf/builtin-report.c
> > +++ b/tools/perf/builtin-report.c
> > @@ -51,6 +51,7 @@
> >  #include "util/util.h" // perf_tip()
> >  #include "ui/ui.h"
> >  #include "ui/progress.h"
> > +#include "ui/util.h"
> >  #include "util/block-info.h"
> >
> >  #include <dlfcn.h>
> > @@ -127,6 +128,11 @@ static int report__config(const char *var, const char *value, void *cb)
> >  {
> >         struct report *rep = cb;
> >
> > +       if (!strcmp(var, "report.prefer_latency")) {
> > +               symbol_conf.prefer_latency = perf_config_bool(var, value);
> > +               symbol_conf.prefer_latency_set = true;
> > +               return 0;
> > +       }
> >         if (!strcmp(var, "report.group")) {
> >                 symbol_conf.event_group = perf_config_bool(var, value);
> >                 return 0;
> > @@ -1734,6 +1740,15 @@ int cmd_report(int argc, const char **argv)
> >                 symbol_conf.annotate_data_sample = true;
> >         }
> >
> > +       if (!symbol_conf.prefer_latency) {
> > +               if (ui__dialog_yesno("Do you want to use latency mode?\n"
> > +                                    "That will add a 'Latency' column that would mean\n"
> > +                                    "'wall-clock' time when used with cycles, for instance\n")) {
> > +                       symbol_conf.prefer_latency = symbol_conf.prefer_latency_set = true;
> > +                       perf_config__set_variable("report.prefer_latency", "yes");
> > +               }
> > +       }
> > +
> >         symbol_conf.enable_latency = true;
> >         if (report.disable_order || !perf_session__has_switch_events(session)) {
> >                 if (symbol_conf.parallelism_list_str ||
> > diff --git a/tools/perf/util/symbol_conf.h b/tools/perf/util/symbol_conf.h
> > index cd9aa82c7d5ad941..f87071f5dedd94ca 100644
> > --- a/tools/perf/util/symbol_conf.h
> > +++ b/tools/perf/util/symbol_conf.h
> > @@ -51,6 +51,7 @@ struct symbol_conf {
> >                         annotate_data_sample,
> >                         skip_empty,
> >                         enable_latency,
> > +                       prefer_latency_set,
> >                         prefer_latency;
> >         const char      *vmlinux_name,
> >                         *kallsyms_name,

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-06  7:40             ` Dmitry Vyukov
@ 2025-05-07 23:43               ` Namhyung Kim
  2025-05-08 12:24                 ` Dmitry Vyukov
  0 siblings, 1 reply; 29+ messages in thread
From: Namhyung Kim @ 2025-05-07 23:43 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Tue, May 06, 2025 at 09:40:52AM +0200, Dmitry Vyukov wrote:
> On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > > >
> > > > Basically yes, but you can use --sample-cpu to add it.
> > >
> > > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
> >
> > Yep, that's why I said "Basically".  So it's not 100% guarantee.
> >
> > We may disable latency column by default in this case and show warning
> > if it's requested.  Or we may add a new attribute to emit sched-switch
> > records only for idle tasks and enable the latency report only if the
> > data has sched-switch records.
> >
> > What do you think?
> 
> Depends on what problem we are trying to solve:
> 
> 1. Enabling latency profiling for system-wide mode.
> 
> 2. Switch events bloating trace too much.
> 
> 3. Lost switch events lead to imprecise accounting.
> 
> The patch mentions all 3 :)
> But I think 2 and 3 are not really specific to system-wide mode.
> An active single process profile can emit more samples than a
> system-wide profile on a lightly loaded system.

True.  But we don't need to care about lightly loaded systems as they
won't cause problems.


> Similarly, if we rely on switch events for system-wide mode, then it's
> equally subject to the lost events problem.

Right, but I'm afraid practically it'll increase the chance of lost
in system-wide mode.  The default size of the sample for system-wide
is 56 byte and the size of the switch is 48 byte.  And the default
sample frequency is 4000 Hz but it cannot control the rate of the
switch.  I saw around 10000 Hz of switches per CPU on my work env.

> 
> For problem 1: we can just permit --latency for system wide mode and
> fully rely on switch events.
> It's not any worse than we do now (wrt both profile size and lost events).

This can be an option and it'd work well on lightly loaded systems.
Maybe we can just try it first.  But I think it's better to have an
option to make it work on heavily loaded systems.

> 
> For problem 2: yes, we could emit only switches to idle tasks. Or
> maybe just a fake CPU sample for an idle task? That's effectively what
> we want, then your current accounting code will work w/o any changes.
> This should help wrt trace size only for system-wide mode (provided
> that user already enables CPU accounting for other reasons, otherwise
> it's unclear what's better -- attaching CPU to each sample, or writing
> switch events).

I'm not sure how we can add the fake samples.  The switch events will be
from the kernel and we may add the condition in the attribute.

And PERF_SAMPLE_CPU is on by default in system-wide mode.

> 
> For problem 3: switches to idle task won't really help. There can be
> lots of them, and missing any will lead to wrong accounting.

I don't know how severe the situation will be.  On heavily loaded
systems, the idle task won't run much and data size won't increase.
On lightly loaded systems, increased data will likely be handled well.


> A principled approach would be to attach a per-thread scheduler
> quantum sequence number to each CPU sample. The sequence number would
> be incremented on every context switch. Then any subset of CPU should
> be enough to understand when a task was scheduled in and out
> (scheduled in on the first CPU sample with sequence number N, and
> switched out on the last sample with sequence number N).

I'm not sure how it can help.  We don't need the switch info itself.
What's needed is when the CPU was idle, right?

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-07 15:47             ` Arnaldo Carvalho de Melo
@ 2025-05-07 23:51               ` Namhyung Kim
  0 siblings, 0 replies; 29+ messages in thread
From: Namhyung Kim @ 2025-05-07 23:51 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Dmitry Vyukov, Ingo Molnar, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, LKML, linux-perf-users, Andi Kleen

On Wed, May 07, 2025 at 12:47:26PM -0300, Arnaldo Carvalho de Melo wrote:
> On Wed, May 07, 2025 at 11:58:10AM +0200, Dmitry Vyukov wrote:
> > On Tue, 6 May 2025 at 18:58, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > > > So maybe when we notice that cycles was used 'perf report/top' should
> > > > use the term 'wall-clock' for the column name?
> 
> > > > So instead of having a --latency we could have a hint that would tell
> > > > the user that if this knob was set:
> 
> > > >   $ perf config report.wall-clock=true
>  
> > I am not sure it can be as simple as a single global knob.
> 
> > First, record needs to collect additional info and that may be
> > somewhat expensive.
>  
> > Second, report now has several modes:
> >  - it can show latency, but order by the current overhead
> >  - it can also show latency, and order by latency
> > A user wants one or another depending on what they are optimizing
> > (possibly looking at both).
>  
> > I also feel that global config switches are even less discoverable by
> > median users (read -- ~nobody will know about that). If these things
> > are normal flags with a help description, then some users may
> > eventually discover them.
> 
> So, the addition of:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e9cbc854d8b148e3491291fb615e94261970fb54
> 
>   perf config: Add a function to set one variable in .perfconfig
> 
>   To allow for setting a variable from some other tool, like with the
>   "wallclock" patchset needs to allow the user to opt-in to having
>   that key in the sort order for 'perf report'.
> 
>   Cc: Dmitriy Vyukov <dvyukov@google.com>
> 
> 
> Was to tell the user about the new possibilities of profiling, once,
> with something like:
> 
> root@number:~# perf report
>              ┌─────────────────────────────────────────────────────┐
>              │Do you want to use latency mode?                     │
>              │That will add a 'Latency' column that would mean     │
>              │'wall-clock' time when used with cycles, for instance│
>              │                                                     │
>              │                                                     │
>              │Enter: Yes, ESC: No                                  │
>              └─────────────────────────────────────────────────────┘
>  
> The working should inform if the current perf.data has what is needed
> and thus this would take immediate effect, and if not, inform that as
> well and what is needed to be able to use that, things like you
> described above:

Looks useful!  Maybe we want to add it to interactive TUI only..

> 
> > First, record needs to collect additional info and that may be
> > somewhat expensive.
> 
> If the user says that the feature is desired by pressing Enter, follow
> up questions can be asked as well, like you described:

I'm afraid it can have a long chain of questions when a new user starts
to use the perf tool with any config.  Maybe we need to limit it somehow.

> 
> > Second, report now has several modes:
> 
> >  - it can show latency, but order by the current overhead
> >  - it can also show latency, and order by latency
> 
> > A user wants one or another depending on what they are optimizing
> > (possibly looking at both).
> 
> Once these are answered, the questions goes away, as they are now
> recorded in the user's ~/.perfconfig file and will be used when what is
> needed is present in perf.data file.

Sounds good.

> 
> The user should also be informed that to change that knob, 'perf config'
> is your friend.
> 
> Additionally some hotkey could be made available to change that
> behaviour, on the fly, with the option of turning the new mode the new
> default, again writing to ~/.perfconfig.
> 
> Sometimes this "on the fly" needs reprocessing of the whole perf.data
> file, sometimes it may be a matter of just showing extra available
> fields, like the 'V' key now, that goes on bumping the verbosity in the
> 'perf report' and 'perf top' browsers (they share the hists browser), or
> 'j' that toogles showing the lines from jump sources to jump targets in
> the annotate browser, 'J' to show how many jumps target a particular
> jump target, also in the annotate browser, etc.
> 
> This could help experimenting with the various modes of doing profiling,
> interactively, to see the problem from different angles, without having
> to restart the whole process by exiting the 'perf report' to add new
> command line options, then restart, etc, reusing computation sometimes
> when switching views, etc.
> 
> To get default behaviour its just a matter of renaming ~/.perfconfig to
> ~/.perfconfig.some_suitabe_name_for_later_use.
> 
> This way interesting new features like this don't get buried behind
> either command line options nor man pages, giving the user at least an
> opportunity to be informed about it.
> 
> That is what the patch below barely starts sketching, not doing all I
> just described.

Sounds like a good idea.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-07 23:43               ` Namhyung Kim
@ 2025-05-08 12:24                 ` Dmitry Vyukov
  2025-05-16 16:33                   ` Namhyung Kim
  0 siblings, 1 reply; 29+ messages in thread
From: Dmitry Vyukov @ 2025-05-08 12:24 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Thu, 8 May 2025 at 01:43, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, May 06, 2025 at 09:40:52AM +0200, Dmitry Vyukov wrote:
> > On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > > > >
> > > > > Basically yes, but you can use --sample-cpu to add it.
> > > >
> > > > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
> > >
> > > Yep, that's why I said "Basically".  So it's not 100% guarantee.
> > >
> > > We may disable latency column by default in this case and show warning
> > > if it's requested.  Or we may add a new attribute to emit sched-switch
> > > records only for idle tasks and enable the latency report only if the
> > > data has sched-switch records.
> > >
> > > What do you think?
> >
> > Depends on what problem we are trying to solve:
> >
> > 1. Enabling latency profiling for system-wide mode.
> >
> > 2. Switch events bloating trace too much.
> >
> > 3. Lost switch events lead to imprecise accounting.
> >
> > The patch mentions all 3 :)
> > But I think 2 and 3 are not really specific to system-wide mode.
> > An active single process profile can emit more samples than a
> > system-wide profile on a lightly loaded system.
>
> True.  But we don't need to care about lightly loaded systems as they
> won't cause problems.
>
>
> > Similarly, if we rely on switch events for system-wide mode, then it's
> > equally subject to the lost events problem.
>
> Right, but I'm afraid practically it'll increase the chance of lost
> in system-wide mode.  The default size of the sample for system-wide
> is 56 byte and the size of the switch is 48 byte.  And the default
> sample frequency is 4000 Hz but it cannot control the rate of the
> switch.  I saw around 10000 Hz of switches per CPU on my work env.
>
> >
> > For problem 1: we can just permit --latency for system wide mode and
> > fully rely on switch events.
> > It's not any worse than we do now (wrt both profile size and lost events).
>
> This can be an option and it'd work well on lightly loaded systems.
> Maybe we can just try it first.  But I think it's better to have an
> option to make it work on heavily loaded systems.
>
> >
> > For problem 2: yes, we could emit only switches to idle tasks. Or
> > maybe just a fake CPU sample for an idle task? That's effectively what
> > we want, then your current accounting code will work w/o any changes.
> > This should help wrt trace size only for system-wide mode (provided
> > that user already enables CPU accounting for other reasons, otherwise
> > it's unclear what's better -- attaching CPU to each sample, or writing
> > switch events).
>
> I'm not sure how we can add the fake samples.  The switch events will be
> from the kernel and we may add the condition in the attribute.
>
> And PERF_SAMPLE_CPU is on by default in system-wide mode.
>
> >
> > For problem 3: switches to idle task won't really help. There can be
> > lots of them, and missing any will lead to wrong accounting.
>
> I don't know how severe the situation will be.  On heavily loaded
> systems, the idle task won't run much and data size won't increase.
> On lightly loaded systems, increased data will likely be handled well.
>
>
> > A principled approach would be to attach a per-thread scheduler
> > quantum sequence number to each CPU sample. The sequence number would
> > be incremented on every context switch. Then any subset of CPU should
> > be enough to understand when a task was scheduled in and out
> > (scheduled in on the first CPU sample with sequence number N, and
> > switched out on the last sample with sequence number N).
>
> I'm not sure how it can help.  We don't need the switch info itself.
> What's needed is when the CPU was idle, right?

I mean the following.
Each sample has a TID.
We add a SEQ field, which is per-thread and is incremented after every
rescheduling of the thread.

When we see the last sample for (TID,SEQ), we pretend there is SCHED
OUT event for this thread at this timestamp. When we see the first
sample for (TID,SEQ+1), we pretend there is SCHED IN event for this
thread at this timestamp.

These SCHED IN/OUT events are not injected by the kernel. We just
pretend they happen for accounting purposes. We may actually
materialize them in the perf tool, or me may just update parallelism
as if they happen.

With this scheme we can lose absolutely any subset of samples, and
still get very precise accounting. When we lose samples, the profile
of course becomes a bit less precise, but the effect is local and
recoverable.

If we lose the last/first event for (TID,SEQ), then we slightly
shorten/postpone the thread accounting in the process parallelism
level. If we lose a middle (TID,SEQ), then parallelism is not
affected.

The switches from a thread to itself is not a problem. We will just
inject a SCHED OUT followed by SCHED IN. But exactly the same happens
now when the kernel injects these events.

But if we switch to idle task and got no samples for some period of
time on the CPU, then we properly inject SCHED OUT/IN that will
account for the thread not actually running.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-08 12:24                 ` Dmitry Vyukov
@ 2025-05-16 16:33                   ` Namhyung Kim
  2025-05-19  6:00                     ` Dmitry Vyukov
  0 siblings, 1 reply; 29+ messages in thread
From: Namhyung Kim @ 2025-05-16 16:33 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

Hello,

Sorry for the delay.

On Thu, May 08, 2025 at 02:24:08PM +0200, Dmitry Vyukov wrote:
> On Thu, 8 May 2025 at 01:43, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Tue, May 06, 2025 at 09:40:52AM +0200, Dmitry Vyukov wrote:
> > > On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > > > > >
> > > > > > Basically yes, but you can use --sample-cpu to add it.
> > > > >
> > > > > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
> > > >
> > > > Yep, that's why I said "Basically".  So it's not 100% guarantee.
> > > >
> > > > We may disable latency column by default in this case and show warning
> > > > if it's requested.  Or we may add a new attribute to emit sched-switch
> > > > records only for idle tasks and enable the latency report only if the
> > > > data has sched-switch records.
> > > >
> > > > What do you think?
> > >
> > > Depends on what problem we are trying to solve:
> > >
> > > 1. Enabling latency profiling for system-wide mode.
> > >
> > > 2. Switch events bloating trace too much.
> > >
> > > 3. Lost switch events lead to imprecise accounting.
> > >
> > > The patch mentions all 3 :)
> > > But I think 2 and 3 are not really specific to system-wide mode.
> > > An active single process profile can emit more samples than a
> > > system-wide profile on a lightly loaded system.
> >
> > True.  But we don't need to care about lightly loaded systems as they
> > won't cause problems.
> >
> >
> > > Similarly, if we rely on switch events for system-wide mode, then it's
> > > equally subject to the lost events problem.
> >
> > Right, but I'm afraid practically it'll increase the chance of lost
> > in system-wide mode.  The default size of the sample for system-wide
> > is 56 byte and the size of the switch is 48 byte.  And the default
> > sample frequency is 4000 Hz but it cannot control the rate of the
> > switch.  I saw around 10000 Hz of switches per CPU on my work env.
> >
> > >
> > > For problem 1: we can just permit --latency for system wide mode and
> > > fully rely on switch events.
> > > It's not any worse than we do now (wrt both profile size and lost events).
> >
> > This can be an option and it'd work well on lightly loaded systems.
> > Maybe we can just try it first.  But I think it's better to have an
> > option to make it work on heavily loaded systems.
> >
> > >
> > > For problem 2: yes, we could emit only switches to idle tasks. Or
> > > maybe just a fake CPU sample for an idle task? That's effectively what
> > > we want, then your current accounting code will work w/o any changes.
> > > This should help wrt trace size only for system-wide mode (provided
> > > that user already enables CPU accounting for other reasons, otherwise
> > > it's unclear what's better -- attaching CPU to each sample, or writing
> > > switch events).
> >
> > I'm not sure how we can add the fake samples.  The switch events will be
> > from the kernel and we may add the condition in the attribute.
> >
> > And PERF_SAMPLE_CPU is on by default in system-wide mode.
> >
> > >
> > > For problem 3: switches to idle task won't really help. There can be
> > > lots of them, and missing any will lead to wrong accounting.
> >
> > I don't know how severe the situation will be.  On heavily loaded
> > systems, the idle task won't run much and data size won't increase.
> > On lightly loaded systems, increased data will likely be handled well.
> >
> >
> > > A principled approach would be to attach a per-thread scheduler
> > > quantum sequence number to each CPU sample. The sequence number would
> > > be incremented on every context switch. Then any subset of CPU should
> > > be enough to understand when a task was scheduled in and out
> > > (scheduled in on the first CPU sample with sequence number N, and
> > > switched out on the last sample with sequence number N).
> >
> > I'm not sure how it can help.  We don't need the switch info itself.
> > What's needed is when the CPU was idle, right?
> 
> I mean the following.
> Each sample has a TID.
> We add a SEQ field, which is per-thread and is incremented after every
> rescheduling of the thread.
> 
> When we see the last sample for (TID,SEQ), we pretend there is SCHED
> OUT event for this thread at this timestamp. When we see the first
> sample for (TID,SEQ+1), we pretend there is SCHED IN event for this
> thread at this timestamp.
> 
> These SCHED IN/OUT events are not injected by the kernel. We just
> pretend they happen for accounting purposes. We may actually
> materialize them in the perf tool, or me may just update parallelism
> as if they happen.

Thanks for the explanation.  But I don't think it needs the SEQ and
SCHED IN/OUT generated from it to track lost records.  Please see below.

> 
> With this scheme we can lose absolutely any subset of samples, and
> still get very precise accounting. When we lose samples, the profile
> of course becomes a bit less precise, but the effect is local and
> recoverable.
> 
> If we lose the last/first event for (TID,SEQ), then we slightly
> shorten/postpone the thread accounting in the process parallelism
> level. If we lose a middle (TID,SEQ), then parallelism is not
> affected.

I'm afraid it cannot check parallelism by just seeing the current thread.
I guess it would need information from other threads even if it has same
SEQ.

Also postpone thread accounting can be complex.  I think it should wait
for all other threads to get a sample.  Maybe some threads exited and
lost too.

In my approach, I can clear the current thread from all CPUs when it
sees a LOST record and restart calculation of parallelism.

> 
> The switches from a thread to itself is not a problem. We will just
> inject a SCHED OUT followed by SCHED IN. But exactly the same happens
> now when the kernel injects these events.
> 
> But if we switch to idle task and got no samples for some period of
> time on the CPU, then we properly inject SCHED OUT/IN that will
> account for the thread not actually running.

Hmm.. ok.  Maybe we can save the timestamp of the last sample on each
CPU and clear the current thread after some period (2x of given freq?).
But it may slow down perf report as it'd check all CPUs for each sample.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-16 16:33                   ` Namhyung Kim
@ 2025-05-19  6:00                     ` Dmitry Vyukov
  2025-05-20  1:43                       ` Namhyung Kim
  0 siblings, 1 reply; 29+ messages in thread
From: Dmitry Vyukov @ 2025-05-19  6:00 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Fri, 16 May 2025 at 18:33, Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> Sorry for the delay.
>
> On Thu, May 08, 2025 at 02:24:08PM +0200, Dmitry Vyukov wrote:
> > On Thu, 8 May 2025 at 01:43, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Tue, May 06, 2025 at 09:40:52AM +0200, Dmitry Vyukov wrote:
> > > > On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > > > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > > > > > >
> > > > > > > Basically yes, but you can use --sample-cpu to add it.
> > > > > >
> > > > > > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
> > > > >
> > > > > Yep, that's why I said "Basically".  So it's not 100% guarantee.
> > > > >
> > > > > We may disable latency column by default in this case and show warning
> > > > > if it's requested.  Or we may add a new attribute to emit sched-switch
> > > > > records only for idle tasks and enable the latency report only if the
> > > > > data has sched-switch records.
> > > > >
> > > > > What do you think?
> > > >
> > > > Depends on what problem we are trying to solve:
> > > >
> > > > 1. Enabling latency profiling for system-wide mode.
> > > >
> > > > 2. Switch events bloating trace too much.
> > > >
> > > > 3. Lost switch events lead to imprecise accounting.
> > > >
> > > > The patch mentions all 3 :)
> > > > But I think 2 and 3 are not really specific to system-wide mode.
> > > > An active single process profile can emit more samples than a
> > > > system-wide profile on a lightly loaded system.
> > >
> > > True.  But we don't need to care about lightly loaded systems as they
> > > won't cause problems.
> > >
> > >
> > > > Similarly, if we rely on switch events for system-wide mode, then it's
> > > > equally subject to the lost events problem.
> > >
> > > Right, but I'm afraid practically it'll increase the chance of lost
> > > in system-wide mode.  The default size of the sample for system-wide
> > > is 56 byte and the size of the switch is 48 byte.  And the default
> > > sample frequency is 4000 Hz but it cannot control the rate of the
> > > switch.  I saw around 10000 Hz of switches per CPU on my work env.
> > >
> > > >
> > > > For problem 1: we can just permit --latency for system wide mode and
> > > > fully rely on switch events.
> > > > It's not any worse than we do now (wrt both profile size and lost events).
> > >
> > > This can be an option and it'd work well on lightly loaded systems.
> > > Maybe we can just try it first.  But I think it's better to have an
> > > option to make it work on heavily loaded systems.
> > >
> > > >
> > > > For problem 2: yes, we could emit only switches to idle tasks. Or
> > > > maybe just a fake CPU sample for an idle task? That's effectively what
> > > > we want, then your current accounting code will work w/o any changes.
> > > > This should help wrt trace size only for system-wide mode (provided
> > > > that user already enables CPU accounting for other reasons, otherwise
> > > > it's unclear what's better -- attaching CPU to each sample, or writing
> > > > switch events).
> > >
> > > I'm not sure how we can add the fake samples.  The switch events will be
> > > from the kernel and we may add the condition in the attribute.
> > >
> > > And PERF_SAMPLE_CPU is on by default in system-wide mode.
> > >
> > > >
> > > > For problem 3: switches to idle task won't really help. There can be
> > > > lots of them, and missing any will lead to wrong accounting.
> > >
> > > I don't know how severe the situation will be.  On heavily loaded
> > > systems, the idle task won't run much and data size won't increase.
> > > On lightly loaded systems, increased data will likely be handled well.
> > >
> > >
> > > > A principled approach would be to attach a per-thread scheduler
> > > > quantum sequence number to each CPU sample. The sequence number would
> > > > be incremented on every context switch. Then any subset of CPU should
> > > > be enough to understand when a task was scheduled in and out
> > > > (scheduled in on the first CPU sample with sequence number N, and
> > > > switched out on the last sample with sequence number N).
> > >
> > > I'm not sure how it can help.  We don't need the switch info itself.
> > > What's needed is when the CPU was idle, right?
> >
> > I mean the following.
> > Each sample has a TID.
> > We add a SEQ field, which is per-thread and is incremented after every
> > rescheduling of the thread.
> >
> > When we see the last sample for (TID,SEQ), we pretend there is SCHED
> > OUT event for this thread at this timestamp. When we see the first
> > sample for (TID,SEQ+1), we pretend there is SCHED IN event for this
> > thread at this timestamp.
> >
> > These SCHED IN/OUT events are not injected by the kernel. We just
> > pretend they happen for accounting purposes. We may actually
> > materialize them in the perf tool, or me may just update parallelism
> > as if they happen.
>
> Thanks for the explanation.  But I don't think it needs the SEQ and
> SCHED IN/OUT generated from it to track lost records.  Please see below.
>
> >
> > With this scheme we can lose absolutely any subset of samples, and
> > still get very precise accounting. When we lose samples, the profile
> > of course becomes a bit less precise, but the effect is local and
> > recoverable.
> >
> > If we lose the last/first event for (TID,SEQ), then we slightly
> > shorten/postpone the thread accounting in the process parallelism
> > level. If we lose a middle (TID,SEQ), then parallelism is not
> > affected.
>
> I'm afraid it cannot check parallelism by just seeing the current thread.
> I guess it would need information from other threads even if it has same
> SEQ.

Yes, we still count parallelism like you do in this patch, we just use
the SEQ info instead of CPU numbers and explicit switch events.

> Also postpone thread accounting can be complex.  I think it should wait
> for all other threads to get a sample.  Maybe some threads exited and
> lost too.

Yes, in order to understand what's the last event for (TID,SEQ) we
need to look ahead and find the event (TID,SEQ+1). The easiest way to
do it would be to do 2 passes over the trace. That's the cost of
saving trace space + being resilient to lost events.

Do you see any other issues with this scheme besides requiring 2 passes?


> In my approach, I can clear the current thread from all CPUs when it
> sees a LOST record and restart calculation of parallelism.
>
> >
> > The switches from a thread to itself is not a problem. We will just
> > inject a SCHED OUT followed by SCHED IN. But exactly the same happens
> > now when the kernel injects these events.
> >
> > But if we switch to idle task and got no samples for some period of
> > time on the CPU, then we properly inject SCHED OUT/IN that will
> > account for the thread not actually running.
>
> Hmm.. ok.  Maybe we can save the timestamp of the last sample on each
> CPU and clear the current thread after some period (2x of given freq?).
> But it may slow down perf report as it'd check all CPUs for each sample.
>
> Thanks,
> Namhyung
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-19  6:00                     ` Dmitry Vyukov
@ 2025-05-20  1:43                       ` Namhyung Kim
  2025-05-20  6:45                         ` Dmitry Vyukov
  0 siblings, 1 reply; 29+ messages in thread
From: Namhyung Kim @ 2025-05-20  1:43 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Mon, May 19, 2025 at 08:00:49AM +0200, Dmitry Vyukov wrote:
> On Fri, 16 May 2025 at 18:33, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Hello,
> >
> > Sorry for the delay.
> >
> > On Thu, May 08, 2025 at 02:24:08PM +0200, Dmitry Vyukov wrote:
> > > On Thu, 8 May 2025 at 01:43, Namhyung Kim <namhyung@kernel.org> wrote:
> > > >
> > > > On Tue, May 06, 2025 at 09:40:52AM +0200, Dmitry Vyukov wrote:
> > > > > On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > > > > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > > > > > > >
> > > > > > > > Basically yes, but you can use --sample-cpu to add it.
> > > > > > >
> > > > > > > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
> > > > > >
> > > > > > Yep, that's why I said "Basically".  So it's not 100% guarantee.
> > > > > >
> > > > > > We may disable latency column by default in this case and show warning
> > > > > > if it's requested.  Or we may add a new attribute to emit sched-switch
> > > > > > records only for idle tasks and enable the latency report only if the
> > > > > > data has sched-switch records.
> > > > > >
> > > > > > What do you think?
> > > > >
> > > > > Depends on what problem we are trying to solve:
> > > > >
> > > > > 1. Enabling latency profiling for system-wide mode.
> > > > >
> > > > > 2. Switch events bloating trace too much.
> > > > >
> > > > > 3. Lost switch events lead to imprecise accounting.
> > > > >
> > > > > The patch mentions all 3 :)
> > > > > But I think 2 and 3 are not really specific to system-wide mode.
> > > > > An active single process profile can emit more samples than a
> > > > > system-wide profile on a lightly loaded system.
> > > >
> > > > True.  But we don't need to care about lightly loaded systems as they
> > > > won't cause problems.
> > > >
> > > >
> > > > > Similarly, if we rely on switch events for system-wide mode, then it's
> > > > > equally subject to the lost events problem.
> > > >
> > > > Right, but I'm afraid practically it'll increase the chance of lost
> > > > in system-wide mode.  The default size of the sample for system-wide
> > > > is 56 byte and the size of the switch is 48 byte.  And the default
> > > > sample frequency is 4000 Hz but it cannot control the rate of the
> > > > switch.  I saw around 10000 Hz of switches per CPU on my work env.
> > > >
> > > > >
> > > > > For problem 1: we can just permit --latency for system wide mode and
> > > > > fully rely on switch events.
> > > > > It's not any worse than we do now (wrt both profile size and lost events).
> > > >
> > > > This can be an option and it'd work well on lightly loaded systems.
> > > > Maybe we can just try it first.  But I think it's better to have an
> > > > option to make it work on heavily loaded systems.
> > > >
> > > > >
> > > > > For problem 2: yes, we could emit only switches to idle tasks. Or
> > > > > maybe just a fake CPU sample for an idle task? That's effectively what
> > > > > we want, then your current accounting code will work w/o any changes.
> > > > > This should help wrt trace size only for system-wide mode (provided
> > > > > that user already enables CPU accounting for other reasons, otherwise
> > > > > it's unclear what's better -- attaching CPU to each sample, or writing
> > > > > switch events).
> > > >
> > > > I'm not sure how we can add the fake samples.  The switch events will be
> > > > from the kernel and we may add the condition in the attribute.
> > > >
> > > > And PERF_SAMPLE_CPU is on by default in system-wide mode.
> > > >
> > > > >
> > > > > For problem 3: switches to idle task won't really help. There can be
> > > > > lots of them, and missing any will lead to wrong accounting.
> > > >
> > > > I don't know how severe the situation will be.  On heavily loaded
> > > > systems, the idle task won't run much and data size won't increase.
> > > > On lightly loaded systems, increased data will likely be handled well.
> > > >
> > > >
> > > > > A principled approach would be to attach a per-thread scheduler
> > > > > quantum sequence number to each CPU sample. The sequence number would
> > > > > be incremented on every context switch. Then any subset of CPU should
> > > > > be enough to understand when a task was scheduled in and out
> > > > > (scheduled in on the first CPU sample with sequence number N, and
> > > > > switched out on the last sample with sequence number N).
> > > >
> > > > I'm not sure how it can help.  We don't need the switch info itself.
> > > > What's needed is when the CPU was idle, right?
> > >
> > > I mean the following.
> > > Each sample has a TID.
> > > We add a SEQ field, which is per-thread and is incremented after every
> > > rescheduling of the thread.
> > >
> > > When we see the last sample for (TID,SEQ), we pretend there is SCHED
> > > OUT event for this thread at this timestamp. When we see the first
> > > sample for (TID,SEQ+1), we pretend there is SCHED IN event for this
> > > thread at this timestamp.
> > >
> > > These SCHED IN/OUT events are not injected by the kernel. We just
> > > pretend they happen for accounting purposes. We may actually
> > > materialize them in the perf tool, or me may just update parallelism
> > > as if they happen.
> >
> > Thanks for the explanation.  But I don't think it needs the SEQ and
> > SCHED IN/OUT generated from it to track lost records.  Please see below.
> >
> > >
> > > With this scheme we can lose absolutely any subset of samples, and
> > > still get very precise accounting. When we lose samples, the profile
> > > of course becomes a bit less precise, but the effect is local and
> > > recoverable.
> > >
> > > If we lose the last/first event for (TID,SEQ), then we slightly
> > > shorten/postpone the thread accounting in the process parallelism
> > > level. If we lose a middle (TID,SEQ), then parallelism is not
> > > affected.
> >
> > I'm afraid it cannot check parallelism by just seeing the current thread.
> > I guess it would need information from other threads even if it has same
> > SEQ.
> 
> Yes, we still count parallelism like you do in this patch, we just use
> the SEQ info instead of CPU numbers and explicit switch events.

I mean after record lost, let's say

  t1: SAMPLE for TID 1234, seq 10  (parallelism = 4)
  t2: LOST
  t3: SAMPLE for TID 1234, seq 10  (parallelism = ?)

I don't think we can continue to use parallelism of 4 after LOST even if
it has the same seq because it cannot know if other threads switched on
other CPUs.  Then do we need really the seq?

> 
> > Also postpone thread accounting can be complex.  I think it should wait
> > for all other threads to get a sample.  Maybe some threads exited and
> > lost too.
> 
> Yes, in order to understand what's the last event for (TID,SEQ) we
> need to look ahead and find the event (TID,SEQ+1). The easiest way to
> do it would be to do 2 passes over the trace. That's the cost of
> saving trace space + being resilient to lost events.
> 
> Do you see any other issues with this scheme besides requiring 2 passes?

Well.. 2 pass itself can be a problem due to slowness it'd bring.  Some
people complain about the speed of perf report as of now.

I think we can simply reset the parallelism in all processes after LOST
and set current process to the idle task.  It'll catch up as soon as it
sees samples from all CPUs.

> 
> 
> > In my approach, I can clear the current thread from all CPUs when it
> > sees a LOST record and restart calculation of parallelism.
> >
> > >
> > > The switches from a thread to itself is not a problem. We will just
> > > inject a SCHED OUT followed by SCHED IN. But exactly the same happens
> > > now when the kernel injects these events.
> > >
> > > But if we switch to idle task and got no samples for some period of
> > > time on the CPU, then we properly inject SCHED OUT/IN that will
> > > account for the thread not actually running.
> >
> > Hmm.. ok.  Maybe we can save the timestamp of the last sample on each
> > CPU and clear the current thread after some period (2x of given freq?).
> > But it may slow down perf report as it'd check all CPUs for each sample.
> >
> > Thanks,
> > Namhyung
> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-20  1:43                       ` Namhyung Kim
@ 2025-05-20  6:45                         ` Dmitry Vyukov
  2025-05-20 22:50                           ` Namhyung Kim
  0 siblings, 1 reply; 29+ messages in thread
From: Dmitry Vyukov @ 2025-05-20  6:45 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Tue, 20 May 2025 at 03:43, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Mon, May 19, 2025 at 08:00:49AM +0200, Dmitry Vyukov wrote:
> > On Fri, 16 May 2025 at 18:33, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > Sorry for the delay.
> > >
> > > On Thu, May 08, 2025 at 02:24:08PM +0200, Dmitry Vyukov wrote:
> > > > On Thu, 8 May 2025 at 01:43, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > >
> > > > > On Tue, May 06, 2025 at 09:40:52AM +0200, Dmitry Vyukov wrote:
> > > > > > On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > > > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > > > > > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > > > > > > > >
> > > > > > > > > Basically yes, but you can use --sample-cpu to add it.
> > > > > > > >
> > > > > > > > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
> > > > > > >
> > > > > > > Yep, that's why I said "Basically".  So it's not 100% guarantee.
> > > > > > >
> > > > > > > We may disable latency column by default in this case and show warning
> > > > > > > if it's requested.  Or we may add a new attribute to emit sched-switch
> > > > > > > records only for idle tasks and enable the latency report only if the
> > > > > > > data has sched-switch records.
> > > > > > >
> > > > > > > What do you think?
> > > > > >
> > > > > > Depends on what problem we are trying to solve:
> > > > > >
> > > > > > 1. Enabling latency profiling for system-wide mode.
> > > > > >
> > > > > > 2. Switch events bloating trace too much.
> > > > > >
> > > > > > 3. Lost switch events lead to imprecise accounting.
> > > > > >
> > > > > > The patch mentions all 3 :)
> > > > > > But I think 2 and 3 are not really specific to system-wide mode.
> > > > > > An active single process profile can emit more samples than a
> > > > > > system-wide profile on a lightly loaded system.
> > > > >
> > > > > True.  But we don't need to care about lightly loaded systems as they
> > > > > won't cause problems.
> > > > >
> > > > >
> > > > > > Similarly, if we rely on switch events for system-wide mode, then it's
> > > > > > equally subject to the lost events problem.
> > > > >
> > > > > Right, but I'm afraid practically it'll increase the chance of lost
> > > > > in system-wide mode.  The default size of the sample for system-wide
> > > > > is 56 byte and the size of the switch is 48 byte.  And the default
> > > > > sample frequency is 4000 Hz but it cannot control the rate of the
> > > > > switch.  I saw around 10000 Hz of switches per CPU on my work env.
> > > > >
> > > > > >
> > > > > > For problem 1: we can just permit --latency for system wide mode and
> > > > > > fully rely on switch events.
> > > > > > It's not any worse than we do now (wrt both profile size and lost events).
> > > > >
> > > > > This can be an option and it'd work well on lightly loaded systems.
> > > > > Maybe we can just try it first.  But I think it's better to have an
> > > > > option to make it work on heavily loaded systems.
> > > > >
> > > > > >
> > > > > > For problem 2: yes, we could emit only switches to idle tasks. Or
> > > > > > maybe just a fake CPU sample for an idle task? That's effectively what
> > > > > > we want, then your current accounting code will work w/o any changes.
> > > > > > This should help wrt trace size only for system-wide mode (provided
> > > > > > that user already enables CPU accounting for other reasons, otherwise
> > > > > > it's unclear what's better -- attaching CPU to each sample, or writing
> > > > > > switch events).
> > > > >
> > > > > I'm not sure how we can add the fake samples.  The switch events will be
> > > > > from the kernel and we may add the condition in the attribute.
> > > > >
> > > > > And PERF_SAMPLE_CPU is on by default in system-wide mode.
> > > > >
> > > > > >
> > > > > > For problem 3: switches to idle task won't really help. There can be
> > > > > > lots of them, and missing any will lead to wrong accounting.
> > > > >
> > > > > I don't know how severe the situation will be.  On heavily loaded
> > > > > systems, the idle task won't run much and data size won't increase.
> > > > > On lightly loaded systems, increased data will likely be handled well.
> > > > >
> > > > >
> > > > > > A principled approach would be to attach a per-thread scheduler
> > > > > > quantum sequence number to each CPU sample. The sequence number would
> > > > > > be incremented on every context switch. Then any subset of CPU should
> > > > > > be enough to understand when a task was scheduled in and out
> > > > > > (scheduled in on the first CPU sample with sequence number N, and
> > > > > > switched out on the last sample with sequence number N).
> > > > >
> > > > > I'm not sure how it can help.  We don't need the switch info itself.
> > > > > What's needed is when the CPU was idle, right?
> > > >
> > > > I mean the following.
> > > > Each sample has a TID.
> > > > We add a SEQ field, which is per-thread and is incremented after every
> > > > rescheduling of the thread.
> > > >
> > > > When we see the last sample for (TID,SEQ), we pretend there is SCHED
> > > > OUT event for this thread at this timestamp. When we see the first
> > > > sample for (TID,SEQ+1), we pretend there is SCHED IN event for this
> > > > thread at this timestamp.
> > > >
> > > > These SCHED IN/OUT events are not injected by the kernel. We just
> > > > pretend they happen for accounting purposes. We may actually
> > > > materialize them in the perf tool, or me may just update parallelism
> > > > as if they happen.
> > >
> > > Thanks for the explanation.  But I don't think it needs the SEQ and
> > > SCHED IN/OUT generated from it to track lost records.  Please see below.
> > >
> > > >
> > > > With this scheme we can lose absolutely any subset of samples, and
> > > > still get very precise accounting. When we lose samples, the profile
> > > > of course becomes a bit less precise, but the effect is local and
> > > > recoverable.
> > > >
> > > > If we lose the last/first event for (TID,SEQ), then we slightly
> > > > shorten/postpone the thread accounting in the process parallelism
> > > > level. If we lose a middle (TID,SEQ), then parallelism is not
> > > > affected.
> > >
> > > I'm afraid it cannot check parallelism by just seeing the current thread.
> > > I guess it would need information from other threads even if it has same
> > > SEQ.
> >
> > Yes, we still count parallelism like you do in this patch, we just use
> > the SEQ info instead of CPU numbers and explicit switch events.
>
> I mean after record lost, let's say
>
>   t1: SAMPLE for TID 1234, seq 10  (parallelism = 4)
>   t2: LOST
>   t3: SAMPLE for TID 1234, seq 10  (parallelism = ?)
>
> I don't think we can continue to use parallelism of 4 after LOST even if
> it has the same seq because it cannot know if other threads switched on
> other CPUs.  Then do we need really the seq?

I do not understand the problem you describe.
We just keep updating parallelism according to the algorithm I
described. It works fine in the presence of lost events.


> > > Also postpone thread accounting can be complex.  I think it should wait
> > > for all other threads to get a sample.  Maybe some threads exited and
> > > lost too.
> >
> > Yes, in order to understand what's the last event for (TID,SEQ) we
> > need to look ahead and find the event (TID,SEQ+1). The easiest way to
> > do it would be to do 2 passes over the trace. That's the cost of
> > saving trace space + being resilient to lost events.
> >
> > Do you see any other issues with this scheme besides requiring 2 passes?
>
> Well.. 2 pass itself can be a problem due to slowness it'd bring.  Some
> people complain about the speed of perf report as of now.

Is trace processing CPU-bound or memory-bound? If it's CPU-bound, then
the second pass may be OK-ish, since we will need minimal CPU
processing during the first pass.


> I think we can simply reset the parallelism in all processes after LOST
> and set current process to the idle task.  It'll catch up as soon as it
> sees samples from all CPUs.

I guess we can approximate parallelism as you described here:

> Hmm.. ok.  Maybe we can save the timestamp of the last sample on each
> CPU and clear the current thread after some period (2x of given freq?).

We probably don't need to do anything special for lost events in this
scheme at all. If the gap caused by lost events is tiny, then we
consider nothing happened. If the gap is large enough, then we
consider the CPU as idle for the duration of the gap. Either way it
will be handled on common grounds.

But tuning of these heuristics + testing and verification may be a bit
of a problem. I would hate to end up with a tool which I won't trust.

Here:
"after some period (2x of given freq?)"
do you mean 2x average/median period, or 1/2 average/median period?
(2x freq is 1/2 period)

Ideally, we consider a CPU idle after 1/2 period after it switched to
the idle task and we stop receiving samples.
But on the other hand, we don't want to consider it constantly
becoming idle, when it's just doing normal sampling with the normal
period...

So ideally the algorithm should be something like:
let's say average/median sampling period is P
we got last sample for CPU X at time T
if by time T+2P we have not seen any other sample on CPU X, then
consider CPU X idle since T+0.5P

But this would also require either 2 passes over the data, or some
kind of look ahead similar to the algo I proposed...

Also, do we take the median period? or average? do we update it over
time (say, if CPU freq changes)? do we count it globally, or per CPU
(in case CPUs run at different freqs)?









> > > In my approach, I can clear the current thread from all CPUs when it
> > > sees a LOST record and restart calculation of parallelism.
> > >
> > > >
> > > > The switches from a thread to itself is not a problem. We will just
> > > > inject a SCHED OUT followed by SCHED IN. But exactly the same happens
> > > > now when the kernel injects these events.
> > > >
> > > > But if we switch to idle task and got no samples for some period of
> > > > time on the CPU, then we properly inject SCHED OUT/IN that will
> > > > account for the thread not actually running.
> > >
> > > Hmm.. ok.  Maybe we can save the timestamp of the last sample on each
> > > CPU and clear the current thread after some period (2x of given freq?).
> > > But it may slow down perf report as it'd check all CPUs for each sample.
> > >
> > > Thanks,
> > > Namhyung
> > >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-20  6:45                         ` Dmitry Vyukov
@ 2025-05-20 22:50                           ` Namhyung Kim
  2025-05-21  7:30                             ` Dmitry Vyukov
  0 siblings, 1 reply; 29+ messages in thread
From: Namhyung Kim @ 2025-05-20 22:50 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Tue, May 20, 2025 at 08:45:51AM +0200, Dmitry Vyukov wrote:
> On Tue, 20 May 2025 at 03:43, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Mon, May 19, 2025 at 08:00:49AM +0200, Dmitry Vyukov wrote:
> > > On Fri, 16 May 2025 at 18:33, Namhyung Kim <namhyung@kernel.org> wrote:
> > > >
> > > > Hello,
> > > >
> > > > Sorry for the delay.
> > > >
> > > > On Thu, May 08, 2025 at 02:24:08PM +0200, Dmitry Vyukov wrote:
> > > > > On Thu, 8 May 2025 at 01:43, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > >
> > > > > > On Tue, May 06, 2025 at 09:40:52AM +0200, Dmitry Vyukov wrote:
> > > > > > > On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > > > > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > > > > > > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > > > > > > > > >
> > > > > > > > > > Basically yes, but you can use --sample-cpu to add it.
> > > > > > > > >
> > > > > > > > > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
> > > > > > > >
> > > > > > > > Yep, that's why I said "Basically".  So it's not 100% guarantee.
> > > > > > > >
> > > > > > > > We may disable latency column by default in this case and show warning
> > > > > > > > if it's requested.  Or we may add a new attribute to emit sched-switch
> > > > > > > > records only for idle tasks and enable the latency report only if the
> > > > > > > > data has sched-switch records.
> > > > > > > >
> > > > > > > > What do you think?
> > > > > > >
> > > > > > > Depends on what problem we are trying to solve:
> > > > > > >
> > > > > > > 1. Enabling latency profiling for system-wide mode.
> > > > > > >
> > > > > > > 2. Switch events bloating trace too much.
> > > > > > >
> > > > > > > 3. Lost switch events lead to imprecise accounting.
> > > > > > >
> > > > > > > The patch mentions all 3 :)
> > > > > > > But I think 2 and 3 are not really specific to system-wide mode.
> > > > > > > An active single process profile can emit more samples than a
> > > > > > > system-wide profile on a lightly loaded system.
> > > > > >
> > > > > > True.  But we don't need to care about lightly loaded systems as they
> > > > > > won't cause problems.
> > > > > >
> > > > > >
> > > > > > > Similarly, if we rely on switch events for system-wide mode, then it's
> > > > > > > equally subject to the lost events problem.
> > > > > >
> > > > > > Right, but I'm afraid practically it'll increase the chance of lost
> > > > > > in system-wide mode.  The default size of the sample for system-wide
> > > > > > is 56 byte and the size of the switch is 48 byte.  And the default
> > > > > > sample frequency is 4000 Hz but it cannot control the rate of the
> > > > > > switch.  I saw around 10000 Hz of switches per CPU on my work env.
> > > > > >
> > > > > > >
> > > > > > > For problem 1: we can just permit --latency for system wide mode and
> > > > > > > fully rely on switch events.
> > > > > > > It's not any worse than we do now (wrt both profile size and lost events).
> > > > > >
> > > > > > This can be an option and it'd work well on lightly loaded systems.
> > > > > > Maybe we can just try it first.  But I think it's better to have an
> > > > > > option to make it work on heavily loaded systems.
> > > > > >
> > > > > > >
> > > > > > > For problem 2: yes, we could emit only switches to idle tasks. Or
> > > > > > > maybe just a fake CPU sample for an idle task? That's effectively what
> > > > > > > we want, then your current accounting code will work w/o any changes.
> > > > > > > This should help wrt trace size only for system-wide mode (provided
> > > > > > > that user already enables CPU accounting for other reasons, otherwise
> > > > > > > it's unclear what's better -- attaching CPU to each sample, or writing
> > > > > > > switch events).
> > > > > >
> > > > > > I'm not sure how we can add the fake samples.  The switch events will be
> > > > > > from the kernel and we may add the condition in the attribute.
> > > > > >
> > > > > > And PERF_SAMPLE_CPU is on by default in system-wide mode.
> > > > > >
> > > > > > >
> > > > > > > For problem 3: switches to idle task won't really help. There can be
> > > > > > > lots of them, and missing any will lead to wrong accounting.
> > > > > >
> > > > > > I don't know how severe the situation will be.  On heavily loaded
> > > > > > systems, the idle task won't run much and data size won't increase.
> > > > > > On lightly loaded systems, increased data will likely be handled well.
> > > > > >
> > > > > >
> > > > > > > A principled approach would be to attach a per-thread scheduler
> > > > > > > quantum sequence number to each CPU sample. The sequence number would
> > > > > > > be incremented on every context switch. Then any subset of CPU should
> > > > > > > be enough to understand when a task was scheduled in and out
> > > > > > > (scheduled in on the first CPU sample with sequence number N, and
> > > > > > > switched out on the last sample with sequence number N).
> > > > > >
> > > > > > I'm not sure how it can help.  We don't need the switch info itself.
> > > > > > What's needed is when the CPU was idle, right?
> > > > >
> > > > > I mean the following.
> > > > > Each sample has a TID.
> > > > > We add a SEQ field, which is per-thread and is incremented after every
> > > > > rescheduling of the thread.
> > > > >
> > > > > When we see the last sample for (TID,SEQ), we pretend there is SCHED
> > > > > OUT event for this thread at this timestamp. When we see the first
> > > > > sample for (TID,SEQ+1), we pretend there is SCHED IN event for this
> > > > > thread at this timestamp.
> > > > >
> > > > > These SCHED IN/OUT events are not injected by the kernel. We just
> > > > > pretend they happen for accounting purposes. We may actually
> > > > > materialize them in the perf tool, or me may just update parallelism
> > > > > as if they happen.
> > > >
> > > > Thanks for the explanation.  But I don't think it needs the SEQ and
> > > > SCHED IN/OUT generated from it to track lost records.  Please see below.
> > > >
> > > > >
> > > > > With this scheme we can lose absolutely any subset of samples, and
> > > > > still get very precise accounting. When we lose samples, the profile
> > > > > of course becomes a bit less precise, but the effect is local and
> > > > > recoverable.
> > > > >
> > > > > If we lose the last/first event for (TID,SEQ), then we slightly
> > > > > shorten/postpone the thread accounting in the process parallelism
> > > > > level. If we lose a middle (TID,SEQ), then parallelism is not
> > > > > affected.
> > > >
> > > > I'm afraid it cannot check parallelism by just seeing the current thread.
> > > > I guess it would need information from other threads even if it has same
> > > > SEQ.
> > >
> > > Yes, we still count parallelism like you do in this patch, we just use
> > > the SEQ info instead of CPU numbers and explicit switch events.
> >
> > I mean after record lost, let's say
> >
> >   t1: SAMPLE for TID 1234, seq 10  (parallelism = 4)
> >   t2: LOST
> >   t3: SAMPLE for TID 1234, seq 10  (parallelism = ?)
> >
> > I don't think we can continue to use parallelism of 4 after LOST even if
> > it has the same seq because it cannot know if other threads switched on
> > other CPUs.  Then do we need really the seq?
> 
> I do not understand the problem you describe.
> We just keep updating parallelism according to the algorithm I
> described. It works fine in the presence of lost events.

Do you think it's ok to use 4 if seq is the same?  I'm afraid it'd be
inaccurate.

> 
> 
> > > > Also postpone thread accounting can be complex.  I think it should wait
> > > > for all other threads to get a sample.  Maybe some threads exited and
> > > > lost too.
> > >
> > > Yes, in order to understand what's the last event for (TID,SEQ) we
> > > need to look ahead and find the event (TID,SEQ+1). The easiest way to
> > > do it would be to do 2 passes over the trace. That's the cost of
> > > saving trace space + being resilient to lost events.
> > >
> > > Do you see any other issues with this scheme besides requiring 2 passes?
> >
> > Well.. 2 pass itself can be a problem due to slowness it'd bring.  Some
> > people complain about the speed of perf report as of now.
> 
> Is trace processing CPU-bound or memory-bound? If it's CPU-bound, then
> the second pass may be OK-ish, since we will need minimal CPU
> processing during the first pass.

It depends on the size of data, but I guess it's CPU-bound in most cases.

> 
> 
> > I think we can simply reset the parallelism in all processes after LOST
> > and set current process to the idle task.  It'll catch up as soon as it
> > sees samples from all CPUs.
> 
> I guess we can approximate parallelism as you described here:
> 
> > Hmm.. ok.  Maybe we can save the timestamp of the last sample on each
> > CPU and clear the current thread after some period (2x of given freq?).
> 
> We probably don't need to do anything special for lost events in this
> scheme at all. If the gap caused by lost events is tiny, then we
> consider nothing happened. If the gap is large enough, then we
> consider the CPU as idle for the duration of the gap. Either way it
> will be handled on common grounds.

How do you know if it's tiny?  Do you mean the seq remains after lost?

> 
> But tuning of these heuristics + testing and verification may be a bit
> of a problem. I would hate to end up with a tool which I won't trust.
> 
> Here:
> "after some period (2x of given freq?)"
> do you mean 2x average/median period, or 1/2 average/median period?
> (2x freq is 1/2 period)

Oh, sorry.  It's 2x period.

> 
> Ideally, we consider a CPU idle after 1/2 period after it switched to
> the idle task and we stop receiving samples.
> But on the other hand, we don't want to consider it constantly
> becoming idle, when it's just doing normal sampling with the normal
> period...
> 
> So ideally the algorithm should be something like:
> let's say average/median sampling period is P
> we got last sample for CPU X at time T
> if by time T+2P we have not seen any other sample on CPU X, then
> consider CPU X idle since T+0.5P
> 
> But this would also require either 2 passes over the data, or some
> kind of look ahead similar to the algo I proposed...

I think we can do it in 1 pass.  For each sample,

  for_each_cpu(cpu) {
      if (current[cpu]->last_timestamp + 2*period < sample->timestamp) {
          if (current[cpu]->thread != idle) {
              current[cpu]->thread->parallelism--;
              current[cpu]->thread = idle;
	  }
      }
  }

  leader = machine__findnew_thread(machine, sample->pid);
  current[sample->cpu]->last_timestamp = sample->timestamp;

  if (current[sample->cpu]->thread != leader) {
      if (current[sample->cpu]->thread != idle)
          current[sample->cpu]->thread->parallelism--;
      
      current[sample->cpu]->thread = leader;
      leader->parallelism++;
  }

  sample->parallelism = leader->parallelism;

> 
> Also, do we take the median period? or average? do we update it over
> time (say, if CPU freq changes)? do we count it globally, or per CPU
> (in case CPUs run at different freqs)?

Oh, perf tools use default frequency of 4000 Hz.  Maybe we can use this
only for the frequency mode which means user didn't use -c option or
similar in the event description.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-20 22:50                           ` Namhyung Kim
@ 2025-05-21  7:30                             ` Dmitry Vyukov
  2025-05-27  7:14                               ` Dmitry Vyukov
  0 siblings, 1 reply; 29+ messages in thread
From: Dmitry Vyukov @ 2025-05-21  7:30 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Wed, 21 May 2025 at 00:50, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > Hello,
> > > > >
> > > > > Sorry for the delay.
> > > > >
> > > > > On Thu, May 08, 2025 at 02:24:08PM +0200, Dmitry Vyukov wrote:
> > > > > > On Thu, 8 May 2025 at 01:43, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > >
> > > > > > > On Tue, May 06, 2025 at 09:40:52AM +0200, Dmitry Vyukov wrote:
> > > > > > > > On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > > > > > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > > > > > > > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > > > > > > > > > >
> > > > > > > > > > > Basically yes, but you can use --sample-cpu to add it.
> > > > > > > > > >
> > > > > > > > > > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
> > > > > > > > >
> > > > > > > > > Yep, that's why I said "Basically".  So it's not 100% guarantee.
> > > > > > > > >
> > > > > > > > > We may disable latency column by default in this case and show warning
> > > > > > > > > if it's requested.  Or we may add a new attribute to emit sched-switch
> > > > > > > > > records only for idle tasks and enable the latency report only if the
> > > > > > > > > data has sched-switch records.
> > > > > > > > >
> > > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > Depends on what problem we are trying to solve:
> > > > > > > >
> > > > > > > > 1. Enabling latency profiling for system-wide mode.
> > > > > > > >
> > > > > > > > 2. Switch events bloating trace too much.
> > > > > > > >
> > > > > > > > 3. Lost switch events lead to imprecise accounting.
> > > > > > > >
> > > > > > > > The patch mentions all 3 :)
> > > > > > > > But I think 2 and 3 are not really specific to system-wide mode.
> > > > > > > > An active single process profile can emit more samples than a
> > > > > > > > system-wide profile on a lightly loaded system.
> > > > > > >
> > > > > > > True.  But we don't need to care about lightly loaded systems as they
> > > > > > > won't cause problems.
> > > > > > >
> > > > > > >
> > > > > > > > Similarly, if we rely on switch events for system-wide mode, then it's
> > > > > > > > equally subject to the lost events problem.
> > > > > > >
> > > > > > > Right, but I'm afraid practically it'll increase the chance of lost
> > > > > > > in system-wide mode.  The default size of the sample for system-wide
> > > > > > > is 56 byte and the size of the switch is 48 byte.  And the default
> > > > > > > sample frequency is 4000 Hz but it cannot control the rate of the
> > > > > > > switch.  I saw around 10000 Hz of switches per CPU on my work env.
> > > > > > >
> > > > > > > >
> > > > > > > > For problem 1: we can just permit --latency for system wide mode and
> > > > > > > > fully rely on switch events.
> > > > > > > > It's not any worse than we do now (wrt both profile size and lost events).
> > > > > > >
> > > > > > > This can be an option and it'd work well on lightly loaded systems.
> > > > > > > Maybe we can just try it first.  But I think it's better to have an
> > > > > > > option to make it work on heavily loaded systems.
> > > > > > >
> > > > > > > >
> > > > > > > > For problem 2: yes, we could emit only switches to idle tasks. Or
> > > > > > > > maybe just a fake CPU sample for an idle task? That's effectively what
> > > > > > > > we want, then your current accounting code will work w/o any changes.
> > > > > > > > This should help wrt trace size only for system-wide mode (provided
> > > > > > > > that user already enables CPU accounting for other reasons, otherwise
> > > > > > > > it's unclear what's better -- attaching CPU to each sample, or writing
> > > > > > > > switch events).
> > > > > > >
> > > > > > > I'm not sure how we can add the fake samples.  The switch events will be
> > > > > > > from the kernel and we may add the condition in the attribute.
> > > > > > >
> > > > > > > And PERF_SAMPLE_CPU is on by default in system-wide mode.
> > > > > > >
> > > > > > > >
> > > > > > > > For problem 3: switches to idle task won't really help. There can be
> > > > > > > > lots of them, and missing any will lead to wrong accounting.
> > > > > > >
> > > > > > > I don't know how severe the situation will be.  On heavily loaded
> > > > > > > systems, the idle task won't run much and data size won't increase.
> > > > > > > On lightly loaded systems, increased data will likely be handled well.
> > > > > > >
> > > > > > >
> > > > > > > > A principled approach would be to attach a per-thread scheduler
> > > > > > > > quantum sequence number to each CPU sample. The sequence number would
> > > > > > > > be incremented on every context switch. Then any subset of CPU should
> > > > > > > > be enough to understand when a task was scheduled in and out
> > > > > > > > (scheduled in on the first CPU sample with sequence number N, and
> > > > > > > > switched out on the last sample with sequence number N).
> > > > > > >
> > > > > > > I'm not sure how it can help.  We don't need the switch info itself.
> > > > > > > What's needed is when the CPU was idle, right?
> > > > > >
> > > > > > I mean the following.
> > > > > > Each sample has a TID.
> > > > > > We add a SEQ field, which is per-thread and is incremented after every
> > > > > > rescheduling of the thread.
> > > > > >
> > > > > > When we see the last sample for (TID,SEQ), we pretend there is SCHED
> > > > > > OUT event for this thread at this timestamp. When we see the first
> > > > > > sample for (TID,SEQ+1), we pretend there is SCHED IN event for this
> > > > > > thread at this timestamp.
> > > > > >
> > > > > > These SCHED IN/OUT events are not injected by the kernel. We just
> > > > > > pretend they happen for accounting purposes. We may actually
> > > > > > materialize them in the perf tool, or me may just update parallelism
> > > > > > as if they happen.
> > > > >
> > > > > Thanks for the explanation.  But I don't think it needs the SEQ and
> > > > > SCHED IN/OUT generated from it to track lost records.  Please see below.
> > > > >
> > > > > >
> > > > > > With this scheme we can lose absolutely any subset of samples, and
> > > > > > still get very precise accounting. When we lose samples, the profile
> > > > > > of course becomes a bit less precise, but the effect is local and
> > > > > > recoverable.
> > > > > >
> > > > > > If we lose the last/first event for (TID,SEQ), then we slightly
> > > > > > shorten/postpone the thread accounting in the process parallelism
> > > > > > level. If we lose a middle (TID,SEQ), then parallelism is not
> > > > > > affected.
> > > > >
> > > > > I'm afraid it cannot check parallelism by just seeing the current thread.
> > > > > I guess it would need information from other threads even if it has same
> > > > > SEQ.
> > > >
> > > > Yes, we still count parallelism like you do in this patch, we just use
> > > > the SEQ info instead of CPU numbers and explicit switch events.
> > >
> > > I mean after record lost, let's say
> > >
> > >   t1: SAMPLE for TID 1234, seq 10  (parallelism = 4)
> > >   t2: LOST
> > >   t3: SAMPLE for TID 1234, seq 10  (parallelism = ?)
> > >
> > > I don't think we can continue to use parallelism of 4 after LOST even if
> > > it has the same seq because it cannot know if other threads switched on
> > > other CPUs.  Then do we need really the seq?
> >
> > I do not understand the problem you describe.
> > We just keep updating parallelism according to the algorithm I
> > described. It works fine in the presence of lost events.
>
> Do you think it's ok to use 4 if seq is the same?  I'm afraid it'd be
> inaccurate.

It will be inaccurate briefly for the period of 1 sample if we lost
specifically the last/first sample of a thread scheduling quantum. And
then it will recover and will be precise.

But it's exactly the same in your scheme, right. If we stopped seeing
events on a CPU due to lost events, we will assume it's not running.

And generally lost events will always lead to imprecision. That's
unavoidable. It's only important if the imprecision is limited and
proportional to the number of lost events. And this is the case for
the SEQ scheme.


> > > > > Also postpone thread accounting can be complex.  I think it should wait
> > > > > for all other threads to get a sample.  Maybe some threads exited and
> > > > > lost too.
> > > >
> > > > Yes, in order to understand what's the last event for (TID,SEQ) we
> > > > need to look ahead and find the event (TID,SEQ+1). The easiest way to
> > > > do it would be to do 2 passes over the trace. That's the cost of
> > > > saving trace space + being resilient to lost events.
> > > >
> > > > Do you see any other issues with this scheme besides requiring 2 passes?
> > >
> > > Well.. 2 pass itself can be a problem due to slowness it'd bring.  Some
> > > people complain about the speed of perf report as of now.
> >
> > Is trace processing CPU-bound or memory-bound? If it's CPU-bound, then
> > the second pass may be OK-ish, since we will need minimal CPU
> > processing during the first pass.
>
> It depends on the size of data, but I guess it's CPU-bound in most cases.
>
> >
> >
> > > I think we can simply reset the parallelism in all processes after LOST
> > > and set current process to the idle task.  It'll catch up as soon as it
> > > sees samples from all CPUs.
> >
> > I guess we can approximate parallelism as you described here:
> >
> > > Hmm.. ok.  Maybe we can save the timestamp of the last sample on each
> > > CPU and clear the current thread after some period (2x of given freq?).
> >
> > We probably don't need to do anything special for lost events in this
> > scheme at all. If the gap caused by lost events is tiny, then we
> > consider nothing happened. If the gap is large enough, then we
> > consider the CPU as idle for the duration of the gap. Either way it
> > will be handled on common grounds.
>
> How do you know if it's tiny?  Do you mean the seq remains after lost?

I was talking about your scheme based on CPU numbers.


> > But tuning of these heuristics + testing and verification may be a bit
> > of a problem. I would hate to end up with a tool which I won't trust.
> >
> > Here:
> > "after some period (2x of given freq?)"
> > do you mean 2x average/median period, or 1/2 average/median period?
> > (2x freq is 1/2 period)
>
> Oh, sorry.  It's 2x period.
>
> >
> > Ideally, we consider a CPU idle after 1/2 period after it switched to
> > the idle task and we stop receiving samples.
> > But on the other hand, we don't want to consider it constantly
> > becoming idle, when it's just doing normal sampling with the normal
> > period...
> >
> > So ideally the algorithm should be something like:
> > let's say average/median sampling period is P
> > we got last sample for CPU X at time T
> > if by time T+2P we have not seen any other sample on CPU X, then
> > consider CPU X idle since T+0.5P
> >
> > But this would also require either 2 passes over the data, or some
> > kind of look ahead similar to the algo I proposed...
>
> I think we can do it in 1 pass.  For each sample,
>
>   for_each_cpu(cpu) {
>       if (current[cpu]->last_timestamp + 2*period < sample->timestamp) {
>           if (current[cpu]->thread != idle) {
>               current[cpu]->thread->parallelism--;
>               current[cpu]->thread = idle;
>           }
>       }
>   }
>
>   leader = machine__findnew_thread(machine, sample->pid);
>   current[sample->cpu]->last_timestamp = sample->timestamp;
>
>   if (current[sample->cpu]->thread != leader) {
>       if (current[sample->cpu]->thread != idle)
>           current[sample->cpu]->thread->parallelism--;
>
>       current[sample->cpu]->thread = leader;
>       leader->parallelism++;
>   }
>
>   sample->parallelism = leader->parallelism;

As I described, for this simple 1 pass algorithm, I am afraid of imprecision.
The thread has not continued to run for 2 periods. I run for 0-1 period.



> > Also, do we take the median period? or average? do we update it over
> > time (say, if CPU freq changes)? do we count it globally, or per CPU
> > (in case CPUs run at different freqs)?
>
> Oh, perf tools use default frequency of 4000 Hz.

Is the actual sample rate reasonably precise across time/CPUs?

> Maybe we can use this
> only for the frequency mode which means user didn't use -c option or
> similar in the event description.
>
> Thanks,
> Namhyung
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-21  7:30                             ` Dmitry Vyukov
@ 2025-05-27  7:14                               ` Dmitry Vyukov
  2025-05-28 18:38                                 ` Namhyung Kim
  0 siblings, 1 reply; 29+ messages in thread
From: Dmitry Vyukov @ 2025-05-27  7:14 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Wed, 21 May 2025 at 09:30, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Wed, 21 May 2025 at 00:50, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > Hello,
> > > > > >
> > > > > > Sorry for the delay.
> > > > > >
> > > > > > On Thu, May 08, 2025 at 02:24:08PM +0200, Dmitry Vyukov wrote:
> > > > > > > On Thu, 8 May 2025 at 01:43, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Tue, May 06, 2025 at 09:40:52AM +0200, Dmitry Vyukov wrote:
> > > > > > > > > On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > > > > > > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > > > > > > > > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > > > > > > > > > > >
> > > > > > > > > > > > Basically yes, but you can use --sample-cpu to add it.
> > > > > > > > > > >
> > > > > > > > > > > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
> > > > > > > > > >
> > > > > > > > > > Yep, that's why I said "Basically".  So it's not 100% guarantee.
> > > > > > > > > >
> > > > > > > > > > We may disable latency column by default in this case and show warning
> > > > > > > > > > if it's requested.  Or we may add a new attribute to emit sched-switch
> > > > > > > > > > records only for idle tasks and enable the latency report only if the
> > > > > > > > > > data has sched-switch records.
> > > > > > > > > >
> > > > > > > > > > What do you think?
> > > > > > > > >
> > > > > > > > > Depends on what problem we are trying to solve:
> > > > > > > > >
> > > > > > > > > 1. Enabling latency profiling for system-wide mode.
> > > > > > > > >
> > > > > > > > > 2. Switch events bloating trace too much.
> > > > > > > > >
> > > > > > > > > 3. Lost switch events lead to imprecise accounting.
> > > > > > > > >
> > > > > > > > > The patch mentions all 3 :)
> > > > > > > > > But I think 2 and 3 are not really specific to system-wide mode.
> > > > > > > > > An active single process profile can emit more samples than a
> > > > > > > > > system-wide profile on a lightly loaded system.
> > > > > > > >
> > > > > > > > True.  But we don't need to care about lightly loaded systems as they
> > > > > > > > won't cause problems.
> > > > > > > >
> > > > > > > >
> > > > > > > > > Similarly, if we rely on switch events for system-wide mode, then it's
> > > > > > > > > equally subject to the lost events problem.
> > > > > > > >
> > > > > > > > Right, but I'm afraid practically it'll increase the chance of lost
> > > > > > > > in system-wide mode.  The default size of the sample for system-wide
> > > > > > > > is 56 byte and the size of the switch is 48 byte.  And the default
> > > > > > > > sample frequency is 4000 Hz but it cannot control the rate of the
> > > > > > > > switch.  I saw around 10000 Hz of switches per CPU on my work env.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > For problem 1: we can just permit --latency for system wide mode and
> > > > > > > > > fully rely on switch events.
> > > > > > > > > It's not any worse than we do now (wrt both profile size and lost events).
> > > > > > > >
> > > > > > > > This can be an option and it'd work well on lightly loaded systems.
> > > > > > > > Maybe we can just try it first.  But I think it's better to have an
> > > > > > > > option to make it work on heavily loaded systems.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > For problem 2: yes, we could emit only switches to idle tasks. Or
> > > > > > > > > maybe just a fake CPU sample for an idle task? That's effectively what
> > > > > > > > > we want, then your current accounting code will work w/o any changes.
> > > > > > > > > This should help wrt trace size only for system-wide mode (provided
> > > > > > > > > that user already enables CPU accounting for other reasons, otherwise
> > > > > > > > > it's unclear what's better -- attaching CPU to each sample, or writing
> > > > > > > > > switch events).
> > > > > > > >
> > > > > > > > I'm not sure how we can add the fake samples.  The switch events will be
> > > > > > > > from the kernel and we may add the condition in the attribute.
> > > > > > > >
> > > > > > > > And PERF_SAMPLE_CPU is on by default in system-wide mode.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > For problem 3: switches to idle task won't really help. There can be
> > > > > > > > > lots of them, and missing any will lead to wrong accounting.
> > > > > > > >
> > > > > > > > I don't know how severe the situation will be.  On heavily loaded
> > > > > > > > systems, the idle task won't run much and data size won't increase.
> > > > > > > > On lightly loaded systems, increased data will likely be handled well.
> > > > > > > >
> > > > > > > >
> > > > > > > > > A principled approach would be to attach a per-thread scheduler
> > > > > > > > > quantum sequence number to each CPU sample. The sequence number would
> > > > > > > > > be incremented on every context switch. Then any subset of CPU should
> > > > > > > > > be enough to understand when a task was scheduled in and out
> > > > > > > > > (scheduled in on the first CPU sample with sequence number N, and
> > > > > > > > > switched out on the last sample with sequence number N).
> > > > > > > >
> > > > > > > > I'm not sure how it can help.  We don't need the switch info itself.
> > > > > > > > What's needed is when the CPU was idle, right?
> > > > > > >
> > > > > > > I mean the following.
> > > > > > > Each sample has a TID.
> > > > > > > We add a SEQ field, which is per-thread and is incremented after every
> > > > > > > rescheduling of the thread.
> > > > > > >
> > > > > > > When we see the last sample for (TID,SEQ), we pretend there is SCHED
> > > > > > > OUT event for this thread at this timestamp. When we see the first
> > > > > > > sample for (TID,SEQ+1), we pretend there is SCHED IN event for this
> > > > > > > thread at this timestamp.
> > > > > > >
> > > > > > > These SCHED IN/OUT events are not injected by the kernel. We just
> > > > > > > pretend they happen for accounting purposes. We may actually
> > > > > > > materialize them in the perf tool, or me may just update parallelism
> > > > > > > as if they happen.
> > > > > >
> > > > > > Thanks for the explanation.  But I don't think it needs the SEQ and
> > > > > > SCHED IN/OUT generated from it to track lost records.  Please see below.
> > > > > >
> > > > > > >
> > > > > > > With this scheme we can lose absolutely any subset of samples, and
> > > > > > > still get very precise accounting. When we lose samples, the profile
> > > > > > > of course becomes a bit less precise, but the effect is local and
> > > > > > > recoverable.
> > > > > > >
> > > > > > > If we lose the last/first event for (TID,SEQ), then we slightly
> > > > > > > shorten/postpone the thread accounting in the process parallelism
> > > > > > > level. If we lose a middle (TID,SEQ), then parallelism is not
> > > > > > > affected.
> > > > > >
> > > > > > I'm afraid it cannot check parallelism by just seeing the current thread.
> > > > > > I guess it would need information from other threads even if it has same
> > > > > > SEQ.
> > > > >
> > > > > Yes, we still count parallelism like you do in this patch, we just use
> > > > > the SEQ info instead of CPU numbers and explicit switch events.
> > > >
> > > > I mean after record lost, let's say
> > > >
> > > >   t1: SAMPLE for TID 1234, seq 10  (parallelism = 4)
> > > >   t2: LOST
> > > >   t3: SAMPLE for TID 1234, seq 10  (parallelism = ?)
> > > >
> > > > I don't think we can continue to use parallelism of 4 after LOST even if
> > > > it has the same seq because it cannot know if other threads switched on
> > > > other CPUs.  Then do we need really the seq?
> > >
> > > I do not understand the problem you describe.
> > > We just keep updating parallelism according to the algorithm I
> > > described. It works fine in the presence of lost events.
> >
> > Do you think it's ok to use 4 if seq is the same?  I'm afraid it'd be
> > inaccurate.
>
> It will be inaccurate briefly for the period of 1 sample if we lost
> specifically the last/first sample of a thread scheduling quantum. And
> then it will recover and will be precise.
>
> But it's exactly the same in your scheme, right. If we stopped seeing
> events on a CPU due to lost events, we will assume it's not running.
>
> And generally lost events will always lead to imprecision. That's
> unavoidable. It's only important if the imprecision is limited and
> proportional to the number of lost events. And this is the case for
> the SEQ scheme.
>
>
> > > > > > Also postpone thread accounting can be complex.  I think it should wait
> > > > > > for all other threads to get a sample.  Maybe some threads exited and
> > > > > > lost too.
> > > > >
> > > > > Yes, in order to understand what's the last event for (TID,SEQ) we
> > > > > need to look ahead and find the event (TID,SEQ+1). The easiest way to
> > > > > do it would be to do 2 passes over the trace. That's the cost of
> > > > > saving trace space + being resilient to lost events.
> > > > >
> > > > > Do you see any other issues with this scheme besides requiring 2 passes?
> > > >
> > > > Well.. 2 pass itself can be a problem due to slowness it'd bring.  Some
> > > > people complain about the speed of perf report as of now.
> > >
> > > Is trace processing CPU-bound or memory-bound? If it's CPU-bound, then
> > > the second pass may be OK-ish, since we will need minimal CPU
> > > processing during the first pass.
> >
> > It depends on the size of data, but I guess it's CPU-bound in most cases.
> >
> > >
> > >
> > > > I think we can simply reset the parallelism in all processes after LOST
> > > > and set current process to the idle task.  It'll catch up as soon as it
> > > > sees samples from all CPUs.
> > >
> > > I guess we can approximate parallelism as you described here:
> > >
> > > > Hmm.. ok.  Maybe we can save the timestamp of the last sample on each
> > > > CPU and clear the current thread after some period (2x of given freq?).
> > >
> > > We probably don't need to do anything special for lost events in this
> > > scheme at all. If the gap caused by lost events is tiny, then we
> > > consider nothing happened. If the gap is large enough, then we
> > > consider the CPU as idle for the duration of the gap. Either way it
> > > will be handled on common grounds.
> >
> > How do you know if it's tiny?  Do you mean the seq remains after lost?
>
> I was talking about your scheme based on CPU numbers.
>
>
> > > But tuning of these heuristics + testing and verification may be a bit
> > > of a problem. I would hate to end up with a tool which I won't trust.
> > >
> > > Here:
> > > "after some period (2x of given freq?)"
> > > do you mean 2x average/median period, or 1/2 average/median period?
> > > (2x freq is 1/2 period)
> >
> > Oh, sorry.  It's 2x period.
> >
> > >
> > > Ideally, we consider a CPU idle after 1/2 period after it switched to
> > > the idle task and we stop receiving samples.
> > > But on the other hand, we don't want to consider it constantly
> > > becoming idle, when it's just doing normal sampling with the normal
> > > period...
> > >
> > > So ideally the algorithm should be something like:
> > > let's say average/median sampling period is P
> > > we got last sample for CPU X at time T
> > > if by time T+2P we have not seen any other sample on CPU X, then
> > > consider CPU X idle since T+0.5P
> > >
> > > But this would also require either 2 passes over the data, or some
> > > kind of look ahead similar to the algo I proposed...
> >
> > I think we can do it in 1 pass.  For each sample,
> >
> >   for_each_cpu(cpu) {
> >       if (current[cpu]->last_timestamp + 2*period < sample->timestamp) {
> >           if (current[cpu]->thread != idle) {
> >               current[cpu]->thread->parallelism--;
> >               current[cpu]->thread = idle;
> >           }
> >       }
> >   }
> >
> >   leader = machine__findnew_thread(machine, sample->pid);
> >   current[sample->cpu]->last_timestamp = sample->timestamp;
> >
> >   if (current[sample->cpu]->thread != leader) {
> >       if (current[sample->cpu]->thread != idle)
> >           current[sample->cpu]->thread->parallelism--;
> >
> >       current[sample->cpu]->thread = leader;
> >       leader->parallelism++;
> >   }
> >
> >   sample->parallelism = leader->parallelism;
>
> As I described, for this simple 1 pass algorithm, I am afraid of imprecision.
> The thread has not continued to run for 2 periods. I run for 0-1 period.
>
>
>
> > > Also, do we take the median period? or average? do we update it over
> > > time (say, if CPU freq changes)? do we count it globally, or per CPU
> > > (in case CPUs run at different freqs)?
> >
> > Oh, perf tools use default frequency of 4000 Hz.
>
> Is the actual sample rate reasonably precise across time/CPUs?
>
> > Maybe we can use this
> > only for the frequency mode which means user didn't use -c option or
> > similar in the event description.


All-in-all I think the best option for now is using CPU IDs to track
parallelism as you suggested, but be more precise with idle detection.
2 passes over the trace may be fine to detect idle points. I see the
most time now spent in hist_entry__cmp, which accesses other entries
and is like a part of O(N*logN) processing, so a simple O(N) pass
shouldn't slow it down much.
That's what I would try. But I would also try to assess the precision
of this approach by comparing with results of using explicit switch
events.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-27  7:14                               ` Dmitry Vyukov
@ 2025-05-28 18:38                                 ` Namhyung Kim
  2025-05-30  5:50                                   ` Dmitry Vyukov
  0 siblings, 1 reply; 29+ messages in thread
From: Namhyung Kim @ 2025-05-28 18:38 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

Hello,

On Tue, May 27, 2025 at 09:14:34AM +0200, Dmitry Vyukov wrote:
> On Wed, 21 May 2025 at 09:30, Dmitry Vyukov <dvyukov@google.com> wrote:
> >
> > On Wed, 21 May 2025 at 00:50, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > Sorry for the delay.
> > > > > > >
> > > > > > > On Thu, May 08, 2025 at 02:24:08PM +0200, Dmitry Vyukov wrote:
> > > > > > > > On Thu, 8 May 2025 at 01:43, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, May 06, 2025 at 09:40:52AM +0200, Dmitry Vyukov wrote:
> > > > > > > > > > On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > > > > > > > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > > > > > > > > > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Basically yes, but you can use --sample-cpu to add it.
> > > > > > > > > > > >
> > > > > > > > > > > > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
> > > > > > > > > > >
> > > > > > > > > > > Yep, that's why I said "Basically".  So it's not 100% guarantee.
> > > > > > > > > > >
> > > > > > > > > > > We may disable latency column by default in this case and show warning
> > > > > > > > > > > if it's requested.  Or we may add a new attribute to emit sched-switch
> > > > > > > > > > > records only for idle tasks and enable the latency report only if the
> > > > > > > > > > > data has sched-switch records.
> > > > > > > > > > >
> > > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > Depends on what problem we are trying to solve:
> > > > > > > > > >
> > > > > > > > > > 1. Enabling latency profiling for system-wide mode.
> > > > > > > > > >
> > > > > > > > > > 2. Switch events bloating trace too much.
> > > > > > > > > >
> > > > > > > > > > 3. Lost switch events lead to imprecise accounting.
> > > > > > > > > >
> > > > > > > > > > The patch mentions all 3 :)
> > > > > > > > > > But I think 2 and 3 are not really specific to system-wide mode.
> > > > > > > > > > An active single process profile can emit more samples than a
> > > > > > > > > > system-wide profile on a lightly loaded system.
> > > > > > > > >
> > > > > > > > > True.  But we don't need to care about lightly loaded systems as they
> > > > > > > > > won't cause problems.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Similarly, if we rely on switch events for system-wide mode, then it's
> > > > > > > > > > equally subject to the lost events problem.
> > > > > > > > >
> > > > > > > > > Right, but I'm afraid practically it'll increase the chance of lost
> > > > > > > > > in system-wide mode.  The default size of the sample for system-wide
> > > > > > > > > is 56 byte and the size of the switch is 48 byte.  And the default
> > > > > > > > > sample frequency is 4000 Hz but it cannot control the rate of the
> > > > > > > > > switch.  I saw around 10000 Hz of switches per CPU on my work env.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > For problem 1: we can just permit --latency for system wide mode and
> > > > > > > > > > fully rely on switch events.
> > > > > > > > > > It's not any worse than we do now (wrt both profile size and lost events).
> > > > > > > > >
> > > > > > > > > This can be an option and it'd work well on lightly loaded systems.
> > > > > > > > > Maybe we can just try it first.  But I think it's better to have an
> > > > > > > > > option to make it work on heavily loaded systems.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > For problem 2: yes, we could emit only switches to idle tasks. Or
> > > > > > > > > > maybe just a fake CPU sample for an idle task? That's effectively what
> > > > > > > > > > we want, then your current accounting code will work w/o any changes.
> > > > > > > > > > This should help wrt trace size only for system-wide mode (provided
> > > > > > > > > > that user already enables CPU accounting for other reasons, otherwise
> > > > > > > > > > it's unclear what's better -- attaching CPU to each sample, or writing
> > > > > > > > > > switch events).
> > > > > > > > >
> > > > > > > > > I'm not sure how we can add the fake samples.  The switch events will be
> > > > > > > > > from the kernel and we may add the condition in the attribute.
> > > > > > > > >
> > > > > > > > > And PERF_SAMPLE_CPU is on by default in system-wide mode.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > For problem 3: switches to idle task won't really help. There can be
> > > > > > > > > > lots of them, and missing any will lead to wrong accounting.
> > > > > > > > >
> > > > > > > > > I don't know how severe the situation will be.  On heavily loaded
> > > > > > > > > systems, the idle task won't run much and data size won't increase.
> > > > > > > > > On lightly loaded systems, increased data will likely be handled well.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > A principled approach would be to attach a per-thread scheduler
> > > > > > > > > > quantum sequence number to each CPU sample. The sequence number would
> > > > > > > > > > be incremented on every context switch. Then any subset of CPU should
> > > > > > > > > > be enough to understand when a task was scheduled in and out
> > > > > > > > > > (scheduled in on the first CPU sample with sequence number N, and
> > > > > > > > > > switched out on the last sample with sequence number N).
> > > > > > > > >
> > > > > > > > > I'm not sure how it can help.  We don't need the switch info itself.
> > > > > > > > > What's needed is when the CPU was idle, right?
> > > > > > > >
> > > > > > > > I mean the following.
> > > > > > > > Each sample has a TID.
> > > > > > > > We add a SEQ field, which is per-thread and is incremented after every
> > > > > > > > rescheduling of the thread.
> > > > > > > >
> > > > > > > > When we see the last sample for (TID,SEQ), we pretend there is SCHED
> > > > > > > > OUT event for this thread at this timestamp. When we see the first
> > > > > > > > sample for (TID,SEQ+1), we pretend there is SCHED IN event for this
> > > > > > > > thread at this timestamp.
> > > > > > > >
> > > > > > > > These SCHED IN/OUT events are not injected by the kernel. We just
> > > > > > > > pretend they happen for accounting purposes. We may actually
> > > > > > > > materialize them in the perf tool, or me may just update parallelism
> > > > > > > > as if they happen.
> > > > > > >
> > > > > > > Thanks for the explanation.  But I don't think it needs the SEQ and
> > > > > > > SCHED IN/OUT generated from it to track lost records.  Please see below.
> > > > > > >
> > > > > > > >
> > > > > > > > With this scheme we can lose absolutely any subset of samples, and
> > > > > > > > still get very precise accounting. When we lose samples, the profile
> > > > > > > > of course becomes a bit less precise, but the effect is local and
> > > > > > > > recoverable.
> > > > > > > >
> > > > > > > > If we lose the last/first event for (TID,SEQ), then we slightly
> > > > > > > > shorten/postpone the thread accounting in the process parallelism
> > > > > > > > level. If we lose a middle (TID,SEQ), then parallelism is not
> > > > > > > > affected.
> > > > > > >
> > > > > > > I'm afraid it cannot check parallelism by just seeing the current thread.
> > > > > > > I guess it would need information from other threads even if it has same
> > > > > > > SEQ.
> > > > > >
> > > > > > Yes, we still count parallelism like you do in this patch, we just use
> > > > > > the SEQ info instead of CPU numbers and explicit switch events.
> > > > >
> > > > > I mean after record lost, let's say
> > > > >
> > > > >   t1: SAMPLE for TID 1234, seq 10  (parallelism = 4)
> > > > >   t2: LOST
> > > > >   t3: SAMPLE for TID 1234, seq 10  (parallelism = ?)
> > > > >
> > > > > I don't think we can continue to use parallelism of 4 after LOST even if
> > > > > it has the same seq because it cannot know if other threads switched on
> > > > > other CPUs.  Then do we need really the seq?
> > > >
> > > > I do not understand the problem you describe.
> > > > We just keep updating parallelism according to the algorithm I
> > > > described. It works fine in the presence of lost events.
> > >
> > > Do you think it's ok to use 4 if seq is the same?  I'm afraid it'd be
> > > inaccurate.
> >
> > It will be inaccurate briefly for the period of 1 sample if we lost
> > specifically the last/first sample of a thread scheduling quantum. And
> > then it will recover and will be precise.
> >
> > But it's exactly the same in your scheme, right. If we stopped seeing
> > events on a CPU due to lost events, we will assume it's not running.
> >
> > And generally lost events will always lead to imprecision. That's
> > unavoidable. It's only important if the imprecision is limited and
> > proportional to the number of lost events. And this is the case for
> > the SEQ scheme.
> >
> >
> > > > > > > Also postpone thread accounting can be complex.  I think it should wait
> > > > > > > for all other threads to get a sample.  Maybe some threads exited and
> > > > > > > lost too.
> > > > > >
> > > > > > Yes, in order to understand what's the last event for (TID,SEQ) we
> > > > > > need to look ahead and find the event (TID,SEQ+1). The easiest way to
> > > > > > do it would be to do 2 passes over the trace. That's the cost of
> > > > > > saving trace space + being resilient to lost events.
> > > > > >
> > > > > > Do you see any other issues with this scheme besides requiring 2 passes?
> > > > >
> > > > > Well.. 2 pass itself can be a problem due to slowness it'd bring.  Some
> > > > > people complain about the speed of perf report as of now.
> > > >
> > > > Is trace processing CPU-bound or memory-bound? If it's CPU-bound, then
> > > > the second pass may be OK-ish, since we will need minimal CPU
> > > > processing during the first pass.
> > >
> > > It depends on the size of data, but I guess it's CPU-bound in most cases.
> > >
> > > >
> > > >
> > > > > I think we can simply reset the parallelism in all processes after LOST
> > > > > and set current process to the idle task.  It'll catch up as soon as it
> > > > > sees samples from all CPUs.
> > > >
> > > > I guess we can approximate parallelism as you described here:
> > > >
> > > > > Hmm.. ok.  Maybe we can save the timestamp of the last sample on each
> > > > > CPU and clear the current thread after some period (2x of given freq?).
> > > >
> > > > We probably don't need to do anything special for lost events in this
> > > > scheme at all. If the gap caused by lost events is tiny, then we
> > > > consider nothing happened. If the gap is large enough, then we
> > > > consider the CPU as idle for the duration of the gap. Either way it
> > > > will be handled on common grounds.
> > >
> > > How do you know if it's tiny?  Do you mean the seq remains after lost?
> >
> > I was talking about your scheme based on CPU numbers.
> >
> >
> > > > But tuning of these heuristics + testing and verification may be a bit
> > > > of a problem. I would hate to end up with a tool which I won't trust.
> > > >
> > > > Here:
> > > > "after some period (2x of given freq?)"
> > > > do you mean 2x average/median period, or 1/2 average/median period?
> > > > (2x freq is 1/2 period)
> > >
> > > Oh, sorry.  It's 2x period.
> > >
> > > >
> > > > Ideally, we consider a CPU idle after 1/2 period after it switched to
> > > > the idle task and we stop receiving samples.
> > > > But on the other hand, we don't want to consider it constantly
> > > > becoming idle, when it's just doing normal sampling with the normal
> > > > period...
> > > >
> > > > So ideally the algorithm should be something like:
> > > > let's say average/median sampling period is P
> > > > we got last sample for CPU X at time T
> > > > if by time T+2P we have not seen any other sample on CPU X, then
> > > > consider CPU X idle since T+0.5P
> > > >
> > > > But this would also require either 2 passes over the data, or some
> > > > kind of look ahead similar to the algo I proposed...
> > >
> > > I think we can do it in 1 pass.  For each sample,
> > >
> > >   for_each_cpu(cpu) {
> > >       if (current[cpu]->last_timestamp + 2*period < sample->timestamp) {
> > >           if (current[cpu]->thread != idle) {
> > >               current[cpu]->thread->parallelism--;
> > >               current[cpu]->thread = idle;
> > >           }
> > >       }
> > >   }
> > >
> > >   leader = machine__findnew_thread(machine, sample->pid);
> > >   current[sample->cpu]->last_timestamp = sample->timestamp;
> > >
> > >   if (current[sample->cpu]->thread != leader) {
> > >       if (current[sample->cpu]->thread != idle)
> > >           current[sample->cpu]->thread->parallelism--;
> > >
> > >       current[sample->cpu]->thread = leader;
> > >       leader->parallelism++;
> > >   }
> > >
> > >   sample->parallelism = leader->parallelism;
> >
> > As I described, for this simple 1 pass algorithm, I am afraid of imprecision.
> > The thread has not continued to run for 2 periods. I run for 0-1 period.
> >
> >
> >
> > > > Also, do we take the median period? or average? do we update it over
> > > > time (say, if CPU freq changes)? do we count it globally, or per CPU
> > > > (in case CPUs run at different freqs)?
> > >
> > > Oh, perf tools use default frequency of 4000 Hz.
> >
> > Is the actual sample rate reasonably precise across time/CPUs?
> >
> > > Maybe we can use this
> > > only for the frequency mode which means user didn't use -c option or
> > > similar in the event description.
> 
> 
> All-in-all I think the best option for now is using CPU IDs to track
> parallelism as you suggested, but be more precise with idle detection.
> 2 passes over the trace may be fine to detect idle points. I see the
> most time now spent in hist_entry__cmp, which accesses other entries
> and is like a part of O(N*logN) processing, so a simple O(N) pass
> shouldn't slow it down much.
> That's what I would try. But I would also try to assess the precision
> of this approach by comparing with results of using explicit switch
> events.

It's not clear to me how you want to maintain the idle info in the 2
pass approach.  Please feel free to propose something based on this
work.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-28 18:38                                 ` Namhyung Kim
@ 2025-05-30  5:50                                   ` Dmitry Vyukov
  2025-05-30 22:05                                     ` Namhyung Kim
  0 siblings, 1 reply; 29+ messages in thread
From: Dmitry Vyukov @ 2025-05-30  5:50 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Wed, 28 May 2025 at 20:38, Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> On Tue, May 27, 2025 at 09:14:34AM +0200, Dmitry Vyukov wrote:
> > On Wed, 21 May 2025 at 09:30, Dmitry Vyukov <dvyukov@google.com> wrote:
> > >
> > > On Wed, 21 May 2025 at 00:50, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > Sorry for the delay.
> > > > > > > >
> > > > > > > > On Thu, May 08, 2025 at 02:24:08PM +0200, Dmitry Vyukov wrote:
> > > > > > > > > On Thu, 8 May 2025 at 01:43, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, May 06, 2025 at 09:40:52AM +0200, Dmitry Vyukov wrote:
> > > > > > > > > > > On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > > > > > > > > > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > > > > > > > > > > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Basically yes, but you can use --sample-cpu to add it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
> > > > > > > > > > > >
> > > > > > > > > > > > Yep, that's why I said "Basically".  So it's not 100% guarantee.
> > > > > > > > > > > >
> > > > > > > > > > > > We may disable latency column by default in this case and show warning
> > > > > > > > > > > > if it's requested.  Or we may add a new attribute to emit sched-switch
> > > > > > > > > > > > records only for idle tasks and enable the latency report only if the
> > > > > > > > > > > > data has sched-switch records.
> > > > > > > > > > > >
> > > > > > > > > > > > What do you think?
> > > > > > > > > > >
> > > > > > > > > > > Depends on what problem we are trying to solve:
> > > > > > > > > > >
> > > > > > > > > > > 1. Enabling latency profiling for system-wide mode.
> > > > > > > > > > >
> > > > > > > > > > > 2. Switch events bloating trace too much.
> > > > > > > > > > >
> > > > > > > > > > > 3. Lost switch events lead to imprecise accounting.
> > > > > > > > > > >
> > > > > > > > > > > The patch mentions all 3 :)
> > > > > > > > > > > But I think 2 and 3 are not really specific to system-wide mode.
> > > > > > > > > > > An active single process profile can emit more samples than a
> > > > > > > > > > > system-wide profile on a lightly loaded system.
> > > > > > > > > >
> > > > > > > > > > True.  But we don't need to care about lightly loaded systems as they
> > > > > > > > > > won't cause problems.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Similarly, if we rely on switch events for system-wide mode, then it's
> > > > > > > > > > > equally subject to the lost events problem.
> > > > > > > > > >
> > > > > > > > > > Right, but I'm afraid practically it'll increase the chance of lost
> > > > > > > > > > in system-wide mode.  The default size of the sample for system-wide
> > > > > > > > > > is 56 byte and the size of the switch is 48 byte.  And the default
> > > > > > > > > > sample frequency is 4000 Hz but it cannot control the rate of the
> > > > > > > > > > switch.  I saw around 10000 Hz of switches per CPU on my work env.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > For problem 1: we can just permit --latency for system wide mode and
> > > > > > > > > > > fully rely on switch events.
> > > > > > > > > > > It's not any worse than we do now (wrt both profile size and lost events).
> > > > > > > > > >
> > > > > > > > > > This can be an option and it'd work well on lightly loaded systems.
> > > > > > > > > > Maybe we can just try it first.  But I think it's better to have an
> > > > > > > > > > option to make it work on heavily loaded systems.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > For problem 2: yes, we could emit only switches to idle tasks. Or
> > > > > > > > > > > maybe just a fake CPU sample for an idle task? That's effectively what
> > > > > > > > > > > we want, then your current accounting code will work w/o any changes.
> > > > > > > > > > > This should help wrt trace size only for system-wide mode (provided
> > > > > > > > > > > that user already enables CPU accounting for other reasons, otherwise
> > > > > > > > > > > it's unclear what's better -- attaching CPU to each sample, or writing
> > > > > > > > > > > switch events).
> > > > > > > > > >
> > > > > > > > > > I'm not sure how we can add the fake samples.  The switch events will be
> > > > > > > > > > from the kernel and we may add the condition in the attribute.
> > > > > > > > > >
> > > > > > > > > > And PERF_SAMPLE_CPU is on by default in system-wide mode.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > For problem 3: switches to idle task won't really help. There can be
> > > > > > > > > > > lots of them, and missing any will lead to wrong accounting.
> > > > > > > > > >
> > > > > > > > > > I don't know how severe the situation will be.  On heavily loaded
> > > > > > > > > > systems, the idle task won't run much and data size won't increase.
> > > > > > > > > > On lightly loaded systems, increased data will likely be handled well.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > A principled approach would be to attach a per-thread scheduler
> > > > > > > > > > > quantum sequence number to each CPU sample. The sequence number would
> > > > > > > > > > > be incremented on every context switch. Then any subset of CPU should
> > > > > > > > > > > be enough to understand when a task was scheduled in and out
> > > > > > > > > > > (scheduled in on the first CPU sample with sequence number N, and
> > > > > > > > > > > switched out on the last sample with sequence number N).
> > > > > > > > > >
> > > > > > > > > > I'm not sure how it can help.  We don't need the switch info itself.
> > > > > > > > > > What's needed is when the CPU was idle, right?
> > > > > > > > >
> > > > > > > > > I mean the following.
> > > > > > > > > Each sample has a TID.
> > > > > > > > > We add a SEQ field, which is per-thread and is incremented after every
> > > > > > > > > rescheduling of the thread.
> > > > > > > > >
> > > > > > > > > When we see the last sample for (TID,SEQ), we pretend there is SCHED
> > > > > > > > > OUT event for this thread at this timestamp. When we see the first
> > > > > > > > > sample for (TID,SEQ+1), we pretend there is SCHED IN event for this
> > > > > > > > > thread at this timestamp.
> > > > > > > > >
> > > > > > > > > These SCHED IN/OUT events are not injected by the kernel. We just
> > > > > > > > > pretend they happen for accounting purposes. We may actually
> > > > > > > > > materialize them in the perf tool, or me may just update parallelism
> > > > > > > > > as if they happen.
> > > > > > > >
> > > > > > > > Thanks for the explanation.  But I don't think it needs the SEQ and
> > > > > > > > SCHED IN/OUT generated from it to track lost records.  Please see below.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > With this scheme we can lose absolutely any subset of samples, and
> > > > > > > > > still get very precise accounting. When we lose samples, the profile
> > > > > > > > > of course becomes a bit less precise, but the effect is local and
> > > > > > > > > recoverable.
> > > > > > > > >
> > > > > > > > > If we lose the last/first event for (TID,SEQ), then we slightly
> > > > > > > > > shorten/postpone the thread accounting in the process parallelism
> > > > > > > > > level. If we lose a middle (TID,SEQ), then parallelism is not
> > > > > > > > > affected.
> > > > > > > >
> > > > > > > > I'm afraid it cannot check parallelism by just seeing the current thread.
> > > > > > > > I guess it would need information from other threads even if it has same
> > > > > > > > SEQ.
> > > > > > >
> > > > > > > Yes, we still count parallelism like you do in this patch, we just use
> > > > > > > the SEQ info instead of CPU numbers and explicit switch events.
> > > > > >
> > > > > > I mean after record lost, let's say
> > > > > >
> > > > > >   t1: SAMPLE for TID 1234, seq 10  (parallelism = 4)
> > > > > >   t2: LOST
> > > > > >   t3: SAMPLE for TID 1234, seq 10  (parallelism = ?)
> > > > > >
> > > > > > I don't think we can continue to use parallelism of 4 after LOST even if
> > > > > > it has the same seq because it cannot know if other threads switched on
> > > > > > other CPUs.  Then do we need really the seq?
> > > > >
> > > > > I do not understand the problem you describe.
> > > > > We just keep updating parallelism according to the algorithm I
> > > > > described. It works fine in the presence of lost events.
> > > >
> > > > Do you think it's ok to use 4 if seq is the same?  I'm afraid it'd be
> > > > inaccurate.
> > >
> > > It will be inaccurate briefly for the period of 1 sample if we lost
> > > specifically the last/first sample of a thread scheduling quantum. And
> > > then it will recover and will be precise.
> > >
> > > But it's exactly the same in your scheme, right. If we stopped seeing
> > > events on a CPU due to lost events, we will assume it's not running.
> > >
> > > And generally lost events will always lead to imprecision. That's
> > > unavoidable. It's only important if the imprecision is limited and
> > > proportional to the number of lost events. And this is the case for
> > > the SEQ scheme.
> > >
> > >
> > > > > > > > Also postpone thread accounting can be complex.  I think it should wait
> > > > > > > > for all other threads to get a sample.  Maybe some threads exited and
> > > > > > > > lost too.
> > > > > > >
> > > > > > > Yes, in order to understand what's the last event for (TID,SEQ) we
> > > > > > > need to look ahead and find the event (TID,SEQ+1). The easiest way to
> > > > > > > do it would be to do 2 passes over the trace. That's the cost of
> > > > > > > saving trace space + being resilient to lost events.
> > > > > > >
> > > > > > > Do you see any other issues with this scheme besides requiring 2 passes?
> > > > > >
> > > > > > Well.. 2 pass itself can be a problem due to slowness it'd bring.  Some
> > > > > > people complain about the speed of perf report as of now.
> > > > >
> > > > > Is trace processing CPU-bound or memory-bound? If it's CPU-bound, then
> > > > > the second pass may be OK-ish, since we will need minimal CPU
> > > > > processing during the first pass.
> > > >
> > > > It depends on the size of data, but I guess it's CPU-bound in most cases.
> > > >
> > > > >
> > > > >
> > > > > > I think we can simply reset the parallelism in all processes after LOST
> > > > > > and set current process to the idle task.  It'll catch up as soon as it
> > > > > > sees samples from all CPUs.
> > > > >
> > > > > I guess we can approximate parallelism as you described here:
> > > > >
> > > > > > Hmm.. ok.  Maybe we can save the timestamp of the last sample on each
> > > > > > CPU and clear the current thread after some period (2x of given freq?).
> > > > >
> > > > > We probably don't need to do anything special for lost events in this
> > > > > scheme at all. If the gap caused by lost events is tiny, then we
> > > > > consider nothing happened. If the gap is large enough, then we
> > > > > consider the CPU as idle for the duration of the gap. Either way it
> > > > > will be handled on common grounds.
> > > >
> > > > How do you know if it's tiny?  Do you mean the seq remains after lost?
> > >
> > > I was talking about your scheme based on CPU numbers.
> > >
> > >
> > > > > But tuning of these heuristics + testing and verification may be a bit
> > > > > of a problem. I would hate to end up with a tool which I won't trust.
> > > > >
> > > > > Here:
> > > > > "after some period (2x of given freq?)"
> > > > > do you mean 2x average/median period, or 1/2 average/median period?
> > > > > (2x freq is 1/2 period)
> > > >
> > > > Oh, sorry.  It's 2x period.
> > > >
> > > > >
> > > > > Ideally, we consider a CPU idle after 1/2 period after it switched to
> > > > > the idle task and we stop receiving samples.
> > > > > But on the other hand, we don't want to consider it constantly
> > > > > becoming idle, when it's just doing normal sampling with the normal
> > > > > period...
> > > > >
> > > > > So ideally the algorithm should be something like:
> > > > > let's say average/median sampling period is P
> > > > > we got last sample for CPU X at time T
> > > > > if by time T+2P we have not seen any other sample on CPU X, then
> > > > > consider CPU X idle since T+0.5P
> > > > >
> > > > > But this would also require either 2 passes over the data, or some
> > > > > kind of look ahead similar to the algo I proposed...
> > > >
> > > > I think we can do it in 1 pass.  For each sample,
> > > >
> > > >   for_each_cpu(cpu) {
> > > >       if (current[cpu]->last_timestamp + 2*period < sample->timestamp) {
> > > >           if (current[cpu]->thread != idle) {
> > > >               current[cpu]->thread->parallelism--;
> > > >               current[cpu]->thread = idle;
> > > >           }
> > > >       }
> > > >   }
> > > >
> > > >   leader = machine__findnew_thread(machine, sample->pid);
> > > >   current[sample->cpu]->last_timestamp = sample->timestamp;
> > > >
> > > >   if (current[sample->cpu]->thread != leader) {
> > > >       if (current[sample->cpu]->thread != idle)
> > > >           current[sample->cpu]->thread->parallelism--;
> > > >
> > > >       current[sample->cpu]->thread = leader;
> > > >       leader->parallelism++;
> > > >   }
> > > >
> > > >   sample->parallelism = leader->parallelism;
> > >
> > > As I described, for this simple 1 pass algorithm, I am afraid of imprecision.
> > > The thread has not continued to run for 2 periods. I run for 0-1 period.
> > >
> > >
> > >
> > > > > Also, do we take the median period? or average? do we update it over
> > > > > time (say, if CPU freq changes)? do we count it globally, or per CPU
> > > > > (in case CPUs run at different freqs)?
> > > >
> > > > Oh, perf tools use default frequency of 4000 Hz.
> > >
> > > Is the actual sample rate reasonably precise across time/CPUs?
> > >
> > > > Maybe we can use this
> > > > only for the frequency mode which means user didn't use -c option or
> > > > similar in the event description.
> >
> >
> > All-in-all I think the best option for now is using CPU IDs to track
> > parallelism as you suggested, but be more precise with idle detection.
> > 2 passes over the trace may be fine to detect idle points. I see the
> > most time now spent in hist_entry__cmp, which accesses other entries
> > and is like a part of O(N*logN) processing, so a simple O(N) pass
> > shouldn't slow it down much.
> > That's what I would try. But I would also try to assess the precision
> > of this approach by comparing with results of using explicit switch
> > events.
>
> It's not clear to me how you want to maintain the idle info in the 2
> pass approach.  Please feel free to propose something based on this
> work.


What part of it is unclear?

Basically, in the first pass we only mark events as sched_out/in.
When we don't see samples on a CPU for 2*period, we mark the previous
sample on the CPU as sched_out:

  // Assuming the period is stable across time and CPUs.
  for_each_cpu(cpu) {
      if (current[cpu]->last_timestamp + 2*period < sample->timestamp) {
          if (current[cpu]->thread != idle)
              current[cpu]->last_sample->sched_out = true;
      }
  }

  leader = machine__findnew_thread(machine, sample->pid);
  if (current[sample->cpu]->thread != leader) {
    current[sample->cpu]->last_sample->sched_out = true;
    sample->sched_in = true;
  }
  current[sample->cpu]->thread = leader;
  current[sample->cpu]->last_sample = sample;
  current[sample->cpu]->last_timestamp = sample->timestamp;


On the second pass we use the precomputed sched_in/out to calculate parallelism:

  leader = machine__findnew_thread(machine, sample->pid);
  if (sample->sched_in)
    leader->parallelism++;
  sample->parallelism = leader->parallelism;
  if (sample->sched_out)
    leader->parallelism--;

This is more precise b/c we don't consider a thread running for
2*period after it stopped running.

A more precise approach would probably be to consider the thread
running for 0.5*period after the last sample (and similarly for
0.5*period before the first sample), but it would require injecting
sched_in/out events into the trace at these points.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-30  5:50                                   ` Dmitry Vyukov
@ 2025-05-30 22:05                                     ` Namhyung Kim
  2025-05-31  6:31                                       ` Dmitry Vyukov
  0 siblings, 1 reply; 29+ messages in thread
From: Namhyung Kim @ 2025-05-30 22:05 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Fri, May 30, 2025 at 07:50:45AM +0200, Dmitry Vyukov wrote:
> On Wed, 28 May 2025 at 20:38, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Hello,
> >
> > On Tue, May 27, 2025 at 09:14:34AM +0200, Dmitry Vyukov wrote:
> > > On Wed, 21 May 2025 at 09:30, Dmitry Vyukov <dvyukov@google.com> wrote:
> > > >
> > > > > Maybe we can use this
> > > > > only for the frequency mode which means user didn't use -c option or
> > > > > similar in the event description.
> > >
> > >
> > > All-in-all I think the best option for now is using CPU IDs to track
> > > parallelism as you suggested, but be more precise with idle detection.
> > > 2 passes over the trace may be fine to detect idle points. I see the
> > > most time now spent in hist_entry__cmp, which accesses other entries
> > > and is like a part of O(N*logN) processing, so a simple O(N) pass
> > > shouldn't slow it down much.
> > > That's what I would try. But I would also try to assess the precision
> > > of this approach by comparing with results of using explicit switch
> > > events.
> >
> > It's not clear to me how you want to maintain the idle info in the 2
> > pass approach.  Please feel free to propose something based on this
> > work.
> 
> 
> What part of it is unclear?
> 
> Basically, in the first pass we only mark events as sched_out/in.
> When we don't see samples on a CPU for 2*period, we mark the previous
> sample on the CPU as sched_out:
> 
>   // Assuming the period is stable across time and CPUs.
>   for_each_cpu(cpu) {
>       if (current[cpu]->last_timestamp + 2*period < sample->timestamp) {
>           if (current[cpu]->thread != idle)
>               current[cpu]->last_sample->sched_out = true;
>       }
>   }
> 
>   leader = machine__findnew_thread(machine, sample->pid);
>   if (current[sample->cpu]->thread != leader) {
>     current[sample->cpu]->last_sample->sched_out = true;
>     sample->sched_in = true;
>   }
>   current[sample->cpu]->thread = leader;
>   current[sample->cpu]->last_sample = sample;
>   current[sample->cpu]->last_timestamp = sample->timestamp;

Oh, you wanted to save the info in the sample.  But I'm afraid it won't
work since it's stack allocated for one-time use in the
perf_session__deliver_event().

> 
> 
> On the second pass we use the precomputed sched_in/out to calculate parallelism:
> 
>   leader = machine__findnew_thread(machine, sample->pid);
>   if (sample->sched_in)
>     leader->parallelism++;
>   sample->parallelism = leader->parallelism;
>   if (sample->sched_out)
>     leader->parallelism--;
> 
> This is more precise b/c we don't consider a thread running for
> 2*period after it stopped running.

IIUC it can make some samples have less parallelism right before
they go to idle.

 
> A more precise approach would probably be to consider the thread
> running for 0.5*period after the last sample (and similarly for
> 0.5*period before the first sample), but it would require injecting
> sched_in/out events into the trace at these points.

Yep, that will fix the issue.  But then how to inject the events is the
problem.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode
  2025-05-30 22:05                                     ` Namhyung Kim
@ 2025-05-31  6:31                                       ` Dmitry Vyukov
  0 siblings, 0 replies; 29+ messages in thread
From: Dmitry Vyukov @ 2025-05-31  6:31 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ian Rogers, Kan Liang, Jiri Olsa,
	Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Andi Kleen

On Sat, 31 May 2025 at 00:05, Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Fri, May 30, 2025 at 07:50:45AM +0200, Dmitry Vyukov wrote:
> > On Wed, 28 May 2025 at 20:38, Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > On Tue, May 27, 2025 at 09:14:34AM +0200, Dmitry Vyukov wrote:
> > > > On Wed, 21 May 2025 at 09:30, Dmitry Vyukov <dvyukov@google.com> wrote:
> > > > >
> > > > > > Maybe we can use this
> > > > > > only for the frequency mode which means user didn't use -c option or
> > > > > > similar in the event description.
> > > >
> > > >
> > > > All-in-all I think the best option for now is using CPU IDs to track
> > > > parallelism as you suggested, but be more precise with idle detection.
> > > > 2 passes over the trace may be fine to detect idle points. I see the
> > > > most time now spent in hist_entry__cmp, which accesses other entries
> > > > and is like a part of O(N*logN) processing, so a simple O(N) pass
> > > > shouldn't slow it down much.
> > > > That's what I would try. But I would also try to assess the precision
> > > > of this approach by comparing with results of using explicit switch
> > > > events.
> > >
> > > It's not clear to me how you want to maintain the idle info in the 2
> > > pass approach.  Please feel free to propose something based on this
> > > work.
> >
> >
> > What part of it is unclear?
> >
> > Basically, in the first pass we only mark events as sched_out/in.
> > When we don't see samples on a CPU for 2*period, we mark the previous
> > sample on the CPU as sched_out:
> >
> >   // Assuming the period is stable across time and CPUs.
> >   for_each_cpu(cpu) {
> >       if (current[cpu]->last_timestamp + 2*period < sample->timestamp) {
> >           if (current[cpu]->thread != idle)
> >               current[cpu]->last_sample->sched_out = true;
> >       }
> >   }
> >
> >   leader = machine__findnew_thread(machine, sample->pid);
> >   if (current[sample->cpu]->thread != leader) {
> >     current[sample->cpu]->last_sample->sched_out = true;
> >     sample->sched_in = true;
> >   }
> >   current[sample->cpu]->thread = leader;
> >   current[sample->cpu]->last_sample = sample;
> >   current[sample->cpu]->last_timestamp = sample->timestamp;
>
> Oh, you wanted to save the info in the sample.  But I'm afraid it won't
> work since it's stack allocated for one-time use in the
> perf_session__deliver_event().

No, I just showed the algorithm. I don't know perf well enough to say
how to implement it.

> > On the second pass we use the precomputed sched_in/out to calculate parallelism:
> >
> >   leader = machine__findnew_thread(machine, sample->pid);
> >   if (sample->sched_in)
> >     leader->parallelism++;
> >   sample->parallelism = leader->parallelism;
> >   if (sample->sched_out)
> >     leader->parallelism--;
> >
> > This is more precise b/c we don't consider a thread running for
> > 2*period after it stopped running.
>
> IIUC it can make some samples have less parallelism right before
> they go to idle.
>
> > A more precise approach would probably be to consider the thread
> > running for 0.5*period after the last sample (and similarly for
> > 0.5*period before the first sample), but it would require injecting
> > sched_in/out events into the trace at these points.
>
> Yep, that will fix the issue.  But then how to inject the events is the
> problem.

Yes, but incorrect data is incomparably worse problem than writing a
bit of code.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-05-31  6:31 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-03  0:36 [RFC/PATCH] perf report: Support latency profiling in system-wide mode Namhyung Kim
2025-05-04  8:22 ` Ingo Molnar
2025-05-04 19:52   ` Namhyung Kim
2025-05-05  8:03     ` Dmitry Vyukov
2025-05-06 14:57       ` Arnaldo Carvalho de Melo
2025-05-06 16:58         ` Arnaldo Carvalho de Melo
2025-05-07  9:58           ` Dmitry Vyukov
2025-05-07 15:47             ` Arnaldo Carvalho de Melo
2025-05-07 23:51               ` Namhyung Kim
2025-05-05  8:08 ` Dmitry Vyukov
2025-05-06  5:30   ` Namhyung Kim
2025-05-06  5:55     ` Dmitry Vyukov
2025-05-06  6:43       ` Namhyung Kim
2025-05-06  6:46         ` Dmitry Vyukov
2025-05-06  7:09           ` Namhyung Kim
2025-05-06  7:40             ` Dmitry Vyukov
2025-05-07 23:43               ` Namhyung Kim
2025-05-08 12:24                 ` Dmitry Vyukov
2025-05-16 16:33                   ` Namhyung Kim
2025-05-19  6:00                     ` Dmitry Vyukov
2025-05-20  1:43                       ` Namhyung Kim
2025-05-20  6:45                         ` Dmitry Vyukov
2025-05-20 22:50                           ` Namhyung Kim
2025-05-21  7:30                             ` Dmitry Vyukov
2025-05-27  7:14                               ` Dmitry Vyukov
2025-05-28 18:38                                 ` Namhyung Kim
2025-05-30  5:50                                   ` Dmitry Vyukov
2025-05-30 22:05                                     ` Namhyung Kim
2025-05-31  6:31                                       ` Dmitry Vyukov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).