[PATCH v6 0/8] perf record --off-cpu: Dump off-cpu samples directly

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/8] perf record --off-cpu: Dump off-cpu samples directly
@ 2024-09-27 20:27 Howard Chu
  2024-09-27 20:27 ` [PATCH v6 1/8] perf evsel: Set off-cpu BPF output to system-wide Howard Chu
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Howard Chu @ 2024-09-27 20:27 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
	Howard Chu

Changes in v6:
 - Make patches bisectable

Changes in v5:
 - Delete unnecessary copy in BPF program
 - Remove sample_embed from perf header, hard code off-cpu stuff instead
 - Move evsel__is_offcpu_event() to evsel.h
 - Minor changes to the test
 - Edit some comments

Changes in v4:
 - Minimize the size of data output by perf_event_output()
 - Keep only one off-cpu event
 - Change off-cpu threshold's unit to microseconds
 - Set a default off-cpu threshold
 - Print the correct error message for the field 'embed' in perf data header

Changes in v3:
 - Add off-cpu-thresh argument
 - Process direct off-cpu samples in post

Changes in v2:
 - Remove unnecessary comments.
 - Rename function off_cpu_change_type to off_cpu_prepare_parse

v1:

As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323

Currently, off-cpu samples are dumped when perf record is exiting. This
results in off-cpu samples being after the regular samples. This patch
series makes possible dumping off-cpu samples on-the-fly, directly into
perf ring buffer. And it dispatches those samples to the correct format
for perf.data consumers.

Before:
```
     migration/0      21 [000] 27981.041319: 2944637851    cycles:P:  ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
            perf  770116 [001] 27981.041375:          1    cycles:P:  ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
            perf  770116 [001] 27981.041377:          1    cycles:P:  ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
            perf  770116 [001] 27981.041379:      51611    cycles:P:  ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
     migration/1      26 [001] 27981.041400: 4227682775    cycles:P:  ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
     migration/2      32 [002] 27981.041477: 4159401534    cycles:P:  ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])

sshd  708098 [000] 18446744069.414584:     286392 offcpu-time: 
	    79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
	    585690935cca [unknown] (/usr/bin/sshd)
```

After:
```
            perf  774767 [003] 28178.033444:        497           cycles:P:  ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
            perf  774767 [003] 28178.033445:     399440           cycles:P:  ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
         swapper       0 [001] 28178.036639:  376650973           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
         swapper       0 [003] 28178.182921:  348779378           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
    blueman-tray    1355 [000] 28178.627906:  100184571 offcpu-time: 
	    7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
	    7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
	    7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
	    7fff24e862d8 [unknown] ([unknown])


    blueman-tray    1355 [000] 28178.728137:  100187539 offcpu-time: 
	    7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
	    7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
	    7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
	    7fff24e862d8 [unknown] ([unknown])


         swapper       0 [000] 28178.463253:  195945410           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
     dbus-broker     412 [002] 28178.464855:  376737008           cycles:P:  ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
```

Howard Chu (8):
  perf evsel: Set off-cpu BPF output to system-wide
  perf record --off-cpu: Add --off-cpu-thresh
  perf record --off-cpu: Parse offcpu-time event
  perf record off-cpu: Dump direct off-cpu samples in BPF
  perf record --off-cpu: Dump total off-cpu time at the end.
  perf evsel: Delete unnecessary = 0
  perf record --off-cpu: Parse BPF output embedded data
  perf test: Add direct off-cpu dumping test

 tools/perf/builtin-record.c             |  26 ++++
 tools/perf/builtin-script.c             |   4 +-
 tools/perf/tests/builtin-test.c         |   1 +
 tools/perf/tests/shell/record_offcpu.sh |  29 ++++
 tools/perf/tests/tests.h                |   1 +
 tools/perf/tests/workloads/Build        |   1 +
 tools/perf/tests/workloads/offcpu.c     |  16 +++
 tools/perf/util/bpf_off_cpu.c           | 174 +++++++++++++++---------
 tools/perf/util/bpf_skel/off_cpu.bpf.c  | 123 +++++++++++++++++
 tools/perf/util/evsel.c                 |  47 ++++---
 tools/perf/util/evsel.h                 |   6 +
 tools/perf/util/off_cpu.h               |  10 +-
 tools/perf/util/record.h                |   1 +
 tools/perf/util/session.c               |  12 +-
 14 files changed, 359 insertions(+), 92 deletions(-)
 create mode 100644 tools/perf/tests/workloads/offcpu.c

-- 
2.43.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v6 1/8] perf evsel: Set off-cpu BPF output to system-wide
  2024-09-27 20:27 [PATCH v6 0/8] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
@ 2024-09-27 20:27 ` Howard Chu
  2024-09-30  5:37   ` Namhyung Kim
  2024-09-27 20:27 ` [PATCH v6 2/8] perf record --off-cpu: Add --off-cpu-thresh Howard Chu
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Howard Chu @ 2024-09-27 20:27 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
	Howard Chu

pid = -1 for off-cpu's bpf-output event.

This makes 'perf record -p <PID> --off-cpu', and 'perf record --off-cpu
<workload>' work. Otherwise bpf-output cannot be collected.

The reason (conjecture): say if we open perf_event on pid = 11451, then
in BPF, we call bpf_perf_event_output() when a direct sample is ready to
be dumped. But currently the perf_event of pid 11451 is not __fully__
sched_in yet, so in kernel/trace/bpf_trace.c's
__bpf_perf_event_output(), there will be event->oncpu != cpu, thus
return -EOPNOTSUPP, direct off-cpu sample output failed.

if (unlikely(event->oncpu != cpu))
		return -EOPNOTSUPP;

So I'm making it pid = -1, everybody can do bpf_perf_event_output()

P.S. In perf trace this is not necessary, because it uses syscall
tracepoints, instead of sched_switch.

Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
 tools/perf/util/evsel.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index edfb376f0611..500ca62669cb 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2368,6 +2368,9 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,

 			test_attr__ready();

+			if (evsel__is_offcpu_event(evsel))
+				pid = -1;
+
 			/* Debug message used by test scripts */
 			pr_debug2_peo("sys_perf_event_open: pid %d  cpu %d  group_fd %d  flags %#lx",
 				pid, perf_cpu_map__cpu(cpus, idx).cpu, group_fd, evsel->open_flags);
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 1/8] perf evsel: Set off-cpu BPF output to system-wide
  2024-09-27 20:27 ` [PATCH v6 1/8] perf evsel: Set off-cpu BPF output to system-wide Howard Chu
@ 2024-09-30  5:37   ` Namhyung Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Namhyung Kim @ 2024-09-30  5:37 UTC (permalink / raw)
  To: Howard Chu
  Cc: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel

On Fri, Sep 27, 2024 at 01:27:29PM -0700, Howard Chu wrote:
> pid = -1 for off-cpu's bpf-output event.
> 
> This makes 'perf record -p <PID> --off-cpu', and 'perf record --off-cpu
> <workload>' work. Otherwise bpf-output cannot be collected.
> 
> The reason (conjecture): say if we open perf_event on pid = 11451, then
> in BPF, we call bpf_perf_event_output() when a direct sample is ready to
> be dumped. But currently the perf_event of pid 11451 is not __fully__
> sched_in yet, so in kernel/trace/bpf_trace.c's
> __bpf_perf_event_output(), there will be event->oncpu != cpu, thus
> return -EOPNOTSUPP, direct off-cpu sample output failed.
> 
> if (unlikely(event->oncpu != cpu))
> 		return -EOPNOTSUPP;
> 
> So I'm making it pid = -1, everybody can do bpf_perf_event_output()
> 
> P.S. In perf trace this is not necessary, because it uses syscall
> tracepoints, instead of sched_switch.
> 
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
>  tools/perf/util/evsel.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index edfb376f0611..500ca62669cb 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2368,6 +2368,9 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
>  
>  			test_attr__ready();
>  
> +			if (evsel__is_offcpu_event(evsel))
> +				pid = -1;
> +

This looks hacky.  I think you'll end up having two copies of offcpu
events if there are two target tasks.  Maybe you can replace the thread
map of the offcpu event to have a single entry (-1) for any thread after
creating the event.

Thanks,
Namhyung


>  			/* Debug message used by test scripts */
>  			pr_debug2_peo("sys_perf_event_open: pid %d  cpu %d  group_fd %d  flags %#lx",
>  				pid, perf_cpu_map__cpu(cpus, idx).cpu, group_fd, evsel->open_flags);
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v6 2/8] perf record --off-cpu: Add --off-cpu-thresh
  2024-09-27 20:27 [PATCH v6 0/8] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
  2024-09-27 20:27 ` [PATCH v6 1/8] perf evsel: Set off-cpu BPF output to system-wide Howard Chu
@ 2024-09-27 20:27 ` Howard Chu
  2024-09-30  5:40   ` Namhyung Kim
  2024-09-27 20:27 ` [PATCH v6 3/8] perf record --off-cpu: Parse offcpu-time event Howard Chu
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Howard Chu @ 2024-09-27 20:27 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
	Howard Chu

Add the --off-cpu-thresh argument to specify the off-cpu time threshold.
If the off-cpu time exceeds this threshold, dump the off-cpu data
directly.

Suggested-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
 tools/perf/builtin-record.c            | 26 ++++++++++++++++++++++++++
 tools/perf/util/bpf_off_cpu.c          |  2 ++
 tools/perf/util/bpf_skel/off_cpu.bpf.c |  2 ++
 tools/perf/util/off_cpu.h              |  2 ++
 tools/perf/util/record.h               |  1 +
 5 files changed, 33 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index adbaf80b398c..bd53fb3c98ec 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -3149,6 +3149,28 @@ static int record__parse_mmap_pages(const struct option *opt,
 	return ret;
 }
 
+static int record__parse_off_cpu_thresh(const struct option *opt,
+					const char *str,
+					int unset __maybe_unused)
+{
+	struct record_opts *opts = opt->value;
+	char *endptr;
+	u64 off_cpu_thresh;
+
+	if (!str)
+		return -EINVAL;
+
+	off_cpu_thresh = strtoul(str, &endptr, 10);
+
+	/* threshold isn't string "0", yet strtoull() returns 0, parsing failed */
+	if (*endptr || (off_cpu_thresh == 0 && strcmp(str, "0")))
+		return -EINVAL;
+	else
+		opts->off_cpu_thresh = off_cpu_thresh;
+
+	return 0;
+}
+
 void __weak arch__add_leaf_frame_record_opts(struct record_opts *opts __maybe_unused)
 {
 }
@@ -3342,6 +3364,7 @@ static struct record record = {
 		.ctl_fd              = -1,
 		.ctl_fd_ack          = -1,
 		.synth               = PERF_SYNTH_ALL,
+		.off_cpu_thresh      = OFF_CPU_THRESH_DEFAULT,
 	},
 };
 
@@ -3564,6 +3587,9 @@ static struct option __record_options[] = {
 	OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
 	OPT_STRING(0, "setup-filter", &record.filter_action, "pin|unpin",
 		   "BPF filter action"),
+	OPT_CALLBACK(0, "off-cpu-thresh", &record.opts, "us",
+		     "Dump off-cpu samples if off-cpu time reaches this threshold. The unit is microsecond (default: 500000)",
+		     record__parse_off_cpu_thresh),
 	OPT_END()
 };
 
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index a590a8ac1f9d..eaef643f50e3 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -272,6 +272,8 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
 		}
 	}
 
+	skel->bss->offcpu_thresh = opts->off_cpu_thresh * 1000;
+
 	err = off_cpu_bpf__attach(skel);
 	if (err) {
 		pr_err("Failed to attach off-cpu BPF skeleton\n");
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index c152116df72f..5ea320aa9a53 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -97,6 +97,8 @@ const volatile bool uses_cgroup_v1 = false;
 
 int perf_subsys_id = -1;
 
+__u64 sample_id, sample_type, offcpu_thresh;
+
 /*
  * Old kernel used to call it task_struct->state and now it's '__state'.
  * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
index 2dd67c60f211..357231cb1c38 100644
--- a/tools/perf/util/off_cpu.h
+++ b/tools/perf/util/off_cpu.h
@@ -10,6 +10,8 @@ struct record_opts;
 
 #define OFFCPU_EVENT  "offcpu-time"
 
+#define OFF_CPU_THRESH_DEFAULT 500000ull
+
 #define OFFCPU_SAMPLE_TYPES  (PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_IP | \
 			      PERF_SAMPLE_TID | PERF_SAMPLE_TIME | \
 			      PERF_SAMPLE_ID | PERF_SAMPLE_CPU | \
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index a6566134e09e..3c11416e6627 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -79,6 +79,7 @@ struct record_opts {
 	int	      synth;
 	int	      threads_spec;
 	const char    *threads_user_spec;
+	u64	      off_cpu_thresh;
 };
 
 extern const char * const *record_usage;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 2/8] perf record --off-cpu: Add --off-cpu-thresh
  2024-09-27 20:27 ` [PATCH v6 2/8] perf record --off-cpu: Add --off-cpu-thresh Howard Chu
@ 2024-09-30  5:40   ` Namhyung Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Namhyung Kim @ 2024-09-30  5:40 UTC (permalink / raw)
  To: Howard Chu
  Cc: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel

On Fri, Sep 27, 2024 at 01:27:30PM -0700, Howard Chu wrote:
> Add the --off-cpu-thresh argument to specify the off-cpu time threshold.
> If the off-cpu time exceeds this threshold, dump the off-cpu data
> directly.
> 
> Suggested-by: Ian Rogers <irogers@google.com>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
>  tools/perf/builtin-record.c            | 26 ++++++++++++++++++++++++++
>  tools/perf/util/bpf_off_cpu.c          |  2 ++
>  tools/perf/util/bpf_skel/off_cpu.bpf.c |  2 ++
>  tools/perf/util/off_cpu.h              |  2 ++
>  tools/perf/util/record.h               |  1 +
>  5 files changed, 33 insertions(+)
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index adbaf80b398c..bd53fb3c98ec 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -3149,6 +3149,28 @@ static int record__parse_mmap_pages(const struct option *opt,
>  	return ret;
>  }
>  
> +static int record__parse_off_cpu_thresh(const struct option *opt,
> +					const char *str,
> +					int unset __maybe_unused)
> +{
> +	struct record_opts *opts = opt->value;
> +	char *endptr;
> +	u64 off_cpu_thresh;
> +
> +	if (!str)
> +		return -EINVAL;
> +
> +	off_cpu_thresh = strtoul(str, &endptr, 10);

Do you mean strtoull() ?


> +
> +	/* threshold isn't string "0", yet strtoull() returns 0, parsing failed */
> +	if (*endptr || (off_cpu_thresh == 0 && strcmp(str, "0")))
> +		return -EINVAL;
> +	else
> +		opts->off_cpu_thresh = off_cpu_thresh;
> +
> +	return 0;
> +}
> +
>  void __weak arch__add_leaf_frame_record_opts(struct record_opts *opts __maybe_unused)
>  {
>  }
> @@ -3342,6 +3364,7 @@ static struct record record = {
>  		.ctl_fd              = -1,
>  		.ctl_fd_ack          = -1,
>  		.synth               = PERF_SYNTH_ALL,
> +		.off_cpu_thresh      = OFF_CPU_THRESH_DEFAULT,
>  	},
>  };
>  
> @@ -3564,6 +3587,9 @@ static struct option __record_options[] = {
>  	OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
>  	OPT_STRING(0, "setup-filter", &record.filter_action, "pin|unpin",
>  		   "BPF filter action"),
> +	OPT_CALLBACK(0, "off-cpu-thresh", &record.opts, "us",
> +		     "Dump off-cpu samples if off-cpu time reaches this threshold. The unit is microsecond (default: 500000)",
> +		     record__parse_off_cpu_thresh),
>  	OPT_END()
>  };
>  
> diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> index a590a8ac1f9d..eaef643f50e3 100644
> --- a/tools/perf/util/bpf_off_cpu.c
> +++ b/tools/perf/util/bpf_off_cpu.c
> @@ -272,6 +272,8 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
>  		}
>  	}
>  
> +	skel->bss->offcpu_thresh = opts->off_cpu_thresh * 1000;
> +
>  	err = off_cpu_bpf__attach(skel);
>  	if (err) {
>  		pr_err("Failed to attach off-cpu BPF skeleton\n");
> diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> index c152116df72f..5ea320aa9a53 100644
> --- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
> +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> @@ -97,6 +97,8 @@ const volatile bool uses_cgroup_v1 = false;
>  
>  int perf_subsys_id = -1;
>  
> +__u64 sample_id, sample_type, offcpu_thresh;

The sample_id and sample_type aren't used in the patch.

Thanks,
Namhyung

> +
>  /*
>   * Old kernel used to call it task_struct->state and now it's '__state'.
>   * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
> diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
> index 2dd67c60f211..357231cb1c38 100644
> --- a/tools/perf/util/off_cpu.h
> +++ b/tools/perf/util/off_cpu.h
> @@ -10,6 +10,8 @@ struct record_opts;
>  
>  #define OFFCPU_EVENT  "offcpu-time"
>  
> +#define OFF_CPU_THRESH_DEFAULT 500000ull
> +
>  #define OFFCPU_SAMPLE_TYPES  (PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_IP | \
>  			      PERF_SAMPLE_TID | PERF_SAMPLE_TIME | \
>  			      PERF_SAMPLE_ID | PERF_SAMPLE_CPU | \
> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> index a6566134e09e..3c11416e6627 100644
> --- a/tools/perf/util/record.h
> +++ b/tools/perf/util/record.h
> @@ -79,6 +79,7 @@ struct record_opts {
>  	int	      synth;
>  	int	      threads_spec;
>  	const char    *threads_user_spec;
> +	u64	      off_cpu_thresh;
>  };
>  
>  extern const char * const *record_usage;
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v6 3/8] perf record --off-cpu: Parse offcpu-time event
  2024-09-27 20:27 [PATCH v6 0/8] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
  2024-09-27 20:27 ` [PATCH v6 1/8] perf evsel: Set off-cpu BPF output to system-wide Howard Chu
  2024-09-27 20:27 ` [PATCH v6 2/8] perf record --off-cpu: Add --off-cpu-thresh Howard Chu
@ 2024-09-27 20:27 ` Howard Chu
  2024-09-30  5:51   ` Namhyung Kim
  2024-09-27 20:27 ` [PATCH v6 4/8] perf record off-cpu: Dump direct off-cpu samples in BPF Howard Chu
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Howard Chu @ 2024-09-27 20:27 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
	Howard Chu

Parse offcpu-time event using parse_event, in off_cpu_start(), write
evlist fds got from evlist__open() to perf_event_array BPF map.

Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
 tools/perf/util/bpf_off_cpu.c          | 56 +++++++++++++++-----------
 tools/perf/util/bpf_skel/off_cpu.bpf.c |  9 +++++
 2 files changed, 41 insertions(+), 24 deletions(-)

diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index eaef643f50e3..f7233a09ec77 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -13,6 +13,7 @@
 #include "util/cgroup.h"
 #include "util/strlist.h"
 #include <bpf/bpf.h>
+#include <internal/xyarray.h>
 
 #include "bpf_skel/off_cpu.skel.h"
 
@@ -36,41 +37,27 @@ union off_cpu_data {
 	u64 array[1024 / sizeof(u64)];
 };
 
+u64 off_cpu_raw_data[1024 / sizeof(u64)];
+
 static int off_cpu_config(struct evlist *evlist)
 {
-	struct evsel *evsel;
-	struct perf_event_attr attr = {
-		.type	= PERF_TYPE_SOFTWARE,
-		.config = PERF_COUNT_SW_BPF_OUTPUT,
-		.size	= sizeof(attr), /* to capture ABI version */
-	};
-	char *evname = strdup(OFFCPU_EVENT);
-
-	if (evname == NULL)
-		return -ENOMEM;
+	char off_cpu_event[64];
 
-	evsel = evsel__new(&attr);
-	if (!evsel) {
-		free(evname);
-		return -ENOMEM;
+	scnprintf(off_cpu_event, sizeof(off_cpu_event), "bpf-output/no-inherit=1,name=%s/", OFFCPU_EVENT);
+	if (parse_event(evlist, off_cpu_event)) {
+		pr_err("Failed to open off-cpu event\n");
+		return -1;
 	}
 
-	evsel->core.attr.freq = 1;
-	evsel->core.attr.sample_period = 1;
-	/* off-cpu analysis depends on stack trace */
-	evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
-
-	evlist__add(evlist, evsel);
-
-	free(evsel->name);
-	evsel->name = evname;
-
 	return 0;
 }
 
 static void off_cpu_start(void *arg)
 {
 	struct evlist *evlist = arg;
+	struct evsel *evsel;
+	struct perf_cpu pcpu;
+	int i, err;
 
 	/* update task filter for the given workload */
 	if (skel->rodata->has_task && skel->rodata->uses_tgid &&
@@ -84,6 +71,27 @@ static void off_cpu_start(void *arg)
 		bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
 	}
 
+	/* sample id and fds in BPF's perf_event_array can only be set after record__open() */
+	evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT);
+	if (evsel == NULL) {
+		pr_err("%s evsel not found\n", OFFCPU_EVENT);
+		return;
+	}
+
+	if (evsel->core.id)
+		skel->bss->sample_id = evsel->core.id[0];
+
+	perf_cpu_map__for_each_cpu(pcpu, i, evsel->core.cpus) {
+		err = bpf_map__update_elem(skel->maps.offcpu_output,
+					   &pcpu.cpu, sizeof(__u32),
+					   xyarray__entry(evsel->core.fd, pcpu.cpu, 0),
+					   sizeof(__u32), BPF_ANY);
+		if (err) {
+			pr_err("Failed to update perf event map for direct off-cpu dumping\n");
+			return;
+		}
+	}
+
 	skel->bss->enabled = 1;
 }
 
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index 5ea320aa9a53..e2a887228fd9 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -18,6 +18,8 @@
 #define MAX_STACKS   32
 #define MAX_ENTRIES  102400
 
+#define MAX_CPUS  4096
+
 struct tstamp_data {
 	__u32 stack_id;
 	__u32 state;
@@ -39,6 +41,13 @@ struct {
 	__uint(max_entries, MAX_ENTRIES);
 } stacks SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, MAX_CPUS);
+} offcpu_output SEC(".maps");
+
 struct {
 	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
 	__uint(map_flags, BPF_F_NO_PREALLOC);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 3/8] perf record --off-cpu: Parse offcpu-time event
  2024-09-27 20:27 ` [PATCH v6 3/8] perf record --off-cpu: Parse offcpu-time event Howard Chu
@ 2024-09-30  5:51   ` Namhyung Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Namhyung Kim @ 2024-09-30  5:51 UTC (permalink / raw)
  To: Howard Chu
  Cc: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel

On Fri, Sep 27, 2024 at 01:27:31PM -0700, Howard Chu wrote:
> Parse offcpu-time event using parse_event, in off_cpu_start(), write
> evlist fds got from evlist__open() to perf_event_array BPF map.
> 
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
>  tools/perf/util/bpf_off_cpu.c          | 56 +++++++++++++++-----------
>  tools/perf/util/bpf_skel/off_cpu.bpf.c |  9 +++++
>  2 files changed, 41 insertions(+), 24 deletions(-)
> 
> diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> index eaef643f50e3..f7233a09ec77 100644
> --- a/tools/perf/util/bpf_off_cpu.c
> +++ b/tools/perf/util/bpf_off_cpu.c
> @@ -13,6 +13,7 @@
>  #include "util/cgroup.h"
>  #include "util/strlist.h"
>  #include <bpf/bpf.h>
> +#include <internal/xyarray.h>
>  
>  #include "bpf_skel/off_cpu.skel.h"
>  
> @@ -36,41 +37,27 @@ union off_cpu_data {
>  	u64 array[1024 / sizeof(u64)];
>  };
>  
> +u64 off_cpu_raw_data[1024 / sizeof(u64)];
> +
>  static int off_cpu_config(struct evlist *evlist)
>  {
> -	struct evsel *evsel;
> -	struct perf_event_attr attr = {
> -		.type	= PERF_TYPE_SOFTWARE,
> -		.config = PERF_COUNT_SW_BPF_OUTPUT,
> -		.size	= sizeof(attr), /* to capture ABI version */
> -	};
> -	char *evname = strdup(OFFCPU_EVENT);
> -
> -	if (evname == NULL)
> -		return -ENOMEM;
> +	char off_cpu_event[64];
>  
> -	evsel = evsel__new(&attr);
> -	if (!evsel) {
> -		free(evname);
> -		return -ENOMEM;
> +	scnprintf(off_cpu_event, sizeof(off_cpu_event), "bpf-output/no-inherit=1,name=%s/", OFFCPU_EVENT);

Is the 'no-inherit=1' really needed?  I guess that's the default and
you won't need it if you open it as a per-cpu event.


> +	if (parse_event(evlist, off_cpu_event)) {
> +		pr_err("Failed to open off-cpu event\n");
> +		return -1;
>  	}
>  
> -	evsel->core.attr.freq = 1;
> -	evsel->core.attr.sample_period = 1;
> -	/* off-cpu analysis depends on stack trace */
> -	evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
> -
> -	evlist__add(evlist, evsel);
> -
> -	free(evsel->name);
> -	evsel->name = evname;
> -
>  	return 0;
>  }
>  
>  static void off_cpu_start(void *arg)
>  {
>  	struct evlist *evlist = arg;
> +	struct evsel *evsel;
> +	struct perf_cpu pcpu;
> +	int i, err;
>  
>  	/* update task filter for the given workload */
>  	if (skel->rodata->has_task && skel->rodata->uses_tgid &&
> @@ -84,6 +71,27 @@ static void off_cpu_start(void *arg)
>  		bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
>  	}
>  
> +	/* sample id and fds in BPF's perf_event_array can only be set after record__open() */
> +	evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT);
> +	if (evsel == NULL) {
> +		pr_err("%s evsel not found\n", OFFCPU_EVENT);
> +		return;
> +	}
> +
> +	if (evsel->core.id)
> +		skel->bss->sample_id = evsel->core.id[0];

Where's it used?

> +
> +	perf_cpu_map__for_each_cpu(pcpu, i, evsel->core.cpus) {
> +		err = bpf_map__update_elem(skel->maps.offcpu_output,
> +					   &pcpu.cpu, sizeof(__u32),

I understand you used pcpu.cpu as a key to the offcpu_output map because
it's a basically per-cpu array.


> +					   xyarray__entry(evsel->core.fd, pcpu.cpu, 0),

But I suspect you should use 'i' instead of pcpu.cpu as an index to the
xyarray.

Thanks,
Namhyung


> +					   sizeof(__u32), BPF_ANY);
> +		if (err) {
> +			pr_err("Failed to update perf event map for direct off-cpu dumping\n");
> +			return;
> +		}
> +	}
> +
>  	skel->bss->enabled = 1;
>  }
>  
> diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> index 5ea320aa9a53..e2a887228fd9 100644
> --- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
> +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> @@ -18,6 +18,8 @@
>  #define MAX_STACKS   32
>  #define MAX_ENTRIES  102400
>  
> +#define MAX_CPUS  4096
> +
>  struct tstamp_data {
>  	__u32 stack_id;
>  	__u32 state;
> @@ -39,6 +41,13 @@ struct {
>  	__uint(max_entries, MAX_ENTRIES);
>  } stacks SEC(".maps");
>  
> +struct {
> +	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
> +	__uint(key_size, sizeof(__u32));
> +	__uint(value_size, sizeof(__u32));
> +	__uint(max_entries, MAX_CPUS);
> +} offcpu_output SEC(".maps");
> +
>  struct {
>  	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
>  	__uint(map_flags, BPF_F_NO_PREALLOC);
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v6 4/8] perf record off-cpu: Dump direct off-cpu samples in BPF
  2024-09-27 20:27 [PATCH v6 0/8] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
                   ` (2 preceding siblings ...)
  2024-09-27 20:27 ` [PATCH v6 3/8] perf record --off-cpu: Parse offcpu-time event Howard Chu
@ 2024-09-27 20:27 ` Howard Chu
  2024-09-30  6:23   ` Namhyung Kim
  2024-09-27 20:27 ` [PATCH v6 5/8] perf record --off-cpu: Dump total off-cpu time at the end Howard Chu
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Howard Chu @ 2024-09-27 20:27 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
	Howard Chu

Add perf_event_array map for dumping direct off-cpu samples, but keep
the at-the-end approach.

Tons of checking before access, to pass the BPF verifier.

If off-cpu time (represented as delta) exceeds the off-cpu threshold, do
output.

Output PERF_SAMPLE_TID, PERF_SAMPLE_PERIOD, PERF_SAMPLE_CALLCHAIN, and
PERF_SAMPLE_CGROUP in bpf_perf_event_output().

Ideally, we should only output
PERF_SAMPLE_PERIOD (off-cpu time) and PERF_SAMPLE_CALLCHAIN (sched_in
process's callchain). One only needs to set PERF_SAMPLE_TID and
PERF_SAMPLE_CGROUP, and perf_event will do everything for us.

But in reality, that's not the case. Setting PERF_SAMPLE_TID will mostly
give us TID of 0. We might get the correct TID for offcpu-time event
from time to time, but it is really rare.
         swapper       0 [000] offcpu-time:  /
        :1321819 1321819 [002] offcpu-time:  /user.slice/user-1000.slice/session-2.scope
         swapper       0 [001] offcpu-time:  /
         swapper       0 [003] offcpu-time:  /

And setting PERF_SAMPLE_CGROUP doesn't work properly either.
    tmux: server    3701 [003] offcpu-time:  /
    blueman-tray    1064 [001] offcpu-time:  /
            bash 1350867 [001] offcpu-time:  /
            bash 1350844 [000] offcpu-time:  /

We need to retrieve PERF_SAMPLE_TID, PERF_SAMPLE_PERIOD,
PERF_SAMPLE_CALLCHAIN, and PERF_SAMPLE_CGROUP using BPF and output these
four fields.

Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
 tools/perf/util/bpf_skel/off_cpu.bpf.c | 112 +++++++++++++++++++++++++
 tools/perf/util/off_cpu.h              |   8 +-
 2 files changed, 119 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index e2a887228fd9..c42d0e2d91d8 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -19,6 +19,7 @@
 #define MAX_ENTRIES  102400
 
 #define MAX_CPUS  4096
+#define MAX_OFFCPU_LEN 128
 
 struct tstamp_data {
 	__u32 stack_id;
@@ -34,6 +35,7 @@ struct offcpu_key {
 	__u64 cgroup_id;
 };
 
+/* for dumping at the end */
 struct {
 	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
 	__uint(key_size, sizeof(__u32));
@@ -41,6 +43,14 @@ struct {
 	__uint(max_entries, MAX_ENTRIES);
 } stacks SEC(".maps");
 
+struct offcpu_data {
+	u64 array[MAX_OFFCPU_LEN];
+};
+
+struct stack_data {
+	u64 array[MAX_STACKS];
+};
+
 struct {
 	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
 	__uint(key_size, sizeof(__u32));
@@ -48,6 +58,22 @@ struct {
 	__uint(max_entries, MAX_CPUS);
 } offcpu_output SEC(".maps");
 
+/* temporary offcpu sample */
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct offcpu_data));
+	__uint(max_entries, 1);
+} offcpu_payload SEC(".maps");
+
+/* cached stack per task storage */
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct stack_data);
+} stack_cache SEC(".maps");
+
 struct {
 	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
 	__uint(map_flags, BPF_F_NO_PREALLOC);
@@ -194,12 +220,75 @@ static inline int can_record(struct task_struct *t, int state)
 	return 1;
 }
 
+static inline bool check_bounds(int index)
+{
+	if (index < 0 || index >= MAX_OFFCPU_LEN)
+		return false;
+
+	return true;
+}
+
+static inline int copy_stack(struct stack_data *from,
+			     struct offcpu_data *to, int n)
+{
+	int max_stacks = MAX_STACKS, len = 0;
+
+	if (!from)
+		return len;
+
+	for (int i = 0; i < max_stacks && from->array[i]; ++i) {
+		if (check_bounds(n + 2 + i)) {
+			to->array[n + 2 + i] = from->array[i];
+			++len;
+		}
+	}
+	return len;
+}
+
+static int off_cpu_dump(void *ctx, struct offcpu_data *data, struct offcpu_key *key,
+			struct stack_data *stack_p, __u64 delta, __u64 timestamp)
+{
+	int size, n = 0, ip_pos = -1, len = 0;
+
+	if (sample_type & PERF_SAMPLE_TID && check_bounds(n))
+		data->array[n++] = (u64)key->tgid << 32 | key->pid;
+	if (sample_type & PERF_SAMPLE_PERIOD && check_bounds(n))
+		data->array[n++] = delta;
+	if (sample_type & PERF_SAMPLE_CALLCHAIN && check_bounds(n + 2)) {
+		/* data->array[n] is callchain->nr (updated later) */
+		data->array[n + 1] = PERF_CONTEXT_USER;
+		data->array[n + 2] = 0;
+
+		len = copy_stack(stack_p, data, n);
+
+		/* update length of callchain */
+		data->array[n] = len + 1;
+
+		/* update sample ip with the first callchain entry */
+		if (ip_pos >= 0)
+			data->array[ip_pos] = data->array[n + 2];
+
+		/* calculate sample callchain data->array length */
+		n += len + 2;
+	}
+	if (sample_type & PERF_SAMPLE_CGROUP && check_bounds(n))
+		data->array[n++] = key->cgroup_id;
+
+	size = n * sizeof(u64);
+	if (size >= 0 && size <= MAX_OFFCPU_LEN * sizeof(u64))
+		bpf_perf_event_output(ctx, &offcpu_output, BPF_F_CURRENT_CPU, data, size);
+
+	return 0;
+}
+
 static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
 			struct task_struct *next, int state)
 {
 	__u64 ts;
 	__u32 stack_id;
 	struct tstamp_data *pelem;
+	struct stack_data *stack_p;
+	int zero = 0;
 
 	ts = bpf_ktime_get_ns();
 
@@ -209,6 +298,21 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
 	stack_id = bpf_get_stackid(ctx, &stacks,
 				   BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);
 
+	/*
+	 * if stacks are successfully collected, cache them to task_storage, they are then
+	 * dumped if the off-cpu time hits the threshold.
+	 */
+	if (stack_id > 0) {
+		stack_p = bpf_task_storage_get(&stack_cache, prev, NULL,
+					       BPF_LOCAL_STORAGE_GET_F_CREATE);
+		if (stack_p) {
+			/* to pass the clang result unused warning */
+			int __attribute__((unused)) len;
+			len = bpf_get_stack(ctx, stack_p->array, MAX_STACKS * sizeof(u64),
+					    BPF_F_USER_STACK) / sizeof(u64);
+		}
+	}
+
 	pelem = bpf_task_storage_get(&tstamp, prev, NULL,
 				     BPF_LOCAL_STORAGE_GET_F_CREATE);
 	if (!pelem)
@@ -238,6 +342,14 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
 		else
 			bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
 
+		if (delta >= offcpu_thresh) {
+			struct offcpu_data *data = bpf_map_lookup_elem(&offcpu_payload, &zero);
+
+			stack_p = bpf_task_storage_get(&stack_cache, next, NULL, 0);
+			if (data && stack_p)
+				off_cpu_dump(ctx, data, &key, stack_p, delta, pelem->timestamp);
+		}
+
 		/* prevent to reuse the timestamp later */
 		pelem->timestamp = 0;
 	}
diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
index 357231cb1c38..eaf7be92472d 100644
--- a/tools/perf/util/off_cpu.h
+++ b/tools/perf/util/off_cpu.h
@@ -15,9 +15,15 @@ struct record_opts;
 #define OFFCPU_SAMPLE_TYPES  (PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_IP | \
 			      PERF_SAMPLE_TID | PERF_SAMPLE_TIME | \
 			      PERF_SAMPLE_ID | PERF_SAMPLE_CPU | \
-			      PERF_SAMPLE_PERIOD | PERF_SAMPLE_CALLCHAIN | \
+			      PERF_SAMPLE_PERIOD | PERF_SAMPLE_RAW | \
 			      PERF_SAMPLE_CGROUP)
 
+/*
+ * for embedded data to overwrite the original sample, duplicated sample types
+ * must be set in the original OFFCPU_SAMPLE_TYPES, except for callchain.
+ */
+#define OFFCPU_EMBEDDED_SAMPLE_TYPES  (PERF_SAMPLE_TID | PERF_SAMPLE_PERIOD | \
+				       PERF_SAMPLE_CALLCHAIN | PERF_SAMPLE_CGROUP)
 
 #ifdef HAVE_BPF_SKEL
 int off_cpu_prepare(struct evlist *evlist, struct target *target,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 4/8] perf record off-cpu: Dump direct off-cpu samples in BPF
  2024-09-27 20:27 ` [PATCH v6 4/8] perf record off-cpu: Dump direct off-cpu samples in BPF Howard Chu
@ 2024-09-30  6:23   ` Namhyung Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Namhyung Kim @ 2024-09-30  6:23 UTC (permalink / raw)
  To: Howard Chu
  Cc: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel

On Fri, Sep 27, 2024 at 01:27:32PM -0700, Howard Chu wrote:
> Add perf_event_array map for dumping direct off-cpu samples, but keep
> the at-the-end approach.
> 
> Tons of checking before access, to pass the BPF verifier.
> 
> If off-cpu time (represented as delta) exceeds the off-cpu threshold, do
> output.
> 
> Output PERF_SAMPLE_TID, PERF_SAMPLE_PERIOD, PERF_SAMPLE_CALLCHAIN, and
> PERF_SAMPLE_CGROUP in bpf_perf_event_output().
> 
> Ideally, we should only output
> PERF_SAMPLE_PERIOD (off-cpu time) and PERF_SAMPLE_CALLCHAIN (sched_in
> process's callchain). One only needs to set PERF_SAMPLE_TID and
> PERF_SAMPLE_CGROUP, and perf_event will do everything for us.
> 
> But in reality, that's not the case. Setting PERF_SAMPLE_TID will mostly
> give us TID of 0. We might get the correct TID for offcpu-time event
> from time to time, but it is really rare.
>          swapper       0 [000] offcpu-time:  /
>         :1321819 1321819 [002] offcpu-time:  /user.slice/user-1000.slice/session-2.scope
>          swapper       0 [001] offcpu-time:  /
>          swapper       0 [003] offcpu-time:  /
> 
> And setting PERF_SAMPLE_CGROUP doesn't work properly either.
>     tmux: server    3701 [003] offcpu-time:  /
>     blueman-tray    1064 [001] offcpu-time:  /
>             bash 1350867 [001] offcpu-time:  /
>             bash 1350844 [000] offcpu-time:  /
> 
> We need to retrieve PERF_SAMPLE_TID, PERF_SAMPLE_PERIOD,
> PERF_SAMPLE_CALLCHAIN, and PERF_SAMPLE_CGROUP using BPF and output these
> four fields.
> 
> Suggested-by: Namhyung Kim <namhyung@kernel.org>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
>  tools/perf/util/bpf_skel/off_cpu.bpf.c | 112 +++++++++++++++++++++++++
>  tools/perf/util/off_cpu.h              |   8 +-
>  2 files changed, 119 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> index e2a887228fd9..c42d0e2d91d8 100644
> --- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
> +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> @@ -19,6 +19,7 @@
>  #define MAX_ENTRIES  102400
>  
>  #define MAX_CPUS  4096
> +#define MAX_OFFCPU_LEN 128

I guess it's too big as you only collect stack and a few more data.

>  
>  struct tstamp_data {
>  	__u32 stack_id;
> @@ -34,6 +35,7 @@ struct offcpu_key {
>  	__u64 cgroup_id;
>  };
>  
> +/* for dumping at the end */
>  struct {
>  	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
>  	__uint(key_size, sizeof(__u32));
> @@ -41,6 +43,14 @@ struct {
>  	__uint(max_entries, MAX_ENTRIES);
>  } stacks SEC(".maps");
>  
> +struct offcpu_data {
> +	u64 array[MAX_OFFCPU_LEN];
> +};
> +
> +struct stack_data {
> +	u64 array[MAX_STACKS];
> +};
> +
>  struct {
>  	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
>  	__uint(key_size, sizeof(__u32));
> @@ -48,6 +58,22 @@ struct {
>  	__uint(max_entries, MAX_CPUS);
>  } offcpu_output SEC(".maps");
>  
> +/* temporary offcpu sample */
> +struct {
> +	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> +	__uint(key_size, sizeof(__u32));
> +	__uint(value_size, sizeof(struct offcpu_data));
> +	__uint(max_entries, 1);
> +} offcpu_payload SEC(".maps");
> +
> +/* cached stack per task storage */
> +struct {
> +	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> +	__uint(map_flags, BPF_F_NO_PREALLOC);
> +	__type(key, int);
> +	__type(value, struct stack_data);
> +} stack_cache SEC(".maps");
> +
>  struct {
>  	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
>  	__uint(map_flags, BPF_F_NO_PREALLOC);
> @@ -194,12 +220,75 @@ static inline int can_record(struct task_struct *t, int state)
>  	return 1;
>  }
>  
> +static inline bool check_bounds(int index)
> +{
> +	if (index < 0 || index >= MAX_OFFCPU_LEN)
> +		return false;
> +
> +	return true;
> +}

Likewise, we may get rid of the bound check entirely as it's always
guaranteed.

> +
> +static inline int copy_stack(struct stack_data *from,
> +			     struct offcpu_data *to, int n)
> +{
> +	int max_stacks = MAX_STACKS, len = 0;
> +
> +	if (!from)
> +		return len;
> +
> +	for (int i = 0; i < max_stacks && from->array[i]; ++i) {
> +		if (check_bounds(n + 2 + i)) {
> +			to->array[n + 2 + i] = from->array[i];
> +			++len;
> +		}
> +	}
> +	return len;
> +}
> +
> +static int off_cpu_dump(void *ctx, struct offcpu_data *data, struct offcpu_key *key,
> +			struct stack_data *stack_p, __u64 delta, __u64 timestamp)
> +{
> +	int size, n = 0, ip_pos = -1, len = 0;
> +
> +	if (sample_type & PERF_SAMPLE_TID && check_bounds(n))

You didn't set the sample_type yet.  Is it intentional?


> +		data->array[n++] = (u64)key->tgid << 32 | key->pid;
> +	if (sample_type & PERF_SAMPLE_PERIOD && check_bounds(n))
> +		data->array[n++] = delta;
> +	if (sample_type & PERF_SAMPLE_CALLCHAIN && check_bounds(n + 2)) {
> +		/* data->array[n] is callchain->nr (updated later) */
> +		data->array[n + 1] = PERF_CONTEXT_USER;
> +		data->array[n + 2] = 0;
> +
> +		len = copy_stack(stack_p, data, n);
> +
> +		/* update length of callchain */
> +		data->array[n] = len + 1;
> +
> +		/* update sample ip with the first callchain entry */
> +		if (ip_pos >= 0)

You don't set the ip_pos, just remove this part and let userspace handle
it later.


> +			data->array[ip_pos] = data->array[n + 2];
> +
> +		/* calculate sample callchain data->array length */
> +		n += len + 2;
> +	}
> +	if (sample_type & PERF_SAMPLE_CGROUP && check_bounds(n))
> +		data->array[n++] = key->cgroup_id;
> +
> +	size = n * sizeof(u64);
> +	if (size >= 0 && size <= MAX_OFFCPU_LEN * sizeof(u64))
> +		bpf_perf_event_output(ctx, &offcpu_output, BPF_F_CURRENT_CPU, data, size);
> +
> +	return 0;
> +}
> +
>  static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
>  			struct task_struct *next, int state)
>  {
>  	__u64 ts;
>  	__u32 stack_id;
>  	struct tstamp_data *pelem;
> +	struct stack_data *stack_p;
> +	int zero = 0;
>  
>  	ts = bpf_ktime_get_ns();
>  
> @@ -209,6 +298,21 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
>  	stack_id = bpf_get_stackid(ctx, &stacks,
>  				   BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);
>  
> +	/*
> +	 * if stacks are successfully collected, cache them to task_storage, they are then
> +	 * dumped if the off-cpu time hits the threshold.
> +	 */
> +	if (stack_id > 0) {
> +		stack_p = bpf_task_storage_get(&stack_cache, prev, NULL,
> +					       BPF_LOCAL_STORAGE_GET_F_CREATE);

Can you just add a new field in the tstamp map instead?


> +		if (stack_p) {
> +			/* to pass the clang result unused warning */
> +			int __attribute__((unused)) len;
> +			len = bpf_get_stack(ctx, stack_p->array, MAX_STACKS * sizeof(u64),
> +					    BPF_F_USER_STACK) / sizeof(u64);

I think you can move to the next task if it failed to get the stack
trace.


> +		}
> +	}
> +
>  	pelem = bpf_task_storage_get(&tstamp, prev, NULL,
>  				     BPF_LOCAL_STORAGE_GET_F_CREATE);
>  	if (!pelem)
> @@ -238,6 +342,14 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
>  		else
>  			bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
>  
> +		if (delta >= offcpu_thresh) {
> +			struct offcpu_data *data = bpf_map_lookup_elem(&offcpu_payload, &zero);
> +
> +			stack_p = bpf_task_storage_get(&stack_cache, next, NULL, 0);
> +			if (data && stack_p)
> +				off_cpu_dump(ctx, data, &key, stack_p, delta, pelem->timestamp);
> +		}

I think you should either dump the data directly or save it in the
off_cpu map.  Otherwise the time is accounted twice.

Thanks,
Namhyung

> +
>  		/* prevent to reuse the timestamp later */
>  		pelem->timestamp = 0;
>  	}
> diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
> index 357231cb1c38..eaf7be92472d 100644
> --- a/tools/perf/util/off_cpu.h
> +++ b/tools/perf/util/off_cpu.h
> @@ -15,9 +15,15 @@ struct record_opts;
>  #define OFFCPU_SAMPLE_TYPES  (PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_IP | \
>  			      PERF_SAMPLE_TID | PERF_SAMPLE_TIME | \
>  			      PERF_SAMPLE_ID | PERF_SAMPLE_CPU | \
> -			      PERF_SAMPLE_PERIOD | PERF_SAMPLE_CALLCHAIN | \
> +			      PERF_SAMPLE_PERIOD | PERF_SAMPLE_RAW | \
>  			      PERF_SAMPLE_CGROUP)
>  
> +/*
> + * for embedded data to overwrite the original sample, duplicated sample types
> + * must be set in the original OFFCPU_SAMPLE_TYPES, except for callchain.
> + */
> +#define OFFCPU_EMBEDDED_SAMPLE_TYPES  (PERF_SAMPLE_TID | PERF_SAMPLE_PERIOD | \
> +				       PERF_SAMPLE_CALLCHAIN | PERF_SAMPLE_CGROUP)
>  
>  #ifdef HAVE_BPF_SKEL
>  int off_cpu_prepare(struct evlist *evlist, struct target *target,
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v6 5/8] perf record --off-cpu: Dump total off-cpu time at the end.
  2024-09-27 20:27 [PATCH v6 0/8] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
                   ` (3 preceding siblings ...)
  2024-09-27 20:27 ` [PATCH v6 4/8] perf record off-cpu: Dump direct off-cpu samples in BPF Howard Chu
@ 2024-09-27 20:27 ` Howard Chu
  2024-09-30  6:39   ` Namhyung Kim
  2024-09-27 20:27 ` [PATCH v6 6/8] perf evsel: Delete unnecessary = 0 Howard Chu
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Howard Chu @ 2024-09-27 20:27 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
	Howard Chu

By setting a placeholder sample_type and then writing real data into
raw_data, we mimic the direct sample method to write data at the end.

Note that some data serve only as placeholders and will be overwritten
by the data in raw_data. Additionally, since the IP will be updated in
evsel__parse_sample(), there is no need to handle it in off_cpu_write().

Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
 tools/perf/util/bpf_off_cpu.c | 116 +++++++++++++++++++++-------------
 1 file changed, 72 insertions(+), 44 deletions(-)

diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index f7233a09ec77..2a1cfd7e0b09 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -138,12 +138,19 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
 	int ncpus = 1, ntasks = 1, ncgrps = 1;
 	struct strlist *pid_slist = NULL;
 	struct str_node *pos;
+	struct evsel *evsel;
 
 	if (off_cpu_config(evlist) < 0) {
 		pr_err("Failed to config off-cpu BPF event\n");
 		return -1;
 	}
 
+	evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT);
+	if (evsel == NULL) {
+		pr_err("%s evsel not found\n", OFFCPU_EVENT);
+		return -1 ;
+	}
+
 	skel = off_cpu_bpf__open();
 	if (!skel) {
 		pr_err("Failed to open off-cpu BPF skeleton\n");
@@ -259,7 +266,6 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
 	}
 
 	if (evlist__first(evlist)->cgrp) {
-		struct evsel *evsel;
 		u8 val = 1;
 
 		fd = bpf_map__fd(skel->maps.cgroup_filter);
@@ -280,6 +286,7 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
 		}
 	}
 
+	skel->bss->sample_type   = OFFCPU_EMBEDDED_SAMPLE_TYPES;
 	skel->bss->offcpu_thresh = opts->off_cpu_thresh * 1000;
 
 	err = off_cpu_bpf__attach(skel);
@@ -305,7 +312,8 @@ int off_cpu_write(struct perf_session *session)
 {
 	int bytes = 0, size;
 	int fd, stack;
-	u64 sample_type, val, sid = 0;
+	u32 raw_size;
+	u64 sample_type_off_cpu, sample_type_bpf_output, val, sid = 0, tstamp = OFF_CPU_TIMESTAMP;
 	struct evsel *evsel;
 	struct perf_data_file *file = &session->data->file;
 	struct off_cpu_key prev, key;
@@ -315,7 +323,6 @@ int off_cpu_write(struct perf_session *session)
 			.misc = PERF_RECORD_MISC_USER,
 		},
 	};
-	u64 tstamp = OFF_CPU_TIMESTAMP;
 
 	skel->bss->enabled = 0;
 
@@ -325,15 +332,10 @@ int off_cpu_write(struct perf_session *session)
 		return 0;
 	}
 
-	sample_type = evsel->core.attr.sample_type;
-
-	if (sample_type & ~OFFCPU_SAMPLE_TYPES) {
-		pr_err("not supported sample type: %llx\n",
-		       (unsigned long long)sample_type);
-		return -1;
-	}
+	sample_type_off_cpu    = OFFCPU_EMBEDDED_SAMPLE_TYPES;
+	sample_type_bpf_output = evsel->core.attr.sample_type;
 
-	if (sample_type & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER)) {
+	if (sample_type_bpf_output & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER)) {
 		if (evsel->core.id)
 			sid = evsel->core.id[0];
 	}
@@ -344,49 +346,75 @@ int off_cpu_write(struct perf_session *session)
 
 	while (!bpf_map_get_next_key(fd, &prev, &key)) {
 		int n = 1;  /* start from perf_event_header */
-		int ip_pos = -1;
+		int i = 0; /* raw data index */
 
 		bpf_map_lookup_elem(fd, &key, &val);
 
-		if (sample_type & PERF_SAMPLE_IDENTIFIER)
+		/*
+		 * Zero-fill some of these fields first, they will be overwritten by the dummy
+		 * embedded data (in raw_data) below, when parsing the samples. And because embedded
+		 * data is in BPF output, perf script -F without bpf-output field will not work
+		 * properly.
+		 */
+		if (sample_type_bpf_output & PERF_SAMPLE_IDENTIFIER)
 			data.array[n++] = sid;
-		if (sample_type & PERF_SAMPLE_IP) {
-			ip_pos = n;
-			data.array[n++] = 0;  /* will be updated */
-		}
-		if (sample_type & PERF_SAMPLE_TID)
-			data.array[n++] = (u64)key.pid << 32 | key.tgid;
-		if (sample_type & PERF_SAMPLE_TIME)
-			data.array[n++] = tstamp;
-		if (sample_type & PERF_SAMPLE_ID)
-			data.array[n++] = sid;
-		if (sample_type & PERF_SAMPLE_CPU)
+		if (sample_type_bpf_output & PERF_SAMPLE_IP)
 			data.array[n++] = 0;
-		if (sample_type & PERF_SAMPLE_PERIOD)
-			data.array[n++] = val;
-		if (sample_type & PERF_SAMPLE_CALLCHAIN) {
-			int len = 0;
-
-			/* data.array[n] is callchain->nr (updated later) */
-			data.array[n + 1] = PERF_CONTEXT_USER;
-			data.array[n + 2] = 0;
+		if (sample_type_bpf_output & PERF_SAMPLE_TID)
+			data.array[n++] = 0;
+		if (sample_type_bpf_output & PERF_SAMPLE_TIME)
+			data.array[n++] = tstamp; /* we won't overwrite time */
+		if (sample_type_bpf_output & PERF_SAMPLE_CPU)
+			data.array[n++] = 0;
+		if (sample_type_bpf_output & PERF_SAMPLE_PERIOD)
+			data.array[n++] = 0;
+		if (sample_type_bpf_output & PERF_SAMPLE_RAW) {
+			/*
+			 * the format of raw data is as follows:
+			 *
+			 *  [ size ][ data ]
+			 *  [     data     ]
+			 *  [     data     ]
+			 *  [     data     ]
+			 *  [ data ][ empty]
+			 *
+			 */
+			if (sample_type_off_cpu & PERF_SAMPLE_TID)
+				off_cpu_raw_data[i++] = (u64)key.pid << 32 | key.tgid;
+			if (sample_type_off_cpu & PERF_SAMPLE_PERIOD)
+				off_cpu_raw_data[i++] = val;
+			if (sample_type_off_cpu & PERF_SAMPLE_CALLCHAIN) {
+				int len = 0;
+
+				/* off_cpu_raw_data[n] is callchain->nr (updated later) */
+				off_cpu_raw_data[i + 1] = PERF_CONTEXT_USER;
+				off_cpu_raw_data[i + 2] = 0;
+
+				bpf_map_lookup_elem(stack, &key.stack_id, &off_cpu_raw_data[i + 2]);
+				while (off_cpu_raw_data[i + 2 + len])
+					len++;
+
+				/* update length of callchain */
+				off_cpu_raw_data[i] = len + 1;
+
+				/* calculate sample callchain off_cpu_raw_data length */
+				i += len + 2;
+			}
+			if (sample_type_off_cpu & PERF_SAMPLE_CGROUP)
+				off_cpu_raw_data[i++] = key.cgroup_id;
 
-			bpf_map_lookup_elem(stack, &key.stack_id, &data.array[n + 2]);
-			while (data.array[n + 2 + len])
-				len++;
+			raw_size = i * sizeof(u64) + sizeof(u32); /* 4 empty bytes for alignment */
 
-			/* update length of callchain */
-			data.array[n] = len + 1;
+			/* raw_size */
+			memcpy((void *)data.array + n * sizeof(u64), &raw_size, sizeof(raw_size));
 
-			/* update sample ip with the first callchain entry */
-			if (ip_pos >= 0)
-				data.array[ip_pos] = data.array[n + 2];
+			/* raw_data */
+			memcpy((void *)data.array + n * sizeof(u64) + sizeof(u32), off_cpu_raw_data, i * sizeof(u64));
 
-			/* calculate sample callchain data array length */
-			n += len + 2;
+			n += i + 1;
 		}
-		if (sample_type & PERF_SAMPLE_CGROUP)
-			data.array[n++] = key.cgroup_id;
+		if (sample_type_bpf_output & PERF_SAMPLE_CGROUP)
+			data.array[n++] = 0;
 
 		size = n * sizeof(u64);
 		data.hdr.size = size;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 5/8] perf record --off-cpu: Dump total off-cpu time at the end.
  2024-09-27 20:27 ` [PATCH v6 5/8] perf record --off-cpu: Dump total off-cpu time at the end Howard Chu
@ 2024-09-30  6:39   ` Namhyung Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Namhyung Kim @ 2024-09-30  6:39 UTC (permalink / raw)
  To: Howard Chu
  Cc: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel

On Fri, Sep 27, 2024 at 01:27:33PM -0700, Howard Chu wrote:
> By setting a placeholder sample_type and then writing real data into
> raw_data, we mimic the direct sample method to write data at the end.
> 
> Note that some data serve only as placeholders and will be overwritten
> by the data in raw_data. Additionally, since the IP will be updated in
> evsel__parse_sample(), there is no need to handle it in off_cpu_write().
> 
> Suggested-by: Namhyung Kim <namhyung@kernel.org>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
>  tools/perf/util/bpf_off_cpu.c | 116 +++++++++++++++++++++-------------
>  1 file changed, 72 insertions(+), 44 deletions(-)
> 
> diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> index f7233a09ec77..2a1cfd7e0b09 100644
> --- a/tools/perf/util/bpf_off_cpu.c
> +++ b/tools/perf/util/bpf_off_cpu.c
> @@ -138,12 +138,19 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
>  	int ncpus = 1, ntasks = 1, ncgrps = 1;
>  	struct strlist *pid_slist = NULL;
>  	struct str_node *pos;
> +	struct evsel *evsel;
>  
>  	if (off_cpu_config(evlist) < 0) {
>  		pr_err("Failed to config off-cpu BPF event\n");
>  		return -1;
>  	}
>  
> +	evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT);
> +	if (evsel == NULL) {
> +		pr_err("%s evsel not found\n", OFFCPU_EVENT);
> +		return -1 ;
> +	}
> +
>  	skel = off_cpu_bpf__open();
>  	if (!skel) {
>  		pr_err("Failed to open off-cpu BPF skeleton\n");
> @@ -259,7 +266,6 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
>  	}
>  
>  	if (evlist__first(evlist)->cgrp) {
> -		struct evsel *evsel;
>  		u8 val = 1;
>  
>  		fd = bpf_map__fd(skel->maps.cgroup_filter);
> @@ -280,6 +286,7 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
>  		}
>  	}
>  
> +	skel->bss->sample_type   = OFFCPU_EMBEDDED_SAMPLE_TYPES;

Hmm.. you set the sample_type unconditionally which means the embedded
data in the raw area has the fixed format.  Then I think you don't need
the sample_type variable at all.


>  	skel->bss->offcpu_thresh = opts->off_cpu_thresh * 1000;
>  
>  	err = off_cpu_bpf__attach(skel);
> @@ -305,7 +312,8 @@ int off_cpu_write(struct perf_session *session)
>  {
>  	int bytes = 0, size;
>  	int fd, stack;
> -	u64 sample_type, val, sid = 0;
> +	u32 raw_size;
> +	u64 sample_type_off_cpu, sample_type_bpf_output, val, sid = 0, tstamp = OFF_CPU_TIMESTAMP;
>  	struct evsel *evsel;
>  	struct perf_data_file *file = &session->data->file;
>  	struct off_cpu_key prev, key;
> @@ -315,7 +323,6 @@ int off_cpu_write(struct perf_session *session)
>  			.misc = PERF_RECORD_MISC_USER,
>  		},
>  	};
> -	u64 tstamp = OFF_CPU_TIMESTAMP;
>  
>  	skel->bss->enabled = 0;
>  
> @@ -325,15 +332,10 @@ int off_cpu_write(struct perf_session *session)
>  		return 0;
>  	}
>  
> -	sample_type = evsel->core.attr.sample_type;
> -
> -	if (sample_type & ~OFFCPU_SAMPLE_TYPES) {
> -		pr_err("not supported sample type: %llx\n",
> -		       (unsigned long long)sample_type);
> -		return -1;
> -	}
> +	sample_type_off_cpu    = OFFCPU_EMBEDDED_SAMPLE_TYPES;
> +	sample_type_bpf_output = evsel->core.attr.sample_type;
>  
> -	if (sample_type & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER)) {
> +	if (sample_type_bpf_output & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER)) {
>  		if (evsel->core.id)
>  			sid = evsel->core.id[0];
>  	}
> @@ -344,49 +346,75 @@ int off_cpu_write(struct perf_session *session)
>  
>  	while (!bpf_map_get_next_key(fd, &prev, &key)) {
>  		int n = 1;  /* start from perf_event_header */
> -		int ip_pos = -1;
> +		int i = 0; /* raw data index */
>  
>  		bpf_map_lookup_elem(fd, &key, &val);
>  
> -		if (sample_type & PERF_SAMPLE_IDENTIFIER)
> +		/*
> +		 * Zero-fill some of these fields first, they will be overwritten by the dummy
> +		 * embedded data (in raw_data) below, when parsing the samples. And because embedded
> +		 * data is in BPF output, perf script -F without bpf-output field will not work
> +		 * properly.
> +		 */
> +		if (sample_type_bpf_output & PERF_SAMPLE_IDENTIFIER)
>  			data.array[n++] = sid;
> -		if (sample_type & PERF_SAMPLE_IP) {
> -			ip_pos = n;
> -			data.array[n++] = 0;  /* will be updated */
> -		}
> -		if (sample_type & PERF_SAMPLE_TID)
> -			data.array[n++] = (u64)key.pid << 32 | key.tgid;
> -		if (sample_type & PERF_SAMPLE_TIME)
> -			data.array[n++] = tstamp;
> -		if (sample_type & PERF_SAMPLE_ID)
> -			data.array[n++] = sid;
> -		if (sample_type & PERF_SAMPLE_CPU)
> +		if (sample_type_bpf_output & PERF_SAMPLE_IP)
>  			data.array[n++] = 0;
> -		if (sample_type & PERF_SAMPLE_PERIOD)
> -			data.array[n++] = val;
> -		if (sample_type & PERF_SAMPLE_CALLCHAIN) {
> -			int len = 0;
> -
> -			/* data.array[n] is callchain->nr (updated later) */
> -			data.array[n + 1] = PERF_CONTEXT_USER;
> -			data.array[n + 2] = 0;
> +		if (sample_type_bpf_output & PERF_SAMPLE_TID)
> +			data.array[n++] = 0;

I'm not sure if it works correctly.  You haven't set the callchain
length and 'n' is not updated.

> +		if (sample_type_bpf_output & PERF_SAMPLE_TIME)
> +			data.array[n++] = tstamp; /* we won't overwrite time */
> +		if (sample_type_bpf_output & PERF_SAMPLE_CPU)
> +			data.array[n++] = 0;
> +		if (sample_type_bpf_output & PERF_SAMPLE_PERIOD)
> +			data.array[n++] = 0;
> +		if (sample_type_bpf_output & PERF_SAMPLE_RAW) {
> +			/*
> +			 * the format of raw data is as follows:
> +			 *
> +			 *  [ size ][ data ]
> +			 *  [     data     ]
> +			 *  [     data     ]
> +			 *  [     data     ]
> +			 *  [ data ][ empty]
> +			 *
> +			 */
> +			if (sample_type_off_cpu & PERF_SAMPLE_TID)

As I said, you can get rid of the sample_type_off_cpu check if it's
always true.

Thanks,
Namhyung


> +				off_cpu_raw_data[i++] = (u64)key.pid << 32 | key.tgid;
> +			if (sample_type_off_cpu & PERF_SAMPLE_PERIOD)
> +				off_cpu_raw_data[i++] = val;
> +			if (sample_type_off_cpu & PERF_SAMPLE_CALLCHAIN) {
> +				int len = 0;
> +
> +				/* off_cpu_raw_data[n] is callchain->nr (updated later) */
> +				off_cpu_raw_data[i + 1] = PERF_CONTEXT_USER;
> +				off_cpu_raw_data[i + 2] = 0;
> +
> +				bpf_map_lookup_elem(stack, &key.stack_id, &off_cpu_raw_data[i + 2]);
> +				while (off_cpu_raw_data[i + 2 + len])
> +					len++;
> +
> +				/* update length of callchain */
> +				off_cpu_raw_data[i] = len + 1;
> +
> +				/* calculate sample callchain off_cpu_raw_data length */
> +				i += len + 2;
> +			}
> +			if (sample_type_off_cpu & PERF_SAMPLE_CGROUP)
> +				off_cpu_raw_data[i++] = key.cgroup_id;
>  
> -			bpf_map_lookup_elem(stack, &key.stack_id, &data.array[n + 2]);
> -			while (data.array[n + 2 + len])
> -				len++;
> +			raw_size = i * sizeof(u64) + sizeof(u32); /* 4 empty bytes for alignment */
>  
> -			/* update length of callchain */
> -			data.array[n] = len + 1;
> +			/* raw_size */
> +			memcpy((void *)data.array + n * sizeof(u64), &raw_size, sizeof(raw_size));
>  
> -			/* update sample ip with the first callchain entry */
> -			if (ip_pos >= 0)
> -				data.array[ip_pos] = data.array[n + 2];
> +			/* raw_data */
> +			memcpy((void *)data.array + n * sizeof(u64) + sizeof(u32), off_cpu_raw_data, i * sizeof(u64));
>  
> -			/* calculate sample callchain data array length */
> -			n += len + 2;
> +			n += i + 1;
>  		}
> -		if (sample_type & PERF_SAMPLE_CGROUP)
> -			data.array[n++] = key.cgroup_id;
> +		if (sample_type_bpf_output & PERF_SAMPLE_CGROUP)
> +			data.array[n++] = 0;
>  
>  		size = n * sizeof(u64);
>  		data.hdr.size = size;
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v6 6/8] perf evsel: Delete unnecessary = 0
  2024-09-27 20:27 [PATCH v6 0/8] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
                   ` (4 preceding siblings ...)
  2024-09-27 20:27 ` [PATCH v6 5/8] perf record --off-cpu: Dump total off-cpu time at the end Howard Chu
@ 2024-09-27 20:27 ` Howard Chu
  2024-09-27 20:27 ` [PATCH v6 7/8] perf record --off-cpu: Parse BPF output embedded data Howard Chu
  2024-09-27 20:27 ` [PATCH v6 8/8] perf test: Add direct off-cpu dumping test Howard Chu
  7 siblings, 0 replies; 15+ messages in thread
From: Howard Chu @ 2024-09-27 20:27 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
	Howard Chu

Delete unnecessary zero-initializations because they are already set to
zero at the top of evsel__parse_sample(). If we don't remove them, it
becomes troublesome to overwrite the sample using data from raw_data.

Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
 tools/perf/util/evsel.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 500ca62669cb..32196e4f0637 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2965,7 +2965,6 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 		array++;
 	}
 
-	data->intr_regs.abi = PERF_SAMPLE_REGS_ABI_NONE;
 	if (type & PERF_SAMPLE_REGS_INTR) {
 		OVERFLOW_CHECK_u64(array);
 		data->intr_regs.abi = *array;
@@ -2982,25 +2981,21 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 		}
 	}
 
-	data->phys_addr = 0;
 	if (type & PERF_SAMPLE_PHYS_ADDR) {
 		data->phys_addr = *array;
 		array++;
 	}
 
-	data->cgroup = 0;
 	if (type & PERF_SAMPLE_CGROUP) {
 		data->cgroup = *array;
 		array++;
 	}
 
-	data->data_page_size = 0;
 	if (type & PERF_SAMPLE_DATA_PAGE_SIZE) {
 		data->data_page_size = *array;
 		array++;
 	}
 
-	data->code_page_size = 0;
 	if (type & PERF_SAMPLE_CODE_PAGE_SIZE) {
 		data->code_page_size = *array;
 		array++;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 7/8] perf record --off-cpu: Parse BPF output embedded data
  2024-09-27 20:27 [PATCH v6 0/8] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
                   ` (5 preceding siblings ...)
  2024-09-27 20:27 ` [PATCH v6 6/8] perf evsel: Delete unnecessary = 0 Howard Chu
@ 2024-09-27 20:27 ` Howard Chu
  2024-09-30  6:51   ` Namhyung Kim
  2024-09-27 20:27 ` [PATCH v6 8/8] perf test: Add direct off-cpu dumping test Howard Chu
  7 siblings, 1 reply; 15+ messages in thread
From: Howard Chu @ 2024-09-27 20:27 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
	Howard Chu

Move evsel__is_offcpu_event() to evsel.h

Add a sample_type_embed member to the struct evsel, along with a couple
of helper functions.

In session.c, we parse BPF output embedded samples in a two-step
process.

Initial Parsing: Treat the sample as a regular BPF-output event.

Secondary Parsing: Extract data from raw_data and parse it according to
the sample_type_embed specification. Since the second step relies on the
raw_data obtained in the first step, we must avoid zero-initializing the
sample data after the first step.

Suggested-by: Ian Rogers <irogers@google.com>
Suggested-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
 tools/perf/builtin-script.c |  4 ++--
 tools/perf/util/evsel.c     | 39 +++++++++++++++++++++++--------------
 tools/perf/util/evsel.h     |  6 ++++++
 tools/perf/util/session.c   | 12 +++++++++++-
 4 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index a644787fa9e1..9719ffae45d5 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -662,7 +662,7 @@ static int perf_session__check_output_opt(struct perf_session *session)
 
 		evlist__for_each_entry(session->evlist, evsel) {
 			not_pipe = true;
-			if (evsel__has_callchain(evsel)) {
+			if (evsel__has_callchain(evsel) || evsel__is_offcpu_event(evsel)) {
 				use_callchain = true;
 				break;
 			}
@@ -2352,7 +2352,7 @@ static void process_event(struct perf_script *script,
 	else if (PRINT_FIELD(BRSTACKOFF))
 		perf_sample__fprintf_brstackoff(sample, thread, attr, fp);
 
-	if (evsel__is_bpf_output(evsel) && PRINT_FIELD(BPF_OUTPUT))
+	if (evsel__is_bpf_output(evsel) && !evsel__is_offcpu_event(evsel) && PRINT_FIELD(BPF_OUTPUT))
 		perf_sample__fprintf_bpf_output(sample, fp);
 	perf_sample__fprintf_insn(sample, evsel, attr, thread, machine, fp, al);
 
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 32196e4f0637..4199a1e409f7 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1105,11 +1105,6 @@ static void evsel__set_default_freq_period(struct record_opts *opts,
 	}
 }
 
-static bool evsel__is_offcpu_event(struct evsel *evsel)
-{
-	return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT);
-}
-
 /*
  * The enable_on_exec/disabled value strategy:
  *
@@ -2677,6 +2672,7 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 	u16 max_size = event->header.size;
 	const void *endp = (void *)event + max_size;
 	u64 sz;
+	bool ip_in_callchain = false;
 
 	/*
 	 * used for cross-endian analysis. See git commit 65014ab3
@@ -2684,14 +2680,25 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 	 */
 	union u64_swap u;
 
-	memset(data, 0, sizeof(*data));
-	data->cpu = data->pid = data->tid = -1;
-	data->stream_id = data->id = data->time = -1ULL;
-	data->period = evsel->core.attr.sample_period;
-	data->cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
-	data->misc    = event->header.misc;
-	data->data_src = PERF_MEM_DATA_SRC_NONE;
-	data->vcpu = -1;
+	/*
+	 * For sample data embedded in BPF output, don't clear the sample we read in the first pass,
+	 * and read the embedded data from raw_data in the second pass.
+	 */
+	if (evsel__is_offcpu_event(evsel) && data->raw_data) {
+		type = OFFCPU_EMBEDDED_SAMPLE_TYPES;
+		array = data->raw_data;
+		ip_in_callchain = true;
+	} else { /* for normal samples, clear to zero before reading */
+		array = event->sample.array;
+		memset(data, 0, sizeof(*data));
+		data->cpu = data->pid = data->tid = -1;
+		data->stream_id = data->id = data->time = -1ULL;
+		data->period = evsel->core.attr.sample_period;
+		data->cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+		data->misc    = event->header.misc;
+		data->data_src = PERF_MEM_DATA_SRC_NONE;
+		data->vcpu = -1;
+	}
 
 	if (event->header.type != PERF_RECORD_SAMPLE) {
 		if (!evsel->core.attr.sample_id_all)
@@ -2699,8 +2706,6 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 		return perf_evsel__parse_id_sample(evsel, event, data);
 	}
 
-	array = event->sample.array;
-
 	if (perf_event__check_size(event, evsel->sample_size))
 		return -EFAULT;
 
@@ -2822,6 +2827,10 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 		data->callchain = (struct ip_callchain *)array++;
 		if (data->callchain->nr > max_callchain_nr)
 			return -EFAULT;
+
+		if (ip_in_callchain && data->callchain->nr > 1)
+			data->ip = data->callchain->ips[1];
+
 		sz = data->callchain->nr * sizeof(u64);
 		OVERFLOW_CHECK(array, sz, max_size);
 		array = (void *)array + sz;
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 3e751ea769ac..6fbf5d4219d1 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -11,6 +11,7 @@
 #include <perf/evsel.h>
 #include "symbol_conf.h"
 #include "pmus.h"
+#include "off_cpu.h"
 
 struct bpf_object;
 struct cgroup;
@@ -580,4 +581,9 @@ u64 evsel__bitfield_swap_branch_flags(u64 value);
 void evsel__set_config_if_unset(struct perf_pmu *pmu, struct evsel *evsel,
 				const char *config_name, u64 val);
 
+static inline bool evsel__is_offcpu_event(struct evsel *evsel)
+{
+	return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT);
+}
+
 #endif /* __PERF_EVSEL_H */
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index dbaf07bf6c5f..d481bc466131 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1229,6 +1229,16 @@ static int evlist__deliver_sample(struct evlist *evlist, const struct perf_tool
 	u64 sample_type = evsel->core.attr.sample_type;
 	u64 read_format = evsel->core.attr.read_format;
 
+	/* parse sample the second time to get embedded data from raw_data */
+	if (evsel__is_offcpu_event(evsel) && sample->raw_data) {
+		int err = evsel__parse_sample(evsel, event, sample);
+
+		if (err) {
+			pr_err("Failed to parse BPF ouput embedded data, err = %d\n", err);
+			return err;
+		}
+	}
+
 	/* Standard sample delivery. */
 	if (!(sample_type & PERF_SAMPLE_READ))
 		return tool->sample(tool, event, sample, evsel, machine);
@@ -1339,7 +1349,7 @@ static int perf_session__deliver_event(struct perf_session *session,
 				       u64 file_offset,
 				       const char *file_path)
 {
-	struct perf_sample sample;
+	struct perf_sample sample = { .raw_data = NULL }; /* avoid accidental read of embedded data */
 	int ret = evlist__parse_sample(session->evlist, event, &sample);
 
 	if (ret) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 7/8] perf record --off-cpu: Parse BPF output embedded data
  2024-09-27 20:27 ` [PATCH v6 7/8] perf record --off-cpu: Parse BPF output embedded data Howard Chu
@ 2024-09-30  6:51   ` Namhyung Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Namhyung Kim @ 2024-09-30  6:51 UTC (permalink / raw)
  To: Howard Chu
  Cc: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel

On Fri, Sep 27, 2024 at 01:27:35PM -0700, Howard Chu wrote:
> Move evsel__is_offcpu_event() to evsel.h
> 
> Add a sample_type_embed member to the struct evsel, along with a couple
> of helper functions.
> 
> In session.c, we parse BPF output embedded samples in a two-step
> process.
> 
> Initial Parsing: Treat the sample as a regular BPF-output event.
> 
> Secondary Parsing: Extract data from raw_data and parse it according to
> the sample_type_embed specification. Since the second step relies on the
> raw_data obtained in the first step, we must avoid zero-initializing the
> sample data after the first step.
> 
> Suggested-by: Ian Rogers <irogers@google.com>
> Suggested-by: Arnaldo Carvalho de Melo <acme@kernel.org>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
>  tools/perf/builtin-script.c |  4 ++--
>  tools/perf/util/evsel.c     | 39 +++++++++++++++++++++++--------------
>  tools/perf/util/evsel.h     |  6 ++++++
>  tools/perf/util/session.c   | 12 +++++++++++-
>  4 files changed, 43 insertions(+), 18 deletions(-)
> 
> diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
> index a644787fa9e1..9719ffae45d5 100644
> --- a/tools/perf/builtin-script.c
> +++ b/tools/perf/builtin-script.c
> @@ -662,7 +662,7 @@ static int perf_session__check_output_opt(struct perf_session *session)
>  
>  		evlist__for_each_entry(session->evlist, evsel) {
>  			not_pipe = true;
> -			if (evsel__has_callchain(evsel)) {
> +			if (evsel__has_callchain(evsel) || evsel__is_offcpu_event(evsel)) {
>  				use_callchain = true;
>  				break;
>  			}
> @@ -2352,7 +2352,7 @@ static void process_event(struct perf_script *script,
>  	else if (PRINT_FIELD(BRSTACKOFF))
>  		perf_sample__fprintf_brstackoff(sample, thread, attr, fp);
>  
> -	if (evsel__is_bpf_output(evsel) && PRINT_FIELD(BPF_OUTPUT))
> +	if (evsel__is_bpf_output(evsel) && !evsel__is_offcpu_event(evsel) && PRINT_FIELD(BPF_OUTPUT))
>  		perf_sample__fprintf_bpf_output(sample, fp);
>  	perf_sample__fprintf_insn(sample, evsel, attr, thread, machine, fp, al);
>  
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 32196e4f0637..4199a1e409f7 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1105,11 +1105,6 @@ static void evsel__set_default_freq_period(struct record_opts *opts,
>  	}
>  }
>  
> -static bool evsel__is_offcpu_event(struct evsel *evsel)
> -{
> -	return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT);
> -}
> -
>  /*
>   * The enable_on_exec/disabled value strategy:
>   *
> @@ -2677,6 +2672,7 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
>  	u16 max_size = event->header.size;
>  	const void *endp = (void *)event + max_size;
>  	u64 sz;
> +	bool ip_in_callchain = false;
>  
>  	/*
>  	 * used for cross-endian analysis. See git commit 65014ab3
> @@ -2684,14 +2680,25 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
>  	 */
>  	union u64_swap u;
>  
> -	memset(data, 0, sizeof(*data));
> -	data->cpu = data->pid = data->tid = -1;
> -	data->stream_id = data->id = data->time = -1ULL;
> -	data->period = evsel->core.attr.sample_period;
> -	data->cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
> -	data->misc    = event->header.misc;
> -	data->data_src = PERF_MEM_DATA_SRC_NONE;
> -	data->vcpu = -1;
> +	/*
> +	 * For sample data embedded in BPF output, don't clear the sample we read in the first pass,
> +	 * and read the embedded data from raw_data in the second pass.
> +	 */

I found this two-pass sample parsing is confusing.  I think you can just
use the existing parsing code and only update relevant fields when you
parse the raw data.  IIUC the only subtle part is CGROUP as it come
later than RAW in the sample format.  Maybe you can save it somewhere
and update at the end.

Thanks,
Namhyung


> +	if (evsel__is_offcpu_event(evsel) && data->raw_data) {
> +		type = OFFCPU_EMBEDDED_SAMPLE_TYPES;
> +		array = data->raw_data;
> +		ip_in_callchain = true;
> +	} else { /* for normal samples, clear to zero before reading */
> +		array = event->sample.array;
> +		memset(data, 0, sizeof(*data));
> +		data->cpu = data->pid = data->tid = -1;
> +		data->stream_id = data->id = data->time = -1ULL;
> +		data->period = evsel->core.attr.sample_period;
> +		data->cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
> +		data->misc    = event->header.misc;
> +		data->data_src = PERF_MEM_DATA_SRC_NONE;
> +		data->vcpu = -1;
> +	}
>  
>  	if (event->header.type != PERF_RECORD_SAMPLE) {
>  		if (!evsel->core.attr.sample_id_all)
> @@ -2699,8 +2706,6 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
>  		return perf_evsel__parse_id_sample(evsel, event, data);
>  	}
>  
> -	array = event->sample.array;
> -
>  	if (perf_event__check_size(event, evsel->sample_size))
>  		return -EFAULT;
>  
> @@ -2822,6 +2827,10 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
>  		data->callchain = (struct ip_callchain *)array++;
>  		if (data->callchain->nr > max_callchain_nr)
>  			return -EFAULT;
> +
> +		if (ip_in_callchain && data->callchain->nr > 1)
> +			data->ip = data->callchain->ips[1];
> +
>  		sz = data->callchain->nr * sizeof(u64);
>  		OVERFLOW_CHECK(array, sz, max_size);
>  		array = (void *)array + sz;
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index 3e751ea769ac..6fbf5d4219d1 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -11,6 +11,7 @@
>  #include <perf/evsel.h>
>  #include "symbol_conf.h"
>  #include "pmus.h"
> +#include "off_cpu.h"
>  
>  struct bpf_object;
>  struct cgroup;
> @@ -580,4 +581,9 @@ u64 evsel__bitfield_swap_branch_flags(u64 value);
>  void evsel__set_config_if_unset(struct perf_pmu *pmu, struct evsel *evsel,
>  				const char *config_name, u64 val);
>  
> +static inline bool evsel__is_offcpu_event(struct evsel *evsel)
> +{
> +	return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT);
> +}
> +
>  #endif /* __PERF_EVSEL_H */
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index dbaf07bf6c5f..d481bc466131 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -1229,6 +1229,16 @@ static int evlist__deliver_sample(struct evlist *evlist, const struct perf_tool
>  	u64 sample_type = evsel->core.attr.sample_type;
>  	u64 read_format = evsel->core.attr.read_format;
>  
> +	/* parse sample the second time to get embedded data from raw_data */
> +	if (evsel__is_offcpu_event(evsel) && sample->raw_data) {
> +		int err = evsel__parse_sample(evsel, event, sample);
> +
> +		if (err) {
> +			pr_err("Failed to parse BPF ouput embedded data, err = %d\n", err);
> +			return err;
> +		}
> +	}
> +
>  	/* Standard sample delivery. */
>  	if (!(sample_type & PERF_SAMPLE_READ))
>  		return tool->sample(tool, event, sample, evsel, machine);
> @@ -1339,7 +1349,7 @@ static int perf_session__deliver_event(struct perf_session *session,
>  				       u64 file_offset,
>  				       const char *file_path)
>  {
> -	struct perf_sample sample;
> +	struct perf_sample sample = { .raw_data = NULL }; /* avoid accidental read of embedded data */
>  	int ret = evlist__parse_sample(session->evlist, event, &sample);
>  
>  	if (ret) {
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v6 8/8] perf test: Add direct off-cpu dumping test
  2024-09-27 20:27 [PATCH v6 0/8] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
                   ` (6 preceding siblings ...)
  2024-09-27 20:27 ` [PATCH v6 7/8] perf record --off-cpu: Parse BPF output embedded data Howard Chu
@ 2024-09-27 20:27 ` Howard Chu
  7 siblings, 0 replies; 15+ messages in thread
From: Howard Chu @ 2024-09-27 20:27 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
	Howard Chu

Add a simple workload(offcpu.c) to create the scenario for direct
off-cpu dumping.

Please run this test with 'perf test offcpu'

Suggested-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
 tools/perf/tests/builtin-test.c         |  1 +
 tools/perf/tests/shell/record_offcpu.sh | 29 +++++++++++++++++++++++++
 tools/perf/tests/tests.h                |  1 +
 tools/perf/tests/workloads/Build        |  1 +
 tools/perf/tests/workloads/offcpu.c     | 16 ++++++++++++++
 5 files changed, 48 insertions(+)
 create mode 100644 tools/perf/tests/workloads/offcpu.c

diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index 470a9709427d..aa33beaf58c8 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -153,6 +153,7 @@ static struct test_workload *workloads[] = {
 	&workload__brstack,
 	&workload__datasym,
 	&workload__landlock,
+	&workload__offcpu,
 };
 
 static int num_subtests(const struct test_suite *t)
diff --git a/tools/perf/tests/shell/record_offcpu.sh b/tools/perf/tests/shell/record_offcpu.sh
index 67c925f3a15a..69a9324ad7f5 100755
--- a/tools/perf/tests/shell/record_offcpu.sh
+++ b/tools/perf/tests/shell/record_offcpu.sh
@@ -6,6 +6,10 @@ set -e
 
 err=0
 perfdata=$(mktemp /tmp/__perf_test.perf.data.XXXXX)
+TEST_PROGRAM="perf test -w offcpu"
+
+ts=$(printf "%u" $((~0 << 32))) # same as the OFF_CPU_TIMESTAMP
+dummy_timestamp=${ts%???} # remove the last 3 digits, like perf script
 
 cleanup() {
   rm -f ${perfdata}
@@ -88,6 +92,27 @@ test_offcpu_child() {
   echo "Child task off-cpu test [Success]"
 }
 
+test_offcpu_direct() {
+  echo "Direct off-cpu test"
+
+  # dump off-cpu samples for task blocked for more than 1.999999s
+  # -D for initial delay, to enable evlist
+  if ! perf record -e dummy -D 500 --off-cpu --off-cpu-thresh 1999999 -o ${perfdata} ${TEST_PROGRAM} 2> /dev/null
+  then
+    echo "Direct off-cpu test [Failed record]"
+    err=1
+    return
+  fi
+  # Direct sample's timestamp should be lower than the dummy_timestamp of the at-the-end sample.
+  if ! perf script -i ${perfdata} -F time,period | sed "s/[\.:]//g" | \
+       awk "{ if (\$1 < ${dummy_timestamp} && \$2 > 1999999999) exit 0; else exit 1; }"
+  then
+    echo "Direct off-cpu test [Failed missing direct sample]"
+    err=1
+    return
+  fi
+  echo "Direct off-cpu test [Success]"
+}
 
 test_offcpu_priv
 
@@ -99,5 +124,9 @@ if [ $err = 0 ]; then
   test_offcpu_child
 fi
 
+if [ $err = 0 ]; then
+  test_offcpu_direct
+fi
+
 cleanup
 exit $err
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index 6ea2be86b7bf..c7a5e27c4567 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -206,6 +206,7 @@ DECLARE_WORKLOAD(sqrtloop);
 DECLARE_WORKLOAD(brstack);
 DECLARE_WORKLOAD(datasym);
 DECLARE_WORKLOAD(landlock);
+DECLARE_WORKLOAD(offcpu);
 
 extern const char *dso_to_test;
 extern const char *test_objdump_path;
diff --git a/tools/perf/tests/workloads/Build b/tools/perf/tests/workloads/Build
index 5af17206f04d..0e78fd01eaf1 100644
--- a/tools/perf/tests/workloads/Build
+++ b/tools/perf/tests/workloads/Build
@@ -7,6 +7,7 @@ perf-test-y += sqrtloop.o
 perf-test-y += brstack.o
 perf-test-y += datasym.o
 perf-test-y += landlock.o
+perf-test-y += offcpu.o
 
 CFLAGS_sqrtloop.o         = -g -O0 -fno-inline -U_FORTIFY_SOURCE
 CFLAGS_leafloop.o         = -g -O0 -fno-inline -fno-omit-frame-pointer -U_FORTIFY_SOURCE
diff --git a/tools/perf/tests/workloads/offcpu.c b/tools/perf/tests/workloads/offcpu.c
new file mode 100644
index 000000000000..57cee201a4c3
--- /dev/null
+++ b/tools/perf/tests/workloads/offcpu.c
@@ -0,0 +1,16 @@
+#include <linux/compiler.h>
+#include <unistd.h>
+#include "../tests.h"
+
+static int offcpu(int argc __maybe_unused, const char **argv __maybe_unused)
+{
+	/* get past the initial delay */
+	sleep(1);
+
+	/* what we want to collect as a direct sample */
+	sleep(2);
+
+	return 0;
+}
+
+DEFINE_WORKLOAD(offcpu);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-09-30  6:51 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-27 20:27 [PATCH v6 0/8] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
2024-09-27 20:27 ` [PATCH v6 1/8] perf evsel: Set off-cpu BPF output to system-wide Howard Chu
2024-09-30  5:37   ` Namhyung Kim
2024-09-27 20:27 ` [PATCH v6 2/8] perf record --off-cpu: Add --off-cpu-thresh Howard Chu
2024-09-30  5:40   ` Namhyung Kim
2024-09-27 20:27 ` [PATCH v6 3/8] perf record --off-cpu: Parse offcpu-time event Howard Chu
2024-09-30  5:51   ` Namhyung Kim
2024-09-27 20:27 ` [PATCH v6 4/8] perf record off-cpu: Dump direct off-cpu samples in BPF Howard Chu
2024-09-30  6:23   ` Namhyung Kim
2024-09-27 20:27 ` [PATCH v6 5/8] perf record --off-cpu: Dump total off-cpu time at the end Howard Chu
2024-09-30  6:39   ` Namhyung Kim
2024-09-27 20:27 ` [PATCH v6 6/8] perf evsel: Delete unnecessary = 0 Howard Chu
2024-09-27 20:27 ` [PATCH v6 7/8] perf record --off-cpu: Parse BPF output embedded data Howard Chu
2024-09-30  6:51   ` Namhyung Kim
2024-09-27 20:27 ` [PATCH v6 8/8] perf test: Add direct off-cpu dumping test Howard Chu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).