* [PATCH v3 0/5] Dump off-cpu samples directly
@ 2024-07-26 10:28 Howard Chu
  2024-07-26 10:28 ` [PATCH v3 1/5] perf record off-cpu: Add direct off-cpu event Howard Chu
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Howard Chu @ 2024-07-26 10:28 UTC (permalink / raw)
  To: namhyung
  Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
Currently, off-cpu samples are dumped when perf record is exiting. This
results in off-cpu samples being after the regular samples. This patch
series makes possible dumping off-cpu samples on-the-fly, directly into
perf ring buffer. And it dispatches those samples to the correct format
for perf.data consumers.
Changes in v3:
 - Add off-cpu-thresh argument
 - Process direct off-cpu samples in post
Changes in v2:
 - Remove unnecessary comments.
 - Rename function off_cpu_change_type to off_cpu_prepare_parse
Before:
```
     migration/0      21 [000] 27981.041319: 2944637851    cycles:P:  ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
            perf  770116 [001] 27981.041375:          1    cycles:P:  ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
            perf  770116 [001] 27981.041377:          1    cycles:P:  ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
            perf  770116 [001] 27981.041379:      51611    cycles:P:  ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
     migration/1      26 [001] 27981.041400: 4227682775    cycles:P:  ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
     migration/2      32 [002] 27981.041477: 4159401534    cycles:P:  ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
sshd  708098 [000] 18446744069.414584:     286392 offcpu-time: 
	    79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
	    585690935cca [unknown] (/usr/bin/sshd)
```
After:
```
            perf  774767 [003] 28178.033444:        497           cycles:P:  ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
            perf  774767 [003] 28178.033445:     399440           cycles:P:  ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
         swapper       0 [001] 28178.036639:  376650973           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
         swapper       0 [003] 28178.182921:  348779378           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
    blueman-tray    1355 [000] 28178.627906:  100184571 offcpu-time-direct: 
	    7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
	    7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
	    7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
	    7fff24e862d8 [unknown] ([unknown])
    blueman-tray    1355 [000] 28178.728137:  100187539 offcpu-time-direct: 
	    7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
	    7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
	    7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
	    7fff24e862d8 [unknown] ([unknown])
         swapper       0 [000] 28178.463253:  195945410           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
     dbus-broker     412 [002] 28178.464855:  376737008           cycles:P:  ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
```
Howard Chu (5):
  perf record off-cpu: Add direct off-cpu event
  perf record off-cpu: Dumping samples in BPF
  perf record off-cpu: processing of embedded sample
  perf record off-cpu: save embedded sample type
  perf record off-cpu: Add direct off-cpu test
 tools/perf/builtin-record.c             |   2 +
 tools/perf/builtin-script.c             |   2 +-
 tools/perf/tests/builtin-test.c         |   1 +
 tools/perf/tests/shell/record_offcpu.sh |  29 +++++
 tools/perf/tests/tests.h                |   1 +
 tools/perf/tests/workloads/Build        |   1 +
 tools/perf/tests/workloads/offcpu.c     |  16 +++
 tools/perf/util/bpf_off_cpu.c           |  53 ++++++++-
 tools/perf/util/bpf_skel/off_cpu.bpf.c  | 143 ++++++++++++++++++++++++
 tools/perf/util/evsel.c                 |  16 ++-
 tools/perf/util/evsel.h                 |  13 +++
 tools/perf/util/header.c                |  12 ++
 tools/perf/util/off_cpu.h               |   1 +
 tools/perf/util/record.h                |   1 +
 tools/perf/util/session.c               |  23 +++-
 15 files changed, 309 insertions(+), 5 deletions(-)
 create mode 100644 tools/perf/tests/workloads/offcpu.c
-- 
2.45.2
^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH v3 1/5] perf record off-cpu: Add direct off-cpu event
  2024-07-26 10:28 [PATCH v3 0/5] Dump off-cpu samples directly Howard Chu
@ 2024-07-26 10:28 ` Howard Chu
  2024-07-26 23:48   ` Ian Rogers
  2024-07-26 10:28 ` [PATCH v3 2/5] perf record off-cpu: Dumping samples in BPF Howard Chu
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Howard Chu @ 2024-07-26 10:28 UTC (permalink / raw)
  To: namhyung
  Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
Add direct off-cpu event called "offcpu-time-direct". Add a threshold to
dump direct off-cpu samples, "--off-cpu-thresh". Default value of
--off-cpu-thresh is UULONG_MAX(no direct off-cpu samples), and
--off-cpu-thresh's unit is milliseconds.
Bind fds and sample_id in off_cpu_start()
Note that we add "offcpu-time-direct" event using parse_event(), because we
need to make it no-inherit, otherwise perf_event_open() will fail.
Introduce sample_type_embed, indicating the sample_type of a sample
embedded in BPF output. More discussions in later patches.
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Suggested-by: Ian Rogers <irogers@google.com>
---
 tools/perf/builtin-record.c   |  2 ++
 tools/perf/util/bpf_off_cpu.c | 53 ++++++++++++++++++++++++++++++++++-
 tools/perf/util/off_cpu.h     |  1 +
 tools/perf/util/record.h      |  1 +
 4 files changed, 56 insertions(+), 1 deletion(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a94516e8c522..708d48d309d6 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -3325,6 +3325,7 @@ static struct record record = {
 		.ctl_fd              = -1,
 		.ctl_fd_ack          = -1,
 		.synth               = PERF_SYNTH_ALL,
+		.off_cpu_thresh      = ULLONG_MAX,
 	},
 	.tool = {
 		.sample		= process_sample_event,
@@ -3557,6 +3558,7 @@ static struct option __record_options[] = {
 			    "write collected trace data into several data files using parallel threads",
 			    record__parse_threads),
 	OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
+	OPT_U64(0, "off-cpu-thresh", &record.opts.off_cpu_thresh, "time threshold(in ms) for dumping off-cpu events"),
 	OPT_END()
 };
 
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 6af36142dc5a..905a11c96c5b 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -13,6 +13,7 @@
 #include "util/cgroup.h"
 #include "util/strlist.h"
 #include <bpf/bpf.h>
+#include <internal/xyarray.h>
 
 #include "bpf_skel/off_cpu.skel.h"
 
@@ -45,10 +46,12 @@ static int off_cpu_config(struct evlist *evlist)
 		.size	= sizeof(attr), /* to capture ABI version */
 	};
 	char *evname = strdup(OFFCPU_EVENT);
+	char off_cpu_direct_event[64];
 
 	if (evname == NULL)
 		return -ENOMEM;
 
+	/* off-cpu event in the end */
 	evsel = evsel__new(&attr);
 	if (!evsel) {
 		free(evname);
@@ -65,12 +68,22 @@ static int off_cpu_config(struct evlist *evlist)
 	free(evsel->name);
 	evsel->name = evname;
 
+	/* direct off-cpu event */
+	snprintf(off_cpu_direct_event, sizeof(off_cpu_direct_event), "bpf-output/no-inherit=1,name=%s/", OFFCPU_EVENT_DIRECT);
+	if (parse_event(evlist, off_cpu_direct_event)) {
+		pr_err("Failed to open off-cpu event\n");
+		return -1;
+	}
+
 	return 0;
 }
 
 static void off_cpu_start(void *arg)
 {
 	struct evlist *evlist = arg;
+	struct evsel *evsel;
+	struct perf_cpu pcpu;
+	int i, err;
 
 	/* update task filter for the given workload */
 	if (!skel->bss->has_cpu && !skel->bss->has_task &&
@@ -86,6 +99,27 @@ static void off_cpu_start(void *arg)
 		bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
 	}
 
+	/* sample id and fds in BPF's perf_event_array can only be set after record__open() */
+	evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT_DIRECT);
+	if (evsel == NULL) {
+		pr_err("%s evsel not found\n", OFFCPU_EVENT_DIRECT);
+		return;
+	}
+
+	if (evsel->core.id)
+		skel->bss->sample_id = evsel->core.id[0];
+
+	perf_cpu_map__for_each_cpu(pcpu, i, evsel->core.cpus) {
+		err = bpf_map__update_elem(skel->maps.offcpu_output,
+					   &pcpu.cpu, sizeof(__u32),
+					   xyarray__entry(evsel->core.fd, pcpu.cpu, 0),
+					   sizeof(__u32), BPF_ANY);
+		if (err) {
+			pr_err("Failed to update perf event map for direct off-cpu dumping\n");
+			return;
+		}
+	}
+
 	skel->bss->enabled = 1;
 }
 
@@ -130,14 +164,24 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
 {
 	int err, fd, i;
 	int ncpus = 1, ntasks = 1, ncgrps = 1;
+	__u64 offcpu_thresh;
 	struct strlist *pid_slist = NULL;
 	struct str_node *pos;
+	struct evsel *evsel;
 
 	if (off_cpu_config(evlist) < 0) {
 		pr_err("Failed to config off-cpu BPF event\n");
 		return -1;
 	}
 
+	evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT_DIRECT);
+	if (evsel == NULL) {
+		pr_err("%s evsel not found\n", OFFCPU_EVENT_DIRECT);
+		return -1 ;
+	}
+
+	evsel->sample_type_embed = OFFCPU_SAMPLE_TYPES;
+
 	skel = off_cpu_bpf__open();
 	if (!skel) {
 		pr_err("Failed to open off-cpu BPF skeleton\n");
@@ -250,7 +294,6 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
 	}
 
 	if (evlist__first(evlist)->cgrp) {
-		struct evsel *evsel;
 		u8 val = 1;
 
 		skel->bss->has_cgroup = 1;
@@ -272,6 +315,14 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
 		}
 	}
 
+	offcpu_thresh = opts->off_cpu_thresh;
+
+	if (opts->off_cpu_thresh != ULLONG_MAX)
+		offcpu_thresh = opts->off_cpu_thresh * 1000000; /* off-cpu-thresh is in ms */
+
+	skel->bss->offcpu_thresh = offcpu_thresh;
+	skel->bss->sample_type   = OFFCPU_SAMPLE_TYPES;
+
 	err = off_cpu_bpf__attach(skel);
 	if (err) {
 		pr_err("Failed to attach off-cpu BPF skeleton\n");
diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
index 2dd67c60f211..a349f8e300e0 100644
--- a/tools/perf/util/off_cpu.h
+++ b/tools/perf/util/off_cpu.h
@@ -9,6 +9,7 @@ struct perf_session;
 struct record_opts;
 
 #define OFFCPU_EVENT  "offcpu-time"
+#define OFFCPU_EVENT_DIRECT  "offcpu-time-direct"
 
 #define OFFCPU_SAMPLE_TYPES  (PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_IP | \
 			      PERF_SAMPLE_TID | PERF_SAMPLE_TIME | \
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index a6566134e09e..3c11416e6627 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -79,6 +79,7 @@ struct record_opts {
 	int	      synth;
 	int	      threads_spec;
 	const char    *threads_user_spec;
+	u64	      off_cpu_thresh;
 };
 
 extern const char * const *record_usage;
-- 
2.45.2
^ permalink raw reply related	[flat|nested] 16+ messages in thread
* [PATCH v3 2/5] perf record off-cpu: Dumping samples in BPF
  2024-07-26 10:28 [PATCH v3 0/5] Dump off-cpu samples directly Howard Chu
  2024-07-26 10:28 ` [PATCH v3 1/5] perf record off-cpu: Add direct off-cpu event Howard Chu
@ 2024-07-26 10:28 ` Howard Chu
  2024-07-26 10:28 ` [PATCH v3 3/5] perf record off-cpu: processing of embedded sample Howard Chu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2024-07-26 10:28 UTC (permalink / raw)
  To: namhyung
  Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
Add perf_event_array map for dumping direct off-cpu samples, but keep
the in-the-end approach.
Tons of checking before access to pass the BPF verifier.
If off-cpu time (represented as delta) is greater than the output
threshold, do the output.
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Suggested-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/bpf_skel/off_cpu.bpf.c | 143 +++++++++++++++++++++++++
 1 file changed, 143 insertions(+)
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index d877a0a9731f..4b0412a7aa5c 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -18,6 +18,9 @@
 #define MAX_STACKS   32
 #define MAX_ENTRIES  102400
 
+#define MAX_CPUS  4096
+#define MAX_OFFCPU_LEN 128
+
 struct tstamp_data {
 	__u32 stack_id;
 	__u32 state;
@@ -32,6 +35,7 @@ struct offcpu_key {
 	__u64 cgroup_id;
 };
 
+/* for dumping at the end */
 struct {
 	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
 	__uint(key_size, sizeof(__u32));
@@ -39,6 +43,45 @@ struct {
 	__uint(max_entries, MAX_ENTRIES);
 } stacks SEC(".maps");
 
+struct offcpu_data {
+	u64 array[MAX_OFFCPU_LEN];
+};
+
+struct stack_data {
+	u64 array[MAX_STACKS];
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, MAX_CPUS);
+} offcpu_output SEC(".maps");
+
+/* temporary offcpu sample */
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct offcpu_data));
+	__uint(max_entries, 1);
+} offcpu_payload SEC(".maps");
+
+/* temporary stack data */
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct stack_data));
+	__uint(max_entries, 1);
+} stack_tmp SEC(".maps");
+
+/* cached stack per task storage */
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct stack_data);
+} stack_cache SEC(".maps");
+
 struct {
 	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
 	__uint(map_flags, BPF_F_NO_PREALLOC);
@@ -96,6 +139,8 @@ const volatile bool uses_cgroup_v1 = false;
 
 int perf_subsys_id = -1;
 
+__u64 sample_id, sample_type, offcpu_thresh;
+
 /*
  * Old kernel used to call it task_struct->state and now it's '__state'.
  * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
@@ -182,12 +227,87 @@ static inline int can_record(struct task_struct *t, int state)
 	return 1;
 }
 
+static inline bool check_bounds(int index)
+{
+	if (index >= 0 && index < MAX_OFFCPU_LEN)
+		return true;
+
+	return false;
+}
+
+static inline int copy_stack(struct stack_data *from,
+			     struct offcpu_data *to, int n)
+{
+	int max_stacks = MAX_STACKS, len = 0;
+
+	if (!from)
+		return len;
+
+	for (int i = 0; i < max_stacks && from->array[i]; ++i) {
+		if (check_bounds(n + 2 + i)) {
+			to->array[n + 2 + i] = from->array[i];
+			++len;
+		}
+	}
+	return len;
+}
+
+static int off_cpu_dump(void *ctx, struct offcpu_data *data, struct offcpu_key *key,
+			struct stack_data *stack_p, __u64 delta, __u64 timestamp)
+{
+	int size, n = 0, ip_pos = -1, len = 0;
+
+	if (sample_type & PERF_SAMPLE_IDENTIFIER && check_bounds(n))
+		data->array[n++] = sample_id;
+	if (sample_type & PERF_SAMPLE_IP && check_bounds(n)) {
+		ip_pos = n;
+		data->array[n++] = 0;  /* will be updated */
+	}
+	if (sample_type & PERF_SAMPLE_TID && check_bounds(n))
+		data->array[n++] = (u64)key->pid << 32 | key->tgid;
+	if (sample_type & PERF_SAMPLE_TIME && check_bounds(n))
+		data->array[n++] = timestamp;
+	if (sample_type & PERF_SAMPLE_ID && check_bounds(n))
+		data->array[n++] = sample_id;
+	if (sample_type & PERF_SAMPLE_CPU && check_bounds(n))
+		data->array[n++] = 0;
+	if (sample_type & PERF_SAMPLE_PERIOD && check_bounds(n))
+		data->array[n++] = delta;
+	if (sample_type & PERF_SAMPLE_CALLCHAIN && check_bounds(n + 2)) {
+		/* data->array[n] is callchain->nr (updated later) */
+		data->array[n + 1] = PERF_CONTEXT_USER;
+		data->array[n + 2] = 0;
+
+		len = copy_stack(stack_p, data, n);
+
+		/* update length of callchain */
+		data->array[n] = len + 1;
+
+		/* update sample ip with the first callchain entry */
+		if (ip_pos >= 0)
+			data->array[ip_pos] = data->array[n + 2];
+
+		/* calculate sample callchain data->array length */
+		n += len + 2;
+	}
+	if (sample_type & PERF_SAMPLE_CGROUP && check_bounds(n))
+		data->array[n++] = key->cgroup_id;
+
+	size = n * sizeof(u64);
+	if (size >= 0 && size <= MAX_OFFCPU_LEN * sizeof(u64))
+		bpf_perf_event_output(ctx, &offcpu_output, BPF_F_CURRENT_CPU, data, size);
+
+	return 0;
+}
+
 static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
 			struct task_struct *next, int state)
 {
 	__u64 ts;
 	__u32 stack_id;
 	struct tstamp_data *pelem;
+	struct stack_data *stack_tmp_p, *stack_p;
+	int zero = 0, len = 0;
 
 	ts = bpf_ktime_get_ns();
 
@@ -197,6 +317,22 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
 	stack_id = bpf_get_stackid(ctx, &stacks,
 				   BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);
 
+	/* temporary stack data */
+	stack_tmp_p = bpf_map_lookup_elem(&stack_tmp, &zero);
+	if (stack_tmp_p)
+		len = bpf_get_stack(ctx, stack_tmp_p->array, MAX_STACKS * sizeof(u64),
+				    BPF_F_USER_STACK) / sizeof(u64);
+
+	/* save stacks if collectable */
+	if (len > 0) {
+		stack_p = bpf_task_storage_get(&stack_cache, prev, NULL,
+					       BPF_LOCAL_STORAGE_GET_F_CREATE);
+		if (stack_p) {
+			for (int i = 0; i < len && i < MAX_STACKS; ++i)
+				stack_p->array[i] = stack_tmp_p->array[i];
+		}
+	}
+
 	pelem = bpf_task_storage_get(&tstamp, prev, NULL,
 				     BPF_LOCAL_STORAGE_GET_F_CREATE);
 	if (!pelem)
@@ -226,6 +362,13 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
 		else
 			bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
 
+		if (delta >= offcpu_thresh) {
+			struct offcpu_data *data = bpf_map_lookup_elem(&offcpu_payload, &zero);
+			stack_p = bpf_task_storage_get(&stack_cache, next, NULL, 0);
+			if (data && stack_p)
+				off_cpu_dump(ctx, data, &key, stack_p, delta, pelem->timestamp);
+		}
+
 		/* prevent to reuse the timestamp later */
 		pelem->timestamp = 0;
 	}
-- 
2.45.2
^ permalink raw reply related	[flat|nested] 16+ messages in thread
* [PATCH v3 3/5] perf record off-cpu: processing of embedded sample
  2024-07-26 10:28 [PATCH v3 0/5] Dump off-cpu samples directly Howard Chu
  2024-07-26 10:28 ` [PATCH v3 1/5] perf record off-cpu: Add direct off-cpu event Howard Chu
  2024-07-26 10:28 ` [PATCH v3 2/5] perf record off-cpu: Dumping samples in BPF Howard Chu
@ 2024-07-26 10:28 ` Howard Chu
  2024-07-26 10:28 ` [PATCH v3 4/5] perf record off-cpu: save embedded sample type Howard Chu
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2024-07-26 10:28 UTC (permalink / raw)
  To: namhyung
  Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
In the first evsel__parse_sample, we parse bpf output as raw samples.
After getting the raw_data, parse it the second time to get the embedded
samples.
Also, because we rely on evlist to dump direct off-cpu samples, if we
evsel__open() a bpf output event on a specific pid, it will fail. So pid = -1 for
every bpf output event.
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Suggested-by: Ian Rogers <irogers@google.com>
Suggested-by: Arnaldo Carvalho de Melo <acme@kernel.org>
---
 tools/perf/builtin-script.c |  2 +-
 tools/perf/util/evsel.c     | 16 ++++++++++++++--
 tools/perf/util/evsel.h     | 13 +++++++++++++
 tools/perf/util/session.c   | 23 ++++++++++++++++++++++-
 4 files changed, 50 insertions(+), 4 deletions(-)
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index c16224b1fef3..c142fb44817f 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -653,7 +653,7 @@ static int perf_session__check_output_opt(struct perf_session *session)
 
 		evlist__for_each_entry(session->evlist, evsel) {
 			not_pipe = true;
-			if (evsel__has_callchain(evsel)) {
+			if (evsel__has_callchain(evsel) || evsel__embed_has_callchain(evsel)) {
 				use_callchain = true;
 				break;
 			}
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index bc603193c477..6c64ec475b80 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -298,6 +298,7 @@ void evsel__init(struct evsel *evsel,
 	evsel->pmu_name      = NULL;
 	evsel->group_pmu_name = NULL;
 	evsel->skippable     = false;
+	evsel->sample_type_embed = 0;
 }
 
 struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
@@ -440,6 +441,7 @@ struct evsel *evsel__clone(struct evsel *orig)
 	evsel->weak_group = orig->weak_group;
 	evsel->use_config_name = orig->use_config_name;
 	evsel->pmu = orig->pmu;
+	evsel->sample_type_embed = orig->sample_type_embed;
 
 	if (evsel__copy_config_terms(evsel, orig) < 0)
 		goto out_err;
@@ -2282,6 +2284,10 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
 
 			test_attr__ready();
 
+			/* BPF output event can only be system-wide, off-cpu filters tasks in BPF */
+			if (evsel__is_bpf_output(evsel))
+				pid = -1;
+
 			/* Debug message used by test scripts */
 			pr_debug2_peo("sys_perf_event_open: pid %d  cpu %d  group_fd %d  flags %#lx",
 				pid, perf_cpu_map__cpu(cpus, idx).cpu, group_fd, evsel->open_flags);
@@ -2592,6 +2598,14 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 	 */
 	union u64_swap u;
 
+	array = event->sample.array;
+
+	/* use raw_data passed in to read embedded data */
+	if (evsel__has_embed(evsel) && evsel__is_bpf_output(evsel) && data->raw_data) {
+		array = data->raw_data;
+		type = evsel->sample_type_embed;
+	}
+
 	memset(data, 0, sizeof(*data));
 	data->cpu = data->pid = data->tid = -1;
 	data->stream_id = data->id = data->time = -1ULL;
@@ -2607,8 +2621,6 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 		return perf_evsel__parse_id_sample(evsel, event, data);
 	}
 
-	array = event->sample.array;
-
 	if (perf_event__check_size(event, evsel->sample_size))
 		return -EFAULT;
 
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 80b5f6dd868e..0d25e82c6154 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -184,6 +184,9 @@ struct evsel {
 	};
 	/* Is the tool's fd for /proc/pid/stat or /proc/stat. */
 	bool pid_stat;
+
+	/* for samples embedded in BPF output */
+	__u64 sample_type_embed;
 };
 
 struct perf_missing_features {
@@ -469,6 +472,11 @@ static inline bool evsel__is_bpf_output(struct evsel *evsel)
 	return evsel__match(evsel, SOFTWARE, SW_BPF_OUTPUT);
 }
 
+static inline bool evsel__has_embed(struct evsel *evsel)
+{
+	return evsel->sample_type_embed != 0;
+}
+
 static inline bool evsel__is_clock(const struct evsel *evsel)
 {
 	return evsel__match(evsel, SOFTWARE, SW_CPU_CLOCK) ||
@@ -525,6 +533,11 @@ static inline bool evsel__has_callchain(const struct evsel *evsel)
 	       evsel->synth_sample_type & PERF_SAMPLE_CALLCHAIN;
 }
 
+static inline bool evsel__embed_has_callchain(const struct evsel *evsel)
+{
+	return evsel->sample_type_embed & PERF_SAMPLE_CALLCHAIN;
+}
+
 static inline bool evsel__has_br_stack(const struct evsel *evsel)
 {
 	/*
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 5596bed1b8c8..d87d85a3e21e 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1528,6 +1528,23 @@ static int evlist__deliver_sample(struct evlist *evlist, struct perf_tool *tool,
 	/* We know evsel != NULL. */
 	u64 sample_type = evsel->core.attr.sample_type;
 	u64 read_format = evsel->core.attr.read_format;
+	struct perf_sample sample_embed;
+
+	if (evsel__is_bpf_output(evsel) && evsel__has_embed(evsel) &&
+	    sample->raw_data && sample->raw_size - sizeof(__u32) > sizeof(struct perf_event_header)) {
+		int err;
+
+		sample_embed.raw_data = sample->raw_data;
+
+		err = evsel__parse_sample(evsel, event, &sample_embed);
+		if (err) {
+			pr_err("Can't parse BPF-embedded sample, err = %d\n", err);
+			return err;
+		}
+
+		sample_type = evsel->sample_type_embed;
+		sample = &sample_embed;
+	}
 
 	/* Standard sample delivery. */
 	if (!(sample_type & PERF_SAMPLE_READ))
@@ -1639,8 +1656,12 @@ static int perf_session__deliver_event(struct perf_session *session,
 				       const char *file_path)
 {
 	struct perf_sample sample;
-	int ret = evlist__parse_sample(session->evlist, event, &sample);
+	int ret;
+
+	/* set to NULL so we don't accidentally parse BPF-embedded sample */
+	sample.raw_data = NULL;
 
+	ret = evlist__parse_sample(session->evlist, event, &sample);
 	if (ret) {
 		pr_err("Can't parse sample, err = %d\n", ret);
 		return ret;
-- 
2.45.2
^ permalink raw reply related	[flat|nested] 16+ messages in thread
* [PATCH v3 4/5] perf record off-cpu: save embedded sample type
  2024-07-26 10:28 [PATCH v3 0/5] Dump off-cpu samples directly Howard Chu
                   ` (2 preceding siblings ...)
  2024-07-26 10:28 ` [PATCH v3 3/5] perf record off-cpu: processing of embedded sample Howard Chu
@ 2024-07-26 10:28 ` Howard Chu
  2024-07-27  0:49   ` Ian Rogers
  2024-07-26 10:28 ` [PATCH v3 5/5] perf record off-cpu: Add direct off-cpu test Howard Chu
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Howard Chu @ 2024-07-26 10:28 UTC (permalink / raw)
  To: namhyung
  Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
We have to save the embedded sample type for it to be consumed correctly
by perf script or perf report.
This is a bad approach because it most definitely will break some
perf.data convertor. Another approach is to add this sample_type_embed
to perf_event_attr, but changing perf api is above my pay grade, so
please give me your suggestions!
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
  | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 55e9553861d0..ccb5493dc515 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -80,6 +80,7 @@ const char perf_version_string[] = PERF_VERSION;
 
 struct perf_file_attr {
 	struct perf_event_attr	attr;
+	__u64 embed;
 	struct perf_file_section	ids;
 };
 
@@ -3713,6 +3714,7 @@ static int perf_session__do_write_header(struct perf_session *session,
 		}
 		f_attr = (struct perf_file_attr){
 			.attr = evsel->core.attr,
+			.embed = evsel->sample_type_embed,
 			.ids  = {
 				.offset = evsel->id_offset,
 				.size   = evsel->core.ids * sizeof(u64),
@@ -4147,6 +4149,14 @@ static int read_attr(int fd, struct perf_header *ph,
 
 		ret = readn(fd, ptr, left);
 	}
+
+	ret = readn(fd, &f_attr->embed, sizeof(f_attr->embed));
+	if (ret <= 0) {
+		pr_debug("cannot read %d bytes of embedded sample type\n",
+			 PERF_ATTR_SIZE_VER0);
+		return -1;
+	}
+
 	/* read perf_file_section, ids are read in caller */
 	ret = readn(fd, &f_attr->ids, sizeof(f_attr->ids));
 
@@ -4272,6 +4282,8 @@ int perf_session__read_header(struct perf_session *session, int repipe_fd)
 		tmp = lseek(fd, 0, SEEK_CUR);
 		evsel = evsel__new(&f_attr.attr);
 
+		evsel->sample_type_embed = f_attr.embed;
+
 		if (evsel == NULL)
 			goto out_delete_evlist;
 
-- 
2.45.2
^ permalink raw reply related	[flat|nested] 16+ messages in thread
* [PATCH v3 5/5] perf record off-cpu: Add direct off-cpu test
  2024-07-26 10:28 [PATCH v3 0/5] Dump off-cpu samples directly Howard Chu
                   ` (3 preceding siblings ...)
  2024-07-26 10:28 ` [PATCH v3 4/5] perf record off-cpu: save embedded sample type Howard Chu
@ 2024-07-26 10:28 ` Howard Chu
  2024-07-27  0:54   ` Ian Rogers
  2024-07-27  1:06 ` [PATCH v3 0/5] Dump off-cpu samples directly Ian Rogers
  2024-07-29  1:21 ` Namhyung Kim
  6 siblings, 1 reply; 16+ messages in thread
From: Howard Chu @ 2024-07-26 10:28 UTC (permalink / raw)
  To: namhyung
  Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
Add a simple workload(offcpu.c) to create the scenario for direct
off-cpu dumping.
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Suggested-by: Ian Rogers <irogers@google.com>
---
 tools/perf/tests/builtin-test.c         |  1 +
 tools/perf/tests/shell/record_offcpu.sh | 29 +++++++++++++++++++++++++
 tools/perf/tests/tests.h                |  1 +
 tools/perf/tests/workloads/Build        |  1 +
 tools/perf/tests/workloads/offcpu.c     | 16 ++++++++++++++
 5 files changed, 48 insertions(+)
 create mode 100644 tools/perf/tests/workloads/offcpu.c
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index c3d84b67ca8e..5062058ad17d 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -152,6 +152,7 @@ static struct test_workload *workloads[] = {
 	&workload__sqrtloop,
 	&workload__brstack,
 	&workload__datasym,
+	&workload__offcpu,
 };
 
 static int num_subtests(const struct test_suite *t)
diff --git a/tools/perf/tests/shell/record_offcpu.sh b/tools/perf/tests/shell/record_offcpu.sh
index 67c925f3a15a..1ea0a44336e2 100755
--- a/tools/perf/tests/shell/record_offcpu.sh
+++ b/tools/perf/tests/shell/record_offcpu.sh
@@ -6,6 +6,7 @@ set -e
 
 err=0
 perfdata=$(mktemp /tmp/__perf_test.perf.data.XXXXX)
+TEST_PROGRAM="perf test -w offcpu"
 
 cleanup() {
   rm -f ${perfdata}
@@ -88,6 +89,30 @@ test_offcpu_child() {
   echo "Child task off-cpu test [Success]"
 }
 
+test_offcpu_direct() {
+  echo "Direct off-cpu test"
+  # dump off-cpu samples for tasks blocked for more than 1999ms (1.9s)
+  # -D for initial delay, which is necessary if we want to enable evlist
+  if ! perf record -F 1 -D 999 --off-cpu --off-cpu-thresh 1999 -o ${perfdata} ${TEST_PROGRAM} 2> /dev/null
+  then
+    echo "Direct off-cpu test [Failed record]"
+    err=1
+    return
+  fi
+  if ! perf evlist -i ${perfdata} | grep -q "offcpu-time-direct"
+  then
+    echo "Direct off-cpu test [Failed no event]"
+    err=1
+    return
+  fi
+  if ! perf script -i ${perfdata} | grep -q -E ".*2[0-9]{9}[ ]*offcpu-time-direct" # 2 seconds (2,000,000,000)
+  then
+    echo "Direct off-cpu test [Failed missing output]"
+    err=1
+    return
+  fi
+  echo "Direct off-cpu test [Success]"
+}
 
 test_offcpu_priv
 
@@ -99,5 +124,9 @@ if [ $err = 0 ]; then
   test_offcpu_child
 fi
 
+if [ $err = 0 ]; then
+  test_offcpu_direct
+fi
+
 cleanup
 exit $err
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index 3aa7701ee0e9..84ab15683269 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -205,6 +205,7 @@ DECLARE_WORKLOAD(leafloop);
 DECLARE_WORKLOAD(sqrtloop);
 DECLARE_WORKLOAD(brstack);
 DECLARE_WORKLOAD(datasym);
+DECLARE_WORKLOAD(offcpu);
 
 extern const char *dso_to_test;
 extern const char *test_objdump_path;
diff --git a/tools/perf/tests/workloads/Build b/tools/perf/tests/workloads/Build
index 48bf0d3b0f3d..f37e9be8b142 100644
--- a/tools/perf/tests/workloads/Build
+++ b/tools/perf/tests/workloads/Build
@@ -6,6 +6,7 @@ perf-test-y += leafloop.o
 perf-test-y += sqrtloop.o
 perf-test-y += brstack.o
 perf-test-y += datasym.o
+perf-test-y += offcpu.o
 
 CFLAGS_sqrtloop.o         = -g -O0 -fno-inline -U_FORTIFY_SOURCE
 CFLAGS_leafloop.o         = -g -O0 -fno-inline -fno-omit-frame-pointer -U_FORTIFY_SOURCE
diff --git a/tools/perf/tests/workloads/offcpu.c b/tools/perf/tests/workloads/offcpu.c
new file mode 100644
index 000000000000..02be3d05b06d
--- /dev/null
+++ b/tools/perf/tests/workloads/offcpu.c
@@ -0,0 +1,16 @@
+#include <linux/compiler.h>
+#include <unistd.h>
+#include "../tests.h"
+
+static int offcpu(int argc __maybe_unused, const char **argv __maybe_unused)
+{
+	/* get pass initial delay */
+	sleep(1);
+
+	/* what we want to collect as a direct sample */
+	sleep(2);
+
+	return 0;
+}
+
+DEFINE_WORKLOAD(offcpu);
-- 
2.45.2
^ permalink raw reply related	[flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/5] perf record off-cpu: Add direct off-cpu event
  2024-07-26 10:28 ` [PATCH v3 1/5] perf record off-cpu: Add direct off-cpu event Howard Chu
@ 2024-07-26 23:48   ` Ian Rogers
  2024-07-29 13:36     ` Howard Chu
  0 siblings, 1 reply; 16+ messages in thread
From: Ian Rogers @ 2024-07-26 23:48 UTC (permalink / raw)
  To: Howard Chu
  Cc: namhyung, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
On Fri, Jul 26, 2024 at 3:28 AM Howard Chu <howardchu95@gmail.com> wrote:
>
> Add direct off-cpu event called "offcpu-time-direct". Add a threshold to
> dump direct off-cpu samples, "--off-cpu-thresh". Default value of
> --off-cpu-thresh is UULONG_MAX(no direct off-cpu samples), and
> --off-cpu-thresh's unit is milliseconds.
>
> Bind fds and sample_id in off_cpu_start()
>
> Note that we add "offcpu-time-direct" event using parse_event(), because we
> need to make it no-inherit, otherwise perf_event_open() will fail.
>
> Introduce sample_type_embed, indicating the sample_type of a sample
> embedded in BPF output. More discussions in later patches.
>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> Suggested-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/builtin-record.c   |  2 ++
>  tools/perf/util/bpf_off_cpu.c | 53 ++++++++++++++++++++++++++++++++++-
>  tools/perf/util/off_cpu.h     |  1 +
>  tools/perf/util/record.h      |  1 +
>  4 files changed, 56 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index a94516e8c522..708d48d309d6 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -3325,6 +3325,7 @@ static struct record record = {
>                 .ctl_fd              = -1,
>                 .ctl_fd_ack          = -1,
>                 .synth               = PERF_SYNTH_ALL,
> +               .off_cpu_thresh      = ULLONG_MAX,
>         },
>         .tool = {
>                 .sample         = process_sample_event,
> @@ -3557,6 +3558,7 @@ static struct option __record_options[] = {
>                             "write collected trace data into several data files using parallel threads",
>                             record__parse_threads),
>         OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
> +       OPT_U64(0, "off-cpu-thresh", &record.opts.off_cpu_thresh, "time threshold(in ms) for dumping off-cpu events"),
>         OPT_END()
>  };
>
> diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> index 6af36142dc5a..905a11c96c5b 100644
> --- a/tools/perf/util/bpf_off_cpu.c
> +++ b/tools/perf/util/bpf_off_cpu.c
> @@ -13,6 +13,7 @@
>  #include "util/cgroup.h"
>  #include "util/strlist.h"
>  #include <bpf/bpf.h>
> +#include <internal/xyarray.h>
>
>  #include "bpf_skel/off_cpu.skel.h"
>
> @@ -45,10 +46,12 @@ static int off_cpu_config(struct evlist *evlist)
>                 .size   = sizeof(attr), /* to capture ABI version */
>         };
>         char *evname = strdup(OFFCPU_EVENT);
> +       char off_cpu_direct_event[64];
>
>         if (evname == NULL)
>                 return -ENOMEM;
>
> +       /* off-cpu event in the end */
>         evsel = evsel__new(&attr);
>         if (!evsel) {
>                 free(evname);
> @@ -65,12 +68,22 @@ static int off_cpu_config(struct evlist *evlist)
>         free(evsel->name);
>         evsel->name = evname;
>
> +       /* direct off-cpu event */
> +       snprintf(off_cpu_direct_event, sizeof(off_cpu_direct_event), "bpf-output/no-inherit=1,name=%s/", OFFCPU_EVENT_DIRECT);
> +       if (parse_event(evlist, off_cpu_direct_event)) {
> +               pr_err("Failed to open off-cpu event\n");
> +               return -1;
> +       }
> +
>         return 0;
>  }
>
>  static void off_cpu_start(void *arg)
>  {
>         struct evlist *evlist = arg;
> +       struct evsel *evsel;
> +       struct perf_cpu pcpu;
> +       int i, err;
>
>         /* update task filter for the given workload */
>         if (!skel->bss->has_cpu && !skel->bss->has_task &&
> @@ -86,6 +99,27 @@ static void off_cpu_start(void *arg)
>                 bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
>         }
>
> +       /* sample id and fds in BPF's perf_event_array can only be set after record__open() */
> +       evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT_DIRECT);
> +       if (evsel == NULL) {
> +               pr_err("%s evsel not found\n", OFFCPU_EVENT_DIRECT);
> +               return;
> +       }
> +
> +       if (evsel->core.id)
> +               skel->bss->sample_id = evsel->core.id[0];
> +
> +       perf_cpu_map__for_each_cpu(pcpu, i, evsel->core.cpus) {
> +               err = bpf_map__update_elem(skel->maps.offcpu_output,
> +                                          &pcpu.cpu, sizeof(__u32),
> +                                          xyarray__entry(evsel->core.fd, pcpu.cpu, 0),
> +                                          sizeof(__u32), BPF_ANY);
> +               if (err) {
> +                       pr_err("Failed to update perf event map for direct off-cpu dumping\n");
> +                       return;
> +               }
> +       }
> +
>         skel->bss->enabled = 1;
>  }
>
> @@ -130,14 +164,24 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
>  {
>         int err, fd, i;
>         int ncpus = 1, ntasks = 1, ncgrps = 1;
> +       __u64 offcpu_thresh;
>         struct strlist *pid_slist = NULL;
>         struct str_node *pos;
> +       struct evsel *evsel;
>
>         if (off_cpu_config(evlist) < 0) {
>                 pr_err("Failed to config off-cpu BPF event\n");
>                 return -1;
>         }
>
> +       evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT_DIRECT);
> +       if (evsel == NULL) {
> +               pr_err("%s evsel not found\n", OFFCPU_EVENT_DIRECT);
> +               return -1 ;
> +       }
> +
> +       evsel->sample_type_embed = OFFCPU_SAMPLE_TYPES;
> +
>         skel = off_cpu_bpf__open();
>         if (!skel) {
>                 pr_err("Failed to open off-cpu BPF skeleton\n");
> @@ -250,7 +294,6 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
>         }
>
>         if (evlist__first(evlist)->cgrp) {
> -               struct evsel *evsel;
>                 u8 val = 1;
>
>                 skel->bss->has_cgroup = 1;
> @@ -272,6 +315,14 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
>                 }
>         }
>
> +       offcpu_thresh = opts->off_cpu_thresh;
> +
> +       if (opts->off_cpu_thresh != ULLONG_MAX)
> +               offcpu_thresh = opts->off_cpu_thresh * 1000000; /* off-cpu-thresh is in ms */
nit: In this comment, it's not clear if you are referring to the
option or the variable. In modern languages it is usual to have some
kind of "duration" type. As we're using u64s I'd be tempted to add a
"_ms" suffix, just to make clear what the units for the time are. I
think that'd make this:
offcpu_thresh_ms = opts->off_cpu_thresh_ns * 1000000
Thanks,
Ian
> +
> +       skel->bss->offcpu_thresh = offcpu_thresh;
> +       skel->bss->sample_type   = OFFCPU_SAMPLE_TYPES;
> +
>         err = off_cpu_bpf__attach(skel);
>         if (err) {
>                 pr_err("Failed to attach off-cpu BPF skeleton\n");
> diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
> index 2dd67c60f211..a349f8e300e0 100644
> --- a/tools/perf/util/off_cpu.h
> +++ b/tools/perf/util/off_cpu.h
> @@ -9,6 +9,7 @@ struct perf_session;
>  struct record_opts;
>
>  #define OFFCPU_EVENT  "offcpu-time"
> +#define OFFCPU_EVENT_DIRECT  "offcpu-time-direct"
>
>  #define OFFCPU_SAMPLE_TYPES  (PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_IP | \
>                               PERF_SAMPLE_TID | PERF_SAMPLE_TIME | \
> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> index a6566134e09e..3c11416e6627 100644
> --- a/tools/perf/util/record.h
> +++ b/tools/perf/util/record.h
> @@ -79,6 +79,7 @@ struct record_opts {
>         int           synth;
>         int           threads_spec;
>         const char    *threads_user_spec;
> +       u64           off_cpu_thresh;
>  };
>
>  extern const char * const *record_usage;
> --
> 2.45.2
>
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/5] perf record off-cpu: save embedded sample type
  2024-07-26 10:28 ` [PATCH v3 4/5] perf record off-cpu: save embedded sample type Howard Chu
@ 2024-07-27  0:49   ` Ian Rogers
  0 siblings, 0 replies; 16+ messages in thread
From: Ian Rogers @ 2024-07-27  0:49 UTC (permalink / raw)
  To: Howard Chu
  Cc: namhyung, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
On Fri, Jul 26, 2024 at 3:28 AM Howard Chu <howardchu95@gmail.com> wrote:
>
> We have to save the embedded sample type for it to be consumed correctly
> by perf script or perf report.
>
> This is a bad approach because it most definitely will break some
> perf.data convertor. Another approach is to add this sample_type_embed
> to perf_event_attr, but changing perf api is above my pay grade, so
> please give me your suggestions!
>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
>  tools/perf/util/header.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
> index 55e9553861d0..ccb5493dc515 100644
> --- a/tools/perf/util/header.c
> +++ b/tools/perf/util/header.c
> @@ -80,6 +80,7 @@ const char perf_version_string[] = PERF_VERSION;
>
>  struct perf_file_attr {
>         struct perf_event_attr  attr;
> +       __u64 embed;
>         struct perf_file_section        ids;
>  };
>
> @@ -3713,6 +3714,7 @@ static int perf_session__do_write_header(struct perf_session *session,
>                 }
>                 f_attr = (struct perf_file_attr){
>                         .attr = evsel->core.attr,
> +                       .embed = evsel->sample_type_embed,
>                         .ids  = {
>                                 .offset = evsel->id_offset,
>                                 .size   = evsel->core.ids * sizeof(u64),
> @@ -4147,6 +4149,14 @@ static int read_attr(int fd, struct perf_header *ph,
>
>                 ret = readn(fd, ptr, left);
>         }
> +
> +       ret = readn(fd, &f_attr->embed, sizeof(f_attr->embed));
> +       if (ret <= 0) {
> +               pr_debug("cannot read %d bytes of embedded sample type\n",
> +                        PERF_ATTR_SIZE_VER0);
nit: sizeof will likely be larger than this, so prefer to this
constant? Maybe errno will be more useful.
Thanks,
Ian
> +               return -1;
> +       }
> +
>         /* read perf_file_section, ids are read in caller */
>         ret = readn(fd, &f_attr->ids, sizeof(f_attr->ids));
>
> @@ -4272,6 +4282,8 @@ int perf_session__read_header(struct perf_session *session, int repipe_fd)
>                 tmp = lseek(fd, 0, SEEK_CUR);
>                 evsel = evsel__new(&f_attr.attr);
>
> +               evsel->sample_type_embed = f_attr.embed;
> +
>                 if (evsel == NULL)
>                         goto out_delete_evlist;
>
> --
> 2.45.2
>
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v3 5/5] perf record off-cpu: Add direct off-cpu test
  2024-07-26 10:28 ` [PATCH v3 5/5] perf record off-cpu: Add direct off-cpu test Howard Chu
@ 2024-07-27  0:54   ` Ian Rogers
  2024-07-29 13:29     ` Howard Chu
  0 siblings, 1 reply; 16+ messages in thread
From: Ian Rogers @ 2024-07-27  0:54 UTC (permalink / raw)
  To: Howard Chu
  Cc: namhyung, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
On Fri, Jul 26, 2024 at 3:28 AM Howard Chu <howardchu95@gmail.com> wrote:
>
> Add a simple workload(offcpu.c) to create the scenario for direct
> off-cpu dumping.
>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> Suggested-by: Ian Rogers <irogers@google.com>
I tried with and without --off-cpu-thresh=1 but I only see 1
offcpu-time sample and no offcpu-time-direct.
Thanks,
Ian
> ---
>  tools/perf/tests/builtin-test.c         |  1 +
>  tools/perf/tests/shell/record_offcpu.sh | 29 +++++++++++++++++++++++++
>  tools/perf/tests/tests.h                |  1 +
>  tools/perf/tests/workloads/Build        |  1 +
>  tools/perf/tests/workloads/offcpu.c     | 16 ++++++++++++++
>  5 files changed, 48 insertions(+)
>  create mode 100644 tools/perf/tests/workloads/offcpu.c
>
> diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
> index c3d84b67ca8e..5062058ad17d 100644
> --- a/tools/perf/tests/builtin-test.c
> +++ b/tools/perf/tests/builtin-test.c
> @@ -152,6 +152,7 @@ static struct test_workload *workloads[] = {
>         &workload__sqrtloop,
>         &workload__brstack,
>         &workload__datasym,
> +       &workload__offcpu,
>  };
>
>  static int num_subtests(const struct test_suite *t)
> diff --git a/tools/perf/tests/shell/record_offcpu.sh b/tools/perf/tests/shell/record_offcpu.sh
> index 67c925f3a15a..1ea0a44336e2 100755
> --- a/tools/perf/tests/shell/record_offcpu.sh
> +++ b/tools/perf/tests/shell/record_offcpu.sh
> @@ -6,6 +6,7 @@ set -e
>
>  err=0
>  perfdata=$(mktemp /tmp/__perf_test.perf.data.XXXXX)
> +TEST_PROGRAM="perf test -w offcpu"
>
>  cleanup() {
>    rm -f ${perfdata}
> @@ -88,6 +89,30 @@ test_offcpu_child() {
>    echo "Child task off-cpu test [Success]"
>  }
>
> +test_offcpu_direct() {
> +  echo "Direct off-cpu test"
> +  # dump off-cpu samples for tasks blocked for more than 1999ms (1.9s)
> +  # -D for initial delay, which is necessary if we want to enable evlist
> +  if ! perf record -F 1 -D 999 --off-cpu --off-cpu-thresh 1999 -o ${perfdata} ${TEST_PROGRAM} 2> /dev/null
> +  then
> +    echo "Direct off-cpu test [Failed record]"
> +    err=1
> +    return
> +  fi
> +  if ! perf evlist -i ${perfdata} | grep -q "offcpu-time-direct"
> +  then
> +    echo "Direct off-cpu test [Failed no event]"
> +    err=1
> +    return
> +  fi
> +  if ! perf script -i ${perfdata} | grep -q -E ".*2[0-9]{9}[ ]*offcpu-time-direct" # 2 seconds (2,000,000,000)
> +  then
> +    echo "Direct off-cpu test [Failed missing output]"
> +    err=1
> +    return
> +  fi
> +  echo "Direct off-cpu test [Success]"
> +}
>
>  test_offcpu_priv
>
> @@ -99,5 +124,9 @@ if [ $err = 0 ]; then
>    test_offcpu_child
>  fi
>
> +if [ $err = 0 ]; then
> +  test_offcpu_direct
> +fi
> +
>  cleanup
>  exit $err
> diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
> index 3aa7701ee0e9..84ab15683269 100644
> --- a/tools/perf/tests/tests.h
> +++ b/tools/perf/tests/tests.h
> @@ -205,6 +205,7 @@ DECLARE_WORKLOAD(leafloop);
>  DECLARE_WORKLOAD(sqrtloop);
>  DECLARE_WORKLOAD(brstack);
>  DECLARE_WORKLOAD(datasym);
> +DECLARE_WORKLOAD(offcpu);
>
>  extern const char *dso_to_test;
>  extern const char *test_objdump_path;
> diff --git a/tools/perf/tests/workloads/Build b/tools/perf/tests/workloads/Build
> index 48bf0d3b0f3d..f37e9be8b142 100644
> --- a/tools/perf/tests/workloads/Build
> +++ b/tools/perf/tests/workloads/Build
> @@ -6,6 +6,7 @@ perf-test-y += leafloop.o
>  perf-test-y += sqrtloop.o
>  perf-test-y += brstack.o
>  perf-test-y += datasym.o
> +perf-test-y += offcpu.o
>
>  CFLAGS_sqrtloop.o         = -g -O0 -fno-inline -U_FORTIFY_SOURCE
>  CFLAGS_leafloop.o         = -g -O0 -fno-inline -fno-omit-frame-pointer -U_FORTIFY_SOURCE
> diff --git a/tools/perf/tests/workloads/offcpu.c b/tools/perf/tests/workloads/offcpu.c
> new file mode 100644
> index 000000000000..02be3d05b06d
> --- /dev/null
> +++ b/tools/perf/tests/workloads/offcpu.c
> @@ -0,0 +1,16 @@
> +#include <linux/compiler.h>
> +#include <unistd.h>
> +#include "../tests.h"
> +
> +static int offcpu(int argc __maybe_unused, const char **argv __maybe_unused)
> +{
> +       /* get pass initial delay */
> +       sleep(1);
> +
> +       /* what we want to collect as a direct sample */
> +       sleep(2);
> +
> +       return 0;
> +}
> +
> +DEFINE_WORKLOAD(offcpu);
> --
> 2.45.2
>
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v3 0/5] Dump off-cpu samples directly
  2024-07-26 10:28 [PATCH v3 0/5] Dump off-cpu samples directly Howard Chu
                   ` (4 preceding siblings ...)
  2024-07-26 10:28 ` [PATCH v3 5/5] perf record off-cpu: Add direct off-cpu test Howard Chu
@ 2024-07-27  1:06 ` Ian Rogers
  2024-07-29  1:21 ` Namhyung Kim
  6 siblings, 0 replies; 16+ messages in thread
From: Ian Rogers @ 2024-07-27  1:06 UTC (permalink / raw)
  To: Howard Chu
  Cc: namhyung, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
On Fri, Jul 26, 2024 at 3:28 AM Howard Chu <howardchu95@gmail.com> wrote:
>
> As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
>
> Currently, off-cpu samples are dumped when perf record is exiting. This
> results in off-cpu samples being after the regular samples. This patch
> series makes possible dumping off-cpu samples on-the-fly, directly into
> perf ring buffer. And it dispatches those samples to the correct format
> for perf.data consumers.
>
> Changes in v3:
>  - Add off-cpu-thresh argument
>  - Process direct off-cpu samples in post
>
> Changes in v2:
>  - Remove unnecessary comments.
>  - Rename function off_cpu_change_type to off_cpu_prepare_parse
>
> Before:
> ```
>      migration/0      21 [000] 27981.041319: 2944637851    cycles:P:  ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
>             perf  770116 [001] 27981.041375:          1    cycles:P:  ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
>             perf  770116 [001] 27981.041377:          1    cycles:P:  ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
>             perf  770116 [001] 27981.041379:      51611    cycles:P:  ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
>      migration/1      26 [001] 27981.041400: 4227682775    cycles:P:  ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
>      migration/2      32 [002] 27981.041477: 4159401534    cycles:P:  ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
>
> sshd  708098 [000] 18446744069.414584:     286392 offcpu-time:
>             79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
>             585690935cca [unknown] (/usr/bin/sshd)
> ```
>
> After:
> ```
>             perf  774767 [003] 28178.033444:        497           cycles:P:  ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
>             perf  774767 [003] 28178.033445:     399440           cycles:P:  ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
>          swapper       0 [001] 28178.036639:  376650973           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
>          swapper       0 [003] 28178.182921:  348779378           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
>     blueman-tray    1355 [000] 28178.627906:  100184571 offcpu-time-direct:
>             7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
>             7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
>             7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
>             7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
>             7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
>             7fff24e862d8 [unknown] ([unknown])
>
>
>     blueman-tray    1355 [000] 28178.728137:  100187539 offcpu-time-direct:
>             7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
>             7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
>             7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
>             7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
>             7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
>             7fff24e862d8 [unknown] ([unknown])
>
>
>          swapper       0 [000] 28178.463253:  195945410           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
>      dbus-broker     412 [002] 28178.464855:  376737008           cycles:P:  ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
> ```
>
> Howard Chu (5):
>   perf record off-cpu: Add direct off-cpu event
>   perf record off-cpu: Dumping samples in BPF
>   perf record off-cpu: processing of embedded sample
>   perf record off-cpu: save embedded sample type
>   perf record off-cpu: Add direct off-cpu test
This worked nicely on some tests with me. I'm not sure about
separating offcpu-time and offcpu-time-direct rather than the just 1
event. I also immediately set --off-cpu-thresh=1 so maybe milliseconds
should be microseconds, and the default shouldn't be to not use the
direct samples.
Thanks,
Ian
>  tools/perf/builtin-record.c             |   2 +
>  tools/perf/builtin-script.c             |   2 +-
>  tools/perf/tests/builtin-test.c         |   1 +
>  tools/perf/tests/shell/record_offcpu.sh |  29 +++++
>  tools/perf/tests/tests.h                |   1 +
>  tools/perf/tests/workloads/Build        |   1 +
>  tools/perf/tests/workloads/offcpu.c     |  16 +++
>  tools/perf/util/bpf_off_cpu.c           |  53 ++++++++-
>  tools/perf/util/bpf_skel/off_cpu.bpf.c  | 143 ++++++++++++++++++++++++
>  tools/perf/util/evsel.c                 |  16 ++-
>  tools/perf/util/evsel.h                 |  13 +++
>  tools/perf/util/header.c                |  12 ++
>  tools/perf/util/off_cpu.h               |   1 +
>  tools/perf/util/record.h                |   1 +
>  tools/perf/util/session.c               |  23 +++-
>  15 files changed, 309 insertions(+), 5 deletions(-)
>  create mode 100644 tools/perf/tests/workloads/offcpu.c
>
> --
> 2.45.2
>
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v3 0/5] Dump off-cpu samples directly
  2024-07-26 10:28 [PATCH v3 0/5] Dump off-cpu samples directly Howard Chu
                   ` (5 preceding siblings ...)
  2024-07-27  1:06 ` [PATCH v3 0/5] Dump off-cpu samples directly Ian Rogers
@ 2024-07-29  1:21 ` Namhyung Kim
  2024-07-29 15:24   ` Howard Chu
  6 siblings, 1 reply; 16+ messages in thread
From: Namhyung Kim @ 2024-07-29  1:21 UTC (permalink / raw)
  To: Howard Chu
  Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
Hello Howard,
On Fri, Jul 26, 2024 at 06:28:21PM +0800, Howard Chu wrote:
> As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> 
> Currently, off-cpu samples are dumped when perf record is exiting. This
> results in off-cpu samples being after the regular samples. This patch
> series makes possible dumping off-cpu samples on-the-fly, directly into
> perf ring buffer. And it dispatches those samples to the correct format
> for perf.data consumers.
Thanks for your work!
But I'm not sure we need a separate event for offcpu-time-direct.  If we
fix the format for the direct event, we can adjust the format of offcpu-
time when it dumps at the end.
Anyway, as far as I can see you don't need to fill the sample info in
the offcpu-time-direct manually in your BPF program.  Because the
bpf_perf_event_output() will call perf_event_output() which fills all
the sample information according to the sample_type flags.
Well.. it'll set IP to the schedule function, but it should be ok.
(updating IP using CALLCHAIN like in off_cpu_write() is a kinda hack and
not absoluately necessary, probably I can get rid of it..  Let's go with
simple for now.)
So I think what you need is to ensure it has the uncessary flags.  And
the only info it needs to fill is the time between the previous schedule
and this can be added to the raw data.
Thanks,
Namhyung
> 
> Changes in v3:
>  - Add off-cpu-thresh argument
>  - Process direct off-cpu samples in post
> 
> Changes in v2:
>  - Remove unnecessary comments.
>  - Rename function off_cpu_change_type to off_cpu_prepare_parse
> 
> Before:
> ```
>      migration/0      21 [000] 27981.041319: 2944637851    cycles:P:  ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
>             perf  770116 [001] 27981.041375:          1    cycles:P:  ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
>             perf  770116 [001] 27981.041377:          1    cycles:P:  ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
>             perf  770116 [001] 27981.041379:      51611    cycles:P:  ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
>      migration/1      26 [001] 27981.041400: 4227682775    cycles:P:  ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
>      migration/2      32 [002] 27981.041477: 4159401534    cycles:P:  ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
> 
> sshd  708098 [000] 18446744069.414584:     286392 offcpu-time: 
> 	    79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
> 	    585690935cca [unknown] (/usr/bin/sshd)
> ```
> 
> After:
> ```
>             perf  774767 [003] 28178.033444:        497           cycles:P:  ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
>             perf  774767 [003] 28178.033445:     399440           cycles:P:  ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
>          swapper       0 [001] 28178.036639:  376650973           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
>          swapper       0 [003] 28178.182921:  348779378           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
>     blueman-tray    1355 [000] 28178.627906:  100184571 offcpu-time-direct: 
> 	    7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> 	    7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> 	    7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> 	    7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> 	    7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> 	    7fff24e862d8 [unknown] ([unknown])
> 
> 
>     blueman-tray    1355 [000] 28178.728137:  100187539 offcpu-time-direct: 
> 	    7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> 	    7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> 	    7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> 	    7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> 	    7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> 	    7fff24e862d8 [unknown] ([unknown])
> 
> 
>          swapper       0 [000] 28178.463253:  195945410           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
>      dbus-broker     412 [002] 28178.464855:  376737008           cycles:P:  ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
> ```
> 
> Howard Chu (5):
>   perf record off-cpu: Add direct off-cpu event
>   perf record off-cpu: Dumping samples in BPF
>   perf record off-cpu: processing of embedded sample
>   perf record off-cpu: save embedded sample type
>   perf record off-cpu: Add direct off-cpu test
> 
>  tools/perf/builtin-record.c             |   2 +
>  tools/perf/builtin-script.c             |   2 +-
>  tools/perf/tests/builtin-test.c         |   1 +
>  tools/perf/tests/shell/record_offcpu.sh |  29 +++++
>  tools/perf/tests/tests.h                |   1 +
>  tools/perf/tests/workloads/Build        |   1 +
>  tools/perf/tests/workloads/offcpu.c     |  16 +++
>  tools/perf/util/bpf_off_cpu.c           |  53 ++++++++-
>  tools/perf/util/bpf_skel/off_cpu.bpf.c  | 143 ++++++++++++++++++++++++
>  tools/perf/util/evsel.c                 |  16 ++-
>  tools/perf/util/evsel.h                 |  13 +++
>  tools/perf/util/header.c                |  12 ++
>  tools/perf/util/off_cpu.h               |   1 +
>  tools/perf/util/record.h                |   1 +
>  tools/perf/util/session.c               |  23 +++-
>  15 files changed, 309 insertions(+), 5 deletions(-)
>  create mode 100644 tools/perf/tests/workloads/offcpu.c
> 
> -- 
> 2.45.2
> 
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v3 5/5] perf record off-cpu: Add direct off-cpu test
  2024-07-27  0:54   ` Ian Rogers
@ 2024-07-29 13:29     ` Howard Chu
  0 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2024-07-29 13:29 UTC (permalink / raw)
  To: Ian Rogers
  Cc: namhyung, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
Hello Ian,
Thanks for reviewing this patch.
On Sat, Jul 27, 2024 at 8:54 AM Ian Rogers <irogers@google.com> wrote:
>
> On Fri, Jul 26, 2024 at 3:28 AM Howard Chu <howardchu95@gmail.com> wrote:
> >
> > Add a simple workload(offcpu.c) to create the scenario for direct
> > off-cpu dumping.
> >
> > Signed-off-by: Howard Chu <howardchu95@gmail.com>
> > Suggested-by: Ian Rogers <irogers@google.com>
>
> I tried with and without --off-cpu-thresh=1 but I only see 1
> offcpu-time sample and no offcpu-time-direct.
TL;DR:
Please use 'perf record -F 1 -D 999 --off-cpu --off-cpu-thresh 1999
perf test -w offcpu' :)
We need to give it an initial delay, using '-D 999'.
That's because when using "perf record <workload>", evlist is not
enabled if you don't give it an initial delay or specify cpu. There
are two occurrences of evlist__enable() in builtin-record.c:
if (!target__none(&opts->target) && !opts->target.initial_delay)
    evlist__enable(rec->evlist);
Here, if you perf record a workload, say "perf record --off-cpu
--off-cpu-thresh=1 perf test -w offcpu", the
"!target__none(&opts->target)" will always be 0, so no enabling.
if (opts->target.initial_delay) {
    pr_info(EVLIST_DISABLED_MSG);
    if (opts->target.initial_delay > 0) {
        usleep(opts->target.initial_delay * USEC_PER_MSEC);
        evlist__enable(rec->evlist);
        pr_info(EVLIST_ENABLED_MSG);
    }
}
But we can do '-D 999' to give it an initial delay, so the evlist does
get enabled, now BPF can perf_event_output() a direct off-cpu sample
to the evlist. This is also the approach I used in the
'tests/shell/record_offcpu.sh'.
```
  if ! perf record -F 1 -D 999 --off-cpu --off-cpu-thresh 1999 -o
${perfdata} ${TEST_PROGRAM} 2> /dev/null
```
Thanks,
Howard
>
> Thanks,
> Ian
>
> > ---
> >  tools/perf/tests/builtin-test.c         |  1 +
> >  tools/perf/tests/shell/record_offcpu.sh | 29 +++++++++++++++++++++++++
> >  tools/perf/tests/tests.h                |  1 +
> >  tools/perf/tests/workloads/Build        |  1 +
> >  tools/perf/tests/workloads/offcpu.c     | 16 ++++++++++++++
> >  5 files changed, 48 insertions(+)
> >  create mode 100644 tools/perf/tests/workloads/offcpu.c
> >
> > diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
> > index c3d84b67ca8e..5062058ad17d 100644
> > --- a/tools/perf/tests/builtin-test.c
> > +++ b/tools/perf/tests/builtin-test.c
> > @@ -152,6 +152,7 @@ static struct test_workload *workloads[] = {
> >         &workload__sqrtloop,
> >         &workload__brstack,
> >         &workload__datasym,
> > +       &workload__offcpu,
> >  };
> >
> >  static int num_subtests(const struct test_suite *t)
> > diff --git a/tools/perf/tests/shell/record_offcpu.sh b/tools/perf/tests/shell/record_offcpu.sh
> > index 67c925f3a15a..1ea0a44336e2 100755
> > --- a/tools/perf/tests/shell/record_offcpu.sh
> > +++ b/tools/perf/tests/shell/record_offcpu.sh
> > @@ -6,6 +6,7 @@ set -e
> >
> >  err=0
> >  perfdata=$(mktemp /tmp/__perf_test.perf.data.XXXXX)
> > +TEST_PROGRAM="perf test -w offcpu"
> >
> >  cleanup() {
> >    rm -f ${perfdata}
> > @@ -88,6 +89,30 @@ test_offcpu_child() {
> >    echo "Child task off-cpu test [Success]"
> >  }
> >
> > +test_offcpu_direct() {
> > +  echo "Direct off-cpu test"
> > +  # dump off-cpu samples for tasks blocked for more than 1999ms (1.9s)
> > +  # -D for initial delay, which is necessary if we want to enable evlist
> > +  if ! perf record -F 1 -D 999 --off-cpu --off-cpu-thresh 1999 -o ${perfdata} ${TEST_PROGRAM} 2> /dev/null
> > +  then
> > +    echo "Direct off-cpu test [Failed record]"
> > +    err=1
> > +    return
> > +  fi
> > +  if ! perf evlist -i ${perfdata} | grep -q "offcpu-time-direct"
> > +  then
> > +    echo "Direct off-cpu test [Failed no event]"
> > +    err=1
> > +    return
> > +  fi
> > +  if ! perf script -i ${perfdata} | grep -q -E ".*2[0-9]{9}[ ]*offcpu-time-direct" # 2 seconds (2,000,000,000)
> > +  then
> > +    echo "Direct off-cpu test [Failed missing output]"
> > +    err=1
> > +    return
> > +  fi
> > +  echo "Direct off-cpu test [Success]"
> > +}
> >
> >  test_offcpu_priv
> >
> > @@ -99,5 +124,9 @@ if [ $err = 0 ]; then
> >    test_offcpu_child
> >  fi
> >
> > +if [ $err = 0 ]; then
> > +  test_offcpu_direct
> > +fi
> > +
> >  cleanup
> >  exit $err
> > diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
> > index 3aa7701ee0e9..84ab15683269 100644
> > --- a/tools/perf/tests/tests.h
> > +++ b/tools/perf/tests/tests.h
> > @@ -205,6 +205,7 @@ DECLARE_WORKLOAD(leafloop);
> >  DECLARE_WORKLOAD(sqrtloop);
> >  DECLARE_WORKLOAD(brstack);
> >  DECLARE_WORKLOAD(datasym);
> > +DECLARE_WORKLOAD(offcpu);
> >
> >  extern const char *dso_to_test;
> >  extern const char *test_objdump_path;
> > diff --git a/tools/perf/tests/workloads/Build b/tools/perf/tests/workloads/Build
> > index 48bf0d3b0f3d..f37e9be8b142 100644
> > --- a/tools/perf/tests/workloads/Build
> > +++ b/tools/perf/tests/workloads/Build
> > @@ -6,6 +6,7 @@ perf-test-y += leafloop.o
> >  perf-test-y += sqrtloop.o
> >  perf-test-y += brstack.o
> >  perf-test-y += datasym.o
> > +perf-test-y += offcpu.o
> >
> >  CFLAGS_sqrtloop.o         = -g -O0 -fno-inline -U_FORTIFY_SOURCE
> >  CFLAGS_leafloop.o         = -g -O0 -fno-inline -fno-omit-frame-pointer -U_FORTIFY_SOURCE
> > diff --git a/tools/perf/tests/workloads/offcpu.c b/tools/perf/tests/workloads/offcpu.c
> > new file mode 100644
> > index 000000000000..02be3d05b06d
> > --- /dev/null
> > +++ b/tools/perf/tests/workloads/offcpu.c
> > @@ -0,0 +1,16 @@
> > +#include <linux/compiler.h>
> > +#include <unistd.h>
> > +#include "../tests.h"
> > +
> > +static int offcpu(int argc __maybe_unused, const char **argv __maybe_unused)
> > +{
> > +       /* get pass initial delay */
> > +       sleep(1);
> > +
> > +       /* what we want to collect as a direct sample */
> > +       sleep(2);
> > +
> > +       return 0;
> > +}
> > +
> > +DEFINE_WORKLOAD(offcpu);
> > --
> > 2.45.2
> >
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v3 1/5] perf record off-cpu: Add direct off-cpu event
  2024-07-26 23:48   ` Ian Rogers
@ 2024-07-29 13:36     ` Howard Chu
  0 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2024-07-29 13:36 UTC (permalink / raw)
  To: Ian Rogers
  Cc: namhyung, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
Hello Ian,
On Sat, Jul 27, 2024 at 7:49 AM Ian Rogers <irogers@google.com> wrote:
>
> On Fri, Jul 26, 2024 at 3:28 AM Howard Chu <howardchu95@gmail.com> wrote:
> >
> > Add direct off-cpu event called "offcpu-time-direct". Add a threshold to
> > dump direct off-cpu samples, "--off-cpu-thresh". Default value of
> > --off-cpu-thresh is UULONG_MAX(no direct off-cpu samples), and
> > --off-cpu-thresh's unit is milliseconds.
> >
> > Bind fds and sample_id in off_cpu_start()
> >
> > Note that we add "offcpu-time-direct" event using parse_event(), because we
> > need to make it no-inherit, otherwise perf_event_open() will fail.
> >
> > Introduce sample_type_embed, indicating the sample_type of a sample
> > embedded in BPF output. More discussions in later patches.
> >
> > Signed-off-by: Howard Chu <howardchu95@gmail.com>
> > Suggested-by: Ian Rogers <irogers@google.com>
> > ---
> >  tools/perf/builtin-record.c   |  2 ++
> >  tools/perf/util/bpf_off_cpu.c | 53 ++++++++++++++++++++++++++++++++++-
> >  tools/perf/util/off_cpu.h     |  1 +
> >  tools/perf/util/record.h      |  1 +
> >  4 files changed, 56 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index a94516e8c522..708d48d309d6 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -3325,6 +3325,7 @@ static struct record record = {
> >                 .ctl_fd              = -1,
> >                 .ctl_fd_ack          = -1,
> >                 .synth               = PERF_SYNTH_ALL,
> > +               .off_cpu_thresh      = ULLONG_MAX,
> >         },
> >         .tool = {
> >                 .sample         = process_sample_event,
> > @@ -3557,6 +3558,7 @@ static struct option __record_options[] = {
> >                             "write collected trace data into several data files using parallel threads",
> >                             record__parse_threads),
> >         OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
> > +       OPT_U64(0, "off-cpu-thresh", &record.opts.off_cpu_thresh, "time threshold(in ms) for dumping off-cpu events"),
> >         OPT_END()
> >  };
> >
> > diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> > index 6af36142dc5a..905a11c96c5b 100644
> > --- a/tools/perf/util/bpf_off_cpu.c
> > +++ b/tools/perf/util/bpf_off_cpu.c
> > @@ -13,6 +13,7 @@
> >  #include "util/cgroup.h"
> >  #include "util/strlist.h"
> >  #include <bpf/bpf.h>
> > +#include <internal/xyarray.h>
> >
> >  #include "bpf_skel/off_cpu.skel.h"
> >
> > @@ -45,10 +46,12 @@ static int off_cpu_config(struct evlist *evlist)
> >                 .size   = sizeof(attr), /* to capture ABI version */
> >         };
> >         char *evname = strdup(OFFCPU_EVENT);
> > +       char off_cpu_direct_event[64];
> >
> >         if (evname == NULL)
> >                 return -ENOMEM;
> >
> > +       /* off-cpu event in the end */
> >         evsel = evsel__new(&attr);
> >         if (!evsel) {
> >                 free(evname);
> > @@ -65,12 +68,22 @@ static int off_cpu_config(struct evlist *evlist)
> >         free(evsel->name);
> >         evsel->name = evname;
> >
> > +       /* direct off-cpu event */
> > +       snprintf(off_cpu_direct_event, sizeof(off_cpu_direct_event), "bpf-output/no-inherit=1,name=%s/", OFFCPU_EVENT_DIRECT);
> > +       if (parse_event(evlist, off_cpu_direct_event)) {
> > +               pr_err("Failed to open off-cpu event\n");
> > +               return -1;
> > +       }
> > +
> >         return 0;
> >  }
> >
> >  static void off_cpu_start(void *arg)
> >  {
> >         struct evlist *evlist = arg;
> > +       struct evsel *evsel;
> > +       struct perf_cpu pcpu;
> > +       int i, err;
> >
> >         /* update task filter for the given workload */
> >         if (!skel->bss->has_cpu && !skel->bss->has_task &&
> > @@ -86,6 +99,27 @@ static void off_cpu_start(void *arg)
> >                 bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
> >         }
> >
> > +       /* sample id and fds in BPF's perf_event_array can only be set after record__open() */
> > +       evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT_DIRECT);
> > +       if (evsel == NULL) {
> > +               pr_err("%s evsel not found\n", OFFCPU_EVENT_DIRECT);
> > +               return;
> > +       }
> > +
> > +       if (evsel->core.id)
> > +               skel->bss->sample_id = evsel->core.id[0];
> > +
> > +       perf_cpu_map__for_each_cpu(pcpu, i, evsel->core.cpus) {
> > +               err = bpf_map__update_elem(skel->maps.offcpu_output,
> > +                                          &pcpu.cpu, sizeof(__u32),
> > +                                          xyarray__entry(evsel->core.fd, pcpu.cpu, 0),
> > +                                          sizeof(__u32), BPF_ANY);
> > +               if (err) {
> > +                       pr_err("Failed to update perf event map for direct off-cpu dumping\n");
> > +                       return;
> > +               }
> > +       }
> > +
> >         skel->bss->enabled = 1;
> >  }
> >
> > @@ -130,14 +164,24 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
> >  {
> >         int err, fd, i;
> >         int ncpus = 1, ntasks = 1, ncgrps = 1;
> > +       __u64 offcpu_thresh;
> >         struct strlist *pid_slist = NULL;
> >         struct str_node *pos;
> > +       struct evsel *evsel;
> >
> >         if (off_cpu_config(evlist) < 0) {
> >                 pr_err("Failed to config off-cpu BPF event\n");
> >                 return -1;
> >         }
> >
> > +       evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT_DIRECT);
> > +       if (evsel == NULL) {
> > +               pr_err("%s evsel not found\n", OFFCPU_EVENT_DIRECT);
> > +               return -1 ;
> > +       }
> > +
> > +       evsel->sample_type_embed = OFFCPU_SAMPLE_TYPES;
> > +
> >         skel = off_cpu_bpf__open();
> >         if (!skel) {
> >                 pr_err("Failed to open off-cpu BPF skeleton\n");
> > @@ -250,7 +294,6 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
> >         }
> >
> >         if (evlist__first(evlist)->cgrp) {
> > -               struct evsel *evsel;
> >                 u8 val = 1;
> >
> >                 skel->bss->has_cgroup = 1;
> > @@ -272,6 +315,14 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
> >                 }
> >         }
> >
> > +       offcpu_thresh = opts->off_cpu_thresh;
> > +
> > +       if (opts->off_cpu_thresh != ULLONG_MAX)
> > +               offcpu_thresh = opts->off_cpu_thresh * 1000000; /* off-cpu-thresh is in ms */
>
> nit: In this comment, it's not clear if you are referring to the
> option or the variable. In modern languages it is usual to have some
> kind of "duration" type. As we're using u64s I'd be tempted to add a
> "_ms" suffix, just to make clear what the units for the time are. I
> think that'd make this:
> offcpu_thresh_ms = opts->off_cpu_thresh_ns * 1000000
I'll do this, but in microseconds as you requested in another review.
Thanks,
Howard
>
> Thanks,
> Ian
>
> > +
> > +       skel->bss->offcpu_thresh = offcpu_thresh;
> > +       skel->bss->sample_type   = OFFCPU_SAMPLE_TYPES;
> > +
> >         err = off_cpu_bpf__attach(skel);
> >         if (err) {
> >                 pr_err("Failed to attach off-cpu BPF skeleton\n");
> > diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
> > index 2dd67c60f211..a349f8e300e0 100644
> > --- a/tools/perf/util/off_cpu.h
> > +++ b/tools/perf/util/off_cpu.h
> > @@ -9,6 +9,7 @@ struct perf_session;
> >  struct record_opts;
> >
> >  #define OFFCPU_EVENT  "offcpu-time"
> > +#define OFFCPU_EVENT_DIRECT  "offcpu-time-direct"
> >
> >  #define OFFCPU_SAMPLE_TYPES  (PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_IP | \
> >                               PERF_SAMPLE_TID | PERF_SAMPLE_TIME | \
> > diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> > index a6566134e09e..3c11416e6627 100644
> > --- a/tools/perf/util/record.h
> > +++ b/tools/perf/util/record.h
> > @@ -79,6 +79,7 @@ struct record_opts {
> >         int           synth;
> >         int           threads_spec;
> >         const char    *threads_user_spec;
> > +       u64           off_cpu_thresh;
> >  };
> >
> >  extern const char * const *record_usage;
> > --
> > 2.45.2
> >
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v3 0/5] Dump off-cpu samples directly
  2024-07-29  1:21 ` Namhyung Kim
@ 2024-07-29 15:24   ` Howard Chu
  2024-07-31 17:46     ` Namhyung Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Howard Chu @ 2024-07-29 15:24 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
Hello Namhyung,
Thanks for reviewing this patch.
On Mon, Jul 29, 2024 at 9:21 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello Howard,
>
> On Fri, Jul 26, 2024 at 06:28:21PM +0800, Howard Chu wrote:
> > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> >
> > Currently, off-cpu samples are dumped when perf record is exiting. This
> > results in off-cpu samples being after the regular samples. This patch
> > series makes possible dumping off-cpu samples on-the-fly, directly into
> > perf ring buffer. And it dispatches those samples to the correct format
> > for perf.data consumers.
>
> Thanks for your work!
>
> But I'm not sure we need a separate event for offcpu-time-direct.  If we
> fix the format for the direct event, we can adjust the format of offcpu-
> time when it dumps at the end.
Thank you and Ian for this advice, I'll do that.
>
> Anyway, as far as I can see you don't need to fill the sample info in
> the offcpu-time-direct manually in your BPF program.  Because the
> bpf_perf_event_output() will call perf_event_output() which fills all
> the sample information according to the sample_type flags.
>
> Well.. it'll set IP to the schedule function, but it should be ok.
> (updating IP using CALLCHAIN like in off_cpu_write() is a kinda hack and
> not absoluately necessary, probably I can get rid of it..  Let's go with
> simple for now.)
>
> So I think what you need is to ensure it has the uncessary flags.  And
> the only info it needs to fill is the time between the previous schedule
> and this can be added to the raw data.
Sure thing, thank you. Other than the off-cpu duration, do you think
we should collect the callchain as well?
This idea is great because it minimizes the data we need to transfer
from kernel space to user space. But with this, the whole
"full-fledged sample in BPF output" is not the case anymore, because
we only take a couple of things(such as the time between the
schedules) from the raw data. This idea is more like an extension of
the sample data, I think it's interesting, will try to implement it.
Thanks,
Howard
>
> Thanks,
> Namhyung
>
> >
> > Changes in v3:
> >  - Add off-cpu-thresh argument
> >  - Process direct off-cpu samples in post
> >
> > Changes in v2:
> >  - Remove unnecessary comments.
> >  - Rename function off_cpu_change_type to off_cpu_prepare_parse
> >
> > Before:
> > ```
> >      migration/0      21 [000] 27981.041319: 2944637851    cycles:P:  ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
> >             perf  770116 [001] 27981.041375:          1    cycles:P:  ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
> >             perf  770116 [001] 27981.041377:          1    cycles:P:  ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
> >             perf  770116 [001] 27981.041379:      51611    cycles:P:  ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
> >      migration/1      26 [001] 27981.041400: 4227682775    cycles:P:  ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
> >      migration/2      32 [002] 27981.041477: 4159401534    cycles:P:  ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
> >
> > sshd  708098 [000] 18446744069.414584:     286392 offcpu-time:
> >           79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
> >           585690935cca [unknown] (/usr/bin/sshd)
> > ```
> >
> > After:
> > ```
> >             perf  774767 [003] 28178.033444:        497           cycles:P:  ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
> >             perf  774767 [003] 28178.033445:     399440           cycles:P:  ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
> >          swapper       0 [001] 28178.036639:  376650973           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> >          swapper       0 [003] 28178.182921:  348779378           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> >     blueman-tray    1355 [000] 28178.627906:  100184571 offcpu-time-direct:
> >           7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> >           7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> >           7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> >           7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> >           7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> >           7fff24e862d8 [unknown] ([unknown])
> >
> >
> >     blueman-tray    1355 [000] 28178.728137:  100187539 offcpu-time-direct:
> >           7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> >           7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> >           7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> >           7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> >           7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> >           7fff24e862d8 [unknown] ([unknown])
> >
> >
> >          swapper       0 [000] 28178.463253:  195945410           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> >      dbus-broker     412 [002] 28178.464855:  376737008           cycles:P:  ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
> > ```
> >
> > Howard Chu (5):
> >   perf record off-cpu: Add direct off-cpu event
> >   perf record off-cpu: Dumping samples in BPF
> >   perf record off-cpu: processing of embedded sample
> >   perf record off-cpu: save embedded sample type
> >   perf record off-cpu: Add direct off-cpu test
> >
> >  tools/perf/builtin-record.c             |   2 +
> >  tools/perf/builtin-script.c             |   2 +-
> >  tools/perf/tests/builtin-test.c         |   1 +
> >  tools/perf/tests/shell/record_offcpu.sh |  29 +++++
> >  tools/perf/tests/tests.h                |   1 +
> >  tools/perf/tests/workloads/Build        |   1 +
> >  tools/perf/tests/workloads/offcpu.c     |  16 +++
> >  tools/perf/util/bpf_off_cpu.c           |  53 ++++++++-
> >  tools/perf/util/bpf_skel/off_cpu.bpf.c  | 143 ++++++++++++++++++++++++
> >  tools/perf/util/evsel.c                 |  16 ++-
> >  tools/perf/util/evsel.h                 |  13 +++
> >  tools/perf/util/header.c                |  12 ++
> >  tools/perf/util/off_cpu.h               |   1 +
> >  tools/perf/util/record.h                |   1 +
> >  tools/perf/util/session.c               |  23 +++-
> >  15 files changed, 309 insertions(+), 5 deletions(-)
> >  create mode 100644 tools/perf/tests/workloads/offcpu.c
> >
> > --
> > 2.45.2
> >
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v3 0/5] Dump off-cpu samples directly
  2024-07-29 15:24   ` Howard Chu
@ 2024-07-31 17:46     ` Namhyung Kim
  2024-07-31 18:23       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 16+ messages in thread
From: Namhyung Kim @ 2024-07-31 17:46 UTC (permalink / raw)
  To: Howard Chu
  Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
	linux-kernel
On Mon, Jul 29, 2024 at 11:24:47PM +0800, Howard Chu wrote:
> Hello Namhyung,
> 
> Thanks for reviewing this patch.
> 
> On Mon, Jul 29, 2024 at 9:21 AM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Hello Howard,
> >
> > On Fri, Jul 26, 2024 at 06:28:21PM +0800, Howard Chu wrote:
> > > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> > >
> > > Currently, off-cpu samples are dumped when perf record is exiting. This
> > > results in off-cpu samples being after the regular samples. This patch
> > > series makes possible dumping off-cpu samples on-the-fly, directly into
> > > perf ring buffer. And it dispatches those samples to the correct format
> > > for perf.data consumers.
> >
> > Thanks for your work!
> >
> > But I'm not sure we need a separate event for offcpu-time-direct.  If we
> > fix the format for the direct event, we can adjust the format of offcpu-
> > time when it dumps at the end.
> 
> Thank you and Ian for this advice, I'll do that.
> 
> >
> > Anyway, as far as I can see you don't need to fill the sample info in
> > the offcpu-time-direct manually in your BPF program.  Because the
> > bpf_perf_event_output() will call perf_event_output() which fills all
> > the sample information according to the sample_type flags.
> >
> > Well.. it'll set IP to the schedule function, but it should be ok.
> > (updating IP using CALLCHAIN like in off_cpu_write() is a kinda hack and
> > not absoluately necessary, probably I can get rid of it..  Let's go with
> > simple for now.)
> >
> > So I think what you need is to ensure it has the uncessary flags.  And
> > the only info it needs to fill is the time between the previous schedule
> > and this can be added to the raw data.
> 
> Sure thing, thank you. Other than the off-cpu duration, do you think
> we should collect the callchain as well?
I think the kernel will do that for you once you set the
SAMPLE_CALLCHAIN flag in the event attr.
Thanks,
Namhyung
> 
> This idea is great because it minimizes the data we need to transfer
> from kernel space to user space. But with this, the whole
> "full-fledged sample in BPF output" is not the case anymore, because
> we only take a couple of things(such as the time between the
> schedules) from the raw data. This idea is more like an extension of
> the sample data, I think it's interesting, will try to implement it.
> 
> Thanks,
> Howard
> >
> > Thanks,
> > Namhyung
> >
> > >
> > > Changes in v3:
> > >  - Add off-cpu-thresh argument
> > >  - Process direct off-cpu samples in post
> > >
> > > Changes in v2:
> > >  - Remove unnecessary comments.
> > >  - Rename function off_cpu_change_type to off_cpu_prepare_parse
> > >
> > > Before:
> > > ```
> > >      migration/0      21 [000] 27981.041319: 2944637851    cycles:P:  ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
> > >             perf  770116 [001] 27981.041375:          1    cycles:P:  ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
> > >             perf  770116 [001] 27981.041377:          1    cycles:P:  ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
> > >             perf  770116 [001] 27981.041379:      51611    cycles:P:  ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
> > >      migration/1      26 [001] 27981.041400: 4227682775    cycles:P:  ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
> > >      migration/2      32 [002] 27981.041477: 4159401534    cycles:P:  ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
> > >
> > > sshd  708098 [000] 18446744069.414584:     286392 offcpu-time:
> > >           79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
> > >           585690935cca [unknown] (/usr/bin/sshd)
> > > ```
> > >
> > > After:
> > > ```
> > >             perf  774767 [003] 28178.033444:        497           cycles:P:  ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
> > >             perf  774767 [003] 28178.033445:     399440           cycles:P:  ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
> > >          swapper       0 [001] 28178.036639:  376650973           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > >          swapper       0 [003] 28178.182921:  348779378           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > >     blueman-tray    1355 [000] 28178.627906:  100184571 offcpu-time-direct:
> > >           7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> > >           7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> > >           7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> > >           7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> > >           7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> > >           7fff24e862d8 [unknown] ([unknown])
> > >
> > >
> > >     blueman-tray    1355 [000] 28178.728137:  100187539 offcpu-time-direct:
> > >           7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> > >           7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> > >           7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> > >           7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> > >           7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> > >           7fff24e862d8 [unknown] ([unknown])
> > >
> > >
> > >          swapper       0 [000] 28178.463253:  195945410           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > >      dbus-broker     412 [002] 28178.464855:  376737008           cycles:P:  ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
> > > ```
> > >
> > > Howard Chu (5):
> > >   perf record off-cpu: Add direct off-cpu event
> > >   perf record off-cpu: Dumping samples in BPF
> > >   perf record off-cpu: processing of embedded sample
> > >   perf record off-cpu: save embedded sample type
> > >   perf record off-cpu: Add direct off-cpu test
> > >
> > >  tools/perf/builtin-record.c             |   2 +
> > >  tools/perf/builtin-script.c             |   2 +-
> > >  tools/perf/tests/builtin-test.c         |   1 +
> > >  tools/perf/tests/shell/record_offcpu.sh |  29 +++++
> > >  tools/perf/tests/tests.h                |   1 +
> > >  tools/perf/tests/workloads/Build        |   1 +
> > >  tools/perf/tests/workloads/offcpu.c     |  16 +++
> > >  tools/perf/util/bpf_off_cpu.c           |  53 ++++++++-
> > >  tools/perf/util/bpf_skel/off_cpu.bpf.c  | 143 ++++++++++++++++++++++++
> > >  tools/perf/util/evsel.c                 |  16 ++-
> > >  tools/perf/util/evsel.h                 |  13 +++
> > >  tools/perf/util/header.c                |  12 ++
> > >  tools/perf/util/off_cpu.h               |   1 +
> > >  tools/perf/util/record.h                |   1 +
> > >  tools/perf/util/session.c               |  23 +++-
> > >  15 files changed, 309 insertions(+), 5 deletions(-)
> > >  create mode 100644 tools/perf/tests/workloads/offcpu.c
> > >
> > > --
> > > 2.45.2
> > >
^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v3 0/5] Dump off-cpu samples directly
  2024-07-31 17:46     ` Namhyung Kim
@ 2024-07-31 18:23       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 16+ messages in thread
From: Arnaldo Carvalho de Melo @ 2024-07-31 18:23 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Howard Chu, Peter Zijlstra, Juri Lelli, irogers, adrian.hunter,
	jolsa, kan.liang, linux-perf-users, linux-kernel
On Wed, Jul 31, 2024 at 10:46:18AM -0700, Namhyung Kim wrote:
> On Mon, Jul 29, 2024 at 11:24:47PM +0800, Howard Chu wrote:
> > On Mon, Jul 29, 2024 at 9:21 AM Namhyung Kim <namhyung@kernel.org> wrote:
> > > On Fri, Jul 26, 2024 at 06:28:21PM +0800, Howard Chu wrote:
> > > > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> > > > Currently, off-cpu samples are dumped when perf record is exiting. This
> > > > results in off-cpu samples being after the regular samples. This patch
> > > > series makes possible dumping off-cpu samples on-the-fly, directly into
> > > > perf ring buffer. And it dispatches those samples to the correct format
> > > > for perf.data consumers.
> > > Thanks for your work!
> > > But I'm not sure we need a separate event for offcpu-time-direct.  If we
> > > fix the format for the direct event, we can adjust the format of offcpu-
> > > time when it dumps at the end.
> > Thank you and Ian for this advice, I'll do that.
> > > Anyway, as far as I can see you don't need to fill the sample info in
> > > the offcpu-time-direct manually in your BPF program.  Because the
> > > bpf_perf_event_output() will call perf_event_output() which fills all
> > > the sample information according to the sample_type flags.
> > > Well.. it'll set IP to the schedule function, but it should be ok.
> > > (updating IP using CALLCHAIN like in off_cpu_write() is a kinda hack and
> > > not absoluately necessary, probably I can get rid of it..  Let's go with
> > > simple for now.)
> > > So I think what you need is to ensure it has the uncessary flags.  And
> > > the only info it needs to fill is the time between the previous schedule
> > > and this can be added to the raw data.
I wonder if there wouldn't be other kernel information about things that
may have affected the time it took for the task to be off-cpu, maybe
system load average, C/P states, but then it would be overengineering, I
think, just thought about what else we could add that could help
understanding the off-cpu time that could be obtained from the vantage
point of BPF attached to sched-out and sched-in, things we could collect
at sched-out in addition to the timestamp and ditto at sched-in, but I'm
no scheduler expert, so take this just as some brainstorming. Maybe we
could have some sort of sample_type specific to this off-cpu event that
would allow us to ask for extra info in an extensible way. We can start
with just PERF_OFFCPU_SAMPLE_TIMESTAMP...
> > Sure thing, thank you. Other than the off-cpu duration, do you think
> > we should collect the callchain as well?
 
> I think the kernel will do that for you once you set the
> SAMPLE_CALLCHAIN flag in the event attr.
Yes, we should not reinvent the wheel for all things that can be asked
from the kernel perf subsystem, only using the raw-data payload on the
bpf-output event for things we can't get from the perf subsystem, and
that is the timestamp for a previous event stored in a BPF map, looking
if the delta to the current time (on a sched-in event) is over the
threshold and then recording this time on this specific "made-up on the
fly using BPF" event that appears on the ring buffer just like any other
"native" events such as tracepoints, hardware events, cache events, etc.
- Arnaldo
^ permalink raw reply	[flat|nested] 16+ messages in thread
end of thread, other threads:[~2024-07-31 18:23 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-26 10:28 [PATCH v3 0/5] Dump off-cpu samples directly Howard Chu
2024-07-26 10:28 ` [PATCH v3 1/5] perf record off-cpu: Add direct off-cpu event Howard Chu
2024-07-26 23:48   ` Ian Rogers
2024-07-29 13:36     ` Howard Chu
2024-07-26 10:28 ` [PATCH v3 2/5] perf record off-cpu: Dumping samples in BPF Howard Chu
2024-07-26 10:28 ` [PATCH v3 3/5] perf record off-cpu: processing of embedded sample Howard Chu
2024-07-26 10:28 ` [PATCH v3 4/5] perf record off-cpu: save embedded sample type Howard Chu
2024-07-27  0:49   ` Ian Rogers
2024-07-26 10:28 ` [PATCH v3 5/5] perf record off-cpu: Add direct off-cpu test Howard Chu
2024-07-27  0:54   ` Ian Rogers
2024-07-29 13:29     ` Howard Chu
2024-07-27  1:06 ` [PATCH v3 0/5] Dump off-cpu samples directly Ian Rogers
2024-07-29  1:21 ` Namhyung Kim
2024-07-29 15:24   ` Howard Chu
2024-07-31 17:46     ` Namhyung Kim
2024-07-31 18:23       ` Arnaldo Carvalho de Melo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).