* [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly
@ 2024-08-07 15:38 Howard Chu
2024-08-07 15:38 ` [PATCH v4 1/9] perf evsel: Set BPF output to system-wide Howard Chu
` (9 more replies)
0 siblings, 10 replies; 24+ messages in thread
From: Howard Chu @ 2024-08-07 15:38 UTC (permalink / raw)
To: namhyung
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
Changes in v4:
- Minimize the size of data output by perf_event_output()
- Keep only one off-cpu event
- Change off-cpu threshold's unit to microseconds
- Set a default off-cpu threshold
- Print the correct error message for the field 'embed' in perf data header
Changes in v3:
- Add off-cpu-thresh argument
- Process direct off-cpu samples in post
Changes in v2:
- Remove unnecessary comments.
- Rename function off_cpu_change_type to off_cpu_prepare_parse
v1:
As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
Currently, off-cpu samples are dumped when perf record is exiting. This
results in off-cpu samples being after the regular samples. This patch
series makes possible dumping off-cpu samples on-the-fly, directly into
perf ring buffer. And it dispatches those samples to the correct format
for perf.data consumers.
Before:
```
migration/0 21 [000] 27981.041319: 2944637851 cycles:P: ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
perf 770116 [001] 27981.041375: 1 cycles:P: ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
perf 770116 [001] 27981.041377: 1 cycles:P: ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
perf 770116 [001] 27981.041379: 51611 cycles:P: ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
migration/1 26 [001] 27981.041400: 4227682775 cycles:P: ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
migration/2 32 [002] 27981.041477: 4159401534 cycles:P: ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
sshd 708098 [000] 18446744069.414584: 286392 offcpu-time:
79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
585690935cca [unknown] (/usr/bin/sshd)
```
After:
```
perf 774767 [003] 28178.033444: 497 cycles:P: ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
perf 774767 [003] 28178.033445: 399440 cycles:P: ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
swapper 0 [001] 28178.036639: 376650973 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
swapper 0 [003] 28178.182921: 348779378 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
blueman-tray 1355 [000] 28178.627906: 100184571 offcpu-time:
7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
7fff24e862d8 [unknown] ([unknown])
blueman-tray 1355 [000] 28178.728137: 100187539 offcpu-time:
7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
7fff24e862d8 [unknown] ([unknown])
swapper 0 [000] 28178.463253: 195945410 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
dbus-broker 412 [002] 28178.464855: 376737008 cycles:P: ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
```
Howard Chu (9):
perf evsel: Set BPF output to system-wide
perf record --off-cpu: Add --off-cpu-thresh
perf record --off-cpu: Parse offcpu-time event
perf record off-cpu: Dump direct off-cpu samples in BPF
perf record --off-cpu: Dump total off-cpu time at the end.
perf evsel: Delete unnecessary = 0
perf record --off-cpu: Parse BPF output embedded data
perf header: Add field 'embed'
perf test: Add direct off-cpu dumping test
tools/perf/builtin-record.c | 26 ++++
tools/perf/builtin-script.c | 5 +-
tools/perf/tests/builtin-test.c | 1 +
tools/perf/tests/shell/record_offcpu.sh | 27 ++++
tools/perf/tests/tests.h | 1 +
tools/perf/tests/workloads/Build | 1 +
tools/perf/tests/workloads/offcpu.c | 16 +++
tools/perf/util/bpf_off_cpu.c | 176 +++++++++++++++---------
tools/perf/util/bpf_skel/off_cpu.bpf.c | 135 ++++++++++++++++++
tools/perf/util/evsel.c | 49 ++++---
tools/perf/util/evsel.h | 13 ++
tools/perf/util/header.c | 12 ++
tools/perf/util/off_cpu.h | 10 +-
tools/perf/util/record.h | 1 +
tools/perf/util/session.c | 16 ++-
15 files changed, 401 insertions(+), 88 deletions(-)
create mode 100644 tools/perf/tests/workloads/offcpu.c
--
2.45.2
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v4 1/9] perf evsel: Set BPF output to system-wide
2024-08-07 15:38 [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
@ 2024-08-07 15:38 ` Howard Chu
2024-08-07 23:21 ` Namhyung Kim
2024-08-07 15:38 ` [PATCH v4 2/9] perf record --off-cpu: Add --off-cpu-thresh Howard Chu
` (8 subsequent siblings)
9 siblings, 1 reply; 24+ messages in thread
From: Howard Chu @ 2024-08-07 15:38 UTC (permalink / raw)
To: namhyung
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
pid = -1 for bpf-output event.
This makes perf record -p <PID> --off-cpu work. Otherwise bpf-output
cannot be collected.
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
tools/perf/util/evsel.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index bc603193c477..b961467133cf 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2282,6 +2282,10 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
test_attr__ready();
+ /* BPF output event can only be system-wide */
+ if (evsel__is_bpf_output(evsel))
+ pid = -1;
+
/* Debug message used by test scripts */
pr_debug2_peo("sys_perf_event_open: pid %d cpu %d group_fd %d flags %#lx",
pid, perf_cpu_map__cpu(cpus, idx).cpu, group_fd, evsel->open_flags);
--
2.45.2
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 2/9] perf record --off-cpu: Add --off-cpu-thresh
2024-08-07 15:38 [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
2024-08-07 15:38 ` [PATCH v4 1/9] perf evsel: Set BPF output to system-wide Howard Chu
@ 2024-08-07 15:38 ` Howard Chu
2024-08-07 23:22 ` Namhyung Kim
2024-08-07 15:38 ` [PATCH v4 3/9] perf record --off-cpu: Parse offcpu-time event Howard Chu
` (7 subsequent siblings)
9 siblings, 1 reply; 24+ messages in thread
From: Howard Chu @ 2024-08-07 15:38 UTC (permalink / raw)
To: namhyung
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
Add the --off-cpu-thresh argument to specify the off-cpu time threshold.
If the off-cpu time exceeds this threshold, dump the off-cpu data
directly.
Suggested-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
tools/perf/builtin-record.c | 26 ++++++++++++++++++++++++++
tools/perf/util/bpf_off_cpu.c | 2 ++
tools/perf/util/bpf_skel/off_cpu.bpf.c | 2 ++
tools/perf/util/record.h | 1 +
4 files changed, 31 insertions(+)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 72345d1e54b0..60c6fe7b4804 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -3133,6 +3133,28 @@ static int record__parse_mmap_pages(const struct option *opt,
return ret;
}
+static int record__parse_off_cpu_thresh(const struct option *opt,
+ const char *str,
+ int unset __maybe_unused)
+{
+ struct record_opts *opts = opt->value;
+ char *endptr;
+ u64 off_cpu_thresh;
+
+ if (!str)
+ return -EINVAL;
+
+ off_cpu_thresh = strtoul(str, &endptr, 10);
+
+ /* threshold isn't string "0", yet strtoull() returns 0, parsing failed. */
+ if (*endptr || (off_cpu_thresh == 0 && strcmp(str, "0")))
+ return -EINVAL;
+ else
+ opts->off_cpu_thresh = off_cpu_thresh;
+
+ return 0;
+}
+
void __weak arch__add_leaf_frame_record_opts(struct record_opts *opts __maybe_unused)
{
}
@@ -3326,6 +3348,7 @@ static struct record record = {
.ctl_fd = -1,
.ctl_fd_ack = -1,
.synth = PERF_SYNTH_ALL,
+ .off_cpu_thresh = OFF_CPU_THRESH_DEFAULT,
},
.tool = {
.sample = process_sample_event,
@@ -3560,6 +3583,9 @@ static struct option __record_options[] = {
OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
OPT_STRING(0, "setup-filter", &record.filter_action, "pin|unpin",
"BPF filter action"),
+ OPT_CALLBACK(0, "off-cpu-thresh", &record.opts, "us",
+ "Dump off-cpu samples if off-cpu time reaches this threshold. The unit is microseconds. (default: 500000)",
+ record__parse_off_cpu_thresh),
OPT_END()
};
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 6af36142dc5a..1e0e454bfb5e 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -272,6 +272,8 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
}
}
+ skel->bss->offcpu_thresh = opts->off_cpu_thresh * 1000ull;
+
err = off_cpu_bpf__attach(skel);
if (err) {
pr_err("Failed to attach off-cpu BPF skeleton\n");
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index d877a0a9731f..cca1b6990a57 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -96,6 +96,8 @@ const volatile bool uses_cgroup_v1 = false;
int perf_subsys_id = -1;
+__u64 sample_id, sample_type, offcpu_thresh;
+
/*
* Old kernel used to call it task_struct->state and now it's '__state'.
* Use BPF CO-RE "ignored suffix rule" to deal with it like below:
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index a6566134e09e..3c11416e6627 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -79,6 +79,7 @@ struct record_opts {
int synth;
int threads_spec;
const char *threads_user_spec;
+ u64 off_cpu_thresh;
};
extern const char * const *record_usage;
--
2.45.2
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 3/9] perf record --off-cpu: Parse offcpu-time event
2024-08-07 15:38 [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
2024-08-07 15:38 ` [PATCH v4 1/9] perf evsel: Set BPF output to system-wide Howard Chu
2024-08-07 15:38 ` [PATCH v4 2/9] perf record --off-cpu: Add --off-cpu-thresh Howard Chu
@ 2024-08-07 15:38 ` Howard Chu
2024-08-07 23:25 ` Namhyung Kim
2024-08-07 15:38 ` [PATCH v4 4/9] perf record off-cpu: Dump direct off-cpu samples in BPF Howard Chu
` (6 subsequent siblings)
9 siblings, 1 reply; 24+ messages in thread
From: Howard Chu @ 2024-08-07 15:38 UTC (permalink / raw)
To: namhyung
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
Parse offcpu-time event using parse_event, in off_cpu_start(), write
evlist fds got from evlist__open() to perf_event_array BPF map.
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
tools/perf/util/bpf_off_cpu.c | 55 ++++++++++++++++++++---------------
tools/perf/util/evsel.c | 2 +-
2 files changed, 32 insertions(+), 25 deletions(-)
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 1e0e454bfb5e..fae0bb8aaa13 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -13,6 +13,7 @@
#include "util/cgroup.h"
#include "util/strlist.h"
#include <bpf/bpf.h>
+#include <internal/xyarray.h>
#include "bpf_skel/off_cpu.skel.h"
@@ -38,39 +39,24 @@ union off_cpu_data {
static int off_cpu_config(struct evlist *evlist)
{
- struct evsel *evsel;
- struct perf_event_attr attr = {
- .type = PERF_TYPE_SOFTWARE,
- .config = PERF_COUNT_SW_BPF_OUTPUT,
- .size = sizeof(attr), /* to capture ABI version */
- };
- char *evname = strdup(OFFCPU_EVENT);
+ char off_cpu_event[64];
- if (evname == NULL)
- return -ENOMEM;
-
- evsel = evsel__new(&attr);
- if (!evsel) {
- free(evname);
- return -ENOMEM;
+ /* after parsing off-cpu event, we'll specify its sample_type in evsel__config() */
+ scnprintf(off_cpu_event, sizeof(off_cpu_event), "bpf-output/no-inherit=1,name=%s/", OFFCPU_EVENT);
+ if (parse_event(evlist, off_cpu_event)) {
+ pr_err("Failed to open off-cpu event\n");
+ return -1;
}
- evsel->core.attr.freq = 1;
- evsel->core.attr.sample_period = 1;
- /* off-cpu analysis depends on stack trace */
- evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
-
- evlist__add(evlist, evsel);
-
- free(evsel->name);
- evsel->name = evname;
-
return 0;
}
static void off_cpu_start(void *arg)
{
struct evlist *evlist = arg;
+ struct evsel *evsel;
+ struct perf_cpu pcpu;
+ int i, err;
/* update task filter for the given workload */
if (!skel->bss->has_cpu && !skel->bss->has_task &&
@@ -86,6 +72,27 @@ static void off_cpu_start(void *arg)
bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
}
+ /* sample id and fds in BPF's perf_event_array can only be set after record__open() */
+ evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT);
+ if (evsel == NULL) {
+ pr_err("%s evsel not found\n", OFFCPU_EVENT);
+ return;
+ }
+
+ if (evsel->core.id)
+ skel->bss->sample_id = evsel->core.id[0];
+
+ perf_cpu_map__for_each_cpu(pcpu, i, evsel->core.cpus) {
+ err = bpf_map__update_elem(skel->maps.offcpu_output,
+ &pcpu.cpu, sizeof(__u32),
+ xyarray__entry(evsel->core.fd, pcpu.cpu, 0),
+ sizeof(__u32), BPF_ANY);
+ if (err) {
+ pr_err("Failed to update perf event map for direct off-cpu dumping\n");
+ return;
+ }
+ }
+
skel->bss->enabled = 1;
}
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index b961467133cf..ccd3bda02b5d 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1379,7 +1379,7 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
evsel__reset_sample_bit(evsel, BRANCH_STACK);
if (evsel__is_offcpu_event(evsel))
- evsel->core.attr.sample_type &= OFFCPU_SAMPLE_TYPES;
+ evsel->core.attr.sample_type = OFFCPU_SAMPLE_TYPES;
arch__post_evsel_config(evsel, attr);
}
--
2.45.2
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 4/9] perf record off-cpu: Dump direct off-cpu samples in BPF
2024-08-07 15:38 [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (2 preceding siblings ...)
2024-08-07 15:38 ` [PATCH v4 3/9] perf record --off-cpu: Parse offcpu-time event Howard Chu
@ 2024-08-07 15:38 ` Howard Chu
2024-08-07 23:49 ` Namhyung Kim
2024-08-07 15:38 ` [PATCH v4 5/9] perf record --off-cpu: Dump total off-cpu time at the end Howard Chu
` (5 subsequent siblings)
9 siblings, 1 reply; 24+ messages in thread
From: Howard Chu @ 2024-08-07 15:38 UTC (permalink / raw)
To: namhyung
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
Output PERF_SAMPLE_TID, PERF_SAMPLE_PERIOD, PERF_SAMPLE_CALLCHAIN, and
PERF_SAMPLE_CGROUP in BPF. Ideally, we should only output
PERF_SAMPLE_PERIOD (off-cpu time) and PERF_SAMPLE_CALLCHAIN (sched_in
process's callchain). One only needs to set PERF_SAMPLE_TID and
PERF_SAMPLE_CGROUP, and perf_event will do everything for us.
But in reality, that's not the case. Setting PERF_SAMPLE_TID will mostly
give us TID of 0. We might get the correct TID for offcpu-time event
from time to time, but it is really rare.
swapper 0 [000] offcpu-time: /
:1321819 1321819 [002] offcpu-time: /user.slice/user-1000.slice/session-2.scope
swapper 0 [001] offcpu-time: /
swapper 0 [003] offcpu-time: /
And setting PERF_SAMPLE_CGROUP doesn't work properly either.
tmux: server 3701 [003] offcpu-time: /
blueman-tray 1064 [001] offcpu-time: /
bash 1350867 [001] offcpu-time: /
bash 1350844 [000] offcpu-time: /
We need to retrieve PERF_SAMPLE_TID, PERF_SAMPLE_PERIOD,
PERF_SAMPLE_CALLCHAIN, and PERF_SAMPLE_CGROUP using BPF and output these
four fields.
Add perf_event_array map for dumping direct off-cpu samples, but keep
the at-the-end approach.
Tons of checking before access, to pass the BPF verifier.
If off-cpu time (represented as delta) exceeds the off-cpu threshold, do
output.
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
tools/perf/util/bpf_skel/off_cpu.bpf.c | 133 +++++++++++++++++++++++++
1 file changed, 133 insertions(+)
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index cca1b6990a57..95cc130bb67e 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -18,6 +18,9 @@
#define MAX_STACKS 32
#define MAX_ENTRIES 102400
+#define MAX_CPUS 4096
+#define MAX_OFFCPU_LEN 128
+
struct tstamp_data {
__u32 stack_id;
__u32 state;
@@ -32,6 +35,7 @@ struct offcpu_key {
__u64 cgroup_id;
};
+/* for dumping at the end */
struct {
__uint(type, BPF_MAP_TYPE_STACK_TRACE);
__uint(key_size, sizeof(__u32));
@@ -39,6 +43,45 @@ struct {
__uint(max_entries, MAX_ENTRIES);
} stacks SEC(".maps");
+struct offcpu_data {
+ u64 array[MAX_OFFCPU_LEN];
+};
+
+struct stack_data {
+ u64 array[MAX_STACKS];
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+ __uint(key_size, sizeof(__u32));
+ __uint(value_size, sizeof(__u32));
+ __uint(max_entries, MAX_CPUS);
+} offcpu_output SEC(".maps");
+
+/* temporary offcpu sample */
+struct {
+ __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+ __uint(key_size, sizeof(__u32));
+ __uint(value_size, sizeof(struct offcpu_data));
+ __uint(max_entries, 1);
+} offcpu_payload SEC(".maps");
+
+/* temporary stack data */
+struct {
+ __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+ __uint(key_size, sizeof(__u32));
+ __uint(value_size, sizeof(struct stack_data));
+ __uint(max_entries, 1);
+} stack_tmp SEC(".maps");
+
+/* cached stack per task storage */
+struct {
+ __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, struct stack_data);
+} stack_cache SEC(".maps");
+
struct {
__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
__uint(map_flags, BPF_F_NO_PREALLOC);
@@ -184,12 +227,75 @@ static inline int can_record(struct task_struct *t, int state)
return 1;
}
+static inline bool check_bounds(int index)
+{
+ if (index >= 0 && index < MAX_OFFCPU_LEN)
+ return true;
+
+ return false;
+}
+
+static inline int copy_stack(struct stack_data *from,
+ struct offcpu_data *to, int n)
+{
+ int max_stacks = MAX_STACKS, len = 0;
+
+ if (!from)
+ return len;
+
+ for (int i = 0; i < max_stacks && from->array[i]; ++i) {
+ if (check_bounds(n + 2 + i)) {
+ to->array[n + 2 + i] = from->array[i];
+ ++len;
+ }
+ }
+ return len;
+}
+
+static int off_cpu_dump(void *ctx, struct offcpu_data *data, struct offcpu_key *key,
+ struct stack_data *stack_p, __u64 delta, __u64 timestamp)
+{
+ int size, n = 0, ip_pos = -1, len = 0;
+
+ if (sample_type & PERF_SAMPLE_TID && check_bounds(n))
+ data->array[n++] = (u64)key->tgid << 32 | key->pid;
+ if (sample_type & PERF_SAMPLE_PERIOD && check_bounds(n))
+ data->array[n++] = delta;
+ if (sample_type & PERF_SAMPLE_CALLCHAIN && check_bounds(n + 2)) {
+ /* data->array[n] is callchain->nr (updated later) */
+ data->array[n + 1] = PERF_CONTEXT_USER;
+ data->array[n + 2] = 0;
+
+ len = copy_stack(stack_p, data, n);
+
+ /* update length of callchain */
+ data->array[n] = len + 1;
+
+ /* update sample ip with the first callchain entry */
+ if (ip_pos >= 0)
+ data->array[ip_pos] = data->array[n + 2];
+
+ /* calculate sample callchain data->array length */
+ n += len + 2;
+ }
+ if (sample_type & PERF_SAMPLE_CGROUP && check_bounds(n))
+ data->array[n++] = key->cgroup_id;
+
+ size = n * sizeof(u64);
+ if (size >= 0 && size <= MAX_OFFCPU_LEN * sizeof(u64))
+ bpf_perf_event_output(ctx, &offcpu_output, BPF_F_CURRENT_CPU, data, size);
+
+ return 0;
+}
+
static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
struct task_struct *next, int state)
{
__u64 ts;
__u32 stack_id;
struct tstamp_data *pelem;
+ struct stack_data *stack_tmp_p, *stack_p;
+ int zero = 0, len = 0;
ts = bpf_ktime_get_ns();
@@ -199,6 +305,25 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
stack_id = bpf_get_stackid(ctx, &stacks,
BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);
+ /* cache stacks */
+ stack_tmp_p = bpf_map_lookup_elem(&stack_tmp, &zero);
+ if (stack_tmp_p)
+ len = bpf_get_stack(ctx, stack_tmp_p->array, MAX_STACKS * sizeof(u64),
+ BPF_F_USER_STACK) / sizeof(u64);
+
+ /*
+ * if stacks are successfully collected, cache them to task_storage, they are then
+ * dumped if the off-cpu time hits the threshold.
+ */
+ if (len > 0) {
+ stack_p = bpf_task_storage_get(&stack_cache, prev, NULL,
+ BPF_LOCAL_STORAGE_GET_F_CREATE);
+ if (stack_p) {
+ for (int i = 0; i < len && i < MAX_STACKS; ++i)
+ stack_p->array[i] = stack_tmp_p->array[i];
+ }
+ }
+
pelem = bpf_task_storage_get(&tstamp, prev, NULL,
BPF_LOCAL_STORAGE_GET_F_CREATE);
if (!pelem)
@@ -228,6 +353,14 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
else
bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
+ if (delta >= offcpu_thresh) {
+ struct offcpu_data *data = bpf_map_lookup_elem(&offcpu_payload, &zero);
+
+ stack_p = bpf_task_storage_get(&stack_cache, next, NULL, 0);
+ if (data && stack_p)
+ off_cpu_dump(ctx, data, &key, stack_p, delta, pelem->timestamp);
+ }
+
/* prevent to reuse the timestamp later */
pelem->timestamp = 0;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 5/9] perf record --off-cpu: Dump total off-cpu time at the end.
2024-08-07 15:38 [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (3 preceding siblings ...)
2024-08-07 15:38 ` [PATCH v4 4/9] perf record off-cpu: Dump direct off-cpu samples in BPF Howard Chu
@ 2024-08-07 15:38 ` Howard Chu
2024-08-07 15:38 ` [PATCH v4 6/9] perf evsel: Delete unnecessary = 0 Howard Chu
` (4 subsequent siblings)
9 siblings, 0 replies; 24+ messages in thread
From: Howard Chu @ 2024-08-07 15:38 UTC (permalink / raw)
To: namhyung
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
By setting a placeholder sample_type and then writing real data into
raw_data, we mimic the direct sample method to write data at the end.
Note that some data serve only as placeholders and will be overwritten
by the data in raw_data. Additionally, since the IP will be updated in
evsel__parse_sample(), there is no need to handle it in off_cpu_write().
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
tools/perf/util/bpf_off_cpu.c | 119 +++++++++++++++++++++-------------
tools/perf/util/off_cpu.h | 10 ++-
2 files changed, 84 insertions(+), 45 deletions(-)
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index fae0bb8aaa13..24104d703242 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -37,6 +37,8 @@ union off_cpu_data {
u64 array[1024 / sizeof(u64)];
};
+u64 off_cpu_raw_data[1024 / sizeof(u64)];
+
static int off_cpu_config(struct evlist *evlist)
{
char off_cpu_event[64];
@@ -139,12 +141,21 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
int ncpus = 1, ntasks = 1, ncgrps = 1;
struct strlist *pid_slist = NULL;
struct str_node *pos;
+ struct evsel *evsel;
if (off_cpu_config(evlist) < 0) {
pr_err("Failed to config off-cpu BPF event\n");
return -1;
}
+ evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT);
+ if (evsel == NULL) {
+ pr_err("%s evsel not found\n", OFFCPU_EVENT);
+ return -1 ;
+ }
+
+ evsel->sample_type_embed = OFFCPU_EMBEDDED_SAMPLE_TYPES;
+
skel = off_cpu_bpf__open();
if (!skel) {
pr_err("Failed to open off-cpu BPF skeleton\n");
@@ -257,7 +268,6 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
}
if (evlist__first(evlist)->cgrp) {
- struct evsel *evsel;
u8 val = 1;
skel->bss->has_cgroup = 1;
@@ -279,6 +289,7 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
}
}
+ skel->bss->sample_type = OFFCPU_EMBEDDED_SAMPLE_TYPES;
skel->bss->offcpu_thresh = opts->off_cpu_thresh * 1000ull;
err = off_cpu_bpf__attach(skel);
@@ -304,7 +315,8 @@ int off_cpu_write(struct perf_session *session)
{
int bytes = 0, size;
int fd, stack;
- u64 sample_type, val, sid = 0;
+ u32 raw_size;
+ u64 sample_type_off_cpu, sample_type_bpf_output, val, sid = 0, tstamp = OFF_CPU_TIMESTAMP;
struct evsel *evsel;
struct perf_data_file *file = &session->data->file;
struct off_cpu_key prev, key;
@@ -314,7 +326,6 @@ int off_cpu_write(struct perf_session *session)
.misc = PERF_RECORD_MISC_USER,
},
};
- u64 tstamp = OFF_CPU_TIMESTAMP;
skel->bss->enabled = 0;
@@ -324,15 +335,10 @@ int off_cpu_write(struct perf_session *session)
return 0;
}
- sample_type = evsel->core.attr.sample_type;
+ sample_type_off_cpu = OFFCPU_EMBEDDED_SAMPLE_TYPES;
+ sample_type_bpf_output = evsel->core.attr.sample_type;
- if (sample_type & ~OFFCPU_SAMPLE_TYPES) {
- pr_err("not supported sample type: %llx\n",
- (unsigned long long)sample_type);
- return -1;
- }
-
- if (sample_type & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER)) {
+ if (sample_type_bpf_output & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER)) {
if (evsel->core.id)
sid = evsel->core.id[0];
}
@@ -343,49 +349,74 @@ int off_cpu_write(struct perf_session *session)
while (!bpf_map_get_next_key(fd, &prev, &key)) {
int n = 1; /* start from perf_event_header */
- int ip_pos = -1;
+ int i = 0; /* raw data index */
bpf_map_lookup_elem(fd, &key, &val);
- if (sample_type & PERF_SAMPLE_IDENTIFIER)
- data.array[n++] = sid;
- if (sample_type & PERF_SAMPLE_IP) {
- ip_pos = n;
- data.array[n++] = 0; /* will be updated */
- }
- if (sample_type & PERF_SAMPLE_TID)
- data.array[n++] = (u64)key.pid << 32 | key.tgid;
- if (sample_type & PERF_SAMPLE_TIME)
- data.array[n++] = tstamp;
- if (sample_type & PERF_SAMPLE_ID)
+ /* bpf output data */
+ if (sample_type_bpf_output & PERF_SAMPLE_IDENTIFIER)
data.array[n++] = sid;
- if (sample_type & PERF_SAMPLE_CPU)
+ /* fill in an empty value, it will be overwritten by embedded data */
+ if (sample_type_bpf_output & PERF_SAMPLE_IP)
data.array[n++] = 0;
- if (sample_type & PERF_SAMPLE_PERIOD)
- data.array[n++] = val;
- if (sample_type & PERF_SAMPLE_CALLCHAIN) {
- int len = 0;
-
- /* data.array[n] is callchain->nr (updated later) */
- data.array[n + 1] = PERF_CONTEXT_USER;
- data.array[n + 2] = 0;
+ if (sample_type_bpf_output & PERF_SAMPLE_TID)
+ data.array[n++] = 0;
+ if (sample_type_bpf_output & PERF_SAMPLE_TIME)
+ data.array[n++] = tstamp; /* we won't overwrite time */
+ if (sample_type_bpf_output & PERF_SAMPLE_CPU)
+ data.array[n++] = 0;
+ if (sample_type_bpf_output & PERF_SAMPLE_PERIOD)
+ data.array[n++] = 0;
+ if (sample_type_bpf_output & PERF_SAMPLE_RAW) {
+ /*
+ * the format of raw data is as follows:
+ *
+ * [ size ][ data ]
+ * [ data ]
+ * [ data ]
+ * [ data ]
+ * [ data ][ empty]
+ *
+ * because embedded data is in BPF output, perf script -F without bpf-output
+ * field will not work properly
+ *
+ */
+ if (sample_type_off_cpu & PERF_SAMPLE_TID)
+ off_cpu_raw_data[i++] = (u64)key.pid << 32 | key.tgid;
+ if (sample_type_off_cpu & PERF_SAMPLE_PERIOD)
+ off_cpu_raw_data[i++] = val;
+ if (sample_type_off_cpu & PERF_SAMPLE_CALLCHAIN) {
+ int len = 0;
+
+ /* off_cpu_raw_data[n] is callchain->nr (updated later) */
+ off_cpu_raw_data[i + 1] = PERF_CONTEXT_USER;
+ off_cpu_raw_data[i + 2] = 0;
+
+ bpf_map_lookup_elem(stack, &key.stack_id, &off_cpu_raw_data[i + 2]);
+ while (off_cpu_raw_data[i + 2 + len])
+ len++;
+
+ /* update length of callchain */
+ off_cpu_raw_data[i] = len + 1;
+
+ /* calculate sample callchain off_cpu_raw_data length */
+ i += len + 2;
+ }
+ if (sample_type_off_cpu & PERF_SAMPLE_CGROUP)
+ off_cpu_raw_data[i++] = key.cgroup_id;
- bpf_map_lookup_elem(stack, &key.stack_id, &data.array[n + 2]);
- while (data.array[n + 2 + len])
- len++;
+ raw_size = i * sizeof(u64) + sizeof(u32); /* 4 empty bytes for alignment */
- /* update length of callchain */
- data.array[n] = len + 1;
+ /* raw_size */
+ memcpy((void *)data.array + n * sizeof(u64), &raw_size, sizeof(raw_size));
- /* update sample ip with the first callchain entry */
- if (ip_pos >= 0)
- data.array[ip_pos] = data.array[n + 2];
+ /* raw_data */
+ memcpy((void *)data.array + n * sizeof(u64) + sizeof(u32), off_cpu_raw_data, i * sizeof(u64));
- /* calculate sample callchain data array length */
- n += len + 2;
+ n += i + 1;
}
- if (sample_type & PERF_SAMPLE_CGROUP)
- data.array[n++] = key.cgroup_id;
+ if (sample_type_bpf_output & PERF_SAMPLE_CGROUP)
+ data.array[n++] = 0;
size = n * sizeof(u64);
data.hdr.size = size;
diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
index 2dd67c60f211..eaf7be92472d 100644
--- a/tools/perf/util/off_cpu.h
+++ b/tools/perf/util/off_cpu.h
@@ -10,12 +10,20 @@ struct record_opts;
#define OFFCPU_EVENT "offcpu-time"
+#define OFF_CPU_THRESH_DEFAULT 500000ull
+
#define OFFCPU_SAMPLE_TYPES (PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_IP | \
PERF_SAMPLE_TID | PERF_SAMPLE_TIME | \
PERF_SAMPLE_ID | PERF_SAMPLE_CPU | \
- PERF_SAMPLE_PERIOD | PERF_SAMPLE_CALLCHAIN | \
+ PERF_SAMPLE_PERIOD | PERF_SAMPLE_RAW | \
PERF_SAMPLE_CGROUP)
+/*
+ * for embedded data to overwrite the original sample, duplicated sample types
+ * must be set in the original OFFCPU_SAMPLE_TYPES, except for callchain.
+ */
+#define OFFCPU_EMBEDDED_SAMPLE_TYPES (PERF_SAMPLE_TID | PERF_SAMPLE_PERIOD | \
+ PERF_SAMPLE_CALLCHAIN | PERF_SAMPLE_CGROUP)
#ifdef HAVE_BPF_SKEL
int off_cpu_prepare(struct evlist *evlist, struct target *target,
--
2.45.2
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 6/9] perf evsel: Delete unnecessary = 0
2024-08-07 15:38 [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (4 preceding siblings ...)
2024-08-07 15:38 ` [PATCH v4 5/9] perf record --off-cpu: Dump total off-cpu time at the end Howard Chu
@ 2024-08-07 15:38 ` Howard Chu
2024-08-07 15:38 ` [PATCH v4 7/9] perf record --off-cpu: Parse BPF output embedded data Howard Chu
` (3 subsequent siblings)
9 siblings, 0 replies; 24+ messages in thread
From: Howard Chu @ 2024-08-07 15:38 UTC (permalink / raw)
To: namhyung
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
Delete unnecessary zero-initializations because they are already set to
zero at the top of evsel__parse_sample(). If we don't remove them, it
becomes troublesome to overwrite the sample using data from raw_data.
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
tools/perf/util/evsel.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index ccd3bda02b5d..c1aac09ad0a5 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2611,8 +2611,6 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
return perf_evsel__parse_id_sample(evsel, event, data);
}
- array = event->sample.array;
-
if (perf_event__check_size(event, evsel->sample_size))
return -EFAULT;
@@ -2879,7 +2877,6 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
array++;
}
- data->intr_regs.abi = PERF_SAMPLE_REGS_ABI_NONE;
if (type & PERF_SAMPLE_REGS_INTR) {
OVERFLOW_CHECK_u64(array);
data->intr_regs.abi = *array;
@@ -2896,25 +2893,21 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
}
}
- data->phys_addr = 0;
if (type & PERF_SAMPLE_PHYS_ADDR) {
data->phys_addr = *array;
array++;
}
- data->cgroup = 0;
if (type & PERF_SAMPLE_CGROUP) {
data->cgroup = *array;
array++;
}
- data->data_page_size = 0;
if (type & PERF_SAMPLE_DATA_PAGE_SIZE) {
data->data_page_size = *array;
array++;
}
- data->code_page_size = 0;
if (type & PERF_SAMPLE_CODE_PAGE_SIZE) {
data->code_page_size = *array;
array++;
--
2.45.2
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 7/9] perf record --off-cpu: Parse BPF output embedded data
2024-08-07 15:38 [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (5 preceding siblings ...)
2024-08-07 15:38 ` [PATCH v4 6/9] perf evsel: Delete unnecessary = 0 Howard Chu
@ 2024-08-07 15:38 ` Howard Chu
2024-08-07 15:38 ` [PATCH v4 8/9] perf header: Add field 'embed' Howard Chu
` (2 subsequent siblings)
9 siblings, 0 replies; 24+ messages in thread
From: Howard Chu @ 2024-08-07 15:38 UTC (permalink / raw)
To: namhyung
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
Add a sample_type_embed member to the struct evsel, along with a couple
of helper functions.
In session.c, we parse BPF output embedded samples in a two-step
process.
Initial Parsing: Treat the sample as a regular BPF-output event.
Secondary Parsing: Extract data from raw_data and parse it according to
the sample_type_embed specification. Since the second step relies on the
raw_data obtained in the first step, we must avoid zero-initializing the
sample data after the first step.
Suggested-by: Ian Rogers <irogers@google.com>
Suggested-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
tools/perf/builtin-script.c | 5 +++--
tools/perf/util/evsel.c | 36 ++++++++++++++++++++++++++++--------
tools/perf/util/evsel.h | 13 +++++++++++++
tools/perf/util/session.c | 16 +++++++++++++++-
4 files changed, 59 insertions(+), 11 deletions(-)
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index c16224b1fef3..4145e61b8d38 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -653,7 +653,8 @@ static int perf_session__check_output_opt(struct perf_session *session)
evlist__for_each_entry(session->evlist, evsel) {
not_pipe = true;
- if (evsel__has_callchain(evsel)) {
+ if (evsel__has_callchain(evsel) ||
+ evsel__embed_has_callchain(evsel)) {
use_callchain = true;
break;
}
@@ -2295,7 +2296,7 @@ static void process_event(struct perf_script *script,
else if (PRINT_FIELD(BRSTACKOFF))
perf_sample__fprintf_brstackoff(sample, thread, attr, fp);
- if (evsel__is_bpf_output(evsel) && PRINT_FIELD(BPF_OUTPUT))
+ if (evsel__is_bpf_output(evsel) && !evsel__has_embed(evsel) && PRINT_FIELD(BPF_OUTPUT))
perf_sample__fprintf_bpf_output(sample, fp);
perf_sample__fprintf_insn(sample, attr, thread, machine, fp, al);
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index c1aac09ad0a5..a5a5b3ff16e1 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -298,6 +298,7 @@ void evsel__init(struct evsel *evsel,
evsel->pmu_name = NULL;
evsel->group_pmu_name = NULL;
evsel->skippable = false;
+ evsel->sample_type_embed = 0;
}
struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
@@ -440,6 +441,7 @@ struct evsel *evsel__clone(struct evsel *orig)
evsel->weak_group = orig->weak_group;
evsel->use_config_name = orig->use_config_name;
evsel->pmu = orig->pmu;
+ evsel->sample_type_embed = orig->sample_type_embed;
if (evsel__copy_config_terms(evsel, orig) < 0)
goto out_err;
@@ -2589,6 +2591,7 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
u16 max_size = event->header.size;
const void *endp = (void *)event + max_size;
u64 sz;
+ bool ip_in_callchain = false;
/*
* used for cross-endian analysis. See git commit 65014ab3
@@ -2596,14 +2599,27 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
*/
union u64_swap u;
- memset(data, 0, sizeof(*data));
- data->cpu = data->pid = data->tid = -1;
- data->stream_id = data->id = data->time = -1ULL;
- data->period = evsel->core.attr.sample_period;
- data->cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
- data->misc = event->header.misc;
- data->data_src = PERF_MEM_DATA_SRC_NONE;
- data->vcpu = -1;
+ /*
+ * For sample data embedded in BPF output, don't clear the sample we read in the first pass,
+ * and read the embedded data from raw_data in the second pass.
+ */
+ if (evsel__has_embed(evsel) && evsel__is_bpf_output(evsel) && data->raw_data) {
+ type = evsel->sample_type_embed;
+ array = data->raw_data;
+ /* if callchain is embedded in BPF output, so is ip */
+ if (type & PERF_SAMPLE_CALLCHAIN)
+ ip_in_callchain = true;
+ } else { /* for normal samples, clear to zero before reading */
+ array = event->sample.array;
+ memset(data, 0, sizeof(*data));
+ data->cpu = data->pid = data->tid = -1;
+ data->stream_id = data->id = data->time = -1ULL;
+ data->period = evsel->core.attr.sample_period;
+ data->cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+ data->misc = event->header.misc;
+ data->data_src = PERF_MEM_DATA_SRC_NONE;
+ data->vcpu = -1;
+ }
if (event->header.type != PERF_RECORD_SAMPLE) {
if (!evsel->core.attr.sample_id_all)
@@ -2732,6 +2748,10 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
data->callchain = (struct ip_callchain *)array++;
if (data->callchain->nr > max_callchain_nr)
return -EFAULT;
+
+ if (ip_in_callchain && data->callchain->nr > 1)
+ data->ip = data->callchain->ips[1];
+
sz = data->callchain->nr * sizeof(u64);
OVERFLOW_CHECK(array, sz, max_size);
array = (void *)array + sz;
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 80b5f6dd868e..0d25e82c6154 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -184,6 +184,9 @@ struct evsel {
};
/* Is the tool's fd for /proc/pid/stat or /proc/stat. */
bool pid_stat;
+
+ /* for samples embedded in BPF output */
+ __u64 sample_type_embed;
};
struct perf_missing_features {
@@ -469,6 +472,11 @@ static inline bool evsel__is_bpf_output(struct evsel *evsel)
return evsel__match(evsel, SOFTWARE, SW_BPF_OUTPUT);
}
+static inline bool evsel__has_embed(struct evsel *evsel)
+{
+ return evsel->sample_type_embed != 0;
+}
+
static inline bool evsel__is_clock(const struct evsel *evsel)
{
return evsel__match(evsel, SOFTWARE, SW_CPU_CLOCK) ||
@@ -525,6 +533,11 @@ static inline bool evsel__has_callchain(const struct evsel *evsel)
evsel->synth_sample_type & PERF_SAMPLE_CALLCHAIN;
}
+static inline bool evsel__embed_has_callchain(const struct evsel *evsel)
+{
+ return evsel->sample_type_embed & PERF_SAMPLE_CALLCHAIN;
+}
+
static inline bool evsel__has_br_stack(const struct evsel *evsel)
{
/*
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 5596bed1b8c8..18aa5e0263b7 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1529,6 +1529,16 @@ static int evlist__deliver_sample(struct evlist *evlist, struct perf_tool *tool,
u64 sample_type = evsel->core.attr.sample_type;
u64 read_format = evsel->core.attr.read_format;
+ /* parse sample the second time to get embedded data from raw_data */
+ if (evsel__is_bpf_output(evsel) && evsel__has_embed(evsel) && sample->raw_data) {
+ int err = evsel__parse_sample(evsel, event, sample);
+
+ if (err) {
+ pr_err("failed to parse BPF ouput embedded data, err = %d\n", err);
+ return err;
+ }
+ }
+
/* Standard sample delivery. */
if (!(sample_type & PERF_SAMPLE_READ))
return tool->sample(tool, event, sample, evsel, machine);
@@ -1639,8 +1649,12 @@ static int perf_session__deliver_event(struct perf_session *session,
const char *file_path)
{
struct perf_sample sample;
- int ret = evlist__parse_sample(session->evlist, event, &sample);
+ int ret;
+
+ /* avoid reading embedded data from raw_data */
+ sample.raw_data = NULL;
+ ret = evlist__parse_sample(session->evlist, event, &sample);
if (ret) {
pr_err("Can't parse sample, err = %d\n", ret);
return ret;
--
2.45.2
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 8/9] perf header: Add field 'embed'
2024-08-07 15:38 [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (6 preceding siblings ...)
2024-08-07 15:38 ` [PATCH v4 7/9] perf record --off-cpu: Parse BPF output embedded data Howard Chu
@ 2024-08-07 15:38 ` Howard Chu
2024-08-07 23:52 ` Namhyung Kim
2024-08-07 15:38 ` [PATCH v4 9/9] perf test: Add direct off-cpu dumping test Howard Chu
2024-09-25 3:04 ` [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Ian Rogers
9 siblings, 1 reply; 24+ messages in thread
From: Howard Chu @ 2024-08-07 15:38 UTC (permalink / raw)
To: namhyung
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
We have to save the embedded data's sample type for it to be consumed
correctly by perf script or perf report.
This will approach most definitely break some perf.data convertor.
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
| 12 ++++++++++++
1 file changed, 12 insertions(+)
--git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 55e9553861d0..d60e77d5c25c 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -80,6 +80,7 @@ const char perf_version_string[] = PERF_VERSION;
struct perf_file_attr {
struct perf_event_attr attr;
+ __u64 embed;
struct perf_file_section ids;
};
@@ -3713,6 +3714,7 @@ static int perf_session__do_write_header(struct perf_session *session,
}
f_attr = (struct perf_file_attr){
.attr = evsel->core.attr,
+ .embed = evsel->sample_type_embed,
.ids = {
.offset = evsel->id_offset,
.size = evsel->core.ids * sizeof(u64),
@@ -4147,6 +4149,14 @@ static int read_attr(int fd, struct perf_header *ph,
ret = readn(fd, ptr, left);
}
+
+ ret = readn(fd, &f_attr->embed, sizeof(f_attr->embed));
+ if (ret <= 0) {
+ pr_debug("failed to read %d bytes of embedded sample type\n",
+ (int)sizeof(f_attr->embed));
+ return -1;
+ }
+
/* read perf_file_section, ids are read in caller */
ret = readn(fd, &f_attr->ids, sizeof(f_attr->ids));
@@ -4272,6 +4282,8 @@ int perf_session__read_header(struct perf_session *session, int repipe_fd)
tmp = lseek(fd, 0, SEEK_CUR);
evsel = evsel__new(&f_attr.attr);
+ evsel->sample_type_embed = f_attr.embed;
+
if (evsel == NULL)
goto out_delete_evlist;
--
2.45.2
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 9/9] perf test: Add direct off-cpu dumping test
2024-08-07 15:38 [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (7 preceding siblings ...)
2024-08-07 15:38 ` [PATCH v4 8/9] perf header: Add field 'embed' Howard Chu
@ 2024-08-07 15:38 ` Howard Chu
2024-09-25 3:04 ` [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Ian Rogers
9 siblings, 0 replies; 24+ messages in thread
From: Howard Chu @ 2024-08-07 15:38 UTC (permalink / raw)
To: namhyung
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
Add a simple workload(offcpu.c) to create the scenario for direct
off-cpu dumping.
Please run this test with 'perf test offcpu'
Suggested-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
tools/perf/tests/builtin-test.c | 1 +
tools/perf/tests/shell/record_offcpu.sh | 27 +++++++++++++++++++++++++
tools/perf/tests/tests.h | 1 +
tools/perf/tests/workloads/Build | 1 +
tools/perf/tests/workloads/offcpu.c | 16 +++++++++++++++
5 files changed, 46 insertions(+)
create mode 100644 tools/perf/tests/workloads/offcpu.c
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index 470a9709427d..aa33beaf58c8 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -153,6 +153,7 @@ static struct test_workload *workloads[] = {
&workload__brstack,
&workload__datasym,
&workload__landlock,
+ &workload__offcpu,
};
static int num_subtests(const struct test_suite *t)
diff --git a/tools/perf/tests/shell/record_offcpu.sh b/tools/perf/tests/shell/record_offcpu.sh
index 67c925f3a15a..6c26f541a09a 100755
--- a/tools/perf/tests/shell/record_offcpu.sh
+++ b/tools/perf/tests/shell/record_offcpu.sh
@@ -6,6 +6,8 @@ set -e
err=0
perfdata=$(mktemp /tmp/__perf_test.perf.data.XXXXX)
+TEST_PROGRAM="perf test -w offcpu"
+dummy_timestamp=18446744069414584
cleanup() {
rm -f ${perfdata}
@@ -88,6 +90,27 @@ test_offcpu_child() {
echo "Child task off-cpu test [Success]"
}
+test_offcpu_direct() {
+ echo "Direct off-cpu test"
+
+ # dump off-cpu samples for task blocked for more than 1.999999s
+ # -D for initial delay, to enable evlist
+ if ! perf record -e dummy -D 500 --off-cpu --off-cpu-thresh 1999999 -o ${perfdata} ${TEST_PROGRAM} 2> /dev/null
+ then
+ echo "Direct off-cpu test [Failed record]"
+ err=1
+ return
+ fi
+ # Direct sample's timestamp should be lower than the dummy_timestamp of the at-the-end sample.
+ if ! perf script -i ${perfdata} -F time,period | sed "s/[\.:]//g" | \
+ awk "{ if (\$1 < ${dummy_timestamp} && \$2 > 1999999999) exit 0; else exit 1; }"
+ then
+ echo "Direct off-cpu test [Failed missing direct sample]"
+ err=1
+ return
+ fi
+ echo "Direct off-cpu test [Success]"
+}
test_offcpu_priv
@@ -99,5 +122,9 @@ if [ $err = 0 ]; then
test_offcpu_child
fi
+if [ $err = 0 ]; then
+ test_offcpu_direct
+fi
+
cleanup
exit $err
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index 6ea2be86b7bf..c7a5e27c4567 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -206,6 +206,7 @@ DECLARE_WORKLOAD(sqrtloop);
DECLARE_WORKLOAD(brstack);
DECLARE_WORKLOAD(datasym);
DECLARE_WORKLOAD(landlock);
+DECLARE_WORKLOAD(offcpu);
extern const char *dso_to_test;
extern const char *test_objdump_path;
diff --git a/tools/perf/tests/workloads/Build b/tools/perf/tests/workloads/Build
index 5af17206f04d..0e78fd01eaf1 100644
--- a/tools/perf/tests/workloads/Build
+++ b/tools/perf/tests/workloads/Build
@@ -7,6 +7,7 @@ perf-test-y += sqrtloop.o
perf-test-y += brstack.o
perf-test-y += datasym.o
perf-test-y += landlock.o
+perf-test-y += offcpu.o
CFLAGS_sqrtloop.o = -g -O0 -fno-inline -U_FORTIFY_SOURCE
CFLAGS_leafloop.o = -g -O0 -fno-inline -fno-omit-frame-pointer -U_FORTIFY_SOURCE
diff --git a/tools/perf/tests/workloads/offcpu.c b/tools/perf/tests/workloads/offcpu.c
new file mode 100644
index 000000000000..57cee201a4c3
--- /dev/null
+++ b/tools/perf/tests/workloads/offcpu.c
@@ -0,0 +1,16 @@
+#include <linux/compiler.h>
+#include <unistd.h>
+#include "../tests.h"
+
+static int offcpu(int argc __maybe_unused, const char **argv __maybe_unused)
+{
+ /* get past the initial delay */
+ sleep(1);
+
+ /* what we want to collect as a direct sample */
+ sleep(2);
+
+ return 0;
+}
+
+DEFINE_WORKLOAD(offcpu);
--
2.45.2
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v4 1/9] perf evsel: Set BPF output to system-wide
2024-08-07 15:38 ` [PATCH v4 1/9] perf evsel: Set BPF output to system-wide Howard Chu
@ 2024-08-07 23:21 ` Namhyung Kim
2024-08-08 3:58 ` Howard Chu
0 siblings, 1 reply; 24+ messages in thread
From: Namhyung Kim @ 2024-08-07 23:21 UTC (permalink / raw)
To: Howard Chu
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
On Wed, Aug 07, 2024 at 11:38:35PM +0800, Howard Chu wrote:
> pid = -1 for bpf-output event.
>
> This makes perf record -p <PID> --off-cpu work. Otherwise bpf-output
> cannot be collected.
I don't understand why it's necessary. Why isn't it collected?
Is it the kernel to reject the BPF output event to open?
Thanks,
Namhyung
>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
> tools/perf/util/evsel.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index bc603193c477..b961467133cf 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2282,6 +2282,10 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
>
> test_attr__ready();
>
> + /* BPF output event can only be system-wide */
> + if (evsel__is_bpf_output(evsel))
> + pid = -1;
> +
> /* Debug message used by test scripts */
> pr_debug2_peo("sys_perf_event_open: pid %d cpu %d group_fd %d flags %#lx",
> pid, perf_cpu_map__cpu(cpus, idx).cpu, group_fd, evsel->open_flags);
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 2/9] perf record --off-cpu: Add --off-cpu-thresh
2024-08-07 15:38 ` [PATCH v4 2/9] perf record --off-cpu: Add --off-cpu-thresh Howard Chu
@ 2024-08-07 23:22 ` Namhyung Kim
2024-08-08 4:01 ` Howard Chu
0 siblings, 1 reply; 24+ messages in thread
From: Namhyung Kim @ 2024-08-07 23:22 UTC (permalink / raw)
To: Howard Chu
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
On Wed, Aug 07, 2024 at 11:38:36PM +0800, Howard Chu wrote:
> Add the --off-cpu-thresh argument to specify the off-cpu time threshold.
> If the off-cpu time exceeds this threshold, dump the off-cpu data
> directly.
>
> Suggested-by: Ian Rogers <irogers@google.com>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
> tools/perf/builtin-record.c | 26 ++++++++++++++++++++++++++
> tools/perf/util/bpf_off_cpu.c | 2 ++
> tools/perf/util/bpf_skel/off_cpu.bpf.c | 2 ++
> tools/perf/util/record.h | 1 +
> 4 files changed, 31 insertions(+)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 72345d1e54b0..60c6fe7b4804 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -3133,6 +3133,28 @@ static int record__parse_mmap_pages(const struct option *opt,
> return ret;
> }
>
> +static int record__parse_off_cpu_thresh(const struct option *opt,
> + const char *str,
> + int unset __maybe_unused)
> +{
> + struct record_opts *opts = opt->value;
> + char *endptr;
> + u64 off_cpu_thresh;
> +
> + if (!str)
> + return -EINVAL;
> +
> + off_cpu_thresh = strtoul(str, &endptr, 10);
> +
> + /* threshold isn't string "0", yet strtoull() returns 0, parsing failed. */
> + if (*endptr || (off_cpu_thresh == 0 && strcmp(str, "0")))
> + return -EINVAL;
> + else
> + opts->off_cpu_thresh = off_cpu_thresh;
> +
> + return 0;
> +}
> +
> void __weak arch__add_leaf_frame_record_opts(struct record_opts *opts __maybe_unused)
> {
> }
> @@ -3326,6 +3348,7 @@ static struct record record = {
> .ctl_fd = -1,
> .ctl_fd_ack = -1,
> .synth = PERF_SYNTH_ALL,
> + .off_cpu_thresh = OFF_CPU_THRESH_DEFAULT,
Where is it defined?
Thanks,
Namhyung
> },
> .tool = {
> .sample = process_sample_event,
> @@ -3560,6 +3583,9 @@ static struct option __record_options[] = {
> OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
> OPT_STRING(0, "setup-filter", &record.filter_action, "pin|unpin",
> "BPF filter action"),
> + OPT_CALLBACK(0, "off-cpu-thresh", &record.opts, "us",
> + "Dump off-cpu samples if off-cpu time reaches this threshold. The unit is microseconds. (default: 500000)",
> + record__parse_off_cpu_thresh),
> OPT_END()
> };
>
> diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> index 6af36142dc5a..1e0e454bfb5e 100644
> --- a/tools/perf/util/bpf_off_cpu.c
> +++ b/tools/perf/util/bpf_off_cpu.c
> @@ -272,6 +272,8 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
> }
> }
>
> + skel->bss->offcpu_thresh = opts->off_cpu_thresh * 1000ull;
> +
> err = off_cpu_bpf__attach(skel);
> if (err) {
> pr_err("Failed to attach off-cpu BPF skeleton\n");
> diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> index d877a0a9731f..cca1b6990a57 100644
> --- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
> +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> @@ -96,6 +96,8 @@ const volatile bool uses_cgroup_v1 = false;
>
> int perf_subsys_id = -1;
>
> +__u64 sample_id, sample_type, offcpu_thresh;
> +
> /*
> * Old kernel used to call it task_struct->state and now it's '__state'.
> * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
> diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> index a6566134e09e..3c11416e6627 100644
> --- a/tools/perf/util/record.h
> +++ b/tools/perf/util/record.h
> @@ -79,6 +79,7 @@ struct record_opts {
> int synth;
> int threads_spec;
> const char *threads_user_spec;
> + u64 off_cpu_thresh;
> };
>
> extern const char * const *record_usage;
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 3/9] perf record --off-cpu: Parse offcpu-time event
2024-08-07 15:38 ` [PATCH v4 3/9] perf record --off-cpu: Parse offcpu-time event Howard Chu
@ 2024-08-07 23:25 ` Namhyung Kim
2024-08-08 8:52 ` Howard Chu
0 siblings, 1 reply; 24+ messages in thread
From: Namhyung Kim @ 2024-08-07 23:25 UTC (permalink / raw)
To: Howard Chu
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
On Wed, Aug 07, 2024 at 11:38:37PM +0800, Howard Chu wrote:
> Parse offcpu-time event using parse_event, in off_cpu_start(), write
> evlist fds got from evlist__open() to perf_event_array BPF map.
>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
> tools/perf/util/bpf_off_cpu.c | 55 ++++++++++++++++++++---------------
> tools/perf/util/evsel.c | 2 +-
> 2 files changed, 32 insertions(+), 25 deletions(-)
>
> diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> index 1e0e454bfb5e..fae0bb8aaa13 100644
> --- a/tools/perf/util/bpf_off_cpu.c
> +++ b/tools/perf/util/bpf_off_cpu.c
> @@ -13,6 +13,7 @@
> #include "util/cgroup.h"
> #include "util/strlist.h"
> #include <bpf/bpf.h>
> +#include <internal/xyarray.h>
>
> #include "bpf_skel/off_cpu.skel.h"
>
> @@ -38,39 +39,24 @@ union off_cpu_data {
>
> static int off_cpu_config(struct evlist *evlist)
> {
> - struct evsel *evsel;
> - struct perf_event_attr attr = {
> - .type = PERF_TYPE_SOFTWARE,
> - .config = PERF_COUNT_SW_BPF_OUTPUT,
> - .size = sizeof(attr), /* to capture ABI version */
> - };
> - char *evname = strdup(OFFCPU_EVENT);
> + char off_cpu_event[64];
>
> - if (evname == NULL)
> - return -ENOMEM;
> -
> - evsel = evsel__new(&attr);
> - if (!evsel) {
> - free(evname);
> - return -ENOMEM;
> + /* after parsing off-cpu event, we'll specify its sample_type in evsel__config() */
> + scnprintf(off_cpu_event, sizeof(off_cpu_event), "bpf-output/no-inherit=1,name=%s/", OFFCPU_EVENT);
> + if (parse_event(evlist, off_cpu_event)) {
> + pr_err("Failed to open off-cpu event\n");
> + return -1;
> }
>
> - evsel->core.attr.freq = 1;
> - evsel->core.attr.sample_period = 1;
> - /* off-cpu analysis depends on stack trace */
> - evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
> -
> - evlist__add(evlist, evsel);
> -
> - free(evsel->name);
> - evsel->name = evname;
> -
> return 0;
> }
>
> static void off_cpu_start(void *arg)
> {
> struct evlist *evlist = arg;
> + struct evsel *evsel;
> + struct perf_cpu pcpu;
> + int i, err;
>
> /* update task filter for the given workload */
> if (!skel->bss->has_cpu && !skel->bss->has_task &&
> @@ -86,6 +72,27 @@ static void off_cpu_start(void *arg)
> bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
> }
>
> + /* sample id and fds in BPF's perf_event_array can only be set after record__open() */
> + evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT);
> + if (evsel == NULL) {
> + pr_err("%s evsel not found\n", OFFCPU_EVENT);
> + return;
> + }
> +
> + if (evsel->core.id)
> + skel->bss->sample_id = evsel->core.id[0];
> +
> + perf_cpu_map__for_each_cpu(pcpu, i, evsel->core.cpus) {
> + err = bpf_map__update_elem(skel->maps.offcpu_output,
> + &pcpu.cpu, sizeof(__u32),
> + xyarray__entry(evsel->core.fd, pcpu.cpu, 0),
> + sizeof(__u32), BPF_ANY);
> + if (err) {
> + pr_err("Failed to update perf event map for direct off-cpu dumping\n");
> + return;
> + }
> + }
> +
> skel->bss->enabled = 1;
> }
>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index b961467133cf..ccd3bda02b5d 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1379,7 +1379,7 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> evsel__reset_sample_bit(evsel, BRANCH_STACK);
>
> if (evsel__is_offcpu_event(evsel))
> - evsel->core.attr.sample_type &= OFFCPU_SAMPLE_TYPES;
> + evsel->core.attr.sample_type = OFFCPU_SAMPLE_TYPES;
I don't think we need this. It should check what you requested.
IOW you don't need to put cgroup info when user didn't ask.
Thanks,
Namhyung
>
> arch__post_evsel_config(evsel, attr);
> }
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 4/9] perf record off-cpu: Dump direct off-cpu samples in BPF
2024-08-07 15:38 ` [PATCH v4 4/9] perf record off-cpu: Dump direct off-cpu samples in BPF Howard Chu
@ 2024-08-07 23:49 ` Namhyung Kim
2024-08-08 11:57 ` Howard Chu
0 siblings, 1 reply; 24+ messages in thread
From: Namhyung Kim @ 2024-08-07 23:49 UTC (permalink / raw)
To: Howard Chu
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
On Wed, Aug 07, 2024 at 11:38:38PM +0800, Howard Chu wrote:
> Output PERF_SAMPLE_TID, PERF_SAMPLE_PERIOD, PERF_SAMPLE_CALLCHAIN, and
> PERF_SAMPLE_CGROUP in BPF. Ideally, we should only output
> PERF_SAMPLE_PERIOD (off-cpu time) and PERF_SAMPLE_CALLCHAIN (sched_in
> process's callchain). One only needs to set PERF_SAMPLE_TID and
> PERF_SAMPLE_CGROUP, and perf_event will do everything for us.
>
> But in reality, that's not the case. Setting PERF_SAMPLE_TID will mostly
> give us TID of 0. We might get the correct TID for offcpu-time event
> from time to time, but it is really rare.
> swapper 0 [000] offcpu-time: /
> :1321819 1321819 [002] offcpu-time: /user.slice/user-1000.slice/session-2.scope
> swapper 0 [001] offcpu-time: /
> swapper 0 [003] offcpu-time: /
>
> And setting PERF_SAMPLE_CGROUP doesn't work properly either.
> tmux: server 3701 [003] offcpu-time: /
> blueman-tray 1064 [001] offcpu-time: /
> bash 1350867 [001] offcpu-time: /
> bash 1350844 [000] offcpu-time: /
Isn't it just because your system was idle?
>
> We need to retrieve PERF_SAMPLE_TID, PERF_SAMPLE_PERIOD,
> PERF_SAMPLE_CALLCHAIN, and PERF_SAMPLE_CGROUP using BPF and output these
> four fields.
>
> Add perf_event_array map for dumping direct off-cpu samples, but keep
> the at-the-end approach.
>
> Tons of checking before access, to pass the BPF verifier.
>
> If off-cpu time (represented as delta) exceeds the off-cpu threshold, do
> output.
>
> Suggested-by: Namhyung Kim <namhyung@kernel.org>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
> tools/perf/util/bpf_skel/off_cpu.bpf.c | 133 +++++++++++++++++++++++++
> 1 file changed, 133 insertions(+)
>
> diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> index cca1b6990a57..95cc130bb67e 100644
> --- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
> +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> @@ -18,6 +18,9 @@
> #define MAX_STACKS 32
> #define MAX_ENTRIES 102400
>
> +#define MAX_CPUS 4096
> +#define MAX_OFFCPU_LEN 128
> +
> struct tstamp_data {
> __u32 stack_id;
> __u32 state;
> @@ -32,6 +35,7 @@ struct offcpu_key {
> __u64 cgroup_id;
> };
>
> +/* for dumping at the end */
> struct {
> __uint(type, BPF_MAP_TYPE_STACK_TRACE);
> __uint(key_size, sizeof(__u32));
> @@ -39,6 +43,45 @@ struct {
> __uint(max_entries, MAX_ENTRIES);
> } stacks SEC(".maps");
>
> +struct offcpu_data {
> + u64 array[MAX_OFFCPU_LEN];
> +};
> +
> +struct stack_data {
> + u64 array[MAX_STACKS];
> +};
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
> + __uint(key_size, sizeof(__u32));
> + __uint(value_size, sizeof(__u32));
> + __uint(max_entries, MAX_CPUS);
> +} offcpu_output SEC(".maps");
> +
> +/* temporary offcpu sample */
> +struct {
> + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> + __uint(key_size, sizeof(__u32));
> + __uint(value_size, sizeof(struct offcpu_data));
> + __uint(max_entries, 1);
> +} offcpu_payload SEC(".maps");
> +
> +/* temporary stack data */
> +struct {
> + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> + __uint(key_size, sizeof(__u32));
> + __uint(value_size, sizeof(struct stack_data));
> + __uint(max_entries, 1);
> +} stack_tmp SEC(".maps");
> +
> +/* cached stack per task storage */
> +struct {
> + __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> + __uint(map_flags, BPF_F_NO_PREALLOC);
> + __type(key, int);
> + __type(value, struct stack_data);
> +} stack_cache SEC(".maps");
> +
> struct {
> __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> __uint(map_flags, BPF_F_NO_PREALLOC);
> @@ -184,12 +227,75 @@ static inline int can_record(struct task_struct *t, int state)
> return 1;
> }
>
> +static inline bool check_bounds(int index)
> +{
> + if (index >= 0 && index < MAX_OFFCPU_LEN)
> + return true;
> +
> + return false;
> +}
> +
> +static inline int copy_stack(struct stack_data *from,
> + struct offcpu_data *to, int n)
> +{
> + int max_stacks = MAX_STACKS, len = 0;
> +
> + if (!from)
> + return len;
> +
> + for (int i = 0; i < max_stacks && from->array[i]; ++i) {
> + if (check_bounds(n + 2 + i)) {
> + to->array[n + 2 + i] = from->array[i];
> + ++len;
> + }
> + }
> + return len;
> +}
> +
> +static int off_cpu_dump(void *ctx, struct offcpu_data *data, struct offcpu_key *key,
> + struct stack_data *stack_p, __u64 delta, __u64 timestamp)
> +{
> + int size, n = 0, ip_pos = -1, len = 0;
> +
> + if (sample_type & PERF_SAMPLE_TID && check_bounds(n))
> + data->array[n++] = (u64)key->tgid << 32 | key->pid;
> + if (sample_type & PERF_SAMPLE_PERIOD && check_bounds(n))
> + data->array[n++] = delta;
> + if (sample_type & PERF_SAMPLE_CALLCHAIN && check_bounds(n + 2)) {
> + /* data->array[n] is callchain->nr (updated later) */
> + data->array[n + 1] = PERF_CONTEXT_USER;
> + data->array[n + 2] = 0;
> +
> + len = copy_stack(stack_p, data, n);
> +
> + /* update length of callchain */
> + data->array[n] = len + 1;
> +
> + /* update sample ip with the first callchain entry */
> + if (ip_pos >= 0)
> + data->array[ip_pos] = data->array[n + 2];
Where is ip_pos set? I guess we don't need this as we don't update the
SAMPLE_IP here.
> +
> + /* calculate sample callchain data->array length */
> + n += len + 2;
> + }
> + if (sample_type & PERF_SAMPLE_CGROUP && check_bounds(n))
> + data->array[n++] = key->cgroup_id;
> +
> + size = n * sizeof(u64);
> + if (size >= 0 && size <= MAX_OFFCPU_LEN * sizeof(u64))
> + bpf_perf_event_output(ctx, &offcpu_output, BPF_F_CURRENT_CPU, data, size);
> +
> + return 0;
> +}
> +
> static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
> struct task_struct *next, int state)
> {
> __u64 ts;
> __u32 stack_id;
> struct tstamp_data *pelem;
> + struct stack_data *stack_tmp_p, *stack_p;
> + int zero = 0, len = 0;
>
> ts = bpf_ktime_get_ns();
>
> @@ -199,6 +305,25 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
> stack_id = bpf_get_stackid(ctx, &stacks,
> BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);
>
> + /* cache stacks */
> + stack_tmp_p = bpf_map_lookup_elem(&stack_tmp, &zero);
> + if (stack_tmp_p)
> + len = bpf_get_stack(ctx, stack_tmp_p->array, MAX_STACKS * sizeof(u64),
> + BPF_F_USER_STACK) / sizeof(u64);
Looks redundant, but not sure if we share the stackid..
> +
> + /*
> + * if stacks are successfully collected, cache them to task_storage, they are then
> + * dumped if the off-cpu time hits the threshold.
> + */
> + if (len > 0) {
> + stack_p = bpf_task_storage_get(&stack_cache, prev, NULL,
> + BPF_LOCAL_STORAGE_GET_F_CREATE);
> + if (stack_p) {
> + for (int i = 0; i < len && i < MAX_STACKS; ++i)
> + stack_p->array[i] = stack_tmp_p->array[i];
> + }
> + }
Why not use the task storage to get the stacktrace directly?
Thanks,
Namhyung
> +
> pelem = bpf_task_storage_get(&tstamp, prev, NULL,
> BPF_LOCAL_STORAGE_GET_F_CREATE);
> if (!pelem)
> @@ -228,6 +353,14 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
> else
> bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
>
> + if (delta >= offcpu_thresh) {
> + struct offcpu_data *data = bpf_map_lookup_elem(&offcpu_payload, &zero);
> +
> + stack_p = bpf_task_storage_get(&stack_cache, next, NULL, 0);
> + if (data && stack_p)
> + off_cpu_dump(ctx, data, &key, stack_p, delta, pelem->timestamp);
> + }
> +
> /* prevent to reuse the timestamp later */
> pelem->timestamp = 0;
> }
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 8/9] perf header: Add field 'embed'
2024-08-07 15:38 ` [PATCH v4 8/9] perf header: Add field 'embed' Howard Chu
@ 2024-08-07 23:52 ` Namhyung Kim
2024-08-08 13:57 ` Howard Chu
0 siblings, 1 reply; 24+ messages in thread
From: Namhyung Kim @ 2024-08-07 23:52 UTC (permalink / raw)
To: Howard Chu
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
On Wed, Aug 07, 2024 at 11:38:42PM +0800, Howard Chu wrote:
> We have to save the embedded data's sample type for it to be consumed
> correctly by perf script or perf report.
>
> This will approach most definitely break some perf.data convertor.
>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
> tools/perf/util/header.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
> index 55e9553861d0..d60e77d5c25c 100644
> --- a/tools/perf/util/header.c
> +++ b/tools/perf/util/header.c
> @@ -80,6 +80,7 @@ const char perf_version_string[] = PERF_VERSION;
>
> struct perf_file_attr {
> struct perf_event_attr attr;
> + __u64 embed;
Can we just set bpf_output.attr correctly and get rid of this?
Thanks,
Namhyung
> struct perf_file_section ids;
> };
>
> @@ -3713,6 +3714,7 @@ static int perf_session__do_write_header(struct perf_session *session,
> }
> f_attr = (struct perf_file_attr){
> .attr = evsel->core.attr,
> + .embed = evsel->sample_type_embed,
> .ids = {
> .offset = evsel->id_offset,
> .size = evsel->core.ids * sizeof(u64),
> @@ -4147,6 +4149,14 @@ static int read_attr(int fd, struct perf_header *ph,
>
> ret = readn(fd, ptr, left);
> }
> +
> + ret = readn(fd, &f_attr->embed, sizeof(f_attr->embed));
> + if (ret <= 0) {
> + pr_debug("failed to read %d bytes of embedded sample type\n",
> + (int)sizeof(f_attr->embed));
> + return -1;
> + }
> +
> /* read perf_file_section, ids are read in caller */
> ret = readn(fd, &f_attr->ids, sizeof(f_attr->ids));
>
> @@ -4272,6 +4282,8 @@ int perf_session__read_header(struct perf_session *session, int repipe_fd)
> tmp = lseek(fd, 0, SEEK_CUR);
> evsel = evsel__new(&f_attr.attr);
>
> + evsel->sample_type_embed = f_attr.embed;
> +
> if (evsel == NULL)
> goto out_delete_evlist;
>
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 1/9] perf evsel: Set BPF output to system-wide
2024-08-07 23:21 ` Namhyung Kim
@ 2024-08-08 3:58 ` Howard Chu
2024-09-25 2:53 ` Ian Rogers
0 siblings, 1 reply; 24+ messages in thread
From: Howard Chu @ 2024-08-08 3:58 UTC (permalink / raw)
To: Namhyung Kim
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
Hello,
The event does open, but bpf_perf_event_output() in BPF will return
-95(-EOPNOTSUPP), so no output. I think this EOPNOTSUPP is not in
bpf_trace.c's __bpf_perf_event_output(), but in perf_event's
perf_event_output() called by BPF.
<idle>-0 [001] d..4. 154921.079230: bpf_trace_printk: err -95
This is also a bug in perf trace -p <PID>.
Thanks,
Howard
On Thu, Aug 8, 2024 at 7:21 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Wed, Aug 07, 2024 at 11:38:35PM +0800, Howard Chu wrote:
> > pid = -1 for bpf-output event.
> >
> > This makes perf record -p <PID> --off-cpu work. Otherwise bpf-output
> > cannot be collected.
>
> I don't understand why it's necessary. Why isn't it collected?
> Is it the kernel to reject the BPF output event to open?
>
> Thanks,
> Namhyung
>
> >
> > Signed-off-by: Howard Chu <howardchu95@gmail.com>
> > ---
> > tools/perf/util/evsel.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > index bc603193c477..b961467133cf 100644
> > --- a/tools/perf/util/evsel.c
> > +++ b/tools/perf/util/evsel.c
> > @@ -2282,6 +2282,10 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
> >
> > test_attr__ready();
> >
> > + /* BPF output event can only be system-wide */
> > + if (evsel__is_bpf_output(evsel))
> > + pid = -1;
> > +
> > /* Debug message used by test scripts */
> > pr_debug2_peo("sys_perf_event_open: pid %d cpu %d group_fd %d flags %#lx",
> > pid, perf_cpu_map__cpu(cpus, idx).cpu, group_fd, evsel->open_flags);
> > --
> > 2.45.2
> >
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 2/9] perf record --off-cpu: Add --off-cpu-thresh
2024-08-07 23:22 ` Namhyung Kim
@ 2024-08-08 4:01 ` Howard Chu
0 siblings, 0 replies; 24+ messages in thread
From: Howard Chu @ 2024-08-08 4:01 UTC (permalink / raw)
To: Namhyung Kim
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
On Thu, Aug 8, 2024 at 7:22 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Wed, Aug 07, 2024 at 11:38:36PM +0800, Howard Chu wrote:
> > Add the --off-cpu-thresh argument to specify the off-cpu time threshold.
> > If the off-cpu time exceeds this threshold, dump the off-cpu data
> > directly.
> >
> > Suggested-by: Ian Rogers <irogers@google.com>
> > Signed-off-by: Howard Chu <howardchu95@gmail.com>
> > ---
> > tools/perf/builtin-record.c | 26 ++++++++++++++++++++++++++
> > tools/perf/util/bpf_off_cpu.c | 2 ++
> > tools/perf/util/bpf_skel/off_cpu.bpf.c | 2 ++
> > tools/perf/util/record.h | 1 +
> > 4 files changed, 31 insertions(+)
> >
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index 72345d1e54b0..60c6fe7b4804 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -3133,6 +3133,28 @@ static int record__parse_mmap_pages(const struct option *opt,
> > return ret;
> > }
> >
> > +static int record__parse_off_cpu_thresh(const struct option *opt,
> > + const char *str,
> > + int unset __maybe_unused)
> > +{
> > + struct record_opts *opts = opt->value;
> > + char *endptr;
> > + u64 off_cpu_thresh;
> > +
> > + if (!str)
> > + return -EINVAL;
> > +
> > + off_cpu_thresh = strtoul(str, &endptr, 10);
> > +
> > + /* threshold isn't string "0", yet strtoull() returns 0, parsing failed. */
> > + if (*endptr || (off_cpu_thresh == 0 && strcmp(str, "0")))
> > + return -EINVAL;
> > + else
> > + opts->off_cpu_thresh = off_cpu_thresh;
> > +
> > + return 0;
> > +}
> > +
> > void __weak arch__add_leaf_frame_record_opts(struct record_opts *opts __maybe_unused)
> > {
> > }
> > @@ -3326,6 +3348,7 @@ static struct record record = {
> > .ctl_fd = -1,
> > .ctl_fd_ack = -1,
> > .synth = PERF_SYNTH_ALL,
> > + .off_cpu_thresh = OFF_CPU_THRESH_DEFAULT,
>
> Where is it defined?
Poor formatting, sorry.
It's in [PATCH v4 5/9] perf record --off-cpu: Dump total off-cpu time at the end
+#define OFF_CPU_THRESH_DEFAULT 500000ull
Thanks,
Howard
>
> Thanks,
> Namhyung
>
>
> > },
> > .tool = {
> > .sample = process_sample_event,
> > @@ -3560,6 +3583,9 @@ static struct option __record_options[] = {
> > OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
> > OPT_STRING(0, "setup-filter", &record.filter_action, "pin|unpin",
> > "BPF filter action"),
> > + OPT_CALLBACK(0, "off-cpu-thresh", &record.opts, "us",
> > + "Dump off-cpu samples if off-cpu time reaches this threshold. The unit is microseconds. (default: 500000)",
> > + record__parse_off_cpu_thresh),
> > OPT_END()
> > };
> >
> > diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> > index 6af36142dc5a..1e0e454bfb5e 100644
> > --- a/tools/perf/util/bpf_off_cpu.c
> > +++ b/tools/perf/util/bpf_off_cpu.c
> > @@ -272,6 +272,8 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
> > }
> > }
> >
> > + skel->bss->offcpu_thresh = opts->off_cpu_thresh * 1000ull;
> > +
> > err = off_cpu_bpf__attach(skel);
> > if (err) {
> > pr_err("Failed to attach off-cpu BPF skeleton\n");
> > diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> > index d877a0a9731f..cca1b6990a57 100644
> > --- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
> > +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> > @@ -96,6 +96,8 @@ const volatile bool uses_cgroup_v1 = false;
> >
> > int perf_subsys_id = -1;
> >
> > +__u64 sample_id, sample_type, offcpu_thresh;
> > +
> > /*
> > * Old kernel used to call it task_struct->state and now it's '__state'.
> > * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
> > diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
> > index a6566134e09e..3c11416e6627 100644
> > --- a/tools/perf/util/record.h
> > +++ b/tools/perf/util/record.h
> > @@ -79,6 +79,7 @@ struct record_opts {
> > int synth;
> > int threads_spec;
> > const char *threads_user_spec;
> > + u64 off_cpu_thresh;
> > };
> >
> > extern const char * const *record_usage;
> > --
> > 2.45.2
> >
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 3/9] perf record --off-cpu: Parse offcpu-time event
2024-08-07 23:25 ` Namhyung Kim
@ 2024-08-08 8:52 ` Howard Chu
0 siblings, 0 replies; 24+ messages in thread
From: Howard Chu @ 2024-08-08 8:52 UTC (permalink / raw)
To: Namhyung Kim
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
On Thu, Aug 8, 2024 at 7:25 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Wed, Aug 07, 2024 at 11:38:37PM +0800, Howard Chu wrote:
> > Parse offcpu-time event using parse_event, in off_cpu_start(), write
> > evlist fds got from evlist__open() to perf_event_array BPF map.
> >
> > Signed-off-by: Howard Chu <howardchu95@gmail.com>
> > ---
> > tools/perf/util/bpf_off_cpu.c | 55 ++++++++++++++++++++---------------
> > tools/perf/util/evsel.c | 2 +-
> > 2 files changed, 32 insertions(+), 25 deletions(-)
> >
> > diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> > index 1e0e454bfb5e..fae0bb8aaa13 100644
> > --- a/tools/perf/util/bpf_off_cpu.c
> > +++ b/tools/perf/util/bpf_off_cpu.c
> > @@ -13,6 +13,7 @@
> > #include "util/cgroup.h"
> > #include "util/strlist.h"
> > #include <bpf/bpf.h>
> > +#include <internal/xyarray.h>
> >
> > #include "bpf_skel/off_cpu.skel.h"
> >
> > @@ -38,39 +39,24 @@ union off_cpu_data {
> >
> > static int off_cpu_config(struct evlist *evlist)
> > {
> > - struct evsel *evsel;
> > - struct perf_event_attr attr = {
> > - .type = PERF_TYPE_SOFTWARE,
> > - .config = PERF_COUNT_SW_BPF_OUTPUT,
> > - .size = sizeof(attr), /* to capture ABI version */
> > - };
> > - char *evname = strdup(OFFCPU_EVENT);
> > + char off_cpu_event[64];
> >
> > - if (evname == NULL)
> > - return -ENOMEM;
> > -
> > - evsel = evsel__new(&attr);
> > - if (!evsel) {
> > - free(evname);
> > - return -ENOMEM;
> > + /* after parsing off-cpu event, we'll specify its sample_type in evsel__config() */
> > + scnprintf(off_cpu_event, sizeof(off_cpu_event), "bpf-output/no-inherit=1,name=%s/", OFFCPU_EVENT);
> > + if (parse_event(evlist, off_cpu_event)) {
> > + pr_err("Failed to open off-cpu event\n");
> > + return -1;
> > }
> >
> > - evsel->core.attr.freq = 1;
> > - evsel->core.attr.sample_period = 1;
> > - /* off-cpu analysis depends on stack trace */
> > - evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
> > -
> > - evlist__add(evlist, evsel);
> > -
> > - free(evsel->name);
> > - evsel->name = evname;
> > -
> > return 0;
> > }
> >
> > static void off_cpu_start(void *arg)
> > {
> > struct evlist *evlist = arg;
> > + struct evsel *evsel;
> > + struct perf_cpu pcpu;
> > + int i, err;
> >
> > /* update task filter for the given workload */
> > if (!skel->bss->has_cpu && !skel->bss->has_task &&
> > @@ -86,6 +72,27 @@ static void off_cpu_start(void *arg)
> > bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
> > }
> >
> > + /* sample id and fds in BPF's perf_event_array can only be set after record__open() */
> > + evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT);
> > + if (evsel == NULL) {
> > + pr_err("%s evsel not found\n", OFFCPU_EVENT);
> > + return;
> > + }
> > +
> > + if (evsel->core.id)
> > + skel->bss->sample_id = evsel->core.id[0];
> > +
> > + perf_cpu_map__for_each_cpu(pcpu, i, evsel->core.cpus) {
> > + err = bpf_map__update_elem(skel->maps.offcpu_output,
> > + &pcpu.cpu, sizeof(__u32),
> > + xyarray__entry(evsel->core.fd, pcpu.cpu, 0),
> > + sizeof(__u32), BPF_ANY);
> > + if (err) {
> > + pr_err("Failed to update perf event map for direct off-cpu dumping\n");
> > + return;
> > + }
> > + }
> > +
> > skel->bss->enabled = 1;
> > }
> >
> > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > index b961467133cf..ccd3bda02b5d 100644
> > --- a/tools/perf/util/evsel.c
> > +++ b/tools/perf/util/evsel.c
> > @@ -1379,7 +1379,7 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> > evsel__reset_sample_bit(evsel, BRANCH_STACK);
> >
> > if (evsel__is_offcpu_event(evsel))
> > - evsel->core.attr.sample_type &= OFFCPU_SAMPLE_TYPES;
> > + evsel->core.attr.sample_type = OFFCPU_SAMPLE_TYPES;
>
> I don't think we need this. It should check what you requested.
> IOW you don't need to put cgroup info when user didn't ask.
I misunderstood the purpose of this check, so I'll leave it as it was.
Thanks,
Howard
>
> Thanks,
> Namhyung
>
> >
> > arch__post_evsel_config(evsel, attr);
> > }
> > --
> > 2.45.2
> >
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 4/9] perf record off-cpu: Dump direct off-cpu samples in BPF
2024-08-07 23:49 ` Namhyung Kim
@ 2024-08-08 11:57 ` Howard Chu
0 siblings, 0 replies; 24+ messages in thread
From: Howard Chu @ 2024-08-08 11:57 UTC (permalink / raw)
To: Namhyung Kim
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
Hello,
On Thu, Aug 8, 2024 at 7:50 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Wed, Aug 07, 2024 at 11:38:38PM +0800, Howard Chu wrote:
> > Output PERF_SAMPLE_TID, PERF_SAMPLE_PERIOD, PERF_SAMPLE_CALLCHAIN, and
> > PERF_SAMPLE_CGROUP in BPF. Ideally, we should only output
> > PERF_SAMPLE_PERIOD (off-cpu time) and PERF_SAMPLE_CALLCHAIN (sched_in
> > process's callchain). One only needs to set PERF_SAMPLE_TID and
> > PERF_SAMPLE_CGROUP, and perf_event will do everything for us.
> >
> > But in reality, that's not the case. Setting PERF_SAMPLE_TID will mostly
> > give us TID of 0. We might get the correct TID for offcpu-time event
> > from time to time, but it is really rare.
> > swapper 0 [000] offcpu-time: /
> > :1321819 1321819 [002] offcpu-time: /user.slice/user-1000.slice/session-2.scope
> > swapper 0 [001] offcpu-time: /
> > swapper 0 [003] offcpu-time: /
> >
> > And setting PERF_SAMPLE_CGROUP doesn't work properly either.
> > tmux: server 3701 [003] offcpu-time: /
> > blueman-tray 1064 [001] offcpu-time: /
> > bash 1350867 [001] offcpu-time: /
> > bash 1350844 [000] offcpu-time: /
>
> Isn't it just because your system was idle?
I'd like to say no, because close to 100% of the offcpu-time samples'
TID is 0, using unmodified 'perf record --off-cpu' yielded different
results, with most of the samples having a non-zero TID.
I also tested collecting the cgroup id with and without BPF output.
This is the process that I'm testing for:
perf $ cat /proc/3555534/cgroup
0::/user.slice/user-0.slice/user@0.service/tmux-spawn-cf25ea6d-587f-408d-92f5-2c8e97cb2a74.scope
Without BPF, we'd get (I added a printf to print cgroup id):
perf $ ./perf record -F 1 --all-cgroups --off-cpu -p 3555534
^C[ perf record: Woken up 1 times to write data ]
evsel cycles:P cgroup id is 7654
evsel cycles:P cgroup id is 7654
evsel cycles:P cgroup id is 7654
evsel cycles:P cgroup id is 7654
evsel offcpu-time cgroup id is 1
evsel offcpu-time cgroup id is 1
evsel offcpu-time cgroup id is 1
evsel offcpu-time cgroup id is 0
[ perf record: Captured and wrote 0.024 MB perf.data (8 samples) ]
perf $ ./perf script -F cgroup,sym,event
cycles:P: /user.slice/user-0.slice/user@0.service/tmux-spawn-cf25ea6d-587f-408d-92f5-2c8e97>
cycles:P: /user.slice/user-0.slice/user@0.service/tmux-spawn-cf25ea6d-587f-408d-92f5-2c8e97>
cycles:P: /user.slice/user-0.slice/user@0.service/tmux-spawn-cf25ea6d-587f-408d-92f5-2c8e97>
cycles:P: /user.slice/user-0.slice/user@0.service/tmux-spawn-cf25ea6d-587f-408d-92f5-2c8e97>
offcpu-time: /
offcpu-time: /
offcpu-time: /
offcpu-time: unknown
Now the reason why one of the offcpu samples' cgroup id is 0, is
because it's an at-the-end sample, 0 is used as a placeholder. The
direct offcpu sample has a cgroup id of 1(collected by perf_event
itself), which is wrong.
Let's use BPF to collect offcpu sample, we'll get:
perf $ ./perf record -F 1 --all-cgroups --off-cpu -p 3555534
^C[ perf record: Woken up 1 times to write data ]
evsel cycles:P cgroup id is 7654
evsel cycles:P cgroup id is 7654
evsel cycles:P cgroup id is 7654
evsel cycles:P cgroup id is 7654
evsel offcpu-time cgroup id is 1
evsel offcpu-time cgroup id is 7654 (embedded data overwrite)
evsel offcpu-time cgroup id is 1
evsel offcpu-time cgroup id is 7654 (embedded data overwrite)
evsel offcpu-time cgroup id is 0
evsel offcpu-time cgroup id is 7654 (embedded data overwrite)
This time offcpu-time's cgroup_id is correct.
perf $ ./perf script -F cgroup,sym,event
cycles:P: /user.slice/user-0.slice/user@0.service/tmux-spawn-cf25ea6d-587f-408d-92f5-2c8e97>
cycles:P: /user.slice/user-0.slice/user@0.service/tmux-spawn-cf25ea6d-587f-408d-92f5-2c8e97>
cycles:P: /user.slice/user-0.slice/user@0.service/tmux-spawn-cf25ea6d-587f-408d-92f5-2c8e97>
cycles:P: /user.slice/user-0.slice/user@0.service/tmux-spawn-cf25ea6d-587f-408d-92f5-2c8e97>
offcpu-time: /user.slice/user-0.slice/user@0.service/tmux-spawn-cf25ea6d-587f-408d-92f5-2c8e97>
offcpu-time: /user.slice/user-0.slice/user@0.service/tmux-spawn-cf25ea6d-587f-408d-92f5-2c8e97>
offcpu-time: /user.slice/user-0.slice/user@0.service/tmux-spawn-cf25ea6d-587f-408d-92f5-2c8e97>
>
> >
> > We need to retrieve PERF_SAMPLE_TID, PERF_SAMPLE_PERIOD,
> > PERF_SAMPLE_CALLCHAIN, and PERF_SAMPLE_CGROUP using BPF and output these
> > four fields.
> >
> > Add perf_event_array map for dumping direct off-cpu samples, but keep
> > the at-the-end approach.
> >
> > Tons of checking before access, to pass the BPF verifier.
> >
> > If off-cpu time (represented as delta) exceeds the off-cpu threshold, do
> > output.
> >
> > Suggested-by: Namhyung Kim <namhyung@kernel.org>
> > Signed-off-by: Howard Chu <howardchu95@gmail.com>
> > ---
> > tools/perf/util/bpf_skel/off_cpu.bpf.c | 133 +++++++++++++++++++++++++
> > 1 file changed, 133 insertions(+)
> >
> > diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> > index cca1b6990a57..95cc130bb67e 100644
> > --- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
> > +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> > @@ -18,6 +18,9 @@
> > #define MAX_STACKS 32
> > #define MAX_ENTRIES 102400
> >
> > +#define MAX_CPUS 4096
> > +#define MAX_OFFCPU_LEN 128
> > +
> > struct tstamp_data {
> > __u32 stack_id;
> > __u32 state;
> > @@ -32,6 +35,7 @@ struct offcpu_key {
> > __u64 cgroup_id;
> > };
> >
> > +/* for dumping at the end */
> > struct {
> > __uint(type, BPF_MAP_TYPE_STACK_TRACE);
> > __uint(key_size, sizeof(__u32));
> > @@ -39,6 +43,45 @@ struct {
> > __uint(max_entries, MAX_ENTRIES);
> > } stacks SEC(".maps");
> >
> > +struct offcpu_data {
> > + u64 array[MAX_OFFCPU_LEN];
> > +};
> > +
> > +struct stack_data {
> > + u64 array[MAX_STACKS];
> > +};
> > +
> > +struct {
> > + __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
> > + __uint(key_size, sizeof(__u32));
> > + __uint(value_size, sizeof(__u32));
> > + __uint(max_entries, MAX_CPUS);
> > +} offcpu_output SEC(".maps");
> > +
> > +/* temporary offcpu sample */
> > +struct {
> > + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> > + __uint(key_size, sizeof(__u32));
> > + __uint(value_size, sizeof(struct offcpu_data));
> > + __uint(max_entries, 1);
> > +} offcpu_payload SEC(".maps");
> > +
> > +/* temporary stack data */
> > +struct {
> > + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> > + __uint(key_size, sizeof(__u32));
> > + __uint(value_size, sizeof(struct stack_data));
> > + __uint(max_entries, 1);
> > +} stack_tmp SEC(".maps");
> > +
> > +/* cached stack per task storage */
> > +struct {
> > + __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > + __uint(map_flags, BPF_F_NO_PREALLOC);
> > + __type(key, int);
> > + __type(value, struct stack_data);
> > +} stack_cache SEC(".maps");
> > +
> > struct {
> > __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > __uint(map_flags, BPF_F_NO_PREALLOC);
> > @@ -184,12 +227,75 @@ static inline int can_record(struct task_struct *t, int state)
> > return 1;
> > }
> >
> > +static inline bool check_bounds(int index)
> > +{
> > + if (index >= 0 && index < MAX_OFFCPU_LEN)
> > + return true;
> > +
> > + return false;
> > +}
> > +
> > +static inline int copy_stack(struct stack_data *from,
> > + struct offcpu_data *to, int n)
> > +{
> > + int max_stacks = MAX_STACKS, len = 0;
> > +
> > + if (!from)
> > + return len;
> > +
> > + for (int i = 0; i < max_stacks && from->array[i]; ++i) {
> > + if (check_bounds(n + 2 + i)) {
> > + to->array[n + 2 + i] = from->array[i];
> > + ++len;
> > + }
> > + }
> > + return len;
> > +}
> > +
> > +static int off_cpu_dump(void *ctx, struct offcpu_data *data, struct offcpu_key *key,
> > + struct stack_data *stack_p, __u64 delta, __u64 timestamp)
> > +{
> > + int size, n = 0, ip_pos = -1, len = 0;
> > +
> > + if (sample_type & PERF_SAMPLE_TID && check_bounds(n))
> > + data->array[n++] = (u64)key->tgid << 32 | key->pid;
> > + if (sample_type & PERF_SAMPLE_PERIOD && check_bounds(n))
> > + data->array[n++] = delta;
> > + if (sample_type & PERF_SAMPLE_CALLCHAIN && check_bounds(n + 2)) {
> > + /* data->array[n] is callchain->nr (updated later) */
> > + data->array[n + 1] = PERF_CONTEXT_USER;
> > + data->array[n + 2] = 0;
> > +
> > + len = copy_stack(stack_p, data, n);
> > +
> > + /* update length of callchain */
> > + data->array[n] = len + 1;
> > +
> > + /* update sample ip with the first callchain entry */
> > + if (ip_pos >= 0)
> > + data->array[ip_pos] = data->array[n + 2];
>
> Where is ip_pos set? I guess we don't need this as we don't update the
> SAMPLE_IP here.
Yes, ip_pos is unnecessary.
>
> > +
> > + /* calculate sample callchain data->array length */
> > + n += len + 2;
> > + }
> > + if (sample_type & PERF_SAMPLE_CGROUP && check_bounds(n))
> > + data->array[n++] = key->cgroup_id;
> > +
> > + size = n * sizeof(u64);
> > + if (size >= 0 && size <= MAX_OFFCPU_LEN * sizeof(u64))
> > + bpf_perf_event_output(ctx, &offcpu_output, BPF_F_CURRENT_CPU, data, size);
> > +
> > + return 0;
> > +}
> > +
> > static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
> > struct task_struct *next, int state)
> > {
> > __u64 ts;
> > __u32 stack_id;
> > struct tstamp_data *pelem;
> > + struct stack_data *stack_tmp_p, *stack_p;
> > + int zero = 0, len = 0;
> >
> > ts = bpf_ktime_get_ns();
> >
> > @@ -199,6 +305,25 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
> > stack_id = bpf_get_stackid(ctx, &stacks,
> > BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);
> >
> > + /* cache stacks */
> > + stack_tmp_p = bpf_map_lookup_elem(&stack_tmp, &zero);
> > + if (stack_tmp_p)
> > + len = bpf_get_stack(ctx, stack_tmp_p->array, MAX_STACKS * sizeof(u64),
> > + BPF_F_USER_STACK) / sizeof(u64);
>
> Looks redundant, but not sure if we share the stackid..
I use bpf_get_stack() because there's only one helper for
BPF_MAP_TYPE_STACK_TRACE, that's bpf_get_stackid(). Meaning one can
only collect, not retrieve stack traces(such as using
bpf_map_lookup_elem()) with STACK_TRACE map in BPF program. For
STACK_TRACE map, all the processing is in user space. Please see
kernel/bpf/verifier.c(line 8982):
case BPF_MAP_TYPE_RINGBUF:
if (func_id != BPF_FUNC_ringbuf_output &&
func_id != BPF_FUNC_ringbuf_reserve &&
func_id != BPF_FUNC_ringbuf_query &&
func_id != BPF_FUNC_ringbuf_reserve_dynptr &&
func_id != BPF_FUNC_ringbuf_submit_dynptr &&
func_id != BPF_FUNC_ringbuf_discard_dynptr)
goto error;
break;
case BPF_MAP_TYPE_STACK_TRACE:
if (func_id != BPF_FUNC_get_stackid)
goto error;
break;
So I chose the easiest way to do it, which is collecting stack traces
__twice__. Once for direct samples, once for at-the-end samples. We
could do the similar stack_id saving for direct off-cpu samples, and
then access that stack trace map in post, do you think we should do
that instead?
>
> > +
> > + /*
> > + * if stacks are successfully collected, cache them to task_storage, they are then
> > + * dumped if the off-cpu time hits the threshold.
> > + */
> > + if (len > 0) {
> > + stack_p = bpf_task_storage_get(&stack_cache, prev, NULL,
> > + BPF_LOCAL_STORAGE_GET_F_CREATE);
> > + if (stack_p) {
> > + for (int i = 0; i < len && i < MAX_STACKS; ++i)
> > + stack_p->array[i] = stack_tmp_p->array[i];
> > + }
> > + }
>
> Why not use the task storage to get the stacktrace directly?
I'm sorry, could you elaborate? I thought we were saving the
prev's(process that gets sched_out) stack traces for later
dumping(when sched_in).
Thanks,
Howard
>
> Thanks,
> Namhyung
>
> > +
> > pelem = bpf_task_storage_get(&tstamp, prev, NULL,
> > BPF_LOCAL_STORAGE_GET_F_CREATE);
> > if (!pelem)
> > @@ -228,6 +353,14 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
> > else
> > bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
> >
> > + if (delta >= offcpu_thresh) {
> > + struct offcpu_data *data = bpf_map_lookup_elem(&offcpu_payload, &zero);
> > +
> > + stack_p = bpf_task_storage_get(&stack_cache, next, NULL, 0);
> > + if (data && stack_p)
> > + off_cpu_dump(ctx, data, &key, stack_p, delta, pelem->timestamp);
> > + }
> > +
> > /* prevent to reuse the timestamp later */
> > pelem->timestamp = 0;
> > }
> > --
> > 2.45.2
> >
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 8/9] perf header: Add field 'embed'
2024-08-07 23:52 ` Namhyung Kim
@ 2024-08-08 13:57 ` Howard Chu
0 siblings, 0 replies; 24+ messages in thread
From: Howard Chu @ 2024-08-08 13:57 UTC (permalink / raw)
To: Namhyung Kim
Cc: irogers, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
Hello,
On Thu, Aug 8, 2024 at 7:52 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Wed, Aug 07, 2024 at 11:38:42PM +0800, Howard Chu wrote:
> > We have to save the embedded data's sample type for it to be consumed
> > correctly by perf script or perf report.
> >
> > This will approach most definitely break some perf.data convertor.
> >
> > Signed-off-by: Howard Chu <howardchu95@gmail.com>
> > ---
> > tools/perf/util/header.c | 12 ++++++++++++
> > 1 file changed, 12 insertions(+)
> >
> > diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
> > index 55e9553861d0..d60e77d5c25c 100644
> > --- a/tools/perf/util/header.c
> > +++ b/tools/perf/util/header.c
> > @@ -80,6 +80,7 @@ const char perf_version_string[] = PERF_VERSION;
> >
> > struct perf_file_attr {
> > struct perf_event_attr attr;
> > + __u64 embed;
>
> Can we just set bpf_output.attr correctly and get rid of this?
"I think this is ugly as well. I believe I can sneak sample_type_embed
into struct perf_event_attr without side effects, but some hacks are
needed. Firstly, the __u64 inherit field can be used to store __u64
sample_type_embed. If BPF output can be collected, we can guarantee
that inherit is 0 (remember when parsing the off-CPU time event, we
used bpf-output/no-inherit=1,name=%s/), so we are certain about its
value. And since it's also a u64, sample_type_embed will fit.
Additionally, it isn't used much in the userspace perf tool (nothing
related to parsing and visualization), so maybe after
perf_event_open(), we can overwrite it."
But that will lead to another question of mine, do you think we should
make this 'sample_type_embed' thing only for off-cpu, or there's a
possibility for scaling? The reason why I ask this question is because
there are things that's not API-ish in this patch, such as overwriting
the 'inherit' for sample_type_embed, it's clearly a hack. I don't have
enough foresight on this matter. Could you please provide some
insight? This could impact the direction of my programming. For
example, if it's only used by off-cpu, the implementation could likely
become much simpler.
Thanks,
Howard
>
> Thanks,
> Namhyung
>
>
> > struct perf_file_section ids;
> > };
> >
> > @@ -3713,6 +3714,7 @@ static int perf_session__do_write_header(struct perf_session *session,
> > }
> > f_attr = (struct perf_file_attr){
> > .attr = evsel->core.attr,
> > + .embed = evsel->sample_type_embed,
> > .ids = {
> > .offset = evsel->id_offset,
> > .size = evsel->core.ids * sizeof(u64),
> > @@ -4147,6 +4149,14 @@ static int read_attr(int fd, struct perf_header *ph,
> >
> > ret = readn(fd, ptr, left);
> > }
> > +
> > + ret = readn(fd, &f_attr->embed, sizeof(f_attr->embed));
> > + if (ret <= 0) {
> > + pr_debug("failed to read %d bytes of embedded sample type\n",
> > + (int)sizeof(f_attr->embed));
> > + return -1;
> > + }
> > +
> > /* read perf_file_section, ids are read in caller */
> > ret = readn(fd, &f_attr->ids, sizeof(f_attr->ids));
> >
> > @@ -4272,6 +4282,8 @@ int perf_session__read_header(struct perf_session *session, int repipe_fd)
> > tmp = lseek(fd, 0, SEEK_CUR);
> > evsel = evsel__new(&f_attr.attr);
> >
> > + evsel->sample_type_embed = f_attr.embed;
> > +
> > if (evsel == NULL)
> > goto out_delete_evlist;
> >
> > --
> > 2.45.2
> >
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 1/9] perf evsel: Set BPF output to system-wide
2024-08-08 3:58 ` Howard Chu
@ 2024-09-25 2:53 ` Ian Rogers
2024-09-25 7:07 ` Howard Chu
0 siblings, 1 reply; 24+ messages in thread
From: Ian Rogers @ 2024-09-25 2:53 UTC (permalink / raw)
To: Howard Chu
Cc: Namhyung Kim, acme, adrian.hunter, jolsa, kan.liang,
linux-perf-users, linux-kernel
On Wed, Aug 7, 2024 at 8:58 PM Howard Chu <howardchu95@gmail.com> wrote:
>
> Hello,
>
> The event does open, but bpf_perf_event_output() in BPF will return
> -95(-EOPNOTSUPP), so no output. I think this EOPNOTSUPP is not in
> bpf_trace.c's __bpf_perf_event_output(), but in perf_event's
> perf_event_output() called by BPF.
>
> <idle>-0 [001] d..4. 154921.079230: bpf_trace_printk: err -95
>
> This is also a bug in perf trace -p <PID>.
>
> Thanks,
> Howard
>
> On Thu, Aug 8, 2024 at 7:21 AM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Wed, Aug 07, 2024 at 11:38:35PM +0800, Howard Chu wrote:
> > > pid = -1 for bpf-output event.
> > >
> > > This makes perf record -p <PID> --off-cpu work. Otherwise bpf-output
> > > cannot be collected.
> >
> > I don't understand why it's necessary. Why isn't it collected?
> > Is it the kernel to reject the BPF output event to open?
> >
> > Thanks,
> > Namhyung
> >
> > >
> > > Signed-off-by: Howard Chu <howardchu95@gmail.com>
> > > ---
> > > tools/perf/util/evsel.c | 4 ++++
> > > 1 file changed, 4 insertions(+)
> > >
> > > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > > index bc603193c477..b961467133cf 100644
> > > --- a/tools/perf/util/evsel.c
> > > +++ b/tools/perf/util/evsel.c
> > > @@ -2282,6 +2282,10 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
> > >
> > > test_attr__ready();
> > >
> > > + /* BPF output event can only be system-wide */
> > > + if (evsel__is_bpf_output(evsel))
> > > + pid = -1;
This matches with libbpf:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/lib/bpf/libbpf.c#n11967
Thanks,
Ian
> > > +
> > > /* Debug message used by test scripts */
> > > pr_debug2_peo("sys_perf_event_open: pid %d cpu %d group_fd %d flags %#lx",
> > > pid, perf_cpu_map__cpu(cpus, idx).cpu, group_fd, evsel->open_flags);
> > > --
> > > 2.45.2
> > >
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly
2024-08-07 15:38 [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (8 preceding siblings ...)
2024-08-07 15:38 ` [PATCH v4 9/9] perf test: Add direct off-cpu dumping test Howard Chu
@ 2024-09-25 3:04 ` Ian Rogers
2024-09-25 6:59 ` Howard Chu
9 siblings, 1 reply; 24+ messages in thread
From: Ian Rogers @ 2024-09-25 3:04 UTC (permalink / raw)
To: Howard Chu
Cc: namhyung, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
On Wed, Aug 7, 2024 at 8:38 AM Howard Chu <howardchu95@gmail.com> wrote:
>
> Changes in v4:
> - Minimize the size of data output by perf_event_output()
> - Keep only one off-cpu event
> - Change off-cpu threshold's unit to microseconds
> - Set a default off-cpu threshold
> - Print the correct error message for the field 'embed' in perf data header
Hi Howard,
I think this all looks good. There is an outstanding comment for patch
2/9 where a definition in a later patch needs moving to patch 2/9.
There is also a comment in 3/9 where you say you will turn a "=" back
to an "&=". These seem like small things to fix, could you rebase and
post the series? Are there other outstanding issues I'm missing? I'd
quite like to see this land.
Thanks,
Ian
> Changes in v3:
> - Add off-cpu-thresh argument
> - Process direct off-cpu samples in post
>
> Changes in v2:
> - Remove unnecessary comments.
> - Rename function off_cpu_change_type to off_cpu_prepare_parse
>
> v1:
>
> As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
>
> Currently, off-cpu samples are dumped when perf record is exiting. This
> results in off-cpu samples being after the regular samples. This patch
> series makes possible dumping off-cpu samples on-the-fly, directly into
> perf ring buffer. And it dispatches those samples to the correct format
> for perf.data consumers.
>
> Before:
> ```
> migration/0 21 [000] 27981.041319: 2944637851 cycles:P: ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
> perf 770116 [001] 27981.041375: 1 cycles:P: ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
> perf 770116 [001] 27981.041377: 1 cycles:P: ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
> perf 770116 [001] 27981.041379: 51611 cycles:P: ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
> migration/1 26 [001] 27981.041400: 4227682775 cycles:P: ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
> migration/2 32 [002] 27981.041477: 4159401534 cycles:P: ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
>
> sshd 708098 [000] 18446744069.414584: 286392 offcpu-time:
> 79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
> 585690935cca [unknown] (/usr/bin/sshd)
> ```
>
> After:
> ```
> perf 774767 [003] 28178.033444: 497 cycles:P: ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
> perf 774767 [003] 28178.033445: 399440 cycles:P: ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
> swapper 0 [001] 28178.036639: 376650973 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> swapper 0 [003] 28178.182921: 348779378 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> blueman-tray 1355 [000] 28178.627906: 100184571 offcpu-time:
> 7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> 7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> 7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> 7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> 7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> 7fff24e862d8 [unknown] ([unknown])
>
>
> blueman-tray 1355 [000] 28178.728137: 100187539 offcpu-time:
> 7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> 7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> 7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> 7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> 7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> 7fff24e862d8 [unknown] ([unknown])
>
>
> swapper 0 [000] 28178.463253: 195945410 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> dbus-broker 412 [002] 28178.464855: 376737008 cycles:P: ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
> ```
>
>
> Howard Chu (9):
> perf evsel: Set BPF output to system-wide
> perf record --off-cpu: Add --off-cpu-thresh
> perf record --off-cpu: Parse offcpu-time event
> perf record off-cpu: Dump direct off-cpu samples in BPF
> perf record --off-cpu: Dump total off-cpu time at the end.
> perf evsel: Delete unnecessary = 0
> perf record --off-cpu: Parse BPF output embedded data
> perf header: Add field 'embed'
> perf test: Add direct off-cpu dumping test
>
> tools/perf/builtin-record.c | 26 ++++
> tools/perf/builtin-script.c | 5 +-
> tools/perf/tests/builtin-test.c | 1 +
> tools/perf/tests/shell/record_offcpu.sh | 27 ++++
> tools/perf/tests/tests.h | 1 +
> tools/perf/tests/workloads/Build | 1 +
> tools/perf/tests/workloads/offcpu.c | 16 +++
> tools/perf/util/bpf_off_cpu.c | 176 +++++++++++++++---------
> tools/perf/util/bpf_skel/off_cpu.bpf.c | 135 ++++++++++++++++++
> tools/perf/util/evsel.c | 49 ++++---
> tools/perf/util/evsel.h | 13 ++
> tools/perf/util/header.c | 12 ++
> tools/perf/util/off_cpu.h | 10 +-
> tools/perf/util/record.h | 1 +
> tools/perf/util/session.c | 16 ++-
> 15 files changed, 401 insertions(+), 88 deletions(-)
> create mode 100644 tools/perf/tests/workloads/offcpu.c
>
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly
2024-09-25 3:04 ` [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Ian Rogers
@ 2024-09-25 6:59 ` Howard Chu
0 siblings, 0 replies; 24+ messages in thread
From: Howard Chu @ 2024-09-25 6:59 UTC (permalink / raw)
To: Ian Rogers
Cc: namhyung, acme, adrian.hunter, jolsa, kan.liang, linux-perf-users,
linux-kernel
Hello Ian,
On Tue, Sep 24, 2024 at 8:04 PM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, Aug 7, 2024 at 8:38 AM Howard Chu <howardchu95@gmail.com> wrote:
> >
> > Changes in v4:
> > - Minimize the size of data output by perf_event_output()
> > - Keep only one off-cpu event
> > - Change off-cpu threshold's unit to microseconds
> > - Set a default off-cpu threshold
> > - Print the correct error message for the field 'embed' in perf data header
>
> Hi Howard,
>
> I think this all looks good. There is an outstanding comment for patch
> 2/9 where a definition in a later patch needs moving to patch 2/9.
> There is also a comment in 3/9 where you say you will turn a "=" back
> to an "&=". These seem like small things to fix, could you rebase and
> post the series? Are there other outstanding issues I'm missing? I'd
Thank you, I fixed them. Namhyung said the sample_embed thing should
be handled with care, so I'll make this sample_embed a special case
for off-cpu, and hard code something, so patches will look more
concise. Will do this, rebase, and send the series. Sorry for taking
so long. (〒︿〒)
Thanks,
Howard
> quite like to see this land.
>
> Thanks,
> Ian
>
> > Changes in v3:
> > - Add off-cpu-thresh argument
> > - Process direct off-cpu samples in post
> >
> > Changes in v2:
> > - Remove unnecessary comments.
> > - Rename function off_cpu_change_type to off_cpu_prepare_parse
> >
> > v1:
> >
> > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> >
> > Currently, off-cpu samples are dumped when perf record is exiting. This
> > results in off-cpu samples being after the regular samples. This patch
> > series makes possible dumping off-cpu samples on-the-fly, directly into
> > perf ring buffer. And it dispatches those samples to the correct format
> > for perf.data consumers.
> >
> > Before:
> > ```
> > migration/0 21 [000] 27981.041319: 2944637851 cycles:P: ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
> > perf 770116 [001] 27981.041375: 1 cycles:P: ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
> > perf 770116 [001] 27981.041377: 1 cycles:P: ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
> > perf 770116 [001] 27981.041379: 51611 cycles:P: ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
> > migration/1 26 [001] 27981.041400: 4227682775 cycles:P: ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
> > migration/2 32 [002] 27981.041477: 4159401534 cycles:P: ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
> >
> > sshd 708098 [000] 18446744069.414584: 286392 offcpu-time:
> > 79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
> > 585690935cca [unknown] (/usr/bin/sshd)
> > ```
> >
> > After:
> > ```
> > perf 774767 [003] 28178.033444: 497 cycles:P: ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
> > perf 774767 [003] 28178.033445: 399440 cycles:P: ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
> > swapper 0 [001] 28178.036639: 376650973 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > swapper 0 [003] 28178.182921: 348779378 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > blueman-tray 1355 [000] 28178.627906: 100184571 offcpu-time:
> > 7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> > 7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> > 7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> > 7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> > 7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> > 7fff24e862d8 [unknown] ([unknown])
> >
> >
> > blueman-tray 1355 [000] 28178.728137: 100187539 offcpu-time:
> > 7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> > 7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> > 7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> > 7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> > 7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> > 7fff24e862d8 [unknown] ([unknown])
> >
> >
> > swapper 0 [000] 28178.463253: 195945410 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > dbus-broker 412 [002] 28178.464855: 376737008 cycles:P: ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
> > ```
> >
> >
> > Howard Chu (9):
> > perf evsel: Set BPF output to system-wide
> > perf record --off-cpu: Add --off-cpu-thresh
> > perf record --off-cpu: Parse offcpu-time event
> > perf record off-cpu: Dump direct off-cpu samples in BPF
> > perf record --off-cpu: Dump total off-cpu time at the end.
> > perf evsel: Delete unnecessary = 0
> > perf record --off-cpu: Parse BPF output embedded data
> > perf header: Add field 'embed'
> > perf test: Add direct off-cpu dumping test
> >
> > tools/perf/builtin-record.c | 26 ++++
> > tools/perf/builtin-script.c | 5 +-
> > tools/perf/tests/builtin-test.c | 1 +
> > tools/perf/tests/shell/record_offcpu.sh | 27 ++++
> > tools/perf/tests/tests.h | 1 +
> > tools/perf/tests/workloads/Build | 1 +
> > tools/perf/tests/workloads/offcpu.c | 16 +++
> > tools/perf/util/bpf_off_cpu.c | 176 +++++++++++++++---------
> > tools/perf/util/bpf_skel/off_cpu.bpf.c | 135 ++++++++++++++++++
> > tools/perf/util/evsel.c | 49 ++++---
> > tools/perf/util/evsel.h | 13 ++
> > tools/perf/util/header.c | 12 ++
> > tools/perf/util/off_cpu.h | 10 +-
> > tools/perf/util/record.h | 1 +
> > tools/perf/util/session.c | 16 ++-
> > 15 files changed, 401 insertions(+), 88 deletions(-)
> > create mode 100644 tools/perf/tests/workloads/offcpu.c
> >
> > --
> > 2.45.2
> >
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 1/9] perf evsel: Set BPF output to system-wide
2024-09-25 2:53 ` Ian Rogers
@ 2024-09-25 7:07 ` Howard Chu
0 siblings, 0 replies; 24+ messages in thread
From: Howard Chu @ 2024-09-25 7:07 UTC (permalink / raw)
To: Ian Rogers
Cc: Namhyung Kim, acme, adrian.hunter, jolsa, kan.liang,
linux-perf-users, linux-kernel
Hello,
On Tue, Sep 24, 2024 at 7:53 PM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, Aug 7, 2024 at 8:58 PM Howard Chu <howardchu95@gmail.com> wrote:
> >
> > Hello,
> >
> > The event does open, but bpf_perf_event_output() in BPF will return
> > -95(-EOPNOTSUPP), so no output. I think this EOPNOTSUPP is not in
> > bpf_trace.c's __bpf_perf_event_output(), but in perf_event's
> > perf_event_output() called by BPF.
> >
> > <idle>-0 [001] d..4. 154921.079230: bpf_trace_printk: err -95
> >
> > This is also a bug in perf trace -p <PID>.
> >
> > Thanks,
> > Howard
> >
> > On Thu, Aug 8, 2024 at 7:21 AM Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > On Wed, Aug 07, 2024 at 11:38:35PM +0800, Howard Chu wrote:
> > > > pid = -1 for bpf-output event.
> > > >
> > > > This makes perf record -p <PID> --off-cpu work. Otherwise bpf-output
> > > > cannot be collected.
> > >
> > > I don't understand why it's necessary. Why isn't it collected?
> > > Is it the kernel to reject the BPF output event to open?
> > >
> > > Thanks,
> > > Namhyung
> > >
> > > >
> > > > Signed-off-by: Howard Chu <howardchu95@gmail.com>
> > > > ---
> > > > tools/perf/util/evsel.c | 4 ++++
> > > > 1 file changed, 4 insertions(+)
> > > >
> > > > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > > > index bc603193c477..b961467133cf 100644
> > > > --- a/tools/perf/util/evsel.c
> > > > +++ b/tools/perf/util/evsel.c
> > > > @@ -2282,6 +2282,10 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
> > > >
> > > > test_attr__ready();
> > > >
> > > > + /* BPF output event can only be system-wide */
> > > > + if (evsel__is_bpf_output(evsel))
> > > > + pid = -1;
>
> This matches with libbpf:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/lib/bpf/libbpf.c#n11967
Thanks for pointing it out, that's a very good reference. Namhyung
actually came up with a very good solution and based on his idea I
posted this patch series:
https://lore.kernel.org/linux-perf-users/20240827092013.1596-1-howardchu95@gmail.com/
Thanks,
Howard
>
> Thanks,
> Ian
>
> > > > +
> > > > /* Debug message used by test scripts */
> > > > pr_debug2_peo("sys_perf_event_open: pid %d cpu %d group_fd %d flags %#lx",
> > > > pid, perf_cpu_map__cpu(cpus, idx).cpu, group_fd, evsel->open_flags);
> > > > --
> > > > 2.45.2
> > > >
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2024-09-25 7:06 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-07 15:38 [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
2024-08-07 15:38 ` [PATCH v4 1/9] perf evsel: Set BPF output to system-wide Howard Chu
2024-08-07 23:21 ` Namhyung Kim
2024-08-08 3:58 ` Howard Chu
2024-09-25 2:53 ` Ian Rogers
2024-09-25 7:07 ` Howard Chu
2024-08-07 15:38 ` [PATCH v4 2/9] perf record --off-cpu: Add --off-cpu-thresh Howard Chu
2024-08-07 23:22 ` Namhyung Kim
2024-08-08 4:01 ` Howard Chu
2024-08-07 15:38 ` [PATCH v4 3/9] perf record --off-cpu: Parse offcpu-time event Howard Chu
2024-08-07 23:25 ` Namhyung Kim
2024-08-08 8:52 ` Howard Chu
2024-08-07 15:38 ` [PATCH v4 4/9] perf record off-cpu: Dump direct off-cpu samples in BPF Howard Chu
2024-08-07 23:49 ` Namhyung Kim
2024-08-08 11:57 ` Howard Chu
2024-08-07 15:38 ` [PATCH v4 5/9] perf record --off-cpu: Dump total off-cpu time at the end Howard Chu
2024-08-07 15:38 ` [PATCH v4 6/9] perf evsel: Delete unnecessary = 0 Howard Chu
2024-08-07 15:38 ` [PATCH v4 7/9] perf record --off-cpu: Parse BPF output embedded data Howard Chu
2024-08-07 15:38 ` [PATCH v4 8/9] perf header: Add field 'embed' Howard Chu
2024-08-07 23:52 ` Namhyung Kim
2024-08-08 13:57 ` Howard Chu
2024-08-07 15:38 ` [PATCH v4 9/9] perf test: Add direct off-cpu dumping test Howard Chu
2024-09-25 3:04 ` [PATCH v4 0/9] perf record --off-cpu: Dump off-cpu samples directly Ian Rogers
2024-09-25 6:59 ` Howard Chu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox