* [PATCH v14 01/10] perf evsel: Expose evsel__is_offcpu_event() for future use
2024-12-15 18:12 [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
@ 2024-12-15 18:12 ` Howard Chu
2024-12-15 18:12 ` [PATCH v14 02/10] perf record --off-cpu: Parse off-cpu event Howard Chu
` (9 subsequent siblings)
10 siblings, 0 replies; 23+ messages in thread
From: Howard Chu @ 2024-12-15 18:12 UTC (permalink / raw)
To: acme
Cc: namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
Howard Chu, Ingo Molnar, James Clark, Peter Zijlstra,
Arnaldo Carvalho de Melo
Expose evsel__is_offcpu_event() so it can be used in
off_cpu_config(), evsel__parse_sample(), and perf script.
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-3-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
tools/perf/util/evsel.c | 2 +-
tools/perf/util/evsel.h | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index c93311e502cf..94eb3dc034f7 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1242,7 +1242,7 @@ static void evsel__set_default_freq_period(struct record_opts *opts,
}
}
-static bool evsel__is_offcpu_event(struct evsel *evsel)
+bool evsel__is_offcpu_event(struct evsel *evsel)
{
return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT);
}
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 93b6244ec302..f5004f92f0c8 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -553,4 +553,6 @@ u64 evsel__bitfield_swap_branch_flags(u64 value);
void evsel__set_config_if_unset(struct perf_pmu *pmu, struct evsel *evsel,
const char *config_name, u64 val);
+bool evsel__is_offcpu_event(struct evsel *evsel);
+
#endif /* __PERF_EVSEL_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v14 02/10] perf record --off-cpu: Parse off-cpu event
2024-12-15 18:12 [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
2024-12-15 18:12 ` [PATCH v14 01/10] perf evsel: Expose evsel__is_offcpu_event() for future use Howard Chu
@ 2024-12-15 18:12 ` Howard Chu
2024-12-15 18:12 ` [PATCH v14 03/10] perf record --off-cpu: Preparation of off-cpu BPF program Howard Chu
` (8 subsequent siblings)
10 siblings, 0 replies; 23+ messages in thread
From: Howard Chu @ 2024-12-15 18:12 UTC (permalink / raw)
To: acme
Cc: namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
Howard Chu, Ingo Molnar, James Clark, Peter Zijlstra,
Arnaldo Carvalho de Melo
Parse the off-cpu event using parse_event(), as bpf-output.
no-inherit should be set to 1, here's the reason:
We update the BPF perf_event map for direct off-cpu sample dumping (in
following patches), it executes as follows:
bpf_map_update_value()
bpf_fd_array_map_update_elem()
perf_event_fd_array_get_ptr()
perf_event_read_local()
In perf_event_read_local(), there is:
int perf_event_read_local(struct perf_event *event, u64 *value,
u64 *enabled, u64 *running)
{
...
/*
* It must not be an event with inherit set, we cannot read
* all child counters from atomic context.
*/
if (event->attr.inherit) {
ret = -EOPNOTSUPP;
goto out;
}
Which means no-inherit has to be true for updating the BPF perf_event
map.
Moreover, for bpf-output events, we primarily want a system-wide event
instead of a per-task event.
The reason is that in BPF's bpf_perf_event_output(), BPF uses the CPU
index to retrieve the perf_event file descriptor it outputs to.
Making a bpf-output event system-wide naturally satisfies this
requirement by mapping CPU appropriately.
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-4-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
tools/perf/util/bpf_off_cpu.c | 33 +++++++++++----------------------
tools/perf/util/evsel.c | 4 +++-
2 files changed, 14 insertions(+), 23 deletions(-)
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index a590a8ac1f9d..9275b022b2ea 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -38,32 +38,21 @@ union off_cpu_data {
static int off_cpu_config(struct evlist *evlist)
{
+ char off_cpu_event[64];
struct evsel *evsel;
- struct perf_event_attr attr = {
- .type = PERF_TYPE_SOFTWARE,
- .config = PERF_COUNT_SW_BPF_OUTPUT,
- .size = sizeof(attr), /* to capture ABI version */
- };
- char *evname = strdup(OFFCPU_EVENT);
-
- if (evname == NULL)
- return -ENOMEM;
- evsel = evsel__new(&attr);
- if (!evsel) {
- free(evname);
- return -ENOMEM;
+ scnprintf(off_cpu_event, sizeof(off_cpu_event), "bpf-output/name=%s/", OFFCPU_EVENT);
+ if (parse_event(evlist, off_cpu_event)) {
+ pr_err("Failed to open off-cpu event\n");
+ return -1;
}
- evsel->core.attr.freq = 1;
- evsel->core.attr.sample_period = 1;
- /* off-cpu analysis depends on stack trace */
- evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
-
- evlist__add(evlist, evsel);
-
- free(evsel->name);
- evsel->name = evname;
+ evlist__for_each_entry(evlist, evsel) {
+ if (evsel__is_offcpu_event(evsel)) {
+ evsel->core.system_wide = true;
+ break;
+ }
+ }
return 0;
}
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 94eb3dc034f7..fc33104cd659 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1521,8 +1521,10 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
if (evsel__is_dummy_event(evsel))
evsel__reset_sample_bit(evsel, BRANCH_STACK);
- if (evsel__is_offcpu_event(evsel))
+ if (evsel__is_offcpu_event(evsel)) {
evsel->core.attr.sample_type &= OFFCPU_SAMPLE_TYPES;
+ attr->inherit = 0;
+ }
arch__post_evsel_config(evsel, attr);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v14 03/10] perf record --off-cpu: Preparation of off-cpu BPF program
2024-12-15 18:12 [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
2024-12-15 18:12 ` [PATCH v14 01/10] perf evsel: Expose evsel__is_offcpu_event() for future use Howard Chu
2024-12-15 18:12 ` [PATCH v14 02/10] perf record --off-cpu: Parse off-cpu event Howard Chu
@ 2024-12-15 18:12 ` Howard Chu
2024-12-15 18:12 ` [PATCH v14 04/10] perf record --off-cpu: Dump off-cpu samples in BPF Howard Chu
` (7 subsequent siblings)
10 siblings, 0 replies; 23+ messages in thread
From: Howard Chu @ 2024-12-15 18:12 UTC (permalink / raw)
To: acme
Cc: namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
Howard Chu, Ingo Molnar, James Clark, Peter Zijlstra,
Arnaldo Carvalho de Melo
Set the perf_event map in BPF for dumping off-cpu samples.
Set the offcpu_thresh to specify the threshold.
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-5-howardchu95@gmail.com
[ Added some missing iteration variables to off_cpu_config() and fixed up
a manually edited patch hunk line boundary line ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
tools/perf/util/bpf_off_cpu.c | 23 +++++++++++++++++++++++
tools/perf/util/bpf_skel/off_cpu.bpf.c | 9 +++++++++
2 files changed, 32 insertions(+)
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 9275b022b2ea..84c3bf6a9614 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -13,6 +13,7 @@
#include "util/cgroup.h"
#include "util/strlist.h"
#include <bpf/bpf.h>
+#include <internal/xyarray.h>
#include "bpf_skel/off_cpu.skel.h"
@@ -60,6 +61,9 @@ static int off_cpu_config(struct evlist *evlist)
static void off_cpu_start(void *arg)
{
struct evlist *evlist = arg;
+ struct evsel *evsel;
+ struct perf_cpu pcpu;
+ int i;
/* update task filter for the given workload */
if (skel->rodata->has_task && skel->rodata->uses_tgid &&
@@ -73,6 +77,25 @@ static void off_cpu_start(void *arg)
bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
}
+ /* update BPF perf_event map */
+ evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT);
+ if (evsel == NULL) {
+ pr_err("%s evsel not found\n", OFFCPU_EVENT);
+ return;
+ }
+
+ perf_cpu_map__for_each_cpu(pcpu, i, evsel->core.cpus) {
+ int err;
+
+ err = bpf_map__update_elem(skel->maps.offcpu_output, &pcpu.cpu, sizeof(__u32),
+ xyarray__entry(evsel->core.fd, i, 0),
+ sizeof(__u32), BPF_ANY);
+ if (err) {
+ pr_err("Failed to update perf event map for direct off-cpu dumping\n");
+ return;
+ }
+ }
+
skel->bss->enabled = 1;
}
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index c152116df72f..1cdd4d63ea92 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -18,6 +18,8 @@
#define MAX_STACKS 32
#define MAX_ENTRIES 102400
+#define MAX_CPUS 4096
+
struct tstamp_data {
__u32 stack_id;
__u32 state;
@@ -39,6 +41,13 @@ struct {
__uint(max_entries, MAX_ENTRIES);
} stacks SEC(".maps");
+struct {
+ __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+ __uint(key_size, sizeof(__u32));
+ __uint(value_size, sizeof(__u32));
+ __uint(max_entries, MAX_CPUS);
+} offcpu_output SEC(".maps");
+
struct {
__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
__uint(map_flags, BPF_F_NO_PREALLOC);
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v14 04/10] perf record --off-cpu: Dump off-cpu samples in BPF
2024-12-15 18:12 [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (2 preceding siblings ...)
2024-12-15 18:12 ` [PATCH v14 03/10] perf record --off-cpu: Preparation of off-cpu BPF program Howard Chu
@ 2024-12-15 18:12 ` Howard Chu
2025-02-11 17:28 ` Arnaldo Carvalho de Melo
2024-12-15 18:12 ` [PATCH v14 05/10] perf evsel: Assemble offcpu samples Howard Chu
` (6 subsequent siblings)
10 siblings, 1 reply; 23+ messages in thread
From: Howard Chu @ 2024-12-15 18:12 UTC (permalink / raw)
To: acme
Cc: namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
Howard Chu, Ingo Molnar, James Clark, Peter Zijlstra,
Arnaldo Carvalho de Melo
Collect tid, period, callchain, and cgroup id and dump them when off-cpu
time threshold is reached.
We don't collect the off-cpu time twice (the delta), it's either in
direct samples, or accumulated samples that are dumped at the end of
perf.data.
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-6-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
tools/perf/util/bpf_skel/off_cpu.bpf.c | 88 ++++++++++++++++++++++++--
1 file changed, 83 insertions(+), 5 deletions(-)
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index 1cdd4d63ea92..77fdc9e81db3 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -19,11 +19,17 @@
#define MAX_ENTRIES 102400
#define MAX_CPUS 4096
+#define MAX_OFFCPU_LEN 37
+
+struct stack {
+ u64 array[MAX_STACKS];
+};
struct tstamp_data {
__u32 stack_id;
__u32 state;
__u64 timestamp;
+ struct stack stack;
};
struct offcpu_key {
@@ -41,6 +47,10 @@ struct {
__uint(max_entries, MAX_ENTRIES);
} stacks SEC(".maps");
+struct offcpu_data {
+ u64 array[MAX_OFFCPU_LEN];
+};
+
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(__u32));
@@ -48,6 +58,13 @@ struct {
__uint(max_entries, MAX_CPUS);
} offcpu_output SEC(".maps");
+struct {
+ __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+ __uint(key_size, sizeof(__u32));
+ __uint(value_size, sizeof(struct offcpu_data));
+ __uint(max_entries, 1);
+} offcpu_payload SEC(".maps");
+
struct {
__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
__uint(map_flags, BPF_F_NO_PREALLOC);
@@ -106,6 +123,8 @@ const volatile bool uses_cgroup_v1 = false;
int perf_subsys_id = -1;
+__u64 offcpu_thresh_ns = 500000000ull;
+
/*
* Old kernel used to call it task_struct->state and now it's '__state'.
* Use BPF CO-RE "ignored suffix rule" to deal with it like below:
@@ -192,6 +211,47 @@ static inline int can_record(struct task_struct *t, int state)
return 1;
}
+static inline int copy_stack(struct stack *from, struct offcpu_data *to, int n)
+{
+ int len = 0;
+
+ for (int i = 0; i < MAX_STACKS && from->array[i]; ++i, ++len)
+ to->array[n + 2 + i] = from->array[i];
+
+ return len;
+}
+
+/**
+ * off_cpu_dump - dump off-cpu samples to ring buffer
+ * @data: payload for dumping off-cpu samples
+ * @key: off-cpu data
+ * @stack: stack trace of the task before being scheduled out
+ *
+ * If the threshold of off-cpu time is reached, acquire tid, period, callchain, and cgroup id
+ * information of the task, and dump it as a raw sample to perf ring buffer
+ */
+static int off_cpu_dump(void *ctx, struct offcpu_data *data, struct offcpu_key *key,
+ struct stack *stack, __u64 delta)
+{
+ int n = 0, len = 0;
+
+ data->array[n++] = (u64)key->tgid << 32 | key->pid;
+ data->array[n++] = delta;
+
+ /* data->array[n] is callchain->nr (updated later) */
+ data->array[n + 1] = PERF_CONTEXT_USER;
+ data->array[n + 2] = 0;
+ len = copy_stack(stack, data, n);
+
+ /* update length of callchain */
+ data->array[n] = len + 1;
+ n += len + 2;
+
+ data->array[n++] = key->cgroup_id;
+
+ return bpf_perf_event_output(ctx, &offcpu_output, BPF_F_CURRENT_CPU, data, n * sizeof(u64));
+}
+
static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
struct task_struct *next, int state)
{
@@ -216,6 +276,16 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
pelem->state = state;
pelem->stack_id = stack_id;
+ /*
+ * If stacks are successfully collected by bpf_get_stackid(), collect them once more
+ * in task_storage for direct off-cpu sample dumping
+ */
+ if (stack_id > 0 && bpf_get_stack(ctx, &pelem->stack, MAX_STACKS * sizeof(u64), BPF_F_USER_STACK)) {
+ /*
+ * This empty if block is used to avoid 'result unused warning' from bpf_get_stack().
+ * If the collection fails, continue with the logic for the next task.
+ */
+ }
next:
pelem = bpf_task_storage_get(&tstamp, next, NULL, 0);
@@ -230,11 +300,19 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
__u64 delta = ts - pelem->timestamp;
__u64 *total;
- total = bpf_map_lookup_elem(&off_cpu, &key);
- if (total)
- *total += delta;
- else
- bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
+ if (delta >= offcpu_thresh_ns) {
+ int zero = 0;
+ struct offcpu_data *data = bpf_map_lookup_elem(&offcpu_payload, &zero);
+
+ if (data)
+ off_cpu_dump(ctx, data, &key, &pelem->stack, delta);
+ } else {
+ total = bpf_map_lookup_elem(&off_cpu, &key);
+ if (total)
+ *total += delta;
+ else
+ bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
+ }
/* prevent to reuse the timestamp later */
pelem->timestamp = 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v14 04/10] perf record --off-cpu: Dump off-cpu samples in BPF
2024-12-15 18:12 ` [PATCH v14 04/10] perf record --off-cpu: Dump off-cpu samples in BPF Howard Chu
@ 2025-02-11 17:28 ` Arnaldo Carvalho de Melo
2025-02-11 17:31 ` Howard Chu
0 siblings, 1 reply; 23+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-02-11 17:28 UTC (permalink / raw)
To: Howard Chu
Cc: namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
Ingo Molnar, James Clark, Peter Zijlstra,
Arnaldo Carvalho de Melo
On Sun, Dec 15, 2024 at 10:12:14AM -0800, Howard Chu wrote:
> Collect tid, period, callchain, and cgroup id and dump them when off-cpu
> time threshold is reached.
>
> We don't collect the off-cpu time twice (the delta), it's either in
> direct samples, or accumulated samples that are dumped at the end of
> perf.data.
>
> Suggested-by: Namhyung Kim <namhyung@kernel.org>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> Cc: Adrian Hunter <adrian.hunter@intel.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Ian Rogers <irogers@google.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: James Clark <james.clark@linaro.org>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: Kan Liang <kan.liang@linux.intel.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: https://lore.kernel.org/r/20241108204137.2444151-6-howardchu95@gmail.com
> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> ---
> tools/perf/util/bpf_skel/off_cpu.bpf.c | 88 ++++++++++++++++++++++++--
> 1 file changed, 83 insertions(+), 5 deletions(-)
>
> diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> index 1cdd4d63ea92..77fdc9e81db3 100644
> --- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
> +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> @@ -19,11 +19,17 @@
> #define MAX_ENTRIES 102400
>
> #define MAX_CPUS 4096
> +#define MAX_OFFCPU_LEN 37
> +
> +struct stack {
> + u64 array[MAX_STACKS];
> +};
So I needed to rename the above as it fails 'make -C tools/perf
build-test', the direct make command line to hit that:
⬢ [acme@toolbox perf-tools-next]$ rm -rf /tmp/build/$(basename $PWD)/ ; mkdir -p /tmp/build/$(basename $PWD)/
⬢ [acme@toolbox perf-tools-next]$ alias m='rm -rf ~/libexec/perf-core/ ; make -k GEN_VMLINUX_H=1 CORESIGHT=1 O=/tmp/build/$(basename $PWD)/ -C tools/perf install-bin && perf test python && cat /tmp/build/perf-tools-next/feature/test-all.output'
⬢ [acme@toolbox perf-tools-next]$ m
GENSKEL /tmp/build/perf-tools-next/util/bpf_skel/sample_filter.skel.h
util/bpf_skel/off_cpu.bpf.c:24:8: error: redefinition of 'stack'
24 | struct stack {
| ^
/tmp/build/perf-tools-next/util/bpf_skel/.tmp/../vmlinux.h:128096:8: note: previous definition is here
128096 | struct stack {
| ^
util/bpf_skel/off_cpu.bpf.c:218:42: error: no member named 'array' in 'struct stack'
218 | for (int i = 0; i < MAX_STACKS && from->array[i]; ++i, ++len)
| ~~~~ ^
util/bpf_skel/off_cpu.bpf.c:219:32: error: no member named 'array' in 'struct stack'
219 | to->array[n + 2 + i] = from->array[i];
| ~~~~ ^
3 errors generated.
make[2]: *** [Makefile.perf:1276: /tmp/build/perf-tools-next/util/bpf_skel/.tmp/off_cpu.bpf.o] Error 1
So for now I'm adding the patch below to get it going.
IIRC there was some discussion about ditching GEN_VMLINUX_H=1, we can
remove this hack later if/when we progress on that.
- Arnaldo
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index 77fdc9e81db395d1..848a123e5610f17b 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -21,7 +21,8 @@
#define MAX_CPUS 4096
#define MAX_OFFCPU_LEN 37
-struct stack {
+// We have a 'struct stack' in vmlinux.h when building with GEN_VMLINUX_H=1
+struct __stack {
u64 array[MAX_STACKS];
};
@@ -29,7 +30,7 @@ struct tstamp_data {
__u32 stack_id;
__u32 state;
__u64 timestamp;
- struct stack stack;
+ struct __stack stack;
};
struct offcpu_key {
@@ -211,7 +212,7 @@ static inline int can_record(struct task_struct *t, int state)
return 1;
}
-static inline int copy_stack(struct stack *from, struct offcpu_data *to, int n)
+static inline int copy_stack(struct __stack *from, struct offcpu_data *to, int n)
{
int len = 0;
@@ -231,7 +232,7 @@ static inline int copy_stack(struct stack *from, struct offcpu_data *to, int n)
* information of the task, and dump it as a raw sample to perf ring buffer
*/
static int off_cpu_dump(void *ctx, struct offcpu_data *data, struct offcpu_key *key,
- struct stack *stack, __u64 delta)
+ struct __stack *stack, __u64 delta)
{
int n = 0, len = 0;
> struct tstamp_data {
> __u32 stack_id;
> __u32 state;
> __u64 timestamp;
> + struct stack stack;
> };
>
> struct offcpu_key {
> @@ -41,6 +47,10 @@ struct {
> __uint(max_entries, MAX_ENTRIES);
> } stacks SEC(".maps");
>
> +struct offcpu_data {
> + u64 array[MAX_OFFCPU_LEN];
> +};
> +
> struct {
> __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
> __uint(key_size, sizeof(__u32));
> @@ -48,6 +58,13 @@ struct {
> __uint(max_entries, MAX_CPUS);
> } offcpu_output SEC(".maps");
>
> +struct {
> + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> + __uint(key_size, sizeof(__u32));
> + __uint(value_size, sizeof(struct offcpu_data));
> + __uint(max_entries, 1);
> +} offcpu_payload SEC(".maps");
> +
> struct {
> __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> __uint(map_flags, BPF_F_NO_PREALLOC);
> @@ -106,6 +123,8 @@ const volatile bool uses_cgroup_v1 = false;
>
> int perf_subsys_id = -1;
>
> +__u64 offcpu_thresh_ns = 500000000ull;
> +
> /*
> * Old kernel used to call it task_struct->state and now it's '__state'.
> * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
> @@ -192,6 +211,47 @@ static inline int can_record(struct task_struct *t, int state)
> return 1;
> }
>
> +static inline int copy_stack(struct stack *from, struct offcpu_data *to, int n)
> +{
> + int len = 0;
> +
> + for (int i = 0; i < MAX_STACKS && from->array[i]; ++i, ++len)
> + to->array[n + 2 + i] = from->array[i];
> +
> + return len;
> +}
> +
> +/**
> + * off_cpu_dump - dump off-cpu samples to ring buffer
> + * @data: payload for dumping off-cpu samples
> + * @key: off-cpu data
> + * @stack: stack trace of the task before being scheduled out
> + *
> + * If the threshold of off-cpu time is reached, acquire tid, period, callchain, and cgroup id
> + * information of the task, and dump it as a raw sample to perf ring buffer
> + */
> +static int off_cpu_dump(void *ctx, struct offcpu_data *data, struct offcpu_key *key,
> + struct stack *stack, __u64 delta)
> +{
> + int n = 0, len = 0;
> +
> + data->array[n++] = (u64)key->tgid << 32 | key->pid;
> + data->array[n++] = delta;
> +
> + /* data->array[n] is callchain->nr (updated later) */
> + data->array[n + 1] = PERF_CONTEXT_USER;
> + data->array[n + 2] = 0;
> + len = copy_stack(stack, data, n);
> +
> + /* update length of callchain */
> + data->array[n] = len + 1;
> + n += len + 2;
> +
> + data->array[n++] = key->cgroup_id;
> +
> + return bpf_perf_event_output(ctx, &offcpu_output, BPF_F_CURRENT_CPU, data, n * sizeof(u64));
> +}
> +
> static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
> struct task_struct *next, int state)
> {
> @@ -216,6 +276,16 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
> pelem->state = state;
> pelem->stack_id = stack_id;
>
> + /*
> + * If stacks are successfully collected by bpf_get_stackid(), collect them once more
> + * in task_storage for direct off-cpu sample dumping
> + */
> + if (stack_id > 0 && bpf_get_stack(ctx, &pelem->stack, MAX_STACKS * sizeof(u64), BPF_F_USER_STACK)) {
> + /*
> + * This empty if block is used to avoid 'result unused warning' from bpf_get_stack().
> + * If the collection fails, continue with the logic for the next task.
> + */
> + }
> next:
> pelem = bpf_task_storage_get(&tstamp, next, NULL, 0);
>
> @@ -230,11 +300,19 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
> __u64 delta = ts - pelem->timestamp;
> __u64 *total;
>
> - total = bpf_map_lookup_elem(&off_cpu, &key);
> - if (total)
> - *total += delta;
> - else
> - bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
> + if (delta >= offcpu_thresh_ns) {
> + int zero = 0;
> + struct offcpu_data *data = bpf_map_lookup_elem(&offcpu_payload, &zero);
> +
> + if (data)
> + off_cpu_dump(ctx, data, &key, &pelem->stack, delta);
> + } else {
> + total = bpf_map_lookup_elem(&off_cpu, &key);
> + if (total)
> + *total += delta;
> + else
> + bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
> + }
>
> /* prevent to reuse the timestamp later */
> pelem->timestamp = 0;
> --
> 2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v14 04/10] perf record --off-cpu: Dump off-cpu samples in BPF
2025-02-11 17:28 ` Arnaldo Carvalho de Melo
@ 2025-02-11 17:31 ` Howard Chu
0 siblings, 0 replies; 23+ messages in thread
From: Howard Chu @ 2025-02-11 17:31 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo
Cc: namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
Ingo Molnar, James Clark, Peter Zijlstra,
Arnaldo Carvalho de Melo
Hello Arnaldo,
On Tue, Feb 11, 2025 at 9:28 AM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> On Sun, Dec 15, 2024 at 10:12:14AM -0800, Howard Chu wrote:
> > Collect tid, period, callchain, and cgroup id and dump them when off-cpu
> > time threshold is reached.
> >
> > We don't collect the off-cpu time twice (the delta), it's either in
> > direct samples, or accumulated samples that are dumped at the end of
> > perf.data.
> >
> > Suggested-by: Namhyung Kim <namhyung@kernel.org>
> > Signed-off-by: Howard Chu <howardchu95@gmail.com>
> > Cc: Adrian Hunter <adrian.hunter@intel.com>
> > Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> > Cc: Ian Rogers <irogers@google.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: James Clark <james.clark@linaro.org>
> > Cc: Jiri Olsa <jolsa@kernel.org>
> > Cc: Kan Liang <kan.liang@linux.intel.com>
> > Cc: Mark Rutland <mark.rutland@arm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Link: https://lore.kernel.org/r/20241108204137.2444151-6-howardchu95@gmail.com
> > Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> > ---
> > tools/perf/util/bpf_skel/off_cpu.bpf.c | 88 ++++++++++++++++++++++++--
> > 1 file changed, 83 insertions(+), 5 deletions(-)
> >
> > diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> > index 1cdd4d63ea92..77fdc9e81db3 100644
> > --- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
> > +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> > @@ -19,11 +19,17 @@
> > #define MAX_ENTRIES 102400
> >
> > #define MAX_CPUS 4096
> > +#define MAX_OFFCPU_LEN 37
> > +
> > +struct stack {
> > + u64 array[MAX_STACKS];
> > +};
>
> So I needed to rename the above as it fails 'make -C tools/perf
> build-test', the direct make command line to hit that:
>
> ⬢ [acme@toolbox perf-tools-next]$ rm -rf /tmp/build/$(basename $PWD)/ ; mkdir -p /tmp/build/$(basename $PWD)/
> ⬢ [acme@toolbox perf-tools-next]$ alias m='rm -rf ~/libexec/perf-core/ ; make -k GEN_VMLINUX_H=1 CORESIGHT=1 O=/tmp/build/$(basename $PWD)/ -C tools/perf install-bin && perf test python && cat /tmp/build/perf-tools-next/feature/test-all.output'
> ⬢ [acme@toolbox perf-tools-next]$ m
> GENSKEL /tmp/build/perf-tools-next/util/bpf_skel/sample_filter.skel.h
> util/bpf_skel/off_cpu.bpf.c:24:8: error: redefinition of 'stack'
> 24 | struct stack {
> | ^
> /tmp/build/perf-tools-next/util/bpf_skel/.tmp/../vmlinux.h:128096:8: note: previous definition is here
> 128096 | struct stack {
> | ^
> util/bpf_skel/off_cpu.bpf.c:218:42: error: no member named 'array' in 'struct stack'
> 218 | for (int i = 0; i < MAX_STACKS && from->array[i]; ++i, ++len)
> | ~~~~ ^
> util/bpf_skel/off_cpu.bpf.c:219:32: error: no member named 'array' in 'struct stack'
> 219 | to->array[n + 2 + i] = from->array[i];
> | ~~~~ ^
> 3 errors generated.
> make[2]: *** [Makefile.perf:1276: /tmp/build/perf-tools-next/util/bpf_skel/.tmp/off_cpu.bpf.o] Error 1
>
> So for now I'm adding the patch below to get it going.
>
> IIRC there was some discussion about ditching GEN_VMLINUX_H=1, we can
> remove this hack later if/when we progress on that.
>
> - Arnaldo
>
> diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> index 77fdc9e81db395d1..848a123e5610f17b 100644
> --- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
> +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> @@ -21,7 +21,8 @@
> #define MAX_CPUS 4096
> #define MAX_OFFCPU_LEN 37
>
> -struct stack {
> +// We have a 'struct stack' in vmlinux.h when building with GEN_VMLINUX_H=1
> +struct __stack {
> u64 array[MAX_STACKS];
> };
>
> @@ -29,7 +30,7 @@ struct tstamp_data {
> __u32 stack_id;
> __u32 state;
> __u64 timestamp;
> - struct stack stack;
> + struct __stack stack;
> };
>
> struct offcpu_key {
> @@ -211,7 +212,7 @@ static inline int can_record(struct task_struct *t, int state)
> return 1;
> }
>
> -static inline int copy_stack(struct stack *from, struct offcpu_data *to, int n)
> +static inline int copy_stack(struct __stack *from, struct offcpu_data *to, int n)
> {
> int len = 0;
>
> @@ -231,7 +232,7 @@ static inline int copy_stack(struct stack *from, struct offcpu_data *to, int n)
> * information of the task, and dump it as a raw sample to perf ring buffer
> */
> static int off_cpu_dump(void *ctx, struct offcpu_data *data, struct offcpu_key *key,
> - struct stack *stack, __u64 delta)
> + struct __stack *stack, __u64 delta)
> {
> int n = 0, len = 0;
Ack. Sorry about that.
Thanks,
Howard
>
>
> > struct tstamp_data {
> > __u32 stack_id;
> > __u32 state;
> > __u64 timestamp;
> > + struct stack stack;
> > };
> >
> > struct offcpu_key {
> > @@ -41,6 +47,10 @@ struct {
> > __uint(max_entries, MAX_ENTRIES);
> > } stacks SEC(".maps");
> >
> > +struct offcpu_data {
> > + u64 array[MAX_OFFCPU_LEN];
> > +};
> > +
> > struct {
> > __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
> > __uint(key_size, sizeof(__u32));
> > @@ -48,6 +58,13 @@ struct {
> > __uint(max_entries, MAX_CPUS);
> > } offcpu_output SEC(".maps");
> >
> > +struct {
> > + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> > + __uint(key_size, sizeof(__u32));
> > + __uint(value_size, sizeof(struct offcpu_data));
> > + __uint(max_entries, 1);
> > +} offcpu_payload SEC(".maps");
> > +
> > struct {
> > __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > __uint(map_flags, BPF_F_NO_PREALLOC);
> > @@ -106,6 +123,8 @@ const volatile bool uses_cgroup_v1 = false;
> >
> > int perf_subsys_id = -1;
> >
> > +__u64 offcpu_thresh_ns = 500000000ull;
> > +
> > /*
> > * Old kernel used to call it task_struct->state and now it's '__state'.
> > * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
> > @@ -192,6 +211,47 @@ static inline int can_record(struct task_struct *t, int state)
> > return 1;
> > }
> >
> > +static inline int copy_stack(struct stack *from, struct offcpu_data *to, int n)
> > +{
> > + int len = 0;
> > +
> > + for (int i = 0; i < MAX_STACKS && from->array[i]; ++i, ++len)
> > + to->array[n + 2 + i] = from->array[i];
> > +
> > + return len;
> > +}
> > +
> > +/**
> > + * off_cpu_dump - dump off-cpu samples to ring buffer
> > + * @data: payload for dumping off-cpu samples
> > + * @key: off-cpu data
> > + * @stack: stack trace of the task before being scheduled out
> > + *
> > + * If the threshold of off-cpu time is reached, acquire tid, period, callchain, and cgroup id
> > + * information of the task, and dump it as a raw sample to perf ring buffer
> > + */
> > +static int off_cpu_dump(void *ctx, struct offcpu_data *data, struct offcpu_key *key,
> > + struct stack *stack, __u64 delta)
> > +{
> > + int n = 0, len = 0;
> > +
> > + data->array[n++] = (u64)key->tgid << 32 | key->pid;
> > + data->array[n++] = delta;
> > +
> > + /* data->array[n] is callchain->nr (updated later) */
> > + data->array[n + 1] = PERF_CONTEXT_USER;
> > + data->array[n + 2] = 0;
> > + len = copy_stack(stack, data, n);
> > +
> > + /* update length of callchain */
> > + data->array[n] = len + 1;
> > + n += len + 2;
> > +
> > + data->array[n++] = key->cgroup_id;
> > +
> > + return bpf_perf_event_output(ctx, &offcpu_output, BPF_F_CURRENT_CPU, data, n * sizeof(u64));
> > +}
> > +
> > static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
> > struct task_struct *next, int state)
> > {
> > @@ -216,6 +276,16 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
> > pelem->state = state;
> > pelem->stack_id = stack_id;
> >
> > + /*
> > + * If stacks are successfully collected by bpf_get_stackid(), collect them once more
> > + * in task_storage for direct off-cpu sample dumping
> > + */
> > + if (stack_id > 0 && bpf_get_stack(ctx, &pelem->stack, MAX_STACKS * sizeof(u64), BPF_F_USER_STACK)) {
> > + /*
> > + * This empty if block is used to avoid 'result unused warning' from bpf_get_stack().
> > + * If the collection fails, continue with the logic for the next task.
> > + */
> > + }
> > next:
> > pelem = bpf_task_storage_get(&tstamp, next, NULL, 0);
> >
> > @@ -230,11 +300,19 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
> > __u64 delta = ts - pelem->timestamp;
> > __u64 *total;
> >
> > - total = bpf_map_lookup_elem(&off_cpu, &key);
> > - if (total)
> > - *total += delta;
> > - else
> > - bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
> > + if (delta >= offcpu_thresh_ns) {
> > + int zero = 0;
> > + struct offcpu_data *data = bpf_map_lookup_elem(&offcpu_payload, &zero);
> > +
> > + if (data)
> > + off_cpu_dump(ctx, data, &key, &pelem->stack, delta);
> > + } else {
> > + total = bpf_map_lookup_elem(&off_cpu, &key);
> > + if (total)
> > + *total += delta;
> > + else
> > + bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
> > + }
> >
> > /* prevent to reuse the timestamp later */
> > pelem->timestamp = 0;
> > --
> > 2.43.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v14 05/10] perf evsel: Assemble offcpu samples
2024-12-15 18:12 [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (3 preceding siblings ...)
2024-12-15 18:12 ` [PATCH v14 04/10] perf record --off-cpu: Dump off-cpu samples in BPF Howard Chu
@ 2024-12-15 18:12 ` Howard Chu
2025-02-06 19:27 ` Arnaldo Carvalho de Melo
2024-12-15 18:12 ` [PATCH v14 06/10] perf record --off-cpu: Disable perf_event's callchain collection Howard Chu
` (5 subsequent siblings)
10 siblings, 1 reply; 23+ messages in thread
From: Howard Chu @ 2024-12-15 18:12 UTC (permalink / raw)
To: acme
Cc: namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
Howard Chu, Ingo Molnar, James Clark, Peter Zijlstra,
Arnaldo Carvalho de Melo
Use the data in bpf-output samples, to assemble offcpu samples. In
evsel__is_offcpu_event(), Check if sample_type is PERF_SAMPLE_RAW to
support off-cpu sample data created by an older version of perf.
Testing compatibility on offcpu samples collected by perf before this patch series:
See below, the sample_type still uses PERF_SAMPLE_CALLCHAIN
$ perf script --header -i ./perf.data.ptn | grep "event : name = offcpu-time"
# event : name = offcpu-time, , id = { 237917, 237918, 237919, 237920 }, type = 1 (software), size = 136, config = 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq } = 1, sample_type = IP|TID|TIME|CALLCHAIN|CPU|PERIOD|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1
The output is correct.
$ perf script -i ./perf.data.ptn | grep offcpu-time
gmain 2173 [000] 18446744069.414584: 100102015 offcpu-time:
NetworkManager 901 [000] 18446744069.414584: 5603579 offcpu-time:
Web Content 1183550 [000] 18446744069.414584: 46278 offcpu-time:
gnome-control-c 2200559 [000] 18446744069.414584: 11998247014 offcpu-time:
And after this patch series:
$ perf script --header -i ./perf.data.off-cpu-v9 | grep "event : name = offcpu-time"
# event : name = offcpu-time, , id = { 237959, 237960, 237961, 237962 }, type = 1 (software), size = 136, config = 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq } = 1, sample_type = IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1
perf $ ./perf script -i ./perf.data.off-cpu-v9 | grep offcpu-time
gnome-shell 1875 [001] 4789616.361225: 100097057 offcpu-time:
gnome-shell 1875 [001] 4789616.461419: 100107463 offcpu-time:
firefox 2206821 [002] 4789616.475690: 255257245 offcpu-time:
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-7-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
tools/perf/util/evsel.c | 35 ++++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index fc33104cd659..277b5babd63e 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1244,7 +1244,8 @@ static void evsel__set_default_freq_period(struct record_opts *opts,
bool evsel__is_offcpu_event(struct evsel *evsel)
{
- return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT);
+ return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT) &&
+ evsel->core.attr.sample_type & PERF_SAMPLE_RAW;
}
/*
@@ -2846,6 +2847,35 @@ static inline bool evsel__has_branch_counters(const struct evsel *evsel)
return false;
}
+static int __set_offcpu_sample(struct perf_sample *data)
+{
+ u64 *array = data->raw_data;
+ u32 max_size = data->raw_size, *p32;
+ const void *endp = (void *)array + max_size;
+
+ if (array == NULL)
+ return -EFAULT;
+
+ OVERFLOW_CHECK_u64(array);
+ p32 = (void *)array++;
+ data->pid = p32[0];
+ data->tid = p32[1];
+
+ OVERFLOW_CHECK_u64(array);
+ data->period = *array++;
+
+ OVERFLOW_CHECK_u64(array);
+ data->callchain = (struct ip_callchain *)array++;
+ OVERFLOW_CHECK(array, data->callchain->nr * sizeof(u64), max_size);
+ data->ip = data->callchain->ips[1];
+ array += data->callchain->nr;
+
+ OVERFLOW_CHECK_u64(array);
+ data->cgroup = *array;
+
+ return 0;
+}
+
int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
struct perf_sample *data)
{
@@ -3197,6 +3227,9 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
array = (void *)array + sz;
}
+ if (evsel__is_offcpu_event(evsel))
+ return __set_offcpu_sample(data);
+
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v14 05/10] perf evsel: Assemble offcpu samples
2024-12-15 18:12 ` [PATCH v14 05/10] perf evsel: Assemble offcpu samples Howard Chu
@ 2025-02-06 19:27 ` Arnaldo Carvalho de Melo
2025-02-06 22:08 ` Howard Chu
0 siblings, 1 reply; 23+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-02-06 19:27 UTC (permalink / raw)
To: Howard Chu
Cc: namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
Ingo Molnar, James Clark, Peter Zijlstra,
Arnaldo Carvalho de Melo
On Sun, Dec 15, 2024 at 10:12:15AM -0800, Howard Chu wrote:
> Use the data in bpf-output samples, to assemble offcpu samples. In
> evsel__is_offcpu_event(), Check if sample_type is PERF_SAMPLE_RAW to
> support off-cpu sample data created by an older version of perf.
> Testing compatibility on offcpu samples collected by perf before this patch series:
> See below, the sample_type still uses PERF_SAMPLE_CALLCHAIN
Getting back from vacation, so bear with me, please :-)
So, while at this point:
⬢ [acme@toolbox perf-tools-next]$ git log --oneline -5
bfd7461c25a5e667 (HEAD) perf record --off-cpu: Dump off-cpu samples in BPF
e75f8ce63bfa6cb9 perf record --off-cpu: Preparation of off-cpu BPF program
0ffab9d26971c91c perf record --off-cpu: Parse off-cpu event
efc3fe2070853b7d perf evsel: Expose evsel__is_offcpu_event() for future use
357b965deba9fb71 (perf-tools-next/perf-tools-next) perf stat: Changes to event name uniquification
⬢ [acme@toolbox perf-tools-next]$
And trying to test this series I do:
root@number:~# perf -v
perf version 6.13.rc2.gbfd7461c25a5
root@number:~#
root@number:~# perf record --off-cpu sleep 5s
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
root@number:~# perf evlist
cpu_atom/cycles/P
cpu_core/cycles/P
offcpu-time
dummy:u
root@number:~#
And:
root@number:~# perf evlist -v
cpu_atom/cycles/P: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0xa00000000, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, enable_on_exec: 1, precise_ip: 3, sample_id_all: 1
cpu_core/cycles/P: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0x400000000, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, enable_on_exec: 1, precise_ip: 3, sample_id_all: 1
offcpu-time: type: 1 (software), size: 136, config: 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, enable_on_exec: 1, sample_id_all: 1
dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
root@number:~#
Above the "offcpu-time" event doesn't have the PERF_SAMPLE_CALLCHAIN in
sample_type, so I should be using some other 'perf record --off-cpu'
specific option?
The HEAD mentions:
"Collect tid, period, callchain, and cgroup id and dump them when off-cpu
time threshold is reached."
But that doesn't match the "offcpu-time" sample_type, right?
> $ perf script --header -i ./perf.data.ptn | grep "event : name = offcpu-time"
> # event : name = offcpu-time, , id = { 237917, 237918, 237919, 237920 }, type = 1 (software), size = 136, config = 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq } = 1, sample_type = IP|TID|TIME|CALLCHAIN|CPU|PERIOD|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1
>
> The output is correct.
You mean because it has PERF_SAMPLE_CALLCHAIN in the sample_type?
> $ perf script -i ./perf.data.ptn | grep offcpu-time
> gmain 2173 [000] 18446744069.414584: 100102015 offcpu-time:
> NetworkManager 901 [000] 18446744069.414584: 5603579 offcpu-time:
> Web Content 1183550 [000] 18446744069.414584: 46278 offcpu-time:
> gnome-control-c 2200559 [000] 18446744069.414584: 11998247014 offcpu-time:
> And after this patch series:
After the _whole_ series? Or at this point in the series?
I'm trying to test it patch by patch, to see from output from 'perf
report -D', perf script, etc that the patch descriptions match what the
code does and then double checking by trying to reproduce it, step by
step.
I'll try to do it at the end of the series, but thought about trowing in
these questions to help me understand your work.
Thanks!
- Arnaldo
> $ perf script --header -i ./perf.data.off-cpu-v9 | grep "event : name = offcpu-time"
> # event : name = offcpu-time, , id = { 237959, 237960, 237961, 237962 }, type = 1 (software), size = 136, config = 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq } = 1, sample_type = IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1
>
> perf $ ./perf script -i ./perf.data.off-cpu-v9 | grep offcpu-time
> gnome-shell 1875 [001] 4789616.361225: 100097057 offcpu-time:
> gnome-shell 1875 [001] 4789616.461419: 100107463 offcpu-time:
> firefox 2206821 [002] 4789616.475690: 255257245 offcpu-time:
>
> Suggested-by: Namhyung Kim <namhyung@kernel.org>
> Reviewed-by: Ian Rogers <irogers@google.com>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> Cc: Adrian Hunter <adrian.hunter@intel.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: James Clark <james.clark@linaro.org>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: Kan Liang <kan.liang@linux.intel.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: https://lore.kernel.org/r/20241108204137.2444151-7-howardchu95@gmail.com
> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> ---
> tools/perf/util/evsel.c | 35 ++++++++++++++++++++++++++++++++++-
> 1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index fc33104cd659..277b5babd63e 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1244,7 +1244,8 @@ static void evsel__set_default_freq_period(struct record_opts *opts,
>
> bool evsel__is_offcpu_event(struct evsel *evsel)
> {
> - return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT);
> + return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT) &&
> + evsel->core.attr.sample_type & PERF_SAMPLE_RAW;
> }
>
> /*
> @@ -2846,6 +2847,35 @@ static inline bool evsel__has_branch_counters(const struct evsel *evsel)
> return false;
> }
>
> +static int __set_offcpu_sample(struct perf_sample *data)
> +{
> + u64 *array = data->raw_data;
> + u32 max_size = data->raw_size, *p32;
> + const void *endp = (void *)array + max_size;
> +
> + if (array == NULL)
> + return -EFAULT;
> +
> + OVERFLOW_CHECK_u64(array);
> + p32 = (void *)array++;
> + data->pid = p32[0];
> + data->tid = p32[1];
> +
> + OVERFLOW_CHECK_u64(array);
> + data->period = *array++;
> +
> + OVERFLOW_CHECK_u64(array);
> + data->callchain = (struct ip_callchain *)array++;
> + OVERFLOW_CHECK(array, data->callchain->nr * sizeof(u64), max_size);
> + data->ip = data->callchain->ips[1];
> + array += data->callchain->nr;
> +
> + OVERFLOW_CHECK_u64(array);
> + data->cgroup = *array;
> +
> + return 0;
> +}
> +
> int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
> struct perf_sample *data)
> {
> @@ -3197,6 +3227,9 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
> array = (void *)array + sz;
> }
>
> + if (evsel__is_offcpu_event(evsel))
> + return __set_offcpu_sample(data);
> +
> return 0;
> }
>
> --
> 2.43.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v14 05/10] perf evsel: Assemble offcpu samples
2025-02-06 19:27 ` Arnaldo Carvalho de Melo
@ 2025-02-06 22:08 ` Howard Chu
0 siblings, 0 replies; 23+ messages in thread
From: Howard Chu @ 2025-02-06 22:08 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo
Cc: namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
Ingo Molnar, James Clark, Peter Zijlstra,
Arnaldo Carvalho de Melo
Hello Arnaldo,
On Thu, Feb 6, 2025 at 11:27 AM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> On Sun, Dec 15, 2024 at 10:12:15AM -0800, Howard Chu wrote:
> > Use the data in bpf-output samples, to assemble offcpu samples. In
> > evsel__is_offcpu_event(), Check if sample_type is PERF_SAMPLE_RAW to
> > support off-cpu sample data created by an older version of perf.
>
> > Testing compatibility on offcpu samples collected by perf before this patch series:
>
> > See below, the sample_type still uses PERF_SAMPLE_CALLCHAIN
>
> Getting back from vacation, so bear with me, please :-)
Good to have you back :), no worries at all.
>
> So, while at this point:
>
> ⬢ [acme@toolbox perf-tools-next]$ git log --oneline -5
> bfd7461c25a5e667 (HEAD) perf record --off-cpu: Dump off-cpu samples in BPF
> e75f8ce63bfa6cb9 perf record --off-cpu: Preparation of off-cpu BPF program
> 0ffab9d26971c91c perf record --off-cpu: Parse off-cpu event
> efc3fe2070853b7d perf evsel: Expose evsel__is_offcpu_event() for future use
> 357b965deba9fb71 (perf-tools-next/perf-tools-next) perf stat: Changes to event name uniquification
> ⬢ [acme@toolbox perf-tools-next]$
>
> And trying to test this series I do:
>
> root@number:~# perf -v
> perf version 6.13.rc2.gbfd7461c25a5
> root@number:~#
>
> root@number:~# perf record --off-cpu sleep 5s
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> root@number:~# perf evlist
> cpu_atom/cycles/P
> cpu_core/cycles/P
> offcpu-time
> dummy:u
> root@number:~#
>
> And:
>
> root@number:~# perf evlist -v
> cpu_atom/cycles/P: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0xa00000000, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, enable_on_exec: 1, precise_ip: 3, sample_id_all: 1
> cpu_core/cycles/P: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0x400000000, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, enable_on_exec: 1, precise_ip: 3, sample_id_all: 1
> offcpu-time: type: 1 (software), size: 136, config: 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, enable_on_exec: 1, sample_id_all: 1
> dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
> root@number:~#
>
> Above the "offcpu-time" event doesn't have the PERF_SAMPLE_CALLCHAIN in
> sample_type, so I should be using some other 'perf record --off-cpu'
> specific option?
>
> The HEAD mentions:
>
> "Collect tid, period, callchain, and cgroup id and dump them when off-cpu
> time threshold is reached."
These are collected in BPF, not from perf_event. I can change the
commit message for clarity if you’d like.
>
> But that doesn't match the "offcpu-time" sample_type, right?
Nope, I'm sorry. If offcpu-time has PERF_SAMPLE_CALLCHAIN, it will
collect the wrong call chain.
Say I have task1 scheduling in and task2 scheduling out. For off-CPU
analysis, I want to save task2's call chain and check if task1 has
been scheduled out for longer than a threshold. If it has, a sample
for task1 should be emitted.
When a sched_switch happens, the call chain belongs to task2, and perf
should only emit task1's call chain if there is one. Therefore,
PERF_SAMPLE_CALLCHAIN, which collects an instant stack call chain at
sched_switch, cannot be used.
>
> > $ perf script --header -i ./perf.data.ptn | grep "event : name = offcpu-time"
> > # event : name = offcpu-time, , id = { 237917, 237918, 237919, 237920 }, type = 1 (software), size = 136, config = 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq } = 1, sample_type = IP|TID|TIME|CALLCHAIN|CPU|PERIOD|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1
> >
> > The output is correct.
>
> You mean because it has PERF_SAMPLE_CALLCHAIN in the sample_type?
The motivation for this patch is to add backward compatibility, i.e.,
to ensure that perf script processes the previously collected off-cpu
perf.data correctly before the change to offcpu-time's sample_type.
>
> > $ perf script -i ./perf.data.ptn | grep offcpu-time
> > gmain 2173 [000] 18446744069.414584: 100102015 offcpu-time:
> > NetworkManager 901 [000] 18446744069.414584: 5603579 offcpu-time:
> > Web Content 1183550 [000] 18446744069.414584: 46278 offcpu-time:
> > gnome-control-c 2200559 [000] 18446744069.414584: 11998247014 offcpu-time:
>
> > And after this patch series:
>
> After the _whole_ series? Or at this point in the series?
I should mention that the two perf.data files I used in the commit
message, perf.data.ptn and perf.data.off-cpu-v9, were collected from
perf without this series and perf with the whole series, respectively
(since only then can I use the --off-cpu-thresh option).
Again, I can add an explanation to the commit message if you’d like.
Sorry for the confusion.
>
> I'm trying to test it patch by patch, to see from output from 'perf
> report -D', perf script, etc that the patch descriptions match what the
> code does and then double checking by trying to reproduce it, step by
> step.
Thank you very much, talking about meticulousness :)
>
> I'll try to do it at the end of the series, but thought about trowing in
> these questions to help me understand your work.
Sure! Throw me some more if you want, because it has been some time.
Thanks,
Howard
>
> Thanks!
>
> - Arnaldo
>
> > $ perf script --header -i ./perf.data.off-cpu-v9 | grep "event : name = offcpu-time"
> > # event : name = offcpu-time, , id = { 237959, 237960, 237961, 237962 }, type = 1 (software), size = 136, config = 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq } = 1, sample_type = IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1
> >
> > perf $ ./perf script -i ./perf.data.off-cpu-v9 | grep offcpu-time
> > gnome-shell 1875 [001] 4789616.361225: 100097057 offcpu-time:
> > gnome-shell 1875 [001] 4789616.461419: 100107463 offcpu-time:
> > firefox 2206821 [002] 4789616.475690: 255257245 offcpu-time:
> >
> > Suggested-by: Namhyung Kim <namhyung@kernel.org>
> > Reviewed-by: Ian Rogers <irogers@google.com>
> > Signed-off-by: Howard Chu <howardchu95@gmail.com>
> > Cc: Adrian Hunter <adrian.hunter@intel.com>
> > Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: James Clark <james.clark@linaro.org>
> > Cc: Jiri Olsa <jolsa@kernel.org>
> > Cc: Kan Liang <kan.liang@linux.intel.com>
> > Cc: Mark Rutland <mark.rutland@arm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Link: https://lore.kernel.org/r/20241108204137.2444151-7-howardchu95@gmail.com
> > Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> > ---
> > tools/perf/util/evsel.c | 35 ++++++++++++++++++++++++++++++++++-
> > 1 file changed, 34 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > index fc33104cd659..277b5babd63e 100644
> > --- a/tools/perf/util/evsel.c
> > +++ b/tools/perf/util/evsel.c
> > @@ -1244,7 +1244,8 @@ static void evsel__set_default_freq_period(struct record_opts *opts,
> >
> > bool evsel__is_offcpu_event(struct evsel *evsel)
> > {
> > - return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT);
> > + return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT) &&
> > + evsel->core.attr.sample_type & PERF_SAMPLE_RAW;
> > }
> >
> > /*
> > @@ -2846,6 +2847,35 @@ static inline bool evsel__has_branch_counters(const struct evsel *evsel)
> > return false;
> > }
> >
> > +static int __set_offcpu_sample(struct perf_sample *data)
> > +{
> > + u64 *array = data->raw_data;
> > + u32 max_size = data->raw_size, *p32;
> > + const void *endp = (void *)array + max_size;
> > +
> > + if (array == NULL)
> > + return -EFAULT;
> > +
> > + OVERFLOW_CHECK_u64(array);
> > + p32 = (void *)array++;
> > + data->pid = p32[0];
> > + data->tid = p32[1];
> > +
> > + OVERFLOW_CHECK_u64(array);
> > + data->period = *array++;
> > +
> > + OVERFLOW_CHECK_u64(array);
> > + data->callchain = (struct ip_callchain *)array++;
> > + OVERFLOW_CHECK(array, data->callchain->nr * sizeof(u64), max_size);
> > + data->ip = data->callchain->ips[1];
> > + array += data->callchain->nr;
> > +
> > + OVERFLOW_CHECK_u64(array);
> > + data->cgroup = *array;
> > +
> > + return 0;
> > +}
> > +
> > int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
> > struct perf_sample *data)
> > {
> > @@ -3197,6 +3227,9 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
> > array = (void *)array + sz;
> > }
> >
> > + if (evsel__is_offcpu_event(evsel))
> > + return __set_offcpu_sample(data);
> > +
> > return 0;
> > }
> >
> > --
> > 2.43.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v14 06/10] perf record --off-cpu: Disable perf_event's callchain collection
2024-12-15 18:12 [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (4 preceding siblings ...)
2024-12-15 18:12 ` [PATCH v14 05/10] perf evsel: Assemble offcpu samples Howard Chu
@ 2024-12-15 18:12 ` Howard Chu
2024-12-15 18:12 ` [PATCH v14 07/10] perf script: Display off-cpu samples correctly Howard Chu
` (4 subsequent siblings)
10 siblings, 0 replies; 23+ messages in thread
From: Howard Chu @ 2024-12-15 18:12 UTC (permalink / raw)
To: acme
Cc: namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
Howard Chu, Ingo Molnar, James Clark, Peter Zijlstra,
Arnaldo Carvalho de Melo
There is a check in evsel.c that does this:
if (evsel__is_offcpu_event(evsel))
evsel->core.attr.sample_type &= OFFCPU_SAMPLE_TYPES;
This along with:
#define OFFCPU_SAMPLE_TYPES (PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_IP | \
PERF_SAMPLE_TID | PERF_SAMPLE_TIME | \
PERF_SAMPLE_ID | PERF_SAMPLE_CPU | \
PERF_SAMPLE_PERIOD | PERF_SAMPLE_CALLCHAIN | \
PERF_SAMPLE_CGROUP)
will tell perf_event to collect callchain.
We don't need the callchain from perf_event when collecting off-cpu
samples, because it's prev's callchain, not next's callchain.
(perf_event) (task_storage) (needed)
prev next
| |
---sched_switch---->
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-8-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
tools/perf/util/off_cpu.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
index 2dd67c60f211..2a4b7f9b2c4c 100644
--- a/tools/perf/util/off_cpu.h
+++ b/tools/perf/util/off_cpu.h
@@ -13,7 +13,7 @@ struct record_opts;
#define OFFCPU_SAMPLE_TYPES (PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_IP | \
PERF_SAMPLE_TID | PERF_SAMPLE_TIME | \
PERF_SAMPLE_ID | PERF_SAMPLE_CPU | \
- PERF_SAMPLE_PERIOD | PERF_SAMPLE_CALLCHAIN | \
+ PERF_SAMPLE_PERIOD | PERF_SAMPLE_RAW | \
PERF_SAMPLE_CGROUP)
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v14 07/10] perf script: Display off-cpu samples correctly
2024-12-15 18:12 [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (5 preceding siblings ...)
2024-12-15 18:12 ` [PATCH v14 06/10] perf record --off-cpu: Disable perf_event's callchain collection Howard Chu
@ 2024-12-15 18:12 ` Howard Chu
2024-12-15 18:12 ` [PATCH v14 08/10] perf record --off-cpu: Dump the remaining samples in BPF's stack trace map Howard Chu
` (3 subsequent siblings)
10 siblings, 0 replies; 23+ messages in thread
From: Howard Chu @ 2024-12-15 18:12 UTC (permalink / raw)
To: acme
Cc: namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
Howard Chu, Ingo Molnar, James Clark, Peter Zijlstra,
Arnaldo Carvalho de Melo
No PERF_SAMPLE_CALLCHAIN in sample_type, but I need perf script to
display a callchain, have to specify manually.
Also, prefer displaying a callchain:
gvfs-afc-volume 2267 [001] 3829232.955656: 1001115340 offcpu-time:
77f05292603f __pselect+0xbf (/usr/lib/x86_64-linux-gnu/libc.so.6)
77f052a1801c [unknown] (/usr/lib/x86_64-linux-gnu/libusbmuxd-2.0.so.6.0.0)
77f052a18d45 [unknown] (/usr/lib/x86_64-linux-gnu/libusbmuxd-2.0.so.6.0.0)
77f05289ca94 start_thread+0x384 (/usr/lib/x86_64-linux-gnu/libc.so.6)
77f052929c3c clone3+0x2c (/usr/lib/x86_64-linux-gnu/libc.so.6)
to a raw binary BPF output:
BPF output: 0000: dd 08 00 00 db 08 00 00 <DD>...<DB>...
0008: cc ce ab 3b 00 00 00 00 <CC>Ϋ;....
0010: 06 00 00 00 00 00 00 00 ........
0018: 00 fe ff ff ff ff ff ff .<FE><FF><FF><FF><FF><FF><FF>
0020: 3f 60 92 52 f0 77 00 00 ?`.R<F0>w..
0028: 1c 80 a1 52 f0 77 00 00 ..<A1>R<F0>w..
0030: 45 8d a1 52 f0 77 00 00 E.<A1>R<F0>w..
0038: 94 ca 89 52 f0 77 00 00 .<CA>.R<F0>w..
0040: 3c 9c 92 52 f0 77 00 00 <..R<F0>w..
0048: 00 00 00 00 00 00 00 00 ........
0050: 00 00 00 00 ....
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-9-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
tools/perf/builtin-script.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index aad607b8918f..87b2cedc1cbc 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -662,7 +662,7 @@ static int perf_session__check_output_opt(struct perf_session *session)
evlist__for_each_entry(session->evlist, evsel) {
not_pipe = true;
- if (evsel__has_callchain(evsel)) {
+ if (evsel__has_callchain(evsel) || evsel__is_offcpu_event(evsel)) {
use_callchain = true;
break;
}
@@ -2358,7 +2358,7 @@ static void process_event(struct perf_script *script,
else if (PRINT_FIELD(BRSTACKOFF))
perf_sample__fprintf_brstackoff(sample, thread, attr, fp);
- if (evsel__is_bpf_output(evsel) && PRINT_FIELD(BPF_OUTPUT))
+ if (evsel__is_bpf_output(evsel) && !evsel__is_offcpu_event(evsel) && PRINT_FIELD(BPF_OUTPUT))
perf_sample__fprintf_bpf_output(sample, fp);
perf_sample__fprintf_insn(sample, evsel, attr, thread, machine, fp, al);
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v14 08/10] perf record --off-cpu: Dump the remaining samples in BPF's stack trace map
2024-12-15 18:12 [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (6 preceding siblings ...)
2024-12-15 18:12 ` [PATCH v14 07/10] perf script: Display off-cpu samples correctly Howard Chu
@ 2024-12-15 18:12 ` Howard Chu
2024-12-15 18:12 ` [PATCH v14 09/10] perf record --off-cpu: Add --off-cpu-thresh option Howard Chu
` (2 subsequent siblings)
10 siblings, 0 replies; 23+ messages in thread
From: Howard Chu @ 2024-12-15 18:12 UTC (permalink / raw)
To: acme
Cc: namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
Howard Chu, Ingo Molnar, James Clark, Peter Zijlstra,
Arnaldo Carvalho de Melo
Dump the remaining samples, as if it is dumping a direct sample.
Put the stack trace, tid, off-cpu time and cgroup id into the raw_data
section, just like a direct off-cpu sample coming from BPF's
bpf_perf_event_output().
This ensures that evsel__parse_sample() correctly parses both direct
samples and accumulated samples.
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-10-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
tools/perf/util/bpf_off_cpu.c | 59 +++++++++++++++++++++--------------
1 file changed, 35 insertions(+), 24 deletions(-)
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 84c3bf6a9614..2e7e4ae43ffc 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -37,6 +37,8 @@ union off_cpu_data {
u64 array[1024 / sizeof(u64)];
};
+u64 off_cpu_raw[MAX_STACKS + 5];
+
static int off_cpu_config(struct evlist *evlist)
{
char off_cpu_event[64];
@@ -307,6 +309,7 @@ int off_cpu_write(struct perf_session *session)
{
int bytes = 0, size;
int fd, stack;
+ u32 raw_size;
u64 sample_type, val, sid = 0;
struct evsel *evsel;
struct perf_data_file *file = &session->data->file;
@@ -346,46 +349,54 @@ int off_cpu_write(struct perf_session *session)
while (!bpf_map_get_next_key(fd, &prev, &key)) {
int n = 1; /* start from perf_event_header */
- int ip_pos = -1;
bpf_map_lookup_elem(fd, &key, &val);
+ /* zero-fill some of the fields, will be overwritten by raw_data when parsing */
if (sample_type & PERF_SAMPLE_IDENTIFIER)
data.array[n++] = sid;
- if (sample_type & PERF_SAMPLE_IP) {
- ip_pos = n;
+ if (sample_type & PERF_SAMPLE_IP)
data.array[n++] = 0; /* will be updated */
- }
if (sample_type & PERF_SAMPLE_TID)
- data.array[n++] = (u64)key.pid << 32 | key.tgid;
+ data.array[n++] = 0;
if (sample_type & PERF_SAMPLE_TIME)
data.array[n++] = tstamp;
- if (sample_type & PERF_SAMPLE_ID)
- data.array[n++] = sid;
if (sample_type & PERF_SAMPLE_CPU)
data.array[n++] = 0;
if (sample_type & PERF_SAMPLE_PERIOD)
- data.array[n++] = val;
- if (sample_type & PERF_SAMPLE_CALLCHAIN) {
- int len = 0;
-
- /* data.array[n] is callchain->nr (updated later) */
- data.array[n + 1] = PERF_CONTEXT_USER;
- data.array[n + 2] = 0;
-
- bpf_map_lookup_elem(stack, &key.stack_id, &data.array[n + 2]);
- while (data.array[n + 2 + len])
+ data.array[n++] = 0;
+ if (sample_type & PERF_SAMPLE_RAW) {
+ /*
+ * [ size ][ data ]
+ * [ data ]
+ * [ data ]
+ * [ data ]
+ * [ data ][ empty]
+ */
+ int len = 0, i = 0;
+ void *raw_data = (void *)data.array + n * sizeof(u64);
+
+ off_cpu_raw[i++] = (u64)key.pid << 32 | key.tgid;
+ off_cpu_raw[i++] = val;
+
+ /* off_cpu_raw[i] is callchain->nr (updated later) */
+ off_cpu_raw[i + 1] = PERF_CONTEXT_USER;
+ off_cpu_raw[i + 2] = 0;
+
+ bpf_map_lookup_elem(stack, &key.stack_id, &off_cpu_raw[i + 2]);
+ while (off_cpu_raw[i + 2 + len])
len++;
- /* update length of callchain */
- data.array[n] = len + 1;
+ off_cpu_raw[i] = len + 1;
+ i += len + 2;
+
+ off_cpu_raw[i++] = key.cgroup_id;
- /* update sample ip with the first callchain entry */
- if (ip_pos >= 0)
- data.array[ip_pos] = data.array[n + 2];
+ raw_size = i * sizeof(u64) + sizeof(u32); /* 4 bytes for alignment */
+ memcpy(raw_data, &raw_size, sizeof(raw_size));
+ memcpy(raw_data + sizeof(u32), off_cpu_raw, i * sizeof(u64));
- /* calculate sample callchain data array length */
- n += len + 2;
+ n += i + 1;
}
if (sample_type & PERF_SAMPLE_CGROUP)
data.array[n++] = key.cgroup_id;
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v14 09/10] perf record --off-cpu: Add --off-cpu-thresh option
2024-12-15 18:12 [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (7 preceding siblings ...)
2024-12-15 18:12 ` [PATCH v14 08/10] perf record --off-cpu: Dump the remaining samples in BPF's stack trace map Howard Chu
@ 2024-12-15 18:12 ` Howard Chu
2024-12-15 18:12 ` [PATCH v14 10/10] perf test: Add direct off-cpu test Howard Chu
2024-12-16 20:11 ` [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Namhyung Kim
10 siblings, 0 replies; 23+ messages in thread
From: Howard Chu @ 2024-12-15 18:12 UTC (permalink / raw)
To: acme
Cc: namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
Howard Chu, Arnaldo Carvalho de Melo, Ingo Molnar, James Clark,
Peter Zijlstra
Specify the threshold for dumping offcpu samples with --off-cpu-thresh,
the unit is milliseconds. Default value is 500ms.
Example:
perf record --off-cpu --off-cpu-thresh 824
The example above collects off-cpu samples where the off-cpu time is
longer than 824ms
Suggested-by: Ian Rogers <irogers@google.com>
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Suggested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-2-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
tools/perf/Documentation/perf-record.txt | 9 ++++++++
tools/perf/builtin-record.c | 26 ++++++++++++++++++++++++
tools/perf/util/bpf_off_cpu.c | 3 +++
tools/perf/util/bpf_skel/off_cpu.bpf.c | 2 +-
tools/perf/util/off_cpu.h | 1 +
tools/perf/util/record.h | 1 +
6 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 242223240a08..f3ac4c739d5f 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -829,6 +829,15 @@ filtered through the mask provided by -C option.
only, as of now. So the applications built without the frame
pointer might see bogus addresses.
+ off-cpu profiling consists two types of samples: direct samples, which
+ share the same behavior as regular samples, and the accumulated
+ samples, stored in BPF stack trace map, presented after all the regular
+ samples.
+
+--off-cpu-thresh::
+ Once a task's off-cpu time reaches this threshold (in milliseconds), it
+ generates a direct off-cpu sample. The default is 500ms.
+
--setup-filter=<action>::
Prepare BPF filter to be used by regular users. The action should be
either "pin" or "unpin". The filter can be used after it's pinned.
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 0b637cea4850..dca343b78825 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -3147,6 +3147,28 @@ static int record__parse_mmap_pages(const struct option *opt,
return ret;
}
+static int record__parse_off_cpu_thresh(const struct option *opt,
+ const char *str,
+ int unset __maybe_unused)
+{
+ struct record_opts *opts = opt->value;
+ char *endptr;
+ u64 off_cpu_thresh_ms;
+
+ if (!str)
+ return -EINVAL;
+
+ off_cpu_thresh_ms = strtoull(str, &endptr, 10);
+
+ /* the threshold isn't string "0", yet strtoull() returns 0, parsing failed */
+ if (*endptr || (off_cpu_thresh_ms == 0 && strcmp(str, "0")))
+ return -EINVAL;
+ else
+ opts->off_cpu_thresh_ns = off_cpu_thresh_ms * NSEC_PER_MSEC;
+
+ return 0;
+}
+
void __weak arch__add_leaf_frame_record_opts(struct record_opts *opts __maybe_unused)
{
}
@@ -3340,6 +3362,7 @@ static struct record record = {
.ctl_fd = -1,
.ctl_fd_ack = -1,
.synth = PERF_SYNTH_ALL,
+ .off_cpu_thresh_ns = OFFCPU_THRESH,
},
};
@@ -3562,6 +3585,9 @@ static struct option __record_options[] = {
OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
OPT_STRING(0, "setup-filter", &record.filter_action, "pin|unpin",
"BPF filter action"),
+ OPT_CALLBACK(0, "off-cpu-thresh", &record.opts, "ms",
+ "Dump off-cpu samples if off-cpu time exceeds this threshold (in milliseconds). (Default: 500ms)",
+ record__parse_off_cpu_thresh),
OPT_END()
};
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 2e7e4ae43ffc..188cbb051470 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -14,6 +14,7 @@
#include "util/strlist.h"
#include <bpf/bpf.h>
#include <internal/xyarray.h>
+#include <linux/time64.h>
#include "bpf_skel/off_cpu.skel.h"
@@ -286,6 +287,8 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
}
}
+ skel->bss->offcpu_thresh_ns = opts->off_cpu_thresh_ns;
+
err = off_cpu_bpf__attach(skel);
if (err) {
pr_err("Failed to attach off-cpu BPF skeleton\n");
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index 77fdc9e81db3..aae63d999abb 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -123,7 +123,7 @@ const volatile bool uses_cgroup_v1 = false;
int perf_subsys_id = -1;
-__u64 offcpu_thresh_ns = 500000000ull;
+__u64 offcpu_thresh_ns;
/*
* Old kernel used to call it task_struct->state and now it's '__state'.
diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
index 2a4b7f9b2c4c..64bf763ddf50 100644
--- a/tools/perf/util/off_cpu.h
+++ b/tools/perf/util/off_cpu.h
@@ -16,6 +16,7 @@ struct record_opts;
PERF_SAMPLE_PERIOD | PERF_SAMPLE_RAW | \
PERF_SAMPLE_CGROUP)
+#define OFFCPU_THRESH 500000000ULL
#ifdef HAVE_BPF_SKEL
int off_cpu_prepare(struct evlist *evlist, struct target *target,
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index a6566134e09e..c82db4833b0a 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -79,6 +79,7 @@ struct record_opts {
int synth;
int threads_spec;
const char *threads_user_spec;
+ u64 off_cpu_thresh_ns;
};
extern const char * const *record_usage;
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v14 10/10] perf test: Add direct off-cpu test
2024-12-15 18:12 [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (8 preceding siblings ...)
2024-12-15 18:12 ` [PATCH v14 09/10] perf record --off-cpu: Add --off-cpu-thresh option Howard Chu
@ 2024-12-15 18:12 ` Howard Chu
2024-12-16 20:11 ` [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Namhyung Kim
10 siblings, 0 replies; 23+ messages in thread
From: Howard Chu @ 2024-12-15 18:12 UTC (permalink / raw)
To: acme
Cc: namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
Howard Chu, Ingo Molnar, James Clark, Peter Zijlstra,
Arnaldo Carvalho de Melo
Why is there a --off-cpu-thresh 2000?
We collect an off-cpu period __ONLY ONCE__, either in direct sample form,
or in accumulated form (in BPF stack trace map).
If I don't add --off-cpu-thresh 2000, the sample in the original test
goes into the ring buffer instead of the BPF stack trace map.
Additionally, when using -e dummy, the ring buffer is not open, causing
us to lose a sample.
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-11-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
tools/perf/tests/builtin-test.c | 1 +
tools/perf/tests/shell/record_offcpu.sh | 35 ++++++++++++++++++++++++-
tools/perf/tests/tests.h | 1 +
tools/perf/tests/workloads/Build | 1 +
tools/perf/tests/workloads/offcpu.c | 16 +++++++++++
5 files changed, 53 insertions(+), 1 deletion(-)
create mode 100644 tools/perf/tests/workloads/offcpu.c
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index a5b9ccd0033a..261325a1088b 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -144,6 +144,7 @@ static struct test_workload *workloads[] = {
&workload__brstack,
&workload__datasym,
&workload__landlock,
+ &workload__offcpu,
};
#define workloads__for_each(workload) \
diff --git a/tools/perf/tests/shell/record_offcpu.sh b/tools/perf/tests/shell/record_offcpu.sh
index 678947fe69ee..73ce16d61b5d 100755
--- a/tools/perf/tests/shell/record_offcpu.sh
+++ b/tools/perf/tests/shell/record_offcpu.sh
@@ -6,6 +6,10 @@ set -e
err=0
perfdata=$(mktemp /tmp/__perf_test.perf.data.XXXXX)
+TEST_PROGRAM="perf test -w offcpu"
+
+ts=$(printf "%u" $((~0 << 32))) # OFF_CPU_TIMESTAMP
+dummy_timestamp=${ts%???} # remove the last 3 digits to match perf script
cleanup() {
rm -f ${perfdata}
@@ -39,7 +43,11 @@ test_offcpu_priv() {
test_offcpu_basic() {
echo "Basic off-cpu test"
- if ! perf record --off-cpu -e dummy -o ${perfdata} sleep 1 2> /dev/null
+ # We collect an off-cpu period __ONLY ONCE__, either in direct sample form, or in accumulated form
+ # (in BPF stack trace map). Without the --off-cpu-thresh 2000 below, the sample will go into the
+ # ring buffer instead of the BPF stack trace map. Additionally, when using -e dummy, the ring
+ # buffer is not enabled, resulting in a lost sample.
+ if ! perf record --off-cpu --off-cpu-thresh 2000 -e dummy -o ${perfdata} sleep 1 2> /dev/null
then
echo "Basic off-cpu test [Failed record]"
err=1
@@ -88,6 +96,27 @@ test_offcpu_child() {
echo "Child task off-cpu test [Success]"
}
+test_offcpu_direct() {
+ echo "Direct off-cpu test"
+
+ # dump off-cpu samples for task blocked for more than 1.999s
+ # -D for initial delay, to enable evlist
+ if ! perf record -e dummy -D 500 --off-cpu --off-cpu-thresh 1999 -o ${perfdata} ${TEST_PROGRAM} 2> /dev/null
+ then
+ echo "Direct off-cpu test [Failed record]"
+ err=1
+ return
+ fi
+ # Direct sample's timestamp should be lower than the dummy_timestamp of the at-the-end sample.
+ if ! perf script -i ${perfdata} -F time,period | sed "s/[\.:]//g" | \
+ awk "{ if (\$1 < ${dummy_timestamp} && \$2 > 1999999999) exit 0; else exit 1; }"
+ then
+ echo "Direct off-cpu test [Failed missing direct sample]"
+ err=1
+ return
+ fi
+ echo "Direct off-cpu test [Success]"
+}
test_offcpu_priv
@@ -99,5 +128,9 @@ if [ $err = 0 ]; then
test_offcpu_child
fi
+if [ $err = 0 ]; then
+ test_offcpu_direct
+fi
+
cleanup
exit $err
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index cb58b43aa063..2e655a617b30 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -217,6 +217,7 @@ DECLARE_WORKLOAD(sqrtloop);
DECLARE_WORKLOAD(brstack);
DECLARE_WORKLOAD(datasym);
DECLARE_WORKLOAD(landlock);
+DECLARE_WORKLOAD(offcpu);
extern const char *dso_to_test;
extern const char *test_objdump_path;
diff --git a/tools/perf/tests/workloads/Build b/tools/perf/tests/workloads/Build
index 5af17206f04d..0e78fd01eaf1 100644
--- a/tools/perf/tests/workloads/Build
+++ b/tools/perf/tests/workloads/Build
@@ -7,6 +7,7 @@ perf-test-y += sqrtloop.o
perf-test-y += brstack.o
perf-test-y += datasym.o
perf-test-y += landlock.o
+perf-test-y += offcpu.o
CFLAGS_sqrtloop.o = -g -O0 -fno-inline -U_FORTIFY_SOURCE
CFLAGS_leafloop.o = -g -O0 -fno-inline -fno-omit-frame-pointer -U_FORTIFY_SOURCE
diff --git a/tools/perf/tests/workloads/offcpu.c b/tools/perf/tests/workloads/offcpu.c
new file mode 100644
index 000000000000..57cee201a4c3
--- /dev/null
+++ b/tools/perf/tests/workloads/offcpu.c
@@ -0,0 +1,16 @@
+#include <linux/compiler.h>
+#include <unistd.h>
+#include "../tests.h"
+
+static int offcpu(int argc __maybe_unused, const char **argv __maybe_unused)
+{
+ /* get past the initial delay */
+ sleep(1);
+
+ /* what we want to collect as a direct sample */
+ sleep(2);
+
+ return 0;
+}
+
+DEFINE_WORKLOAD(offcpu);
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly
2024-12-15 18:12 [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
` (9 preceding siblings ...)
2024-12-15 18:12 ` [PATCH v14 10/10] perf test: Add direct off-cpu test Howard Chu
@ 2024-12-16 20:11 ` Namhyung Kim
2024-12-16 20:49 ` Howard Chu
2025-02-11 18:02 ` Arnaldo Carvalho de Melo
10 siblings, 2 replies; 23+ messages in thread
From: Namhyung Kim @ 2024-12-16 20:11 UTC (permalink / raw)
To: Howard Chu
Cc: acme, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel
On Sun, Dec 15, 2024 at 10:12:10AM -0800, Howard Chu wrote:
> Changes in v14:
> - Change the internal off_cpu_thresh_us to off_cpu_thresh_ns, i.e. use
> nsec instead of usec
>
> Changes in v13:
> - Move the definition of 'off_cpu_thresh_ns' to the same commit as
> dumping off-cpu samples in BPF, and give off_cpu_thresh_ns a default
> value before the --off-cpu-thresh option is parsed.
>
> Changes in v12:
> - Restore patches' bisectability, because the ordering of patches has
> changed.
> - Change 'us = ms * 1000' to 'us = ms * USEC_PER_MSEC'
>
> Changes in v11:
> - Modify the options used in the off-cpu tests, as I changed the unit
> of the off-cpu threshold to milliseconds.
>
> Changes in v10:
> - Move the commit "perf record --off-cpu: Add --off-cpu-thresh option"
> to where the direct sample feature is completed.
> - Make --off-cpu-thresh use milliseconds as the unit.
>
> Changes in v9:
> - Add documentation for the new option '--off-cpu-thresh', and include
> an example of its usage in the commit message
> - Set inherit in evsel__config() to prevent future modifications
> - Support off-cpu sample data collected by perf before this patch series
>
> Changes in v8:
> - Make this series bisectable
> - Rename off_cpu_thresh to off_cpu_thresh_us and offcpu_thresh (in BPF)
> to offcpu_thresh_ns for clarity
> - Add commit messages to 'perf evsel: Expose evsel__is_offcpu_event()
> for future use' commit
> - Correct spelling mistakes in the commit message (s/is should be/should be/)
> - Add kernel-doc comments to off_cpu_dump(), and comments to the empty
> if block
> - Add some comments to off-cpu test
> - Delete an unused variable 'timestamp' in off_cpu_dump()
>
> Changes in v7:
> - Make off-cpu event system-wide
> - Use strtoull instead of strtoul
> - Delete unused variable such as sample_id, and sample_type
> - Use i as index to update BPF perf_event map
> - MAX_OFFCPU_LEN 128 is too big, make it smaller.
> - Delete some bound check as it's always guaranteed
> - Do not set ip_pos in BPF
> - Add a new field for storing stack traces in the tstamp map
> - Dump the off-cpu sample directly or save it in the off_cpu map, not both
> - Delete the sample_type_off_cpu check
> - Use __set_off_cpu_sample() to parse samples instead of a two-pass parsing
>
> Changes in v6:
> - Make patches bisectable
>
> Changes in v5:
> - Delete unnecessary copy in BPF program
> - Remove sample_embed from perf header, hard code off-cpu stuff instead
> - Move evsel__is_offcpu_event() to evsel.h
> - Minor changes to the test
> - Edit some comments
>
> Changes in v4:
> - Minimize the size of data output by perf_event_output()
> - Keep only one off-cpu event
> - Change off-cpu threshold's unit to microseconds
> - Set a default off-cpu threshold
> - Print the correct error message for the field 'embed' in perf data header
>
> Changes in v3:
> - Add off-cpu-thresh argument
> - Process direct off-cpu samples in post
>
> Changes in v2:
> - Remove unnecessary comments.
> - Rename function off_cpu_change_type to off_cpu_prepare_parse
>
> v1:
>
> As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
>
> Currently, off-cpu samples are dumped when perf record is exiting. This
> results in off-cpu samples being after the regular samples. This patch
> series makes possible dumping off-cpu samples on-the-fly, directly into
> perf ring buffer. And it dispatches those samples to the correct format
> for perf.data consumers.
>
> Before:
> ```
> migration/0 21 [000] 27981.041319: 2944637851 cycles:P: ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
> perf 770116 [001] 27981.041375: 1 cycles:P: ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
> perf 770116 [001] 27981.041377: 1 cycles:P: ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
> perf 770116 [001] 27981.041379: 51611 cycles:P: ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
> migration/1 26 [001] 27981.041400: 4227682775 cycles:P: ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
> migration/2 32 [002] 27981.041477: 4159401534 cycles:P: ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
>
> sshd 708098 [000] 18446744069.414584: 286392 offcpu-time:
> 79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
> 585690935cca [unknown] (/usr/bin/sshd)
> ```
>
> After:
> ```
> perf 774767 [003] 28178.033444: 497 cycles:P: ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
> perf 774767 [003] 28178.033445: 399440 cycles:P: ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
> swapper 0 [001] 28178.036639: 376650973 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> swapper 0 [003] 28178.182921: 348779378 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> blueman-tray 1355 [000] 28178.627906: 100184571 offcpu-time:
> 7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> 7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> 7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> 7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> 7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> 7fff24e862d8 [unknown] ([unknown])
>
>
> blueman-tray 1355 [000] 28178.728137: 100187539 offcpu-time:
> 7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> 7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> 7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> 7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> 7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> 7fff24e862d8 [unknown] ([unknown])
>
>
> swapper 0 [000] 28178.463253: 195945410 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> dbus-broker 412 [002] 28178.464855: 376737008 cycles:P: ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
> ```
>
> Howard Chu (10):
> perf evsel: Expose evsel__is_offcpu_event() for future use
> perf record --off-cpu: Parse off-cpu event
> perf record --off-cpu: Preparation of off-cpu BPF program
> perf record --off-cpu: Dump off-cpu samples in BPF
> perf evsel: Assemble offcpu samples
> perf record --off-cpu: Disable perf_event's callchain collection
> perf script: Display off-cpu samples correctly
> perf record --off-cpu: Dump the remaining samples in BPF's stack trace
> map
> perf record --off-cpu: Add --off-cpu-thresh option
> perf test: Add direct off-cpu test
Thanks for your work and the persistence.
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Thanks,
Namhyung
>
> tools/perf/Documentation/perf-record.txt | 9 ++
> tools/perf/builtin-record.c | 26 +++++
> tools/perf/builtin-script.c | 4 +-
> tools/perf/tests/builtin-test.c | 1 +
> tools/perf/tests/shell/record_offcpu.sh | 35 ++++++-
> tools/perf/tests/tests.h | 1 +
> tools/perf/tests/workloads/Build | 1 +
> tools/perf/tests/workloads/offcpu.c | 16 +++
> tools/perf/util/bpf_off_cpu.c | 118 ++++++++++++++---------
> tools/perf/util/bpf_skel/off_cpu.bpf.c | 97 ++++++++++++++++++-
> tools/perf/util/evsel.c | 41 +++++++-
> tools/perf/util/evsel.h | 2 +
> tools/perf/util/off_cpu.h | 3 +-
> tools/perf/util/record.h | 1 +
> 14 files changed, 297 insertions(+), 58 deletions(-)
> create mode 100644 tools/perf/tests/workloads/offcpu.c
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly
2024-12-16 20:11 ` [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Namhyung Kim
@ 2024-12-16 20:49 ` Howard Chu
[not found] ` <CA+JHD92xXH=6K0-imdObf+dw3=B0uiGdFH16DO3ArdHv3r1Zzg@mail.gmail.com>
2025-02-11 18:02 ` Arnaldo Carvalho de Melo
1 sibling, 1 reply; 23+ messages in thread
From: Howard Chu @ 2024-12-16 20:49 UTC (permalink / raw)
To: Namhyung Kim
Cc: acme, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel
Hello Namhyung,
On Mon, Dec 16, 2024 at 12:11 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Sun, Dec 15, 2024 at 10:12:10AM -0800, Howard Chu wrote:
> > Changes in v14:
> > - Change the internal off_cpu_thresh_us to off_cpu_thresh_ns, i.e. use
> > nsec instead of usec
> >
> > Changes in v13:
> > - Move the definition of 'off_cpu_thresh_ns' to the same commit as
> > dumping off-cpu samples in BPF, and give off_cpu_thresh_ns a default
> > value before the --off-cpu-thresh option is parsed.
> >
> > Changes in v12:
> > - Restore patches' bisectability, because the ordering of patches has
> > changed.
> > - Change 'us = ms * 1000' to 'us = ms * USEC_PER_MSEC'
> >
> > Changes in v11:
> > - Modify the options used in the off-cpu tests, as I changed the unit
> > of the off-cpu threshold to milliseconds.
> >
> > Changes in v10:
> > - Move the commit "perf record --off-cpu: Add --off-cpu-thresh option"
> > to where the direct sample feature is completed.
> > - Make --off-cpu-thresh use milliseconds as the unit.
> >
> > Changes in v9:
> > - Add documentation for the new option '--off-cpu-thresh', and include
> > an example of its usage in the commit message
> > - Set inherit in evsel__config() to prevent future modifications
> > - Support off-cpu sample data collected by perf before this patch series
> >
> > Changes in v8:
> > - Make this series bisectable
> > - Rename off_cpu_thresh to off_cpu_thresh_us and offcpu_thresh (in BPF)
> > to offcpu_thresh_ns for clarity
> > - Add commit messages to 'perf evsel: Expose evsel__is_offcpu_event()
> > for future use' commit
> > - Correct spelling mistakes in the commit message (s/is should be/should be/)
> > - Add kernel-doc comments to off_cpu_dump(), and comments to the empty
> > if block
> > - Add some comments to off-cpu test
> > - Delete an unused variable 'timestamp' in off_cpu_dump()
> >
> > Changes in v7:
> > - Make off-cpu event system-wide
> > - Use strtoull instead of strtoul
> > - Delete unused variable such as sample_id, and sample_type
> > - Use i as index to update BPF perf_event map
> > - MAX_OFFCPU_LEN 128 is too big, make it smaller.
> > - Delete some bound check as it's always guaranteed
> > - Do not set ip_pos in BPF
> > - Add a new field for storing stack traces in the tstamp map
> > - Dump the off-cpu sample directly or save it in the off_cpu map, not both
> > - Delete the sample_type_off_cpu check
> > - Use __set_off_cpu_sample() to parse samples instead of a two-pass parsing
> >
> > Changes in v6:
> > - Make patches bisectable
> >
> > Changes in v5:
> > - Delete unnecessary copy in BPF program
> > - Remove sample_embed from perf header, hard code off-cpu stuff instead
> > - Move evsel__is_offcpu_event() to evsel.h
> > - Minor changes to the test
> > - Edit some comments
> >
> > Changes in v4:
> > - Minimize the size of data output by perf_event_output()
> > - Keep only one off-cpu event
> > - Change off-cpu threshold's unit to microseconds
> > - Set a default off-cpu threshold
> > - Print the correct error message for the field 'embed' in perf data header
> >
> > Changes in v3:
> > - Add off-cpu-thresh argument
> > - Process direct off-cpu samples in post
> >
> > Changes in v2:
> > - Remove unnecessary comments.
> > - Rename function off_cpu_change_type to off_cpu_prepare_parse
> >
> > v1:
> >
> > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> >
> > Currently, off-cpu samples are dumped when perf record is exiting. This
> > results in off-cpu samples being after the regular samples. This patch
> > series makes possible dumping off-cpu samples on-the-fly, directly into
> > perf ring buffer. And it dispatches those samples to the correct format
> > for perf.data consumers.
> >
> > Before:
> > ```
> > migration/0 21 [000] 27981.041319: 2944637851 cycles:P: ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
> > perf 770116 [001] 27981.041375: 1 cycles:P: ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
> > perf 770116 [001] 27981.041377: 1 cycles:P: ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
> > perf 770116 [001] 27981.041379: 51611 cycles:P: ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
> > migration/1 26 [001] 27981.041400: 4227682775 cycles:P: ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
> > migration/2 32 [002] 27981.041477: 4159401534 cycles:P: ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
> >
> > sshd 708098 [000] 18446744069.414584: 286392 offcpu-time:
> > 79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
> > 585690935cca [unknown] (/usr/bin/sshd)
> > ```
> >
> > After:
> > ```
> > perf 774767 [003] 28178.033444: 497 cycles:P: ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
> > perf 774767 [003] 28178.033445: 399440 cycles:P: ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
> > swapper 0 [001] 28178.036639: 376650973 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > swapper 0 [003] 28178.182921: 348779378 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > blueman-tray 1355 [000] 28178.627906: 100184571 offcpu-time:
> > 7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> > 7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> > 7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> > 7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> > 7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> > 7fff24e862d8 [unknown] ([unknown])
> >
> >
> > blueman-tray 1355 [000] 28178.728137: 100187539 offcpu-time:
> > 7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> > 7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> > 7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> > 7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> > 7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> > 7fff24e862d8 [unknown] ([unknown])
> >
> >
> > swapper 0 [000] 28178.463253: 195945410 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > dbus-broker 412 [002] 28178.464855: 376737008 cycles:P: ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
> > ```
> >
> > Howard Chu (10):
> > perf evsel: Expose evsel__is_offcpu_event() for future use
> > perf record --off-cpu: Parse off-cpu event
> > perf record --off-cpu: Preparation of off-cpu BPF program
> > perf record --off-cpu: Dump off-cpu samples in BPF
> > perf evsel: Assemble offcpu samples
> > perf record --off-cpu: Disable perf_event's callchain collection
> > perf script: Display off-cpu samples correctly
> > perf record --off-cpu: Dump the remaining samples in BPF's stack trace
> > map
> > perf record --off-cpu: Add --off-cpu-thresh option
> > perf test: Add direct off-cpu test
>
> Thanks for your work and the persistence.
I think you misspelled 'procrastination' :). All jokes aside, thank
you for the reviews—it’s much harder on your side.
Thanks,
Howard
>
> Reviewed-by: Namhyung Kim <namhyung@kernel.org>
>
> Thanks,
> Namhyung
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly
2024-12-16 20:11 ` [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly Namhyung Kim
2024-12-16 20:49 ` Howard Chu
@ 2025-02-11 18:02 ` Arnaldo Carvalho de Melo
2025-02-11 18:26 ` Arnaldo Carvalho de Melo
2025-02-11 18:30 ` Howard Chu
1 sibling, 2 replies; 23+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-02-11 18:02 UTC (permalink / raw)
To: Howard Chu
Cc: Namhyung Kim, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel
On Mon, Dec 16, 2024 at 12:11:07PM -0800, Namhyung Kim wrote:
> On Sun, Dec 15, 2024 at 10:12:10AM -0800, Howard Chu wrote:
> > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> >
> > Currently, off-cpu samples are dumped when perf record is exiting. This
> > results in off-cpu samples being after the regular samples. This patch
> > series makes possible dumping off-cpu samples on-the-fly, directly into
> > perf ring buffer. And it dispatches those samples to the correct format
> > for perf.data consumers.
> >
> > Before:
> > ```
> > migration/0 21 [000] 27981.041319: 2944637851 cycles:P: ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
> > perf 770116 [001] 27981.041375: 1 cycles:P: ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
> > perf 770116 [001] 27981.041377: 1 cycles:P: ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
> > perf 770116 [001] 27981.041379: 51611 cycles:P: ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
> > migration/1 26 [001] 27981.041400: 4227682775 cycles:P: ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
> > migration/2 32 [002] 27981.041477: 4159401534 cycles:P: ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
> >
> > sshd 708098 [000] 18446744069.414584: 286392 offcpu-time:
> > 79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
> > 585690935cca [unknown] (/usr/bin/sshd)
> > ```
> >
> > After:
> > ```
> > perf 774767 [003] 28178.033444: 497 cycles:P: ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
> > perf 774767 [003] 28178.033445: 399440 cycles:P: ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
> > swapper 0 [001] 28178.036639: 376650973 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > swapper 0 [003] 28178.182921: 348779378 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > blueman-tray 1355 [000] 28178.627906: 100184571 offcpu-time:
> > 7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> > 7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> > 7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> > 7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> > 7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> > 7fff24e862d8 [unknown] ([unknown])
> >
> >
> > blueman-tray 1355 [000] 28178.728137: 100187539 offcpu-time:
> > 7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> > 7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> > 7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> > 7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> > 7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> > 7fff24e862d8 [unknown] ([unknown])
> >
> >
> > swapper 0 [000] 28178.463253: 195945410 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > dbus-broker 412 [002] 28178.464855: 376737008 cycles:P: ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
> > ```
So I'm trying to reproduce your results, with:
root@number:~# perf record --off-cpu sleep 0.6
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
root@number:~# perf trace -e *sleep sleep 0.6
0.000 (600.137 ms): sleep/466769 clock_nanosleep(rqtp: { .tv_sec: 0, .tv_nsec: 600000000 }, rmtp: 0x7ffe024dea30) = 0
root@number:~#
So a simple test that crosses that 500ms threshold and thus should
produce an offcpu-time sample, right?
root@number:~# perf record --off-cpu sleep 0.6
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
root@number:~# perf evlist
cpu_atom/cycles/P
cpu_core/cycles/P
offcpu-time
dummy:u
root@number:~#
root@number:~# perf script | grep offcpu
root@number:~#
What I have:
⬢ [acme@toolbox perf-tools-next]$ git log --oneline -11
cd59081880e89df8 (HEAD -> perf-tools-next) perf test: Add direct off-cpu test
56cbd794c0c46ba9 perf record --off-cpu: Add --off-cpu-thresh option
28d9b19c5455556f perf record --off-cpu: Dump the remaining samples in BPF's stack trace map
2bc05b02743b50a7 perf script: Display off-cpu samples correctly
bfa457a621596947 perf record --off-cpu: Disable perf_event's callchain collection
eca732cc42d20266 perf evsel: Assemble offcpu samples
74ce50e40c569e90 perf record --off-cpu: Dump off-cpu samples in BPF
e75f8ce63bfa6cb9 perf record --off-cpu: Preparation of off-cpu BPF program
0ffab9d26971c91c perf record --off-cpu: Parse off-cpu event
efc3fe2070853b7d perf evsel: Expose evsel__is_offcpu_event() for future use
357b965deba9fb71 (perf-tools-next/perf-tools-next) perf stat: Changes to event name uniquification
⬢ [acme@toolbox perf-tools-next]$
The only change was the one I mentioned about 'struct __stack'
replacing 'struct stack' to avoid the clash with the vmlinux.h generated
using GEN_VMLINUX_H=1.
Perhaps that is because my test is not being system wide, if I use '-a'
we get:
root@number:~# perf record --off-cpu -a sleep 0.6
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.868 MB perf.data (443 samples) ]
root@number:~# perf script | grep offcpu
firefox 4010 [004] 16746.571035: 518930922 offcpu-time:
sleep 466955 [019] 16746.629875: 600279888 offcpu-time:
perf 466953 [027] 16746.630541: 601731496 offcpu-time:
IPC I/O Parent 4039 [000] 18446744069.414584: 20781 offcpu-time:
TaskCon~ller #3 4454 [000] 18446744069.414584: 117724 offcpu-time:
Isolated Servic 4253 [000] 18446744069.414584: 387311029 offcpu-time:
IPC I/O Parent 4039 [000] 18446744069.414584: 520075456 offcpu-time:
Timer 4040 [000] 18446744069.414584: 581156755 offcpu-time:
Timer 4380 [000] 18446744069.414584: 22520305 offcpu-time:
sqldb:c~lite #6 4327 [000] 18446744069.414584: 2073583 offcpu-time:
Isolated Web Co 156033 [000] 18446744069.414584: 23277 offcpu-time:
sqldb:c~lite #6 4327 [000] 18446744069.414584: 24974 offcpu-time:
dbus-broker 2526 [000] 18446744069.414584: 1175447 offcpu-time:
IPDL Background 4043 [000] 18446744069.414584: 12055 offcpu-time:
glean.dispatche 4037 [000] 18446744069.414584: 62321 offcpu-time:
podman 12139 [000] 18446744069.414584: 502656763 offcpu-time:
gnome-shell 2709 [000] 18446744069.414584: 579190477 offcpu-time:
podman 3936 [000] 18446744069.414584: 60777801 offcpu-time:
KMS thread 2739 [000] 18446744069.414584: 14578641 offcpu-time:
IPC I/O Child 7113 [000] 18446744069.414584: 386343655 offcpu-time:
WaylandProxy 4027 [000] 18446744069.414584: 46989879 offcpu-time:
Timer 156048 [000] 18446744069.414584: 404662799 offcpu-time:
URL Classifier 4234 [000] 18446744069.414584: 55337 offcpu-time:
firefox 4010 [000] 18446744069.414584: 57229 offcpu-time:
Isolated Web Co 4261 [000] 18446744069.414584: 26671 offcpu-time:
podman 12130 [000] 18446744069.414584: 450853233 offcpu-time:
firefox 4010 [000] 18446744069.414584: 56800 offcpu-time:
podman 3951 [000] 18446744069.414584: 502100761 offcpu-time:
podman 12131 [000] 18446744069.414584: 501954466 offcpu-time:
Timer 4306 [000] 18446744069.414584: 360125199 offcpu-time:
RemoteLzyStream 4230 [000] 18446744069.414584: 172647 offcpu-time:
TaskCon~ller #1 4452 [000] 18446744069.414584: 38083 offcpu-time:
Isolated Web Co 156033 [000] 18446744069.414584: 496574141 offcpu-time:
Socket Thread 4042 [000] 18446744069.414584: 295030041 offcpu-time:
IPDL Background 4043 [000] 18446744069.414584: 13990 offcpu-time:
Inotify~tThread 7430 [000] 18446744069.414584: 300841349 offcpu-time:
Timer 7186 [000] 18446744069.414584: 117536 offcpu-time:
Backgro~ol #163 466653 [000] 18446744069.414584: 29312 offcpu-time:
Isolated Web Co 7107 [000] 18446744069.414584: 418466228 offcpu-time:
Isolated Web Co 156033 [000] 18446744069.414584: 22049 offcpu-time:
Socket Thread 4295 [000] 18446744069.414584: 309685524 offcpu-time:
podman 12130 [000] 18446744069.414584: 60960566 offcpu-time:
Socket Thread 4042 [000] 18446744069.414584: 9726 offcpu-time:
glean.dispatche 4037 [000] 18446744069.414584: 30560 offcpu-time:
TaskCon~ller #0 4451 [000] 18446744069.414584: 3661284 offcpu-time:
URL Classifier 4234 [000] 18446744069.414584: 18112 offcpu-time:
Isolated Web Co 4261 [000] 18446744069.414584: 380745777 offcpu-time:
IPC I/O Child 7039 [000] 18446744069.414584: 54365677 offcpu-time:
IPC I/O Child 4292 [000] 18446744069.414584: 386687065 offcpu-time:
glean.dispatche 4037 [000] 18446744069.414584: 48091 offcpu-time:
firefox 4010 [000] 18446744069.414584: 570509826 offcpu-time:
podman 3936 [000] 18446744069.414584: 451176929 offcpu-time:
Isolated Web Co 7034 [000] 18446744069.414584: 57005742 offcpu-time:
gnome-terminal- 3583 [000] 18446744069.414584: 97027967 offcpu-time:
sqldb:c~lite #6 4327 [000] 18446744069.414584: 27118 offcpu-time:
TaskCon~ller #1 4452 [000] 18446744069.414584: 20008 offcpu-time:
IPC I/O Parent 4039 [000] 18446744069.414584: 58205 offcpu-time:
Timer 157669 [000] 18446744069.414584: 180649 offcpu-time:
IPC I/O Child 4265 [000] 18446744069.414584: 387442919 offcpu-time:
firefox 4010 [000] 18446744069.414584: 215006 offcpu-time:
podman 3949 [000] 18446744069.414584: 502715978 offcpu-time:
firefox 4010 [000] 18446744069.414584: 94719 offcpu-time:
gdbus 4032 [000] 18446744069.414584: 1330625 offcpu-time:
Worker Launcher 4348 [000] 18446744069.414584: 83393104 offcpu-time:
Socket Thread 4042 [000] 18446744069.414584: 48713 offcpu-time:
IPDL Background 4043 [000] 18446744069.414584: 386560332 offcpu-time:
Timer 7123 [000] 18446744069.414584: 422634111 offcpu-time:
Isolated Web Co 4358 [000] 18446744069.414584: 18063623 offcpu-time:
glean.dispatche 4037 [000] 18446744069.414584: 302249112 offcpu-time:
IPC I/O Parent 4039 [000] 18446744069.414584: 16204 offcpu-time:
Inotify~tThread 7430 [000] 18446744069.414584: 300780100 offcpu-time:
RemoteLzyStream 4230 [000] 18446744069.414584: 14778 offcpu-time:
DOM Worker 10133 [000] 18446744069.414584: 331946 offcpu-time:
URL Classifier 4234 [000] 18446744069.414584: 5003953 offcpu-time:
gdbus 2729 [000] 18446744069.414584: 331119 offcpu-time:
TaskCon~ller #2 4453 [000] 18446744069.414584: 125448 offcpu-time:
firefox 4010 [000] 18446744069.414584: 40894 offcpu-time:
IPC I/O Child 156038 [000] 18446744069.414584: 4079439 offcpu-time:
DOM Worker 10228 [000] 18446744069.414584: 3397046 offcpu-time:
root@number:~#
With 'offcpu-time' samples intermixed with the other cycles events:
<SNIP>
Isolated Web Co 4261 [010] 16746.570827: 287731 cpu_core/cycles/P: ffffffff8b145b24 __get_user_nocheck_8+0x4 ([kernel.kallsyms])
Isolated Web Co 7107 [006] 16746.570834: 235654 cpu_core/cycles/P: 7f08bf336592 pthread_mutex_trylock@@GLIBC_2.34+0xe2 (/usr/lib64/libc.so.6)
firefox 4010 [004] 16746.571035: 518930922 offcpu-time:
7f2dab89eda9 dlsym@@GLIBC_2.34+0x59 (/usr/lib64/libc.so.6)
7f2dab8a17f9 pthread_cond_broadcast@@GLIBC_2.3.2+0x399 (/usr/lib64/libc.so.6)
556309ec5bca mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&)+0xa (/usr/lib64/firefox/firefox)
swapper 0 [019] 16746.571151: 315218 cpu_atom/cycles/P: ffffffff8adf4ad0 menu_select+0x1b0 ([kernel.kallsyms])
<SNIP>
Isolated Web Co 7107 [010] 16746.607631: 288723 cpu_core/cycles/P: 34d114ccce4a [unknown] (/tmp/perf-7107.map)
swapper 0 [008] 16746.622828: 136175 cpu_core/cycles/P: ffffffff8b213d7a intel_idle+0x6a ([kernel.kallsyms])
Timer 4306 [008] 16746.622977: 129681 cpu_core/cycles/P: ffffffff8b134e46 rb_erase+0x46 ([kernel.kallsyms])
swapper 0 [000] 16746.623018: 298081 cpu_core/cycles/P: ffffffff8a13d3ea update_rq_clock+0x7a ([kernel.kallsyms])
swapper 0 [019] 16746.629839: 256701 cpu_atom/cycles/P: ffffffff8a148455 __dequeue_entity+0x5 ([kernel.kallsyms])
sleep 466955 [019] 16746.629875: 600279888 offcpu-time:
7f6feb8bf497 clock_nanosleep@GLIBC_2.2.5+0x27 (/usr/lib64/libc.so.6)
5608d1601a9f [unknown] (/usr/bin/sleep)
7f6feb805088 __libc_start_call_main+0x78 (/usr/lib64/libc.so.6)
7f6feb80514b __libc_start_main@@GLIBC_2.34+0x8b (/usr/lib64/libc.so.6)
5608d1601c05 [unknown] (/usr/bin/sleep)
sleep 466955 [006] 16746.630223: 225700 cpu_core/cycles/P: ffffffff8a40ec5a folio_remove_rmap_ptes+0x1a ([kernel.kallsyms])
:466955 466955 [006] 16746.630442: 225700 cpu_core/cycles/P: ffffffff8b1387eb timerqueue_add+0x3b ([kernel.kallsyms])
perf 466953 [027] 16746.630541: 601731496 offcpu-time:
7f818451c1fd __poll+0x4d (/usr/lib64/libc.so.6)
49606e fdarray__poll+0x2e (/home/acme/bin/perf)
429eed __cmd_record.constprop.0+0x18ad (/home/acme/bin/perf)
42c77e cmd_record+0xd9e (/home/acme/bin/perf)
495700 run_builtin+0x70 (/home/acme/bin/perf)
495a1b handle_internal_command+0xab (/home/acme/bin/perf)
413473 main+0x2f3 (/home/acme/bin/perf)
7f8184438088 __libc_start_call_main+0x78 (/usr/lib64/libc.so.6)
7f818443814b __libc_start_main@@GLIBC_2.34+0x8b (/usr/lib64/libc.so.6)
413ac5 _start+0x25 (/home/acme/bin/perf)
perf 466953 [000] 16746.630831: 298081 cpu_core/cycles/P: ffffffff8b40012f entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms])
swapper 0 [019] 16746.630985: 250096 cpu_atom/cycles/P: ffffffff8b213d7a intel_idle+0x6a ([kernel.kallsyms])
perf 466953 [010] 16746.632045: 286467 cpu_core/cycles/P: ffffffff8a1c3f21 __rcu_read_lock+0x11 ([kernel.kallsyms])
IPC I/O Parent 4039 [000] 18446744069.414584: 20781 offcpu-time:
7f2dab89f000 __GI___pthread_enable_asynccancel+0x20 (/usr/lib64/libc.so.6)
556309ebede4 moz_xmalloc+0x104 (/usr/lib64/firefox/firefox)
3bbdf3d834830 [unknown] ([unknown])
And if I install the 'coreutils' debuginfo I get the sleep backtraces
resolved, nice, but again only when I'm running in system wide (-a)
mode.
Isolated Web Co 7167 [008] 16915.115432: 188682 cpu_core/cycles/P: 7fe85f558260 nsBaseHashtable<nsStringHashKey, mozilla::UniquePtr<nsBaseHashtable<nsIntegralHashKey<unsigned lo>
swapper 0 [013] 16915.126739: 120783 cpu_core/cycles/P: ffffffff8a15221d enqueue_task_fair+0x22d ([kernel.kallsyms])
sleep 467147 [013] 16915.126780: 600336127 offcpu-time:
7ff010611497 clock_nanosleep@GLIBC_2.2.5+0x27 (/usr/lib64/libc.so.6)
563f045a3a9f main+0x4af (/usr/bin/sleep)
7ff010557088 __libc_start_call_main+0x78 (/usr/lib64/libc.so.6)
7ff01055714b __libc_start_main@@GLIBC_2.34+0x8b (/usr/lib64/libc.so.6)
563f045a3c05 _start+0x25 (/usr/bin/sleep)
sleep 467147 [013] 16915.126861: 120783 cpu_core/cycles/P: ffffffff8a3ef391 unmap_page_range+0x661 ([kernel.kallsyms])
sleep 467147 [013] 16915.126973: 139106 cpu_core/cycles/P: ffffffff8a3a5b4d folios_put_refs+0x8d ([kernel.kallsyms])
swapper 0 [013] 16915.127102: 160383 cpu_core/cycles/P: ffffffff8a14b5c4 sched_balance_update_blocked_averages+0xb4 ([kernel.kallsyms])
When running just one workload:
root@number:~# for a in `seq 10` ; do perf record --off-cpu sleep 1s ; perf script | grep offcpu-time ; done
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (13 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (16 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (14 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (12 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
root@number:~#
Have you tested it like that? Is this a known issue/current limitation?
I'm reading the patches to see I understand this better.
- Arnaldo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly
2025-02-11 18:02 ` Arnaldo Carvalho de Melo
@ 2025-02-11 18:26 ` Arnaldo Carvalho de Melo
2025-02-11 18:30 ` Howard Chu
1 sibling, 0 replies; 23+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-02-11 18:26 UTC (permalink / raw)
To: Howard Chu
Cc: Namhyung Kim, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel
On Tue, Feb 11, 2025 at 07:02:04PM +0100, Arnaldo Carvalho de Melo wrote:
> When running just one workload:
>
> root@number:~# for a in `seq 10` ; do perf record --off-cpu sleep 1s ; perf script | grep offcpu-time ; done
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (13 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (16 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (14 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (12 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> root@number:~#
>
> Have you tested it like that? Is this a known issue/current limitation?
>
> I'm reading the patches to see I understand this better.
Ok, if I go back and use the example Namhyung added to the commit
message introducing off-cpu profiling:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=edc41a1099c2d08ccfd4ed7d59688501e3749015
We get offcpu-time samples:
root@number:~# perf record --off-cpu perf bench sched messaging -l 1000
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 10 groups == 400 processes run
Total time: 0.244 [sec]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 2.182 MB perf.data (43600 samples) ]
root@number:~# perf script | grep offcpu-time | wc -l
867
root@number:~#
But they are all at the end:
sched-messaging 469373 17707.960698: 1055439 cpu_atom/cycles/P: ffffffff8a49a453 mod_objcg_state+0xb3 ([kernel.kallsyms])
sched-messaging 469373 17707.960944: 1057379 cpu_atom/cycles/P: ffffffff8b222cf9 native_queued_spin_lock_slowpath+0x209 ([kernel.kallsyms])
sched-messaging 469274 17707.961299: 6311728 cpu_core/cycles/P: 7fa751a510f1 __cxa_finalize+0x151 (/usr/lib64/libc.so.6)
:469374 469374 [000] 18446744069.414584: 43388310 offcpu-time:
7fa751b1c1c4 [unknown] ([unknown])
4fb0e8 [unknown] ([unknown])
4faa29 [unknown] ([unknown])
495700 [unknown] ([unknown])
495a1b [unknown] ([unknown])
413473 [unknown] ([unknown])
7fa751a38088 [unknown] ([unknown])
7fa751a3814b [unknown] ([unknown])
413ac5 [unknown] ([unknown])
:469518 469518 [000] 18446744069.414584: 22436206 offcpu-time:
7fa751b1c1c4 [unknown] ([unknown])
4fb0e8 [unknown] ([unknown])
4faa29 [unknown] ([unknown])
495700 [unknown] ([unknown])
495a1b [unknown] ([unknown])
413473 [unknown] ([unknown])
7fa751a38088 [unknown] ([unknown])
7fa751a3814b [unknown] ([unknown])
413ac5 [unknown] ([unknown])
:469663 469663 [000] 18446744069.414584: 180198700 offcpu-time:
7fa751b1d304 [unknown] ([unknown])
4faa29 [unknown] ([unknown])
495700 [unknown] ([unknown])
495a1b [unknown] ([unknown])
413473 [unknown] ([unknown])
7fa751a38088 [unknown] ([unknown])
7fa751a3814b [unknown] ([unknown])
413ac5 [unknown] ([unknown])
possibly the threshold?
Tried:
root@number:~# perf record --off-cpu --off-cpu-thresh 10 perf bench sched messaging -l 1000
Didn't help.
When doing a system wide session with that 10 threshold I get the
offcpu-time samples intermixed, i.e. not just at the end:
root@number:~# perf record -a --off-cpu --off-cpu-thresh 10 perf bench sched messaging -l 1000
sched-messaging 472324 [007] 18587.393976: 1371787 cpu_core/cycles/P: ffffffff8a7864d2 avc_has_perm+0x92 ([kernel.kallsyms])
swapper 0 [023] 18587.394111: 286156 cpu_atom/cycles/P: ffffffff8a16aacb do_idle+0x14b ([kernel.kallsyms])
sched-messaging 472532 [016] 18587.394210: 294672 cpu_atom/cycles/P: ffffffff8b4000ab entry_SYSCALL_64_safe_stack+0x0 ([kernel.kallsyms])
sched-messaging 472324 [007] 18587.394226: 1371903 cpu_core/cycles/P: ffffffff8a3f6308 copy_page_range+0x12a8 ([kernel.kallsyms])
podman 3935 [012] 18587.394310: 10106482 offcpu-time:
5630f6bf2457 [unknown] (/usr/bin/podman)
5630f6bc49e5 [unknown] (/usr/bin/podman)
5630f6bbb1f5 [unknown] (/usr/bin/podman)
5630f6bbb136 [unknown] (/usr/bin/podman)
5630f6beeb65 [unknown] (/usr/bin/podman)
7f4474032088 start_thread+0x3b8 (/usr/lib64/libc.so.6)
7f44740b5f8c clone3+0x2c (/usr/lib64/libc.so.6)
sched-messaging 472534 [016] 18587.394434: 279374 cpu_atom/cycles/P: ffffffff8a3826c9 get_callchain_entry+0x9 ([kernel.kallsyms])
sched-messaging 472324 [007] 18587.394476: 1371988 cpu_core/cycles/P: ffffffff8a43d264 __kmalloc_node_noprof+0xd4 ([kernel.kallsyms])
swapper 0 [016] 18587.394717: 276534 cpu_atom/cycles/P: ffffffff8a1d2ef4 rcu_preempt_deferred_qs+0x4 ([kernel.kallsyms])
<SNIP>
sched-messaging 472674 [013] 18587.414755: 6576247 cpu_core/cycles/P: ffffffff8b222d7d native_queued_spin_lock_slowpath+0x28d ([kernel.kallsyms])
sched-messaging 472625 [022] 18587.414760: 10699110 offcpu-time:
7f6b1a71d304 __GI___libc_write+0x14 (/usr/lib64/libc.so.6)
4fb0e8 bench_sched_messaging+0x218 (/home/acme/bin/perf)
4faa29 cmd_bench+0x2f9 (/home/acme/bin/perf)
495700 run_builtin+0x70 (/home/acme/bin/perf)
495a1b handle_internal_command+0xab (/home/acme/bin/perf)
413473 main+0x2f3 (/home/acme/bin/perf)
7f6b1a638088 __libc_start_call_main+0x78 (/usr/lib64/libc.so.6)
7f6b1a63814b __libc_start_main@@GLIBC_2.34+0x8b (/usr/lib64/libc.so.6)
413ac5 _start+0x25 (/home/acme/bin/perf)
sched-messaging 472623 [023] 18587.414763: 10911814 offcpu-time:
7f6b1a71d304 __GI___libc_write+0x14 (/usr/lib64/libc.so.6)
4fb0e8 bench_sched_messaging+0x218 (/home/acme/bin/perf)
4faa29 cmd_bench+0x2f9 (/home/acme/bin/perf)
495700 run_builtin+0x70 (/home/acme/bin/perf)
495a1b handle_internal_command+0xab (/home/acme/bin/perf)
413473 main+0x2f3 (/home/acme/bin/perf)
7f6b1a638088 __libc_start_call_main+0x78 (/usr/lib64/libc.so.6)
7f6b1a63814b __libc_start_main@@GLIBC_2.34+0x8b (/usr/lib64/libc.so.6)
413ac5 _start+0x25 (/home/acme/bin/perf)
<SNIP>
So I'm guessing that running just the 'sleep' workload we are missing
the sample at some point while with the 'perf bench' we have way more
samples, etc.
Also the intermixed entries have backtraces resolved to symbols, while
the ones at the end don't have, something else to investigate.
- Arnaldo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly
2025-02-11 18:02 ` Arnaldo Carvalho de Melo
2025-02-11 18:26 ` Arnaldo Carvalho de Melo
@ 2025-02-11 18:30 ` Howard Chu
2025-02-11 18:49 ` Arnaldo Carvalho de Melo
1 sibling, 1 reply; 23+ messages in thread
From: Howard Chu @ 2025-02-11 18:30 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo
Cc: Namhyung Kim, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel
Hello Arnaldo,
On Tue, Feb 11, 2025 at 10:02 AM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> On Mon, Dec 16, 2024 at 12:11:07PM -0800, Namhyung Kim wrote:
> > On Sun, Dec 15, 2024 at 10:12:10AM -0800, Howard Chu wrote:
> > > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> > >
> > > Currently, off-cpu samples are dumped when perf record is exiting. This
> > > results in off-cpu samples being after the regular samples. This patch
> > > series makes possible dumping off-cpu samples on-the-fly, directly into
> > > perf ring buffer. And it dispatches those samples to the correct format
> > > for perf.data consumers.
> > >
> > > Before:
> > > ```
> > > migration/0 21 [000] 27981.041319: 2944637851 cycles:P: ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
> > > perf 770116 [001] 27981.041375: 1 cycles:P: ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
> > > perf 770116 [001] 27981.041377: 1 cycles:P: ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
> > > perf 770116 [001] 27981.041379: 51611 cycles:P: ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
> > > migration/1 26 [001] 27981.041400: 4227682775 cycles:P: ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
> > > migration/2 32 [002] 27981.041477: 4159401534 cycles:P: ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])
> > >
> > > sshd 708098 [000] 18446744069.414584: 286392 offcpu-time:
> > > 79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
> > > 585690935cca [unknown] (/usr/bin/sshd)
> > > ```
> > >
> > > After:
> > > ```
> > > perf 774767 [003] 28178.033444: 497 cycles:P: ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
> > > perf 774767 [003] 28178.033445: 399440 cycles:P: ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
> > > swapper 0 [001] 28178.036639: 376650973 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > > swapper 0 [003] 28178.182921: 348779378 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > > blueman-tray 1355 [000] 28178.627906: 100184571 offcpu-time:
> > > 7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> > > 7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> > > 7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> > > 7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> > > 7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> > > 7fff24e862d8 [unknown] ([unknown])
> > >
> > >
> > > blueman-tray 1355 [000] 28178.728137: 100187539 offcpu-time:
> > > 7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
> > > 7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
> > > 7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
> > > 7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
> > > 7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
> > > 7fff24e862d8 [unknown] ([unknown])
> > >
> > >
> > > swapper 0 [000] 28178.463253: 195945410 cycles:P: ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
> > > dbus-broker 412 [002] 28178.464855: 376737008 cycles:P: ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
> > > ```
>
> So I'm trying to reproduce your results, with:
>
> root@number:~# perf record --off-cpu sleep 0.6
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> root@number:~# perf trace -e *sleep sleep 0.6
> 0.000 (600.137 ms): sleep/466769 clock_nanosleep(rqtp: { .tv_sec: 0, .tv_nsec: 600000000 }, rmtp: 0x7ffe024dea30) = 0
> root@number:~#
>
> So a simple test that crosses that 500ms threshold and thus should
> produce an offcpu-time sample, right?
>
> root@number:~# perf record --off-cpu sleep 0.6
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> root@number:~# perf evlist
> cpu_atom/cycles/P
> cpu_core/cycles/P
> offcpu-time
> dummy:u
> root@number:~#
>
> root@number:~# perf script | grep offcpu
> root@number:~#
>
> What I have:
>
> ⬢ [acme@toolbox perf-tools-next]$ git log --oneline -11
> cd59081880e89df8 (HEAD -> perf-tools-next) perf test: Add direct off-cpu test
> 56cbd794c0c46ba9 perf record --off-cpu: Add --off-cpu-thresh option
> 28d9b19c5455556f perf record --off-cpu: Dump the remaining samples in BPF's stack trace map
> 2bc05b02743b50a7 perf script: Display off-cpu samples correctly
> bfa457a621596947 perf record --off-cpu: Disable perf_event's callchain collection
> eca732cc42d20266 perf evsel: Assemble offcpu samples
> 74ce50e40c569e90 perf record --off-cpu: Dump off-cpu samples in BPF
> e75f8ce63bfa6cb9 perf record --off-cpu: Preparation of off-cpu BPF program
> 0ffab9d26971c91c perf record --off-cpu: Parse off-cpu event
> efc3fe2070853b7d perf evsel: Expose evsel__is_offcpu_event() for future use
> 357b965deba9fb71 (perf-tools-next/perf-tools-next) perf stat: Changes to event name uniquification
> ⬢ [acme@toolbox perf-tools-next]$
>
> The only change was the one I mentioned about 'struct __stack'
> replacing 'struct stack' to avoid the clash with the vmlinux.h generated
> using GEN_VMLINUX_H=1.
>
> Perhaps that is because my test is not being system wide, if I use '-a'
> we get:
>
> root@number:~# perf record --off-cpu -a sleep 0.6
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 1.868 MB perf.data (443 samples) ]
> root@number:~# perf script | grep offcpu
> firefox 4010 [004] 16746.571035: 518930922 offcpu-time:
> sleep 466955 [019] 16746.629875: 600279888 offcpu-time:
> perf 466953 [027] 16746.630541: 601731496 offcpu-time:
> IPC I/O Parent 4039 [000] 18446744069.414584: 20781 offcpu-time:
> TaskCon~ller #3 4454 [000] 18446744069.414584: 117724 offcpu-time:
> Isolated Servic 4253 [000] 18446744069.414584: 387311029 offcpu-time:
> IPC I/O Parent 4039 [000] 18446744069.414584: 520075456 offcpu-time:
> Timer 4040 [000] 18446744069.414584: 581156755 offcpu-time:
> Timer 4380 [000] 18446744069.414584: 22520305 offcpu-time:
> sqldb:c~lite #6 4327 [000] 18446744069.414584: 2073583 offcpu-time:
> Isolated Web Co 156033 [000] 18446744069.414584: 23277 offcpu-time:
> sqldb:c~lite #6 4327 [000] 18446744069.414584: 24974 offcpu-time:
> dbus-broker 2526 [000] 18446744069.414584: 1175447 offcpu-time:
> IPDL Background 4043 [000] 18446744069.414584: 12055 offcpu-time:
> glean.dispatche 4037 [000] 18446744069.414584: 62321 offcpu-time:
> podman 12139 [000] 18446744069.414584: 502656763 offcpu-time:
> gnome-shell 2709 [000] 18446744069.414584: 579190477 offcpu-time:
> podman 3936 [000] 18446744069.414584: 60777801 offcpu-time:
> KMS thread 2739 [000] 18446744069.414584: 14578641 offcpu-time:
> IPC I/O Child 7113 [000] 18446744069.414584: 386343655 offcpu-time:
> WaylandProxy 4027 [000] 18446744069.414584: 46989879 offcpu-time:
> Timer 156048 [000] 18446744069.414584: 404662799 offcpu-time:
> URL Classifier 4234 [000] 18446744069.414584: 55337 offcpu-time:
> firefox 4010 [000] 18446744069.414584: 57229 offcpu-time:
> Isolated Web Co 4261 [000] 18446744069.414584: 26671 offcpu-time:
> podman 12130 [000] 18446744069.414584: 450853233 offcpu-time:
> firefox 4010 [000] 18446744069.414584: 56800 offcpu-time:
> podman 3951 [000] 18446744069.414584: 502100761 offcpu-time:
> podman 12131 [000] 18446744069.414584: 501954466 offcpu-time:
> Timer 4306 [000] 18446744069.414584: 360125199 offcpu-time:
> RemoteLzyStream 4230 [000] 18446744069.414584: 172647 offcpu-time:
> TaskCon~ller #1 4452 [000] 18446744069.414584: 38083 offcpu-time:
> Isolated Web Co 156033 [000] 18446744069.414584: 496574141 offcpu-time:
> Socket Thread 4042 [000] 18446744069.414584: 295030041 offcpu-time:
> IPDL Background 4043 [000] 18446744069.414584: 13990 offcpu-time:
> Inotify~tThread 7430 [000] 18446744069.414584: 300841349 offcpu-time:
> Timer 7186 [000] 18446744069.414584: 117536 offcpu-time:
> Backgro~ol #163 466653 [000] 18446744069.414584: 29312 offcpu-time:
> Isolated Web Co 7107 [000] 18446744069.414584: 418466228 offcpu-time:
> Isolated Web Co 156033 [000] 18446744069.414584: 22049 offcpu-time:
> Socket Thread 4295 [000] 18446744069.414584: 309685524 offcpu-time:
> podman 12130 [000] 18446744069.414584: 60960566 offcpu-time:
> Socket Thread 4042 [000] 18446744069.414584: 9726 offcpu-time:
> glean.dispatche 4037 [000] 18446744069.414584: 30560 offcpu-time:
> TaskCon~ller #0 4451 [000] 18446744069.414584: 3661284 offcpu-time:
> URL Classifier 4234 [000] 18446744069.414584: 18112 offcpu-time:
> Isolated Web Co 4261 [000] 18446744069.414584: 380745777 offcpu-time:
> IPC I/O Child 7039 [000] 18446744069.414584: 54365677 offcpu-time:
> IPC I/O Child 4292 [000] 18446744069.414584: 386687065 offcpu-time:
> glean.dispatche 4037 [000] 18446744069.414584: 48091 offcpu-time:
> firefox 4010 [000] 18446744069.414584: 570509826 offcpu-time:
> podman 3936 [000] 18446744069.414584: 451176929 offcpu-time:
> Isolated Web Co 7034 [000] 18446744069.414584: 57005742 offcpu-time:
> gnome-terminal- 3583 [000] 18446744069.414584: 97027967 offcpu-time:
> sqldb:c~lite #6 4327 [000] 18446744069.414584: 27118 offcpu-time:
> TaskCon~ller #1 4452 [000] 18446744069.414584: 20008 offcpu-time:
> IPC I/O Parent 4039 [000] 18446744069.414584: 58205 offcpu-time:
> Timer 157669 [000] 18446744069.414584: 180649 offcpu-time:
> IPC I/O Child 4265 [000] 18446744069.414584: 387442919 offcpu-time:
> firefox 4010 [000] 18446744069.414584: 215006 offcpu-time:
> podman 3949 [000] 18446744069.414584: 502715978 offcpu-time:
> firefox 4010 [000] 18446744069.414584: 94719 offcpu-time:
> gdbus 4032 [000] 18446744069.414584: 1330625 offcpu-time:
> Worker Launcher 4348 [000] 18446744069.414584: 83393104 offcpu-time:
> Socket Thread 4042 [000] 18446744069.414584: 48713 offcpu-time:
> IPDL Background 4043 [000] 18446744069.414584: 386560332 offcpu-time:
> Timer 7123 [000] 18446744069.414584: 422634111 offcpu-time:
> Isolated Web Co 4358 [000] 18446744069.414584: 18063623 offcpu-time:
> glean.dispatche 4037 [000] 18446744069.414584: 302249112 offcpu-time:
> IPC I/O Parent 4039 [000] 18446744069.414584: 16204 offcpu-time:
> Inotify~tThread 7430 [000] 18446744069.414584: 300780100 offcpu-time:
> RemoteLzyStream 4230 [000] 18446744069.414584: 14778 offcpu-time:
> DOM Worker 10133 [000] 18446744069.414584: 331946 offcpu-time:
> URL Classifier 4234 [000] 18446744069.414584: 5003953 offcpu-time:
> gdbus 2729 [000] 18446744069.414584: 331119 offcpu-time:
> TaskCon~ller #2 4453 [000] 18446744069.414584: 125448 offcpu-time:
> firefox 4010 [000] 18446744069.414584: 40894 offcpu-time:
> IPC I/O Child 156038 [000] 18446744069.414584: 4079439 offcpu-time:
> DOM Worker 10228 [000] 18446744069.414584: 3397046 offcpu-time:
> root@number:~#
>
> With 'offcpu-time' samples intermixed with the other cycles events:
>
> <SNIP>
> Isolated Web Co 4261 [010] 16746.570827: 287731 cpu_core/cycles/P: ffffffff8b145b24 __get_user_nocheck_8+0x4 ([kernel.kallsyms])
> Isolated Web Co 7107 [006] 16746.570834: 235654 cpu_core/cycles/P: 7f08bf336592 pthread_mutex_trylock@@GLIBC_2.34+0xe2 (/usr/lib64/libc.so.6)
> firefox 4010 [004] 16746.571035: 518930922 offcpu-time:
> 7f2dab89eda9 dlsym@@GLIBC_2.34+0x59 (/usr/lib64/libc.so.6)
> 7f2dab8a17f9 pthread_cond_broadcast@@GLIBC_2.3.2+0x399 (/usr/lib64/libc.so.6)
> 556309ec5bca mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&)+0xa (/usr/lib64/firefox/firefox)
>
> swapper 0 [019] 16746.571151: 315218 cpu_atom/cycles/P: ffffffff8adf4ad0 menu_select+0x1b0 ([kernel.kallsyms])
> <SNIP>
> Isolated Web Co 7107 [010] 16746.607631: 288723 cpu_core/cycles/P: 34d114ccce4a [unknown] (/tmp/perf-7107.map)
> swapper 0 [008] 16746.622828: 136175 cpu_core/cycles/P: ffffffff8b213d7a intel_idle+0x6a ([kernel.kallsyms])
> Timer 4306 [008] 16746.622977: 129681 cpu_core/cycles/P: ffffffff8b134e46 rb_erase+0x46 ([kernel.kallsyms])
> swapper 0 [000] 16746.623018: 298081 cpu_core/cycles/P: ffffffff8a13d3ea update_rq_clock+0x7a ([kernel.kallsyms])
> swapper 0 [019] 16746.629839: 256701 cpu_atom/cycles/P: ffffffff8a148455 __dequeue_entity+0x5 ([kernel.kallsyms])
> sleep 466955 [019] 16746.629875: 600279888 offcpu-time:
> 7f6feb8bf497 clock_nanosleep@GLIBC_2.2.5+0x27 (/usr/lib64/libc.so.6)
> 5608d1601a9f [unknown] (/usr/bin/sleep)
> 7f6feb805088 __libc_start_call_main+0x78 (/usr/lib64/libc.so.6)
> 7f6feb80514b __libc_start_main@@GLIBC_2.34+0x8b (/usr/lib64/libc.so.6)
> 5608d1601c05 [unknown] (/usr/bin/sleep)
>
> sleep 466955 [006] 16746.630223: 225700 cpu_core/cycles/P: ffffffff8a40ec5a folio_remove_rmap_ptes+0x1a ([kernel.kallsyms])
> :466955 466955 [006] 16746.630442: 225700 cpu_core/cycles/P: ffffffff8b1387eb timerqueue_add+0x3b ([kernel.kallsyms])
> perf 466953 [027] 16746.630541: 601731496 offcpu-time:
> 7f818451c1fd __poll+0x4d (/usr/lib64/libc.so.6)
> 49606e fdarray__poll+0x2e (/home/acme/bin/perf)
> 429eed __cmd_record.constprop.0+0x18ad (/home/acme/bin/perf)
> 42c77e cmd_record+0xd9e (/home/acme/bin/perf)
> 495700 run_builtin+0x70 (/home/acme/bin/perf)
> 495a1b handle_internal_command+0xab (/home/acme/bin/perf)
> 413473 main+0x2f3 (/home/acme/bin/perf)
> 7f8184438088 __libc_start_call_main+0x78 (/usr/lib64/libc.so.6)
> 7f818443814b __libc_start_main@@GLIBC_2.34+0x8b (/usr/lib64/libc.so.6)
> 413ac5 _start+0x25 (/home/acme/bin/perf)
>
> perf 466953 [000] 16746.630831: 298081 cpu_core/cycles/P: ffffffff8b40012f entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms])
> swapper 0 [019] 16746.630985: 250096 cpu_atom/cycles/P: ffffffff8b213d7a intel_idle+0x6a ([kernel.kallsyms])
> perf 466953 [010] 16746.632045: 286467 cpu_core/cycles/P: ffffffff8a1c3f21 __rcu_read_lock+0x11 ([kernel.kallsyms])
> IPC I/O Parent 4039 [000] 18446744069.414584: 20781 offcpu-time:
> 7f2dab89f000 __GI___pthread_enable_asynccancel+0x20 (/usr/lib64/libc.so.6)
> 556309ebede4 moz_xmalloc+0x104 (/usr/lib64/firefox/firefox)
> 3bbdf3d834830 [unknown] ([unknown])
>
>
> And if I install the 'coreutils' debuginfo I get the sleep backtraces
> resolved, nice, but again only when I'm running in system wide (-a)
> mode.
>
> Isolated Web Co 7167 [008] 16915.115432: 188682 cpu_core/cycles/P: 7fe85f558260 nsBaseHashtable<nsStringHashKey, mozilla::UniquePtr<nsBaseHashtable<nsIntegralHashKey<unsigned lo>
> swapper 0 [013] 16915.126739: 120783 cpu_core/cycles/P: ffffffff8a15221d enqueue_task_fair+0x22d ([kernel.kallsyms])
> sleep 467147 [013] 16915.126780: 600336127 offcpu-time:
> 7ff010611497 clock_nanosleep@GLIBC_2.2.5+0x27 (/usr/lib64/libc.so.6)
> 563f045a3a9f main+0x4af (/usr/bin/sleep)
> 7ff010557088 __libc_start_call_main+0x78 (/usr/lib64/libc.so.6)
> 7ff01055714b __libc_start_main@@GLIBC_2.34+0x8b (/usr/lib64/libc.so.6)
> 563f045a3c05 _start+0x25 (/usr/bin/sleep)
>
> sleep 467147 [013] 16915.126861: 120783 cpu_core/cycles/P: ffffffff8a3ef391 unmap_page_range+0x661 ([kernel.kallsyms])
> sleep 467147 [013] 16915.126973: 139106 cpu_core/cycles/P: ffffffff8a3a5b4d folios_put_refs+0x8d ([kernel.kallsyms])
> swapper 0 [013] 16915.127102: 160383 cpu_core/cycles/P: ffffffff8a14b5c4 sched_balance_update_blocked_averages+0xb4 ([kernel.kallsyms])
>
> When running just one workload:
>
> root@number:~# for a in `seq 10` ; do perf record --off-cpu sleep 1s ; perf script | grep offcpu-time ; done
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (13 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (16 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (14 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (12 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> root@number:~#
>
> Have you tested it like that? Is this a known issue/current limitation?
With your command:
Tuel $ for a in `seq 10` ; do sudo perf record --off-cpu sleep 1s ;
sudo perf script | grep offcpu-time ; done
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (8 samples) ]
:56645 56645 18446744069.414584: 1000127411 offcpu-time:
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (8 samples) ]
:56723 56723 18446744069.414584: 1000058038 offcpu-time:
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (8 samples) ]
:56778 56778 18446744069.414584: 1000230651 offcpu-time:
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (9 samples) ]
:56810 56810 18446744069.414584: 1000140263 offcpu-time:
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (17 samples) ]
:56877 56877 18446744069.414584: 1001209864 offcpu-time:
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (9 samples) ]
:56919 56919 18446744069.414584: 1000230734 offcpu-time:
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (9 samples) ]
:56975 56975 18446744069.414584: 1000080484 offcpu-time:
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (9 samples) ]
:57008 57008 18446744069.414584: 1000057996 offcpu-time:
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (9 samples) ]
:57040 57040 18446744069.414584: 1000084270 offcpu-time:
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (8 samples) ]
:57072 57072 18446744069.414584: 1000081663 offcpu-time:
It works for me... I'm trying to think of a reason. I have no perf config btw.
Thanks,
Howard
>
> I'm reading the patches to see I understand this better.
>
> - Arnaldo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v14 00/10] perf record --off-cpu: Dump off-cpu samples directly
2025-02-11 18:30 ` Howard Chu
@ 2025-02-11 18:49 ` Arnaldo Carvalho de Melo
0 siblings, 0 replies; 23+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-02-11 18:49 UTC (permalink / raw)
To: Howard Chu
Cc: Namhyung Kim, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, linux-perf-users, linux-kernel
On Tue, Feb 11, 2025 at 10:30:16AM -0800, Howard Chu wrote:
> On Tue, Feb 11, 2025 at 10:02 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > When running just one workload:
> > root@number:~# for a in `seq 10` ; do perf record --off-cpu sleep 1s ; perf script | grep offcpu-time ; done
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.037 MB perf.data (13 samples) ]
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.037 MB perf.data (16 samples) ]
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.037 MB perf.data (14 samples) ]
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.037 MB perf.data (12 samples) ]
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
> > root@number:~#
> > Have you tested it like that? Is this a known issue/current limitation?
> With your command:
> Tuel $ for a in `seq 10` ; do sudo perf record --off-cpu sleep 1s ;
> sudo perf script | grep offcpu-time ; done
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.017 MB perf.data (8 samples) ]
> :56645 56645 18446744069.414584: 1000127411 offcpu-time:
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.017 MB perf.data (8 samples) ]
> :56723 56723 18446744069.414584: 1000058038 offcpu-time:
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.017 MB perf.data (8 samples) ]
> :56778 56778 18446744069.414584: 1000230651 offcpu-time:
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.017 MB perf.data (9 samples) ]
> :56810 56810 18446744069.414584: 1000140263 offcpu-time:
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.017 MB perf.data (17 samples) ]
> :56877 56877 18446744069.414584: 1001209864 offcpu-time:
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.017 MB perf.data (9 samples) ]
> :56919 56919 18446744069.414584: 1000230734 offcpu-time:
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.017 MB perf.data (9 samples) ]
> :56975 56975 18446744069.414584: 1000080484 offcpu-time:
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.017 MB perf.data (9 samples) ]
> :57008 57008 18446744069.414584: 1000057996 offcpu-time:
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.017 MB perf.data (9 samples) ]
> :57040 57040 18446744069.414584: 1000084270 offcpu-time:
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.017 MB perf.data (8 samples) ]
> :57072 57072 18446744069.414584: 1000081663 offcpu-time:
>
> It works for me... I'm trying to think of a reason. I have no perf config btw.
root@number:~# uname -a
Linux number 6.12.11-100.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jan 23 22:07:15 UTC 2025 x86_64 GNU/Linux
root@number:~# cat ~/.perfconfig
# this file is auto-generated.
[report]
children = no
root@number:~# rm -f ~/.perfconfig
root@number:~# for a in `seq 10` ; do perf record --off-cpu sleep 1s ; perf script | grep offcpu-time ; done
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (14 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (13 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
root@number:~#
Some more data:
root@number:~# perf record --off-cpu sleep 1s
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (7 samples) ]
root@number:~# perf evlist
cpu_atom/cycles/P
cpu_core/cycles/P
offcpu-time
dummy:u
root@number:~# perf evlist -v
cpu_atom/cycles/P: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0xa00000000, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, enable_on_exec: 1, precise_ip: 3, sample_id_all: 1
cpu_core/cycles/P: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0x400000000, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, enable_on_exec: 1, precise_ip: 3, sample_id_all: 1
offcpu-time: type: 1 (software), size: 136, config: 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER, read_format: ID|LOST, disabled: 1, enable_on_exec: 1, sample_id_all: 1
dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
root@number:~# perf report -D | grep PERF_RECORD_SAMPLE
4294967295 20129389592917 0x9240 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 474261/474261: 0xffffffff8a376ff1 period: 1 addr: 0
4294967295 20129389596787 0x9270 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 474261/474261: 0xffffffff8a01e8b6 period: 1 addr: 0
4294967295 20129389598447 0x92a0 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 474261/474261: 0xffffffff8a053e64 period: 13 addr: 0
4294967295 20129389599368 0x92d0 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 474261/474261: 0xffffffff8a00f052 period: 398 addr: 0
4294967295 20129389600270 0x9300 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 474261/474261: 0xffffffff8b401fad period: 15226 addr: 0
4294967295 20129389604095 0x9360 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 474261/474261: 0xffffffff8a405930 period: 607346 addr: 0
4294967295 20129389739536 0x94e8 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4002): 474261/474261: 0x7f9bf499fe7c period: 5436975 addr: 0
root@number:~#
root@number:~# perf report -D | grep PERF_RECORD_SAMPLE -A2
4294967295 20129389592917 0x9240 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 474261/474261: 0xffffffff8a376ff1 period: 1 addr: 0
... thread: perf-exec:474261
...... dso: /proc/kcore
--
4294967295 20129389596787 0x9270 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 474261/474261: 0xffffffff8a01e8b6 period: 1 addr: 0
... thread: perf-exec:474261
...... dso: /proc/kcore
--
4294967295 20129389598447 0x92a0 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 474261/474261: 0xffffffff8a053e64 period: 13 addr: 0
... thread: perf-exec:474261
...... dso: /proc/kcore
--
4294967295 20129389599368 0x92d0 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 474261/474261: 0xffffffff8a00f052 period: 398 addr: 0
... thread: perf-exec:474261
...... dso: /proc/kcore
--
4294967295 20129389600270 0x9300 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 474261/474261: 0xffffffff8b401fad period: 15226 addr: 0
... thread: perf-exec:474261
...... dso: /proc/kcore
--
4294967295 20129389604095 0x9360 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 474261/474261: 0xffffffff8a405930 period: 607346 addr: 0
... thread: sleep:474261
...... dso: /proc/kcore
--
4294967295 20129389739536 0x94e8 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4002): 474261/474261: 0x7f9bf499fe7c period: 5436975 addr: 0
... thread: sleep:474261
...... dso: /usr/lib64/ld-linux-x86-64.so.2
root@number:~#
- Arnaldo
^ permalink raw reply [flat|nested] 23+ messages in thread