linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly
@ 2025-02-13 22:59 Howard Chu
  2025-02-13 23:00 ` [PATCH v15 01/10] perf evsel: Expose evsel__is_offcpu_event() for future use Howard Chu
                   ` (10 more replies)
  0 siblings, 11 replies; 16+ messages in thread
From: Howard Chu @ 2025-02-13 22:59 UTC (permalink / raw)
  To: acme
  Cc: namhyung, jolsa, irogers, adrian.hunter, linux-perf-users,
	linux-kernel, Howard Chu

Changes in v15:
 - Fix workload recording bug pointed out by Arnaldo.
 - Rename struct stack to struct __stack as suggested by Arnaldo.
 - Delete the extra offcpu workload now that recording workload is fixed,
   use 'sleep 1' for testing instead.
 - Add more tests for the off-cpu-thresh option.

Changes in v14:
 - Change the internal off_cpu_thresh_us to off_cpu_thresh_ns, i.e. use
   nsec instead of usec

Changes in v13:
 - Move the definition of 'off_cpu_thresh_ns' to the same commit as
   dumping off-cpu samples in BPF, and give off_cpu_thresh_ns a default
   value before the --off-cpu-thresh option is parsed.

Changes in v12:
 - Restore patches' bisectability, because the ordering of patches has
   changed.
 - Change 'us = ms * 1000' to 'us = ms * USEC_PER_MSEC'

Changes in v11:
 - Modify the options used in the off-cpu tests, as I changed the unit
   of the off-cpu threshold to milliseconds.

Changes in v10:
 - Move the commit "perf record --off-cpu: Add --off-cpu-thresh option"
   to where the direct sample feature is completed.
 - Make --off-cpu-thresh use milliseconds as the unit.

Changes in v9:
 - Add documentation for the new option '--off-cpu-thresh', and include
   an example of its usage in the commit message
 - Set inherit in evsel__config() to prevent future modifications
 - Support off-cpu sample data collected by perf before this patch series

Changes in v8:
 - Make this series bisectable
 - Rename off_cpu_thresh to off_cpu_thresh_us and offcpu_thresh (in BPF)
   to offcpu_thresh_ns for clarity
 - Add commit messages to 'perf evsel: Expose evsel__is_offcpu_event()
   for future use' commit
 - Correct spelling mistakes in the commit message (s/is should be/should be/)
 - Add kernel-doc comments to off_cpu_dump(), and comments to the empty
   if block
 - Add some comments to off-cpu test
 - Delete an unused variable 'timestamp' in off_cpu_dump()

Changes in v7:
 - Make off-cpu event system-wide
 - Use strtoull instead of strtoul
 - Delete unused variable such as sample_id, and sample_type
 - Use i as index to update BPF perf_event map
 - MAX_OFFCPU_LEN 128 is too big, make it smaller.
 - Delete some bound check as it's always guaranteed
 - Do not set ip_pos in BPF
 - Add a new field for storing stack traces in the tstamp map
 - Dump the off-cpu sample directly or save it in the off_cpu map, not both
 - Delete the sample_type_off_cpu check
 - Use __set_off_cpu_sample() to parse samples instead of a two-pass parsing

Changes in v6:
 - Make patches bisectable

Changes in v5:
 - Delete unnecessary copy in BPF program
 - Remove sample_embed from perf header, hard code off-cpu stuff instead
 - Move evsel__is_offcpu_event() to evsel.h
 - Minor changes to the test
 - Edit some comments

Changes in v4:
 - Minimize the size of data output by perf_event_output()
 - Keep only one off-cpu event
 - Change off-cpu threshold's unit to microseconds
 - Set a default off-cpu threshold
 - Print the correct error message for the field 'embed' in perf data header

Changes in v3:
 - Add off-cpu-thresh argument
 - Process direct off-cpu samples in post

Changes in v2:
 - Remove unnecessary comments.
 - Rename function off_cpu_change_type to off_cpu_prepare_parse

v1:

As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323

Currently, off-cpu samples are dumped when perf record is exiting. This
results in off-cpu samples being after the regular samples. This patch
series makes possible dumping off-cpu samples on-the-fly, directly into
perf ring buffer. And it dispatches those samples to the correct format
for perf.data consumers.

Before:
```
     migration/0      21 [000] 27981.041319: 2944637851    cycles:P:  ffffffff90d2e8aa record_times+0xa ([kernel.kallsyms])
            perf  770116 [001] 27981.041375:          1    cycles:P:  ffffffff90ee4960 event_function+0xf0 ([kernel.kallsyms])
            perf  770116 [001] 27981.041377:          1    cycles:P:  ffffffff90c184b1 intel_bts_enable_local+0x31 ([kernel.kallsyms])
            perf  770116 [001] 27981.041379:      51611    cycles:P:  ffffffff91a160b0 native_sched_clock+0x30 ([kernel.kallsyms])
     migration/1      26 [001] 27981.041400: 4227682775    cycles:P:  ffffffff90d06a74 wakeup_preempt+0x44 ([kernel.kallsyms])
     migration/2      32 [002] 27981.041477: 4159401534    cycles:P:  ffffffff90d11993 update_load_avg+0x63 ([kernel.kallsyms])

sshd  708098 [000] 18446744069.414584:     286392 offcpu-time: 
	    79a864f1c8bb ppoll+0x4b (/usr/lib/libc.so.6)
	    585690935cca [unknown] (/usr/bin/sshd)
```

After:
```
            perf  774767 [003] 28178.033444:        497           cycles:P:  ffffffff91a160c3 native_sched_clock+0x43 ([kernel.kallsyms])
            perf  774767 [003] 28178.033445:     399440           cycles:P:  ffffffff91c01f8d nmi_restore+0x25 ([kernel.kallsyms])
         swapper       0 [001] 28178.036639:  376650973           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
         swapper       0 [003] 28178.182921:  348779378           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
    blueman-tray    1355 [000] 28178.627906:  100184571 offcpu-time: 
	    7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
	    7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
	    7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
	    7fff24e862d8 [unknown] ([unknown])


    blueman-tray    1355 [000] 28178.728137:  100187539 offcpu-time: 
	    7528eef1c39d __poll+0x4d (/usr/lib/libc.so.6)
	    7528edf7d8fd [unknown] (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528edf1af95 g_main_context_iteration+0x35 (/usr/lib/libglib-2.0.so.0.8000.2)
	    7528eda4ab86 g_application_run+0x1f6 (/usr/lib/libgio-2.0.so.0.8000.2)
	    7528ee6aa596 [unknown] (/usr/lib/libffi.so.8.1.4)
	    7fff24e862d8 [unknown] ([unknown])


         swapper       0 [000] 28178.463253:  195945410           cycles:P:  ffffffff91a1ae99 intel_idle+0x59 ([kernel.kallsyms])
     dbus-broker     412 [002] 28178.464855:  376737008           cycles:P:  ffffffff91c000a0 entry_SYSCALL_64+0x20 ([kernel.kallsyms])
```

Howard Chu (10):
  perf evsel: Expose evsel__is_offcpu_event() for future use
  perf record --off-cpu: Parse off-cpu event
  perf record --off-cpu: Preparation of off-cpu BPF program
  perf record --off-cpu: Dump off-cpu samples in BPF
  perf evsel: Assemble offcpu samples
  perf record --off-cpu: Disable perf_event's callchain collection
  perf script: Display off-cpu samples correctly
  perf record --off-cpu: Dump the remaining samples in BPF's stack trace
    map
  perf record --off-cpu: Add --off-cpu-thresh option
  perf test: Add direct off-cpu test

 tools/perf/Documentation/perf-record.txt |   9 ++
 tools/perf/builtin-record.c              |  33 +++++++
 tools/perf/builtin-script.c              |   4 +-
 tools/perf/tests/shell/record_offcpu.sh  |  71 ++++++++++++++
 tools/perf/util/bpf_off_cpu.c            | 118 ++++++++++++++---------
 tools/perf/util/bpf_skel/off_cpu.bpf.c   |  98 ++++++++++++++++++-
 tools/perf/util/evsel.c                  |  41 +++++++-
 tools/perf/util/evsel.h                  |   2 +
 tools/perf/util/off_cpu.h                |   3 +-
 tools/perf/util/record.h                 |   1 +
 10 files changed, 323 insertions(+), 57 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v15 01/10] perf evsel: Expose evsel__is_offcpu_event() for future use
  2025-02-13 22:59 [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
@ 2025-02-13 23:00 ` Howard Chu
  2025-02-13 23:00 ` [PATCH v15 02/10] perf record --off-cpu: Parse off-cpu event Howard Chu
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2025-02-13 23:00 UTC (permalink / raw)
  To: acme
  Cc: namhyung, jolsa, irogers, adrian.hunter, linux-perf-users,
	linux-kernel, Howard Chu, Alexander Shishkin, Ingo Molnar,
	James Clark, Kan Liang, Mark Rutland, Peter Zijlstra,
	Arnaldo Carvalho de Melo

Expose evsel__is_offcpu_event() so it can be used in
off_cpu_config(), evsel__parse_sample(), and perf script.

Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-3-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/evsel.c | 2 +-
 tools/perf/util/evsel.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 4a0ef095db92..d5519ab25996 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1261,7 +1261,7 @@ static void evsel__set_default_freq_period(struct record_opts *opts,
 	}
 }
 
-static bool evsel__is_offcpu_event(struct evsel *evsel)
+bool evsel__is_offcpu_event(struct evsel *evsel)
 {
 	return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT);
 }
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index aae431d63d64..e58de60210a4 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -557,4 +557,6 @@ u64 evsel__bitfield_swap_branch_flags(u64 value);
 void evsel__set_config_if_unset(struct perf_pmu *pmu, struct evsel *evsel,
 				const char *config_name, u64 val);
 
+bool evsel__is_offcpu_event(struct evsel *evsel);
+
 #endif /* __PERF_EVSEL_H */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v15 02/10] perf record --off-cpu: Parse off-cpu event
  2025-02-13 22:59 [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
  2025-02-13 23:00 ` [PATCH v15 01/10] perf evsel: Expose evsel__is_offcpu_event() for future use Howard Chu
@ 2025-02-13 23:00 ` Howard Chu
  2025-02-13 23:00 ` [PATCH v15 03/10] perf record --off-cpu: Preparation of off-cpu BPF program Howard Chu
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2025-02-13 23:00 UTC (permalink / raw)
  To: acme
  Cc: namhyung, jolsa, irogers, adrian.hunter, linux-perf-users,
	linux-kernel, Howard Chu, Alexander Shishkin, Ingo Molnar,
	James Clark, Kan Liang, Mark Rutland, Peter Zijlstra,
	Arnaldo Carvalho de Melo

Parse the off-cpu event using parse_event(), as bpf-output.

Call evlist__enable_evsel() on off-cpu event. This fixes the inability to
collect direct off-cpu samples on a workload, as reported by Arnaldo
Carvalho de Melo <acme@redhat.com>. The reason being, workload sets
enable_on_exec instead of calling evlist__enable(), but off-cpu event does
not attach to an executable and execve won't be called, so the fds from
perf_event_open() are not enabled.

no-inherit should be set to 1, here's the reason:

We update the BPF perf_event map for direct off-cpu sample dumping (in
following patches), it executes as follows:

bpf_map_update_value()
 bpf_fd_array_map_update_elem()
  perf_event_fd_array_get_ptr()
   perf_event_read_local()

In perf_event_read_local(), there is:

int perf_event_read_local(struct perf_event *event, u64 *value,
			  u64 *enabled, u64 *running)
{
...
	/*
	 * It must not be an event with inherit set, we cannot read
	 * all child counters from atomic context.
	 */
	if (event->attr.inherit) {
		ret = -EOPNOTSUPP;
		goto out;
	}

Which means no-inherit has to be true for updating the BPF perf_event
map.

Moreover, for bpf-output events, we primarily want a system-wide event
instead of a per-task event.

The reason is that in BPF's bpf_perf_event_output(), BPF uses the CPU
index to retrieve the perf_event file descriptor it outputs to.

Making a bpf-output event system-wide naturally satisfies this
requirement by mapping CPU appropriately.

Suggested-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-4-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-record.c   |  7 +++++++
 tools/perf/util/bpf_off_cpu.c | 33 +++++++++++----------------------
 tools/perf/util/evsel.c       |  4 +++-
 3 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index cda7e6a7b45d..f3e5f856f4a4 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -2564,6 +2564,13 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	if (!target__none(&opts->target) && !opts->target.initial_delay)
 		evlist__enable(rec->evlist);
 
+	/*
+	 * offcpu-time does not call execve, so enable_on_exe wouldn't work
+	 * when recording a workload, do it manually
+	 */
+	if (rec->off_cpu)
+		evlist__enable_evsel(rec->evlist, (char *)OFFCPU_EVENT);
+
 	/*
 	 * Let the child rip
 	 */
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 4269b41d1771..2101aa2b7c42 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -38,32 +38,21 @@ union off_cpu_data {
 
 static int off_cpu_config(struct evlist *evlist)
 {
+	char off_cpu_event[64];
 	struct evsel *evsel;
-	struct perf_event_attr attr = {
-		.type	= PERF_TYPE_SOFTWARE,
-		.config = PERF_COUNT_SW_BPF_OUTPUT,
-		.size	= sizeof(attr), /* to capture ABI version */
-	};
-	char *evname = strdup(OFFCPU_EVENT);
-
-	if (evname == NULL)
-		return -ENOMEM;
 
-	evsel = evsel__new(&attr);
-	if (!evsel) {
-		free(evname);
-		return -ENOMEM;
+	scnprintf(off_cpu_event, sizeof(off_cpu_event), "bpf-output/name=%s/", OFFCPU_EVENT);
+	if (parse_event(evlist, off_cpu_event)) {
+		pr_err("Failed to open off-cpu event\n");
+		return -1;
 	}
 
-	evsel->core.attr.freq = 1;
-	evsel->core.attr.sample_period = 1;
-	/* off-cpu analysis depends on stack trace */
-	evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
-
-	evlist__add(evlist, evsel);
-
-	free(evsel->name);
-	evsel->name = evname;
+	evlist__for_each_entry(evlist, evsel) {
+		if (evsel__is_offcpu_event(evsel)) {
+			evsel->core.system_wide = true;
+			break;
+		}
+	}
 
 	return 0;
 }
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index d5519ab25996..f45d4b44d70d 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1540,8 +1540,10 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
 	if (evsel__is_dummy_event(evsel))
 		evsel__reset_sample_bit(evsel, BRANCH_STACK);
 
-	if (evsel__is_offcpu_event(evsel))
+	if (evsel__is_offcpu_event(evsel)) {
 		evsel->core.attr.sample_type &= OFFCPU_SAMPLE_TYPES;
+		attr->inherit = 0;
+	}
 
 	arch__post_evsel_config(evsel, attr);
 }
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v15 03/10] perf record --off-cpu: Preparation of off-cpu BPF program
  2025-02-13 22:59 [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
  2025-02-13 23:00 ` [PATCH v15 01/10] perf evsel: Expose evsel__is_offcpu_event() for future use Howard Chu
  2025-02-13 23:00 ` [PATCH v15 02/10] perf record --off-cpu: Parse off-cpu event Howard Chu
@ 2025-02-13 23:00 ` Howard Chu
  2025-02-13 23:00 ` [PATCH v15 04/10] perf record --off-cpu: Dump off-cpu samples in BPF Howard Chu
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2025-02-13 23:00 UTC (permalink / raw)
  To: acme
  Cc: namhyung, jolsa, irogers, adrian.hunter, linux-perf-users,
	linux-kernel, Howard Chu, Alexander Shishkin, Ingo Molnar,
	James Clark, Kan Liang, Mark Rutland, Peter Zijlstra,
	Arnaldo Carvalho de Melo

Set the perf_event map in BPF for dumping off-cpu samples.

Set the offcpu_thresh to specify the threshold.

Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-5-howardchu95@gmail.com
[ Added some missing iteration variables to off_cpu_config() and fixed up
  a manually edited patch hunk line boundary line ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/bpf_off_cpu.c          | 23 +++++++++++++++++++++++
 tools/perf/util/bpf_skel/off_cpu.bpf.c |  9 +++++++++
 2 files changed, 32 insertions(+)

diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 2101aa2b7c42..de71ff7a80d0 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -13,6 +13,7 @@
 #include "util/cgroup.h"
 #include "util/strlist.h"
 #include <bpf/bpf.h>
+#include <internal/xyarray.h>
 
 #include "bpf_skel/off_cpu.skel.h"
 
@@ -60,6 +61,9 @@ static int off_cpu_config(struct evlist *evlist)
 static void off_cpu_start(void *arg)
 {
 	struct evlist *evlist = arg;
+	struct evsel *evsel;
+	struct perf_cpu pcpu;
+	int i;
 
 	/* update task filter for the given workload */
 	if (skel->rodata->has_task && skel->rodata->uses_tgid &&
@@ -73,6 +77,25 @@ static void off_cpu_start(void *arg)
 		bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
 	}
 
+	/* update BPF perf_event map */
+	evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT);
+	if (evsel == NULL) {
+		pr_err("%s evsel not found\n", OFFCPU_EVENT);
+		return;
+	}
+
+	perf_cpu_map__for_each_cpu(pcpu, i, evsel->core.cpus) {
+		int err;
+
+		err = bpf_map__update_elem(skel->maps.offcpu_output, &pcpu.cpu, sizeof(__u32),
+					   xyarray__entry(evsel->core.fd, i, 0),
+					   sizeof(__u32), BPF_ANY);
+		if (err) {
+			pr_err("Failed to update perf event map for direct off-cpu dumping\n");
+			return;
+		}
+	}
+
 	skel->bss->enabled = 1;
 }
 
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index c152116df72f..1cdd4d63ea92 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -18,6 +18,8 @@
 #define MAX_STACKS   32
 #define MAX_ENTRIES  102400
 
+#define MAX_CPUS  4096
+
 struct tstamp_data {
 	__u32 stack_id;
 	__u32 state;
@@ -39,6 +41,13 @@ struct {
 	__uint(max_entries, MAX_ENTRIES);
 } stacks SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, MAX_CPUS);
+} offcpu_output SEC(".maps");
+
 struct {
 	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
 	__uint(map_flags, BPF_F_NO_PREALLOC);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v15 04/10] perf record --off-cpu: Dump off-cpu samples in BPF
  2025-02-13 22:59 [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
                   ` (2 preceding siblings ...)
  2025-02-13 23:00 ` [PATCH v15 03/10] perf record --off-cpu: Preparation of off-cpu BPF program Howard Chu
@ 2025-02-13 23:00 ` Howard Chu
  2025-02-13 23:00 ` [PATCH v15 05/10] perf evsel: Assemble offcpu samples Howard Chu
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2025-02-13 23:00 UTC (permalink / raw)
  To: acme
  Cc: namhyung, jolsa, irogers, adrian.hunter, linux-perf-users,
	linux-kernel, Howard Chu, Alexander Shishkin, Ingo Molnar,
	James Clark, Kan Liang, Mark Rutland, Peter Zijlstra,
	Arnaldo Carvalho de Melo

Collect tid, period, callchain, and cgroup id and dump them when off-cpu
time threshold is reached.

We don't collect the off-cpu time twice (the delta), it's either in
direct samples, or accumulated samples that are dumped at the end of
perf.data.

Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-6-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/bpf_skel/off_cpu.bpf.c | 89 ++++++++++++++++++++++++--
 1 file changed, 84 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index 1cdd4d63ea92..c15b69586723 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -19,11 +19,18 @@
 #define MAX_ENTRIES  102400
 
 #define MAX_CPUS  4096
+#define MAX_OFFCPU_LEN 37
+
+// We have a 'struct stack' in vmlinux.h when building with GEN_VMLINUX_H=1
+struct __stack {
+	u64 array[MAX_STACKS];
+};
 
 struct tstamp_data {
 	__u32 stack_id;
 	__u32 state;
 	__u64 timestamp;
+	struct __stack stack;
 };
 
 struct offcpu_key {
@@ -41,6 +48,10 @@ struct {
 	__uint(max_entries, MAX_ENTRIES);
 } stacks SEC(".maps");
 
+struct offcpu_data {
+	u64 array[MAX_OFFCPU_LEN];
+};
+
 struct {
 	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
 	__uint(key_size, sizeof(__u32));
@@ -48,6 +59,13 @@ struct {
 	__uint(max_entries, MAX_CPUS);
 } offcpu_output SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct offcpu_data));
+	__uint(max_entries, 1);
+} offcpu_payload SEC(".maps");
+
 struct {
 	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
 	__uint(map_flags, BPF_F_NO_PREALLOC);
@@ -106,6 +124,8 @@ const volatile bool uses_cgroup_v1 = false;
 
 int perf_subsys_id = -1;
 
+__u64 offcpu_thresh_ns = 500000000ull;
+
 /*
  * Old kernel used to call it task_struct->state and now it's '__state'.
  * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
@@ -192,6 +212,47 @@ static inline int can_record(struct task_struct *t, int state)
 	return 1;
 }
 
+static inline int copy_stack(struct __stack *from, struct offcpu_data *to, int n)
+{
+	int len = 0;
+
+	for (int i = 0; i < MAX_STACKS && from->array[i]; ++i, ++len)
+		to->array[n + 2 + i] = from->array[i];
+
+	return len;
+}
+
+/**
+ * off_cpu_dump - dump off-cpu samples to ring buffer
+ * @data: payload for dumping off-cpu samples
+ * @key: off-cpu data
+ * @stack: stack trace of the task before being scheduled out
+ *
+ * If the threshold of off-cpu time is reached, acquire tid, period, callchain, and cgroup id
+ * information of the task, and dump it as a raw sample to perf ring buffer
+ */
+static int off_cpu_dump(void *ctx, struct offcpu_data *data, struct offcpu_key *key,
+			struct __stack *stack, __u64 delta)
+{
+	int n = 0, len = 0;
+
+	data->array[n++] = (u64)key->tgid << 32 | key->pid;
+	data->array[n++] = delta;
+
+	/* data->array[n] is callchain->nr (updated later) */
+	data->array[n + 1] = PERF_CONTEXT_USER;
+	data->array[n + 2] = 0;
+	len = copy_stack(stack, data, n);
+
+	/* update length of callchain */
+	data->array[n] = len + 1;
+	n += len + 2;
+
+	data->array[n++] = key->cgroup_id;
+
+	return bpf_perf_event_output(ctx, &offcpu_output, BPF_F_CURRENT_CPU, data, n * sizeof(u64));
+}
+
 static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
 			struct task_struct *next, int state)
 {
@@ -216,6 +277,16 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
 	pelem->state = state;
 	pelem->stack_id = stack_id;
 
+	/*
+	 * If stacks are successfully collected by bpf_get_stackid(), collect them once more
+	 * in task_storage for direct off-cpu sample dumping
+	 */
+	if (stack_id > 0 && bpf_get_stack(ctx, &pelem->stack, MAX_STACKS * sizeof(u64), BPF_F_USER_STACK)) {
+		/*
+		 * This empty if block is used to avoid 'result unused warning' from bpf_get_stack().
+		 * If the collection fails, continue with the logic for the next task.
+		 */
+	}
 next:
 	pelem = bpf_task_storage_get(&tstamp, next, NULL, 0);
 
@@ -230,11 +301,19 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
 		__u64 delta = ts - pelem->timestamp;
 		__u64 *total;
 
-		total = bpf_map_lookup_elem(&off_cpu, &key);
-		if (total)
-			*total += delta;
-		else
-			bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
+		if (delta >= offcpu_thresh_ns) {
+			int zero = 0;
+			struct offcpu_data *data = bpf_map_lookup_elem(&offcpu_payload, &zero);
+
+			if (data)
+				off_cpu_dump(ctx, data, &key, &pelem->stack, delta);
+		} else {
+			total = bpf_map_lookup_elem(&off_cpu, &key);
+			if (total)
+				*total += delta;
+			else
+				bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
+		}
 
 		/* prevent to reuse the timestamp later */
 		pelem->timestamp = 0;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v15 05/10] perf evsel: Assemble offcpu samples
  2025-02-13 22:59 [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
                   ` (3 preceding siblings ...)
  2025-02-13 23:00 ` [PATCH v15 04/10] perf record --off-cpu: Dump off-cpu samples in BPF Howard Chu
@ 2025-02-13 23:00 ` Howard Chu
  2025-02-13 23:00 ` [PATCH v15 06/10] perf record --off-cpu: Disable perf_event's callchain collection Howard Chu
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2025-02-13 23:00 UTC (permalink / raw)
  To: acme
  Cc: namhyung, jolsa, irogers, adrian.hunter, linux-perf-users,
	linux-kernel, Howard Chu, Alexander Shishkin, Ingo Molnar,
	James Clark, Kan Liang, Mark Rutland, Peter Zijlstra,
	Arnaldo Carvalho de Melo

Use the data in bpf-output samples, to assemble offcpu samples. In
evsel__is_offcpu_event(), Check if sample_type is PERF_SAMPLE_RAW to
support off-cpu sample data created by an older version of perf.

Testing compatibility on offcpu samples collected by perf before this patch series:

See below, the sample_type still uses PERF_SAMPLE_CALLCHAIN

$ perf script --header -i ./perf.data.ptn | grep "event : name = offcpu-time"
 # event : name = offcpu-time, , id = { 237917, 237918, 237919, 237920 }, type = 1 (software), size = 136, config = 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq } = 1, sample_type = IP|TID|TIME|CALLCHAIN|CPU|PERIOD|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1

The output is correct.

$ perf script -i ./perf.data.ptn | grep offcpu-time
gmain    2173 [000] 18446744069.414584:  100102015 offcpu-time:
NetworkManager     901 [000] 18446744069.414584:    5603579 offcpu-time:
Web Content 1183550 [000] 18446744069.414584:      46278 offcpu-time:
gnome-control-c 2200559 [000] 18446744069.414584: 11998247014 offcpu-time:

And after this patch series:

$ perf script --header -i ./perf.data.off-cpu-v9 | grep "event : name = offcpu-time"
 # event : name = offcpu-time, , id = { 237959, 237960, 237961, 237962 }, type = 1 (software), size = 136, config = 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq } = 1, sample_type = IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1

perf $ ./perf script -i ./perf.data.off-cpu-v9 | grep offcpu-time
     gnome-shell    1875 [001] 4789616.361225:  100097057 offcpu-time:
     gnome-shell    1875 [001] 4789616.461419:  100107463 offcpu-time:
         firefox 2206821 [002] 4789616.475690:  255257245 offcpu-time:

Suggested-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-7-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/evsel.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index f45d4b44d70d..fcd49f407f52 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1263,7 +1263,8 @@ static void evsel__set_default_freq_period(struct record_opts *opts,
 
 bool evsel__is_offcpu_event(struct evsel *evsel)
 {
-	return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT);
+	return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT) &&
+	       evsel->core.attr.sample_type & PERF_SAMPLE_RAW;
 }
 
 /*
@@ -2933,6 +2934,35 @@ static inline bool evsel__has_branch_counters(const struct evsel *evsel)
 	return false;
 }
 
+static int __set_offcpu_sample(struct perf_sample *data)
+{
+	u64 *array = data->raw_data;
+	u32 max_size = data->raw_size, *p32;
+	const void *endp = (void *)array + max_size;
+
+	if (array == NULL)
+		return -EFAULT;
+
+	OVERFLOW_CHECK_u64(array);
+	p32 = (void *)array++;
+	data->pid = p32[0];
+	data->tid = p32[1];
+
+	OVERFLOW_CHECK_u64(array);
+	data->period = *array++;
+
+	OVERFLOW_CHECK_u64(array);
+	data->callchain = (struct ip_callchain *)array++;
+	OVERFLOW_CHECK(array, data->callchain->nr * sizeof(u64), max_size);
+	data->ip = data->callchain->ips[1];
+	array += data->callchain->nr;
+
+	OVERFLOW_CHECK_u64(array);
+	data->cgroup = *array;
+
+	return 0;
+}
+
 int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 			struct perf_sample *data)
 {
@@ -3287,6 +3317,9 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 		array = (void *)array + sz;
 	}
 
+	if (evsel__is_offcpu_event(evsel))
+		return __set_offcpu_sample(data);
+
 	return 0;
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v15 06/10] perf record --off-cpu: Disable perf_event's callchain collection
  2025-02-13 22:59 [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
                   ` (4 preceding siblings ...)
  2025-02-13 23:00 ` [PATCH v15 05/10] perf evsel: Assemble offcpu samples Howard Chu
@ 2025-02-13 23:00 ` Howard Chu
  2025-02-13 23:00 ` [PATCH v15 07/10] perf script: Display off-cpu samples correctly Howard Chu
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2025-02-13 23:00 UTC (permalink / raw)
  To: acme
  Cc: namhyung, jolsa, irogers, adrian.hunter, linux-perf-users,
	linux-kernel, Howard Chu, Alexander Shishkin, Ingo Molnar,
	James Clark, Kan Liang, Mark Rutland, Peter Zijlstra,
	Arnaldo Carvalho de Melo

There is a check in evsel.c that does this:

if (evsel__is_offcpu_event(evsel))
	evsel->core.attr.sample_type &= OFFCPU_SAMPLE_TYPES;

This along with:

 #define OFFCPU_SAMPLE_TYPES  (PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_IP | \
			      PERF_SAMPLE_TID | PERF_SAMPLE_TIME | \
			      PERF_SAMPLE_ID | PERF_SAMPLE_CPU | \
			      PERF_SAMPLE_PERIOD | PERF_SAMPLE_CALLCHAIN | \
			      PERF_SAMPLE_CGROUP)

will tell perf_event to collect callchain.

We don't need the callchain from perf_event when collecting off-cpu
samples, because it's prev's callchain, not next's callchain.

   (perf_event)     (task_storage) (needed)
   prev             next
   |                  |
   ---sched_switch---->

Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-8-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/off_cpu.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
index 2dd67c60f211..2a4b7f9b2c4c 100644
--- a/tools/perf/util/off_cpu.h
+++ b/tools/perf/util/off_cpu.h
@@ -13,7 +13,7 @@ struct record_opts;
 #define OFFCPU_SAMPLE_TYPES  (PERF_SAMPLE_IDENTIFIER | PERF_SAMPLE_IP | \
 			      PERF_SAMPLE_TID | PERF_SAMPLE_TIME | \
 			      PERF_SAMPLE_ID | PERF_SAMPLE_CPU | \
-			      PERF_SAMPLE_PERIOD | PERF_SAMPLE_CALLCHAIN | \
+			      PERF_SAMPLE_PERIOD | PERF_SAMPLE_RAW | \
 			      PERF_SAMPLE_CGROUP)
 
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v15 07/10] perf script: Display off-cpu samples correctly
  2025-02-13 22:59 [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
                   ` (5 preceding siblings ...)
  2025-02-13 23:00 ` [PATCH v15 06/10] perf record --off-cpu: Disable perf_event's callchain collection Howard Chu
@ 2025-02-13 23:00 ` Howard Chu
  2025-02-13 23:00 ` [PATCH v15 08/10] perf record --off-cpu: Dump the remaining samples in BPF's stack trace map Howard Chu
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2025-02-13 23:00 UTC (permalink / raw)
  To: acme
  Cc: namhyung, jolsa, irogers, adrian.hunter, linux-perf-users,
	linux-kernel, Howard Chu, Alexander Shishkin, Ingo Molnar,
	James Clark, Kan Liang, Mark Rutland, Peter Zijlstra,
	Arnaldo Carvalho de Melo

No PERF_SAMPLE_CALLCHAIN in sample_type, but I need perf script to
display a callchain, have to specify manually.

Also, prefer displaying a callchain:

 gvfs-afc-volume    2267 [001] 3829232.955656: 1001115340 offcpu-time:
            77f05292603f __pselect+0xbf (/usr/lib/x86_64-linux-gnu/libc.so.6)
            77f052a1801c [unknown] (/usr/lib/x86_64-linux-gnu/libusbmuxd-2.0.so.6.0.0)
            77f052a18d45 [unknown] (/usr/lib/x86_64-linux-gnu/libusbmuxd-2.0.so.6.0.0)
            77f05289ca94 start_thread+0x384 (/usr/lib/x86_64-linux-gnu/libc.so.6)
            77f052929c3c clone3+0x2c (/usr/lib/x86_64-linux-gnu/libc.so.6)

to a raw binary BPF output:

BPF output: 0000: dd 08 00 00 db 08 00 00  <DD>...<DB>...
	  0008: cc ce ab 3b 00 00 00 00  <CC>Ϋ;....
	  0010: 06 00 00 00 00 00 00 00  ........
	  0018: 00 fe ff ff ff ff ff ff  .<FE><FF><FF><FF><FF><FF><FF>
	  0020: 3f 60 92 52 f0 77 00 00  ?`.R<F0>w..
	  0028: 1c 80 a1 52 f0 77 00 00  ..<A1>R<F0>w..
	  0030: 45 8d a1 52 f0 77 00 00  E.<A1>R<F0>w..
	  0038: 94 ca 89 52 f0 77 00 00  .<CA>.R<F0>w..
	  0040: 3c 9c 92 52 f0 77 00 00  <..R<F0>w..
	  0048: 00 00 00 00 00 00 00 00  ........
	  0050: 00 00 00 00              ....

Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-9-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-script.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index d797cec4f054..16b500b23417 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -670,7 +670,7 @@ static int perf_session__check_output_opt(struct perf_session *session)
 
 		evlist__for_each_entry(session->evlist, evsel) {
 			not_pipe = true;
-			if (evsel__has_callchain(evsel)) {
+			if (evsel__has_callchain(evsel) || evsel__is_offcpu_event(evsel)) {
 				use_callchain = true;
 				break;
 			}
@@ -2274,7 +2274,7 @@ static void process_event(struct perf_script *script,
 	else if (PRINT_FIELD(BRSTACKOFF))
 		perf_sample__fprintf_brstackoff(sample, thread, evsel, fp);
 
-	if (evsel__is_bpf_output(evsel) && PRINT_FIELD(BPF_OUTPUT))
+	if (evsel__is_bpf_output(evsel) && !evsel__is_offcpu_event(evsel) && PRINT_FIELD(BPF_OUTPUT))
 		perf_sample__fprintf_bpf_output(sample, fp);
 	perf_sample__fprintf_insn(sample, evsel, attr, thread, machine, fp, al);
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v15 08/10] perf record --off-cpu: Dump the remaining samples in BPF's stack trace map
  2025-02-13 22:59 [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
                   ` (6 preceding siblings ...)
  2025-02-13 23:00 ` [PATCH v15 07/10] perf script: Display off-cpu samples correctly Howard Chu
@ 2025-02-13 23:00 ` Howard Chu
  2025-02-13 23:00 ` [PATCH v15 09/10] perf record --off-cpu: Add --off-cpu-thresh option Howard Chu
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2025-02-13 23:00 UTC (permalink / raw)
  To: acme
  Cc: namhyung, jolsa, irogers, adrian.hunter, linux-perf-users,
	linux-kernel, Howard Chu, Alexander Shishkin, Ingo Molnar,
	James Clark, Kan Liang, Mark Rutland, Peter Zijlstra,
	Arnaldo Carvalho de Melo

Dump the remaining samples, as if it is dumping a direct sample.

Put the stack trace, tid, off-cpu time and cgroup id into the raw_data
section, just like a direct off-cpu sample coming from BPF's
bpf_perf_event_output().

This ensures that evsel__parse_sample() correctly parses both direct
samples and accumulated samples.

Suggested-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-10-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/bpf_off_cpu.c | 59 +++++++++++++++++++++--------------
 1 file changed, 35 insertions(+), 24 deletions(-)

diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index de71ff7a80d0..e55693bcbf08 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -37,6 +37,8 @@ union off_cpu_data {
 	u64 array[1024 / sizeof(u64)];
 };
 
+u64 off_cpu_raw[MAX_STACKS + 5];
+
 static int off_cpu_config(struct evlist *evlist)
 {
 	char off_cpu_event[64];
@@ -312,6 +314,7 @@ int off_cpu_write(struct perf_session *session)
 {
 	int bytes = 0, size;
 	int fd, stack;
+	u32 raw_size;
 	u64 sample_type, val, sid = 0;
 	struct evsel *evsel;
 	struct perf_data_file *file = &session->data->file;
@@ -351,46 +354,54 @@ int off_cpu_write(struct perf_session *session)
 
 	while (!bpf_map_get_next_key(fd, &prev, &key)) {
 		int n = 1;  /* start from perf_event_header */
-		int ip_pos = -1;
 
 		bpf_map_lookup_elem(fd, &key, &val);
 
+		/* zero-fill some of the fields, will be overwritten by raw_data when parsing */
 		if (sample_type & PERF_SAMPLE_IDENTIFIER)
 			data.array[n++] = sid;
-		if (sample_type & PERF_SAMPLE_IP) {
-			ip_pos = n;
+		if (sample_type & PERF_SAMPLE_IP)
 			data.array[n++] = 0;  /* will be updated */
-		}
 		if (sample_type & PERF_SAMPLE_TID)
-			data.array[n++] = (u64)key.pid << 32 | key.tgid;
+			data.array[n++] = 0;
 		if (sample_type & PERF_SAMPLE_TIME)
 			data.array[n++] = tstamp;
-		if (sample_type & PERF_SAMPLE_ID)
-			data.array[n++] = sid;
 		if (sample_type & PERF_SAMPLE_CPU)
 			data.array[n++] = 0;
 		if (sample_type & PERF_SAMPLE_PERIOD)
-			data.array[n++] = val;
-		if (sample_type & PERF_SAMPLE_CALLCHAIN) {
-			int len = 0;
-
-			/* data.array[n] is callchain->nr (updated later) */
-			data.array[n + 1] = PERF_CONTEXT_USER;
-			data.array[n + 2] = 0;
-
-			bpf_map_lookup_elem(stack, &key.stack_id, &data.array[n + 2]);
-			while (data.array[n + 2 + len])
+			data.array[n++] = 0;
+		if (sample_type & PERF_SAMPLE_RAW) {
+			/*
+			 *  [ size ][ data ]
+			 *  [     data     ]
+			 *  [     data     ]
+			 *  [     data     ]
+			 *  [ data ][ empty]
+			 */
+			int len = 0, i = 0;
+			void *raw_data = (void *)data.array + n * sizeof(u64);
+
+			off_cpu_raw[i++] = (u64)key.pid << 32 | key.tgid;
+			off_cpu_raw[i++] = val;
+
+			/* off_cpu_raw[i] is callchain->nr (updated later) */
+			off_cpu_raw[i + 1] = PERF_CONTEXT_USER;
+			off_cpu_raw[i + 2] = 0;
+
+			bpf_map_lookup_elem(stack, &key.stack_id, &off_cpu_raw[i + 2]);
+			while (off_cpu_raw[i + 2 + len])
 				len++;
 
-			/* update length of callchain */
-			data.array[n] = len + 1;
+			off_cpu_raw[i] = len + 1;
+			i += len + 2;
+
+			off_cpu_raw[i++] = key.cgroup_id;
 
-			/* update sample ip with the first callchain entry */
-			if (ip_pos >= 0)
-				data.array[ip_pos] = data.array[n + 2];
+			raw_size = i * sizeof(u64) + sizeof(u32); /* 4 bytes for alignment */
+			memcpy(raw_data, &raw_size, sizeof(raw_size));
+			memcpy(raw_data + sizeof(u32), off_cpu_raw, i * sizeof(u64));
 
-			/* calculate sample callchain data array length */
-			n += len + 2;
+			n += i + 1;
 		}
 		if (sample_type & PERF_SAMPLE_CGROUP)
 			data.array[n++] = key.cgroup_id;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v15 09/10] perf record --off-cpu: Add --off-cpu-thresh option
  2025-02-13 22:59 [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
                   ` (7 preceding siblings ...)
  2025-02-13 23:00 ` [PATCH v15 08/10] perf record --off-cpu: Dump the remaining samples in BPF's stack trace map Howard Chu
@ 2025-02-13 23:00 ` Howard Chu
  2025-02-13 23:00 ` [PATCH v15 10/10] perf test: Add direct off-cpu test Howard Chu
  2025-02-28 16:36 ` [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Gautam Menghani
  10 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2025-02-13 23:00 UTC (permalink / raw)
  To: acme
  Cc: namhyung, jolsa, irogers, adrian.hunter, linux-perf-users,
	linux-kernel, Howard Chu, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Ingo Molnar, James Clark, Kan Liang,
	Mark Rutland, Peter Zijlstra

Specify the threshold for dumping offcpu samples with --off-cpu-thresh,
the unit is milliseconds. Default value is 500ms.

Example:

  perf record --off-cpu --off-cpu-thresh 824

The example above collects off-cpu samples where the off-cpu time is
longer than 824ms

Suggested-by: Ian Rogers <irogers@google.com>
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Suggested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-2-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-record.txt |  9 ++++++++
 tools/perf/builtin-record.c              | 26 ++++++++++++++++++++++++
 tools/perf/util/bpf_off_cpu.c            |  3 +++
 tools/perf/util/bpf_skel/off_cpu.bpf.c   |  2 +-
 tools/perf/util/off_cpu.h                |  1 +
 tools/perf/util/record.h                 |  1 +
 6 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 80686d590de2..3a87e635f52c 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -833,6 +833,15 @@ filtered through the mask provided by -C option.
 	only, as of now.  So the applications built without the frame
 	pointer might see bogus addresses.
 
+	off-cpu profiling consists two types of samples: direct samples, which
+	share the same behavior as regular samples, and the accumulated
+	samples, stored in BPF stack trace map, presented after all the regular
+	samples.
+
+--off-cpu-thresh::
+	Once a task's off-cpu time reaches this threshold (in milliseconds), it
+	generates a direct off-cpu sample. The default is 500ms.
+
 --setup-filter=<action>::
 	Prepare BPF filter to be used by regular users.  The action should be
 	either "pin" or "unpin".  The filter can be used after it's pinned.
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f3e5f856f4a4..4bdc7a0111ef 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -3158,6 +3158,28 @@ static int record__parse_mmap_pages(const struct option *opt,
 	return ret;
 }
 
+static int record__parse_off_cpu_thresh(const struct option *opt,
+					const char *str,
+					int unset __maybe_unused)
+{
+	struct record_opts *opts = opt->value;
+	char *endptr;
+	u64 off_cpu_thresh_ms;
+
+	if (!str)
+		return -EINVAL;
+
+	off_cpu_thresh_ms = strtoull(str, &endptr, 10);
+
+	/* the threshold isn't string "0", yet strtoull() returns 0, parsing failed */
+	if (*endptr || (off_cpu_thresh_ms == 0 && strcmp(str, "0")))
+		return -EINVAL;
+	else
+		opts->off_cpu_thresh_ns = off_cpu_thresh_ms * NSEC_PER_MSEC;
+
+	return 0;
+}
+
 void __weak arch__add_leaf_frame_record_opts(struct record_opts *opts __maybe_unused)
 {
 }
@@ -3351,6 +3373,7 @@ static struct record record = {
 		.ctl_fd              = -1,
 		.ctl_fd_ack          = -1,
 		.synth               = PERF_SYNTH_ALL,
+		.off_cpu_thresh_ns   = OFFCPU_THRESH,
 	},
 };
 
@@ -3573,6 +3596,9 @@ static struct option __record_options[] = {
 	OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
 	OPT_STRING(0, "setup-filter", &record.filter_action, "pin|unpin",
 		   "BPF filter action"),
+	OPT_CALLBACK(0, "off-cpu-thresh", &record.opts, "ms",
+		     "Dump off-cpu samples if off-cpu time exceeds this threshold (in milliseconds). (Default: 500ms)",
+		     record__parse_off_cpu_thresh),
 	OPT_END()
 };
 
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index e55693bcbf08..00736fb2678c 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -14,6 +14,7 @@
 #include "util/strlist.h"
 #include <bpf/bpf.h>
 #include <internal/xyarray.h>
+#include <linux/time64.h>
 
 #include "bpf_skel/off_cpu.skel.h"
 
@@ -291,6 +292,8 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
 		}
 	}
 
+	skel->bss->offcpu_thresh_ns = opts->off_cpu_thresh_ns;
+
 	err = off_cpu_bpf__attach(skel);
 	if (err) {
 		pr_err("Failed to attach off-cpu BPF skeleton\n");
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index c15b69586723..8df35541141b 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -124,7 +124,7 @@ const volatile bool uses_cgroup_v1 = false;
 
 int perf_subsys_id = -1;
 
-__u64 offcpu_thresh_ns = 500000000ull;
+__u64 offcpu_thresh_ns;
 
 /*
  * Old kernel used to call it task_struct->state and now it's '__state'.
diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
index 2a4b7f9b2c4c..64bf763ddf50 100644
--- a/tools/perf/util/off_cpu.h
+++ b/tools/perf/util/off_cpu.h
@@ -16,6 +16,7 @@ struct record_opts;
 			      PERF_SAMPLE_PERIOD | PERF_SAMPLE_RAW | \
 			      PERF_SAMPLE_CGROUP)
 
+#define OFFCPU_THRESH 500000000ULL
 
 #ifdef HAVE_BPF_SKEL
 int off_cpu_prepare(struct evlist *evlist, struct target *target,
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index a6566134e09e..c82db4833b0a 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -79,6 +79,7 @@ struct record_opts {
 	int	      synth;
 	int	      threads_spec;
 	const char    *threads_user_spec;
+	u64	      off_cpu_thresh_ns;
 };
 
 extern const char * const *record_usage;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v15 10/10] perf test: Add direct off-cpu test
  2025-02-13 22:59 [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
                   ` (8 preceding siblings ...)
  2025-02-13 23:00 ` [PATCH v15 09/10] perf record --off-cpu: Add --off-cpu-thresh option Howard Chu
@ 2025-02-13 23:00 ` Howard Chu
  2025-02-13 23:04   ` Howard Chu
  2025-02-28 16:36 ` [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Gautam Menghani
  10 siblings, 1 reply; 16+ messages in thread
From: Howard Chu @ 2025-02-13 23:00 UTC (permalink / raw)
  To: acme
  Cc: namhyung, jolsa, irogers, adrian.hunter, linux-perf-users,
	linux-kernel, Howard Chu, Alexander Shishkin, Ingo Molnar,
	James Clark, Kan Liang, Mark Rutland, Peter Zijlstra,
	Arnaldo Carvalho de Melo

Why is there a --off-cpu-thresh 2000?

We collect an off-cpu period __ONLY ONCE__, either in direct sample form,
or in accumulated form (in BPF stack trace map).

If I don't add --off-cpu-thresh 2000, the sample in the original test
goes into the ring buffer instead of the BPF stack trace map.

Additionally, when using -e dummy, the ring buffer is not open, causing
us to lose a sample.

Signed-off-by: Howard Chu <howardchu95@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241108204137.2444151-11-howardchu95@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/tests/shell/record_offcpu.sh | 71 +++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/tools/perf/tests/shell/record_offcpu.sh b/tools/perf/tests/shell/record_offcpu.sh
index 678947fe69ee..c5d6cae94c65 100755
--- a/tools/perf/tests/shell/record_offcpu.sh
+++ b/tools/perf/tests/shell/record_offcpu.sh
@@ -7,6 +7,9 @@ set -e
 err=0
 perfdata=$(mktemp /tmp/__perf_test.perf.data.XXXXX)
 
+ts=$(printf "%u" $((~0 << 32))) # OFF_CPU_TIMESTAMP
+dummy_timestamp=${ts%???} # remove the last 3 digits to match perf script
+
 cleanup() {
   rm -f ${perfdata}
   rm -f ${perfdata}.old
@@ -19,6 +22,9 @@ trap_cleanup() {
 }
 trap trap_cleanup EXIT TERM INT
 
+test_over_thresh="Threshold test (over threshold)"
+test_below_thresh="Threshold test (below threshold)"
+
 test_offcpu_priv() {
   echo "Checking off-cpu privilege"
 
@@ -88,6 +94,63 @@ test_offcpu_child() {
   echo "Child task off-cpu test [Success]"
 }
 
+# task blocks longer than the --off-cpu-thresh, perf should collect a direct sample
+test_offcpu_over_thresh() {
+  echo "${test_over_thresh}"
+
+  # collect direct off-cpu samples for tasks blocked for more than 999ms
+  if ! perf record -e dummy --off-cpu --off-cpu-thresh 999 -o ${perfdata} -- sleep 1 2> /dev/null
+  then
+    echo "${test_over_thresh} [Failed record]"
+    err=1
+    return
+  fi
+  # direct sample's timestamp should be lower than the dummy_timestamp of the at-the-end sample
+  # check if a direct sample exists
+  if ! perf script --time "0, ${dummy_timestamp}" -i ${perfdata} -F event | grep -q "offcpu-time"
+  then
+    echo "${test_over_thresh} [Failed missing direct samples]"
+    err=1
+    return
+  fi
+  # there should only be one direct sample, and its period should be higher than off-cpu-thresh
+  if ! perf script --time "0, ${dummy_timestamp}" -i ${perfdata} -F period | \
+       awk '{ if (int($1) > 999000000) exit 0; else exit 1; }'
+  then
+    echo "${test_over_thresh} [Failed off-cpu time too short]"
+    err=1
+    return
+  fi
+  echo "${test_over_thresh} [Success]"
+}
+
+# task blocks shorter than the --off-cpu-thresh, perf should collect an at-the-end sample
+test_offcpu_below_thresh() {
+  echo "${test_below_thresh}"
+
+  # collect direct off-cpu samples for tasks blocked for more than 1.2s
+  if ! perf record -e dummy --off-cpu --off-cpu-thresh 12000 -o ${perfdata} -- sleep 1 2> /dev/null
+  then
+    echo "${test_below_thresh} [Failed record]"
+    err=1
+    return
+  fi
+  # see if there's an at-the-end sample
+  if ! perf script --time "${dummy_timestamp}," -i ${perfdata} -F event | grep -q 'offcpu-time'
+  then
+    echo "${test_below_thresh} [Failed at-the-end samples cannot be found]"
+    err=1
+    return
+  fi
+  # plus there shouldn't be any direct samples
+  if perf script --time "0, ${dummy_timestamp}" -i ${perfdata} -F event | grep -q 'offcpu-time'
+  then
+    echo "${test_below_thresh} [Failed direct samples are found when they shouldn't be]"
+    err=1
+    return
+  fi
+  echo "${test_below_thresh} [Success]"
+}
 
 test_offcpu_priv
 
@@ -99,5 +162,13 @@ if [ $err = 0 ]; then
   test_offcpu_child
 fi
 
+if [ $err = 0 ]; then
+  test_offcpu_over_thresh
+fi
+
+if [ $err = 0 ]; then
+  test_offcpu_below_thresh
+fi
+
 cleanup
 exit $err
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v15 10/10] perf test: Add direct off-cpu test
  2025-02-13 23:00 ` [PATCH v15 10/10] perf test: Add direct off-cpu test Howard Chu
@ 2025-02-13 23:04   ` Howard Chu
  2025-02-18 19:32     ` Ian Rogers
  0 siblings, 1 reply; 16+ messages in thread
From: Howard Chu @ 2025-02-13 23:04 UTC (permalink / raw)
  To: acme
  Cc: namhyung, jolsa, irogers, adrian.hunter, linux-perf-users,
	linux-kernel, Alexander Shishkin, Ingo Molnar, James Clark,
	Kan Liang, Mark Rutland, Peter Zijlstra, Arnaldo Carvalho de Melo

Hello,

On Thu, Feb 13, 2025 at 3:00 PM Howard Chu <howardchu95@gmail.com> wrote:
>
> Why is there a --off-cpu-thresh 2000?
>
> We collect an off-cpu period __ONLY ONCE__, either in direct sample form,
> or in accumulated form (in BPF stack trace map).
>
> If I don't add --off-cpu-thresh 2000, the sample in the original test
> goes into the ring buffer instead of the BPF stack trace map.
>
> Additionally, when using -e dummy, the ring buffer is not open, causing
> us to lose a sample.

Just noticed that this commit message is wrong, should be:
"""
Add tests for direct off-cpu samples and --off-cpu-thresh option.
"""

Sorry.

Thanks,
Howard

>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> Cc: Adrian Hunter <adrian.hunter@intel.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Ian Rogers <irogers@google.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: James Clark <james.clark@linaro.org>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: Kan Liang <kan.liang@linux.intel.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: https://lore.kernel.org/r/20241108204137.2444151-11-howardchu95@gmail.com
> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> ---
>  tools/perf/tests/shell/record_offcpu.sh | 71 +++++++++++++++++++++++++
>  1 file changed, 71 insertions(+)
>
> diff --git a/tools/perf/tests/shell/record_offcpu.sh b/tools/perf/tests/shell/record_offcpu.sh
> index 678947fe69ee..c5d6cae94c65 100755
> --- a/tools/perf/tests/shell/record_offcpu.sh
> +++ b/tools/perf/tests/shell/record_offcpu.sh
> @@ -7,6 +7,9 @@ set -e
>  err=0
>  perfdata=$(mktemp /tmp/__perf_test.perf.data.XXXXX)
>
> +ts=$(printf "%u" $((~0 << 32))) # OFF_CPU_TIMESTAMP
> +dummy_timestamp=${ts%???} # remove the last 3 digits to match perf script
> +
>  cleanup() {
>    rm -f ${perfdata}
>    rm -f ${perfdata}.old
> @@ -19,6 +22,9 @@ trap_cleanup() {
>  }
>  trap trap_cleanup EXIT TERM INT
>
> +test_over_thresh="Threshold test (over threshold)"
> +test_below_thresh="Threshold test (below threshold)"
> +
>  test_offcpu_priv() {
>    echo "Checking off-cpu privilege"
>
> @@ -88,6 +94,63 @@ test_offcpu_child() {
>    echo "Child task off-cpu test [Success]"
>  }
>
> +# task blocks longer than the --off-cpu-thresh, perf should collect a direct sample
> +test_offcpu_over_thresh() {
> +  echo "${test_over_thresh}"
> +
> +  # collect direct off-cpu samples for tasks blocked for more than 999ms
> +  if ! perf record -e dummy --off-cpu --off-cpu-thresh 999 -o ${perfdata} -- sleep 1 2> /dev/null
> +  then
> +    echo "${test_over_thresh} [Failed record]"
> +    err=1
> +    return
> +  fi
> +  # direct sample's timestamp should be lower than the dummy_timestamp of the at-the-end sample
> +  # check if a direct sample exists
> +  if ! perf script --time "0, ${dummy_timestamp}" -i ${perfdata} -F event | grep -q "offcpu-time"
> +  then
> +    echo "${test_over_thresh} [Failed missing direct samples]"
> +    err=1
> +    return
> +  fi
> +  # there should only be one direct sample, and its period should be higher than off-cpu-thresh
> +  if ! perf script --time "0, ${dummy_timestamp}" -i ${perfdata} -F period | \
> +       awk '{ if (int($1) > 999000000) exit 0; else exit 1; }'
> +  then
> +    echo "${test_over_thresh} [Failed off-cpu time too short]"
> +    err=1
> +    return
> +  fi
> +  echo "${test_over_thresh} [Success]"
> +}
> +
> +# task blocks shorter than the --off-cpu-thresh, perf should collect an at-the-end sample
> +test_offcpu_below_thresh() {
> +  echo "${test_below_thresh}"
> +
> +  # collect direct off-cpu samples for tasks blocked for more than 1.2s
> +  if ! perf record -e dummy --off-cpu --off-cpu-thresh 12000 -o ${perfdata} -- sleep 1 2> /dev/null
> +  then
> +    echo "${test_below_thresh} [Failed record]"
> +    err=1
> +    return
> +  fi
> +  # see if there's an at-the-end sample
> +  if ! perf script --time "${dummy_timestamp}," -i ${perfdata} -F event | grep -q 'offcpu-time'
> +  then
> +    echo "${test_below_thresh} [Failed at-the-end samples cannot be found]"
> +    err=1
> +    return
> +  fi
> +  # plus there shouldn't be any direct samples
> +  if perf script --time "0, ${dummy_timestamp}" -i ${perfdata} -F event | grep -q 'offcpu-time'
> +  then
> +    echo "${test_below_thresh} [Failed direct samples are found when they shouldn't be]"
> +    err=1
> +    return
> +  fi
> +  echo "${test_below_thresh} [Success]"
> +}
>
>  test_offcpu_priv
>
> @@ -99,5 +162,13 @@ if [ $err = 0 ]; then
>    test_offcpu_child
>  fi
>
> +if [ $err = 0 ]; then
> +  test_offcpu_over_thresh
> +fi
> +
> +if [ $err = 0 ]; then
> +  test_offcpu_below_thresh
> +fi
> +
>  cleanup
>  exit $err
> --
> 2.45.2
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v15 10/10] perf test: Add direct off-cpu test
  2025-02-13 23:04   ` Howard Chu
@ 2025-02-18 19:32     ` Ian Rogers
  2025-02-19  0:58       ` Howard Chu
  0 siblings, 1 reply; 16+ messages in thread
From: Ian Rogers @ 2025-02-18 19:32 UTC (permalink / raw)
  To: Howard Chu
  Cc: acme, namhyung, jolsa, adrian.hunter, linux-perf-users,
	linux-kernel, Alexander Shishkin, Ingo Molnar, James Clark,
	Kan Liang, Mark Rutland, Peter Zijlstra, Arnaldo Carvalho de Melo

On Thu, Feb 13, 2025 at 3:04 PM Howard Chu <howardchu95@gmail.com> wrote:
>
> Hello,
>
> On Thu, Feb 13, 2025 at 3:00 PM Howard Chu <howardchu95@gmail.com> wrote:
> >
> > Why is there a --off-cpu-thresh 2000?
> >
> > We collect an off-cpu period __ONLY ONCE__, either in direct sample form,
> > or in accumulated form (in BPF stack trace map).
> >
> > If I don't add --off-cpu-thresh 2000, the sample in the original test
> > goes into the ring buffer instead of the BPF stack trace map.
> >
> > Additionally, when using -e dummy, the ring buffer is not open, causing
> > us to lose a sample.
>
> Just noticed that this commit message is wrong, should be:
> """
> Add tests for direct off-cpu samples and --off-cpu-thresh option.
> """

Tested-by: Ian Rogers <irogers@google.com>
```
121: perf record offcpu profiling tests                              : Ok
```
I'd be tempted to keep the comments about why 2000 next to the actual
code rather than in the commit message. In the code the value is 12000
and not 2000 though?

Thanks,
Ian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v15 10/10] perf test: Add direct off-cpu test
  2025-02-18 19:32     ` Ian Rogers
@ 2025-02-19  0:58       ` Howard Chu
  0 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2025-02-19  0:58 UTC (permalink / raw)
  To: Ian Rogers
  Cc: acme, namhyung, jolsa, adrian.hunter, linux-perf-users,
	linux-kernel, Alexander Shishkin, Ingo Molnar, James Clark,
	Kan Liang, Mark Rutland, Peter Zijlstra, Arnaldo Carvalho de Melo

Hello Ian,

Thanks for testing this patch :).

On Tue, Feb 18, 2025 at 11:32 AM Ian Rogers <irogers@google.com> wrote:
>
> On Thu, Feb 13, 2025 at 3:04 PM Howard Chu <howardchu95@gmail.com> wrote:
> >
> > Hello,
> >
> > On Thu, Feb 13, 2025 at 3:00 PM Howard Chu <howardchu95@gmail.com> wrote:
> > >
> > > Why is there a --off-cpu-thresh 2000?
> > >
> > > We collect an off-cpu period __ONLY ONCE__, either in direct sample form,
> > > or in accumulated form (in BPF stack trace map).
> > >
> > > If I don't add --off-cpu-thresh 2000, the sample in the original test
> > > goes into the ring buffer instead of the BPF stack trace map.
> > >
> > > Additionally, when using -e dummy, the ring buffer is not open, causing
> > > us to lose a sample.
> >
> > Just noticed that this commit message is wrong, should be:
> > """
> > Add tests for direct off-cpu samples and --off-cpu-thresh option.
> > """
>
> Tested-by: Ian Rogers <irogers@google.com>
> ```
> 121: perf record offcpu profiling tests                              : Ok
> ```
> I'd be tempted to keep the comments about why 2000 next to the actual
> code rather than in the commit message. In the code the value is 12000
> and not 2000 though?

I actually deleted the --off-cpu-thresh 2000. It was intended to fix
Namhyung's original test because I forgot to enable the off-cpu event.
Now that recording the off-cpu time of a task is fixed, that
workaround is no longer necessary. The --off-cpu-thresh 12000 option
is used to force sleep 1 to produce an at-the-end sample, as only
tasks that have been off the CPU for more than 1.2 seconds can emit
direct samples, and since this is recording an off-cpu period below
the threshold, I think it makes sense to put it here in
test_offcpu_below_thresh(). That being said, if you’d like me to add a
comment or two, I’d be glad to do so.

Thanks,
Howard

>
> Thanks,
> Ian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly
  2025-02-13 22:59 [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
                   ` (9 preceding siblings ...)
  2025-02-13 23:00 ` [PATCH v15 10/10] perf test: Add direct off-cpu test Howard Chu
@ 2025-02-28 16:36 ` Gautam Menghani
  2025-02-28 17:40   ` Howard Chu
  10 siblings, 1 reply; 16+ messages in thread
From: Gautam Menghani @ 2025-02-28 16:36 UTC (permalink / raw)
  To: Howard Chu
  Cc: acme, namhyung, jolsa, irogers, adrian.hunter, linux-perf-users,
	linux-kernel

Tested this series on IBM Power machines (both pseries and PowerNV)

Tested-by: Gautam Menghani <gautam@linux.ibm.com>

Thanks,
Gautam

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly
  2025-02-28 16:36 ` [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Gautam Menghani
@ 2025-02-28 17:40   ` Howard Chu
  0 siblings, 0 replies; 16+ messages in thread
From: Howard Chu @ 2025-02-28 17:40 UTC (permalink / raw)
  To: Gautam Menghani
  Cc: acme, namhyung, jolsa, irogers, adrian.hunter, linux-perf-users,
	linux-kernel

Hello Gautam,

On Fri, Feb 28, 2025 at 8:37 AM Gautam Menghani <gautam@linux.ibm.com> wrote:
>
> Tested this series on IBM Power machines (both pseries and PowerNV)
>
> Tested-by: Gautam Menghani <gautam@linux.ibm.com>

Thanks for testing this series. :)

Thanks,
Howard

>
> Thanks,
> Gautam

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-02-28 17:40 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-13 22:59 [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Howard Chu
2025-02-13 23:00 ` [PATCH v15 01/10] perf evsel: Expose evsel__is_offcpu_event() for future use Howard Chu
2025-02-13 23:00 ` [PATCH v15 02/10] perf record --off-cpu: Parse off-cpu event Howard Chu
2025-02-13 23:00 ` [PATCH v15 03/10] perf record --off-cpu: Preparation of off-cpu BPF program Howard Chu
2025-02-13 23:00 ` [PATCH v15 04/10] perf record --off-cpu: Dump off-cpu samples in BPF Howard Chu
2025-02-13 23:00 ` [PATCH v15 05/10] perf evsel: Assemble offcpu samples Howard Chu
2025-02-13 23:00 ` [PATCH v15 06/10] perf record --off-cpu: Disable perf_event's callchain collection Howard Chu
2025-02-13 23:00 ` [PATCH v15 07/10] perf script: Display off-cpu samples correctly Howard Chu
2025-02-13 23:00 ` [PATCH v15 08/10] perf record --off-cpu: Dump the remaining samples in BPF's stack trace map Howard Chu
2025-02-13 23:00 ` [PATCH v15 09/10] perf record --off-cpu: Add --off-cpu-thresh option Howard Chu
2025-02-13 23:00 ` [PATCH v15 10/10] perf test: Add direct off-cpu test Howard Chu
2025-02-13 23:04   ` Howard Chu
2025-02-18 19:32     ` Ian Rogers
2025-02-19  0:58       ` Howard Chu
2025-02-28 16:36 ` [PATCH v15 00/10] perf record --off-cpu: Dump off-cpu samples directly Gautam Menghani
2025-02-28 17:40   ` Howard Chu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).