* [PATCH RFC 0/3] perf: add logic to collect off-cpu samples
@ 2024-07-11 12:16 Ajay Kaher
2024-07-11 12:16 ` [PATCH RFC 1/3] perf/core: add logic to collect off-cpu sample Ajay Kaher
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Ajay Kaher @ 2024-07-11 12:16 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung
Cc: mark.rutland, rostedt, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, yangjihong1, zegao2021, leo.yan,
asmadeus, siyanteng, sunhaiyong, linux-perf-users, linux-kernel,
ajay.kaher, alexey.makhalov, vasavi.sirnapalli
Add --off-cpu-kernel option to capture off-cpu sample alongwith on-cpu
samples.
off-cpu samples represent time spent by task when it was on wait queue
(schedule out to waiting for events, blocked on I/O, locks, timers,
paging/swapping, etc)
Refer following links for more details:
https://lpc.events/event/17/contributions/1556/
https://www.youtube.com/watch?v=sF2faKGRnjs
Ajay Kaher (3):
perf/core: add logic to collect off-cpu sample
perf/record: add options --off-cpu-kernel
perf/report: add off-cpu samples
include/linux/perf_event.h | 16 ++++++++++++++
include/uapi/linux/perf_event.h | 3 ++-
kernel/events/core.c | 27 ++++++++++++++++++-----
tools/include/uapi/linux/perf_event.h | 3 ++-
tools/perf/builtin-record.c | 2 ++
tools/perf/util/events_stats.h | 2 ++
tools/perf/util/evsel.c | 4 ++++
tools/perf/util/hist.c | 31 ++++++++++++++++++++++++---
tools/perf/util/hist.h | 1 +
tools/perf/util/record.h | 1 +
tools/perf/util/sample.h | 1 +
11 files changed, 81 insertions(+), 10 deletions(-)
--
2.39.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH RFC 1/3] perf/core: add logic to collect off-cpu sample
2024-07-11 12:16 [PATCH RFC 0/3] perf: add logic to collect off-cpu samples Ajay Kaher
@ 2024-07-11 12:16 ` Ajay Kaher
2024-07-11 21:49 ` Peter Zijlstra
2024-07-11 12:16 ` [PATCH RFC 2/3] perf/record: add options --off-cpu-kernel Ajay Kaher
` (2 subsequent siblings)
3 siblings, 1 reply; 10+ messages in thread
From: Ajay Kaher @ 2024-07-11 12:16 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung
Cc: mark.rutland, rostedt, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, yangjihong1, zegao2021, leo.yan,
asmadeus, siyanteng, sunhaiyong, linux-perf-users, linux-kernel,
ajay.kaher, alexey.makhalov, vasavi.sirnapalli
following logics has been added to collect the off-cpu sample:
- 'task_pt_regs(current)' has been used to capture registers
status off-cpu sample.
- off-cpu time represent the time period for which the target
process not occupying the cpu cycles. And calculate as:
off-cpu time = swap-in time - swap-out time
Signed-off-by: Ajay Kaher <ajay.kaher@broadcom.com>
---
include/linux/perf_event.h | 16 ++++++++++++++++
include/uapi/linux/perf_event.h | 3 ++-
kernel/events/core.c | 27 ++++++++++++++++++++++-----
tools/include/uapi/linux/perf_event.h | 3 ++-
4 files changed, 42 insertions(+), 7 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a5304ae8c654..09dc3f695974 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -599,6 +599,11 @@ enum perf_event_state {
PERF_EVENT_STATE_ACTIVE = 1,
};
+enum perf_sample_cpu {
+ PERF_SAMPLE_ON_CPU,
+ PERF_SAMPLE_OFF_CPU,
+};
+
struct file;
struct perf_sample_data;
@@ -828,6 +833,7 @@ struct perf_event {
void *security;
#endif
struct list_head sb_list;
+ u64 stop_time;
/*
* Certain events gets forwarded to another pmu internally by over-
@@ -1301,6 +1307,16 @@ static inline u32 perf_sample_data_size(struct perf_sample_data *data,
return size;
}
+static inline void perf_sample_set_off_cpu(struct perf_sample_data *data)
+{
+ data->cpu_entry.reserved = PERF_SAMPLE_OFF_CPU;
+}
+
+static inline void perf_sample_set_on_cpu(struct perf_sample_data *data)
+{
+ data->cpu_entry.reserved = PERF_SAMPLE_ON_CPU;
+}
+
/*
* Clear all bitfields in the perf_branch_entry.
* The to and from fields are not cleared because they are
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 3a64499b0f5d..9d0a23ab2549 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -460,7 +460,8 @@ struct perf_event_attr {
inherit_thread : 1, /* children only inherit if cloned with CLONE_THREAD */
remove_on_exec : 1, /* event is removed from task on exec */
sigtrap : 1, /* send synchronous SIGTRAP on event */
- __reserved_1 : 26;
+ off_cpu : 1, /* include off_cpu sample */
+ __reserved_1 : 25;
union {
__u32 wakeup_events; /* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8f908f077935..e71eb7936134 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7168,10 +7168,8 @@ static void __perf_event_header__init_id(struct perf_sample_data *data,
if (sample_type & PERF_SAMPLE_STREAM_ID)
data->stream_id = event->id;
- if (sample_type & PERF_SAMPLE_CPU) {
- data->cpu_entry.cpu = raw_smp_processor_id();
- data->cpu_entry.reserved = 0;
- }
+ if (sample_type & PERF_SAMPLE_CPU)
+ data->cpu_entry.cpu = raw_smp_processor_id();
}
void perf_event_header__init_id(struct perf_event_header *header,
@@ -11083,8 +11081,8 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
return HRTIMER_NORESTART;
event->pmu->read(event);
-
perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_sample_set_on_cpu(&data);
regs = get_irq_regs();
if (regs && !perf_exclude_event(event, regs)) {
@@ -11099,6 +11097,18 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
return ret;
}
+static void perf_swevent_offCPUSample(struct perf_event *event, u64 period)
+{
+ struct perf_sample_data data;
+ struct pt_regs *regs;
+
+ event->pmu->read(event);
+ perf_sample_data_init(&data, 0, period);
+ perf_sample_set_off_cpu(&data);
+ regs = task_pt_regs(current);
+ __perf_event_overflow(event, 1, &data, regs);
+}
+
static void perf_swevent_start_hrtimer(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
@@ -11107,6 +11117,11 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
if (!is_sampling_event(event))
return;
+ if (event->attr.off_cpu && event->stop_time && hwc->sample_period) {
+ perf_swevent_offCPUSample(event, perf_clock() - event->stop_time);
+ event->stop_time = 0;
+ }
+
period = local64_read(&hwc->period_left);
if (period) {
if (period < 0)
@@ -11128,6 +11143,7 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer);
local64_set(&hwc->period_left, ktime_to_ns(remaining));
+ event->stop_time = perf_clock();
hrtimer_cancel(&hwc->hrtimer);
}
}
@@ -11139,6 +11155,7 @@ static void perf_swevent_init_hrtimer(struct perf_event *event)
if (!is_sampling_event(event))
return;
+ event->stop_time = 0;
hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
hwc->hrtimer.function = perf_swevent_hrtimer;
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 3a64499b0f5d..9d0a23ab2549 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -460,7 +460,8 @@ struct perf_event_attr {
inherit_thread : 1, /* children only inherit if cloned with CLONE_THREAD */
remove_on_exec : 1, /* event is removed from task on exec */
sigtrap : 1, /* send synchronous SIGTRAP on event */
- __reserved_1 : 26;
+ off_cpu : 1, /* include off_cpu sample */
+ __reserved_1 : 25;
union {
__u32 wakeup_events; /* wakeup every n events */
--
2.39.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH RFC 2/3] perf/record: add options --off-cpu-kernel
2024-07-11 12:16 [PATCH RFC 0/3] perf: add logic to collect off-cpu samples Ajay Kaher
2024-07-11 12:16 ` [PATCH RFC 1/3] perf/core: add logic to collect off-cpu sample Ajay Kaher
@ 2024-07-11 12:16 ` Ajay Kaher
2024-07-11 12:16 ` [PATCH RFC 3/3] perf/report: add off-cpu samples Ajay Kaher
2024-07-11 21:58 ` [PATCH RFC 0/3] perf: add logic to collect " Ian Rogers
3 siblings, 0 replies; 10+ messages in thread
From: Ajay Kaher @ 2024-07-11 12:16 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung
Cc: mark.rutland, rostedt, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, yangjihong1, zegao2021, leo.yan,
asmadeus, siyanteng, sunhaiyong, linux-perf-users, linux-kernel,
ajay.kaher, alexey.makhalov, vasavi.sirnapalli
--off-cpu-kernel to collect off cpu samples using on/off cpu time.
Signed-off-by: Ajay Kaher <ajay.kaher@broadcom.com>
---
tools/perf/builtin-record.c | 2 ++
tools/perf/util/evsel.c | 2 ++
tools/perf/util/record.h | 1 +
3 files changed, 5 insertions(+)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 0a8ba1323d64..5be172537330 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -3571,6 +3571,8 @@ static struct option __record_options[] = {
"write collected trace data into several data files using parallel threads",
record__parse_threads),
OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
+ OPT_BOOLEAN(0, "off-cpu-kernel", &record.opts.off_cpu_kernel,
+ "Enable kernel based off-cpu analysis"),
OPT_END()
};
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 4f818ab6b662..8ba890a5ac6e 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -862,6 +862,8 @@ static void __evsel__config_callchain(struct evsel *evsel, struct record_opts *o
attr->exclude_callchain_user = 1;
if (opts->user_callchains)
attr->exclude_callchain_kernel = 1;
+ if (opts->off_cpu_kernel)
+ attr->off_cpu = 1;
if (param->record_mode == CALLCHAIN_LBR) {
if (!opts->branch_stack) {
if (attr->exclude_user) {
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index a6566134e09e..cfa5e34b78ad 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -52,6 +52,7 @@ struct record_opts {
bool kcore;
bool text_poke;
bool build_id;
+ bool off_cpu_kernel;
unsigned int freq;
unsigned int mmap_pages;
unsigned int auxtrace_mmap_pages;
--
2.39.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH RFC 3/3] perf/report: add off-cpu samples
2024-07-11 12:16 [PATCH RFC 0/3] perf: add logic to collect off-cpu samples Ajay Kaher
2024-07-11 12:16 ` [PATCH RFC 1/3] perf/core: add logic to collect off-cpu sample Ajay Kaher
2024-07-11 12:16 ` [PATCH RFC 2/3] perf/record: add options --off-cpu-kernel Ajay Kaher
@ 2024-07-11 12:16 ` Ajay Kaher
2024-07-11 21:58 ` [PATCH RFC 0/3] perf: add logic to collect " Ian Rogers
3 siblings, 0 replies; 10+ messages in thread
From: Ajay Kaher @ 2024-07-11 12:16 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung
Cc: mark.rutland, rostedt, alexander.shishkin, jolsa, irogers,
adrian.hunter, kan.liang, yangjihong1, zegao2021, leo.yan,
asmadeus, siyanteng, sunhaiyong, linux-perf-users, linux-kernel,
ajay.kaher, alexey.makhalov, vasavi.sirnapalli
off-cpu samples represent the time period for which the target
process not occupying the cpu cycles.
In following example, perf has collected 15 off-cpu samples and
program was running on cpu for 27%:
Samples: 24 of 'task-clock:ppp', 15 of 'offcpu', Event count: ~9150831908 (73% offcpu)
+73.77% 73.77% a.out libc.so.6 [.] clock_nanosleep <-- off-cpu sample
+24.04% 24.04% a.out [vdso] [.] __vdso_gettimeofday <-- on-cpu sample
Signed-off-by: Ajay Kaher <ajay.kaher@broadcom.com>
---
tools/perf/util/events_stats.h | 2 ++
tools/perf/util/evsel.c | 2 ++
tools/perf/util/hist.c | 31 ++++++++++++++++++++++++++++---
tools/perf/util/hist.h | 1 +
tools/perf/util/sample.h | 1 +
5 files changed, 34 insertions(+), 3 deletions(-)
diff --git a/tools/perf/util/events_stats.h b/tools/perf/util/events_stats.h
index 8fecc9fbaecc..7bb3cf1ab835 100644
--- a/tools/perf/util/events_stats.h
+++ b/tools/perf/util/events_stats.h
@@ -44,8 +44,10 @@ struct events_stats {
struct hists_stats {
u64 total_period;
+ u64 total_period_off_cpu;
u64 total_non_filtered_period;
u32 nr_samples;
+ u64 nr_samples_off_cpu;
u32 nr_non_filtered_samples;
u32 nr_lost_samples;
};
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 8ba890a5ac6e..ea41586474e3 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1146,6 +1146,7 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
attr->write_backward = opts->overwrite ? 1 : 0;
attr->read_format = PERF_FORMAT_LOST;
+ evsel__set_sample_bit(evsel, CPU);
evsel__set_sample_bit(evsel, IP);
evsel__set_sample_bit(evsel, TID);
@@ -2438,6 +2439,7 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
u.val32[0] = bswap_32(u.val32[0]);
}
+ data->off_cpu = u.val32[1];
data->cpu = u.val32[0];
array++;
}
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 2e9e193179dd..251333e0b021 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -23,6 +23,7 @@
#include "thread.h"
#include "block-info.h"
#include "ui/progress.h"
+#include "ui/util.h"
#include <errno.h>
#include <math.h>
#include <inttypes.h>
@@ -725,6 +726,7 @@ __hists__add_entry(struct hists *hists,
.socket = al->socket,
.cpu = al->cpu,
.cpumode = al->cpumode,
+ .off_cpu = sample->off_cpu,
.ip = al->addr,
.level = al->level,
.code_page_size = sample->code_page_size,
@@ -1076,6 +1078,8 @@ iter_add_single_cumulative_entry(struct hist_entry_iter *iter,
callchain_cursor_commit(get_tls_callchain_cursor());
hists__inc_nr_samples(hists, he->filtered);
+ if (sample->off_cpu)
+ ++hists->stats.nr_samples_off_cpu;
return err;
}
@@ -1740,6 +1744,7 @@ void hists__reset_stats(struct hists *hists)
{
hists->nr_entries = 0;
hists->stats.total_period = 0;
+ hists->stats.total_period_off_cpu = 0;
hists__reset_filter_stats(hists);
}
@@ -1757,6 +1762,9 @@ void hists__inc_stats(struct hists *hists, struct hist_entry *h)
hists->nr_entries++;
hists->stats.total_period += h->stat.period;
+
+ if (h->off_cpu)
+ hists->stats.total_period_off_cpu += h->stat.period;
}
static void hierarchy_recalc_total_periods(struct hists *hists)
@@ -2745,14 +2753,20 @@ int __hists__scnprintf_title(struct hists *hists, char *bf, size_t size, bool sh
struct thread *thread = hists->thread_filter;
int socket_id = hists->socket_filter;
unsigned long nr_samples = hists->stats.nr_samples;
+ unsigned long nr_samples_off_cpu = hists->stats.nr_samples_off_cpu;
u64 nr_events = hists->stats.total_period;
+ int nr_events_off_cpu_percentage = (hists->stats.total_period_off_cpu * 100) / nr_events;
struct evsel *evsel = hists_to_evsel(hists);
const char *ev_name = evsel__name(evsel);
char buf[512], sample_freq_str[64] = "";
+ char oncpu_str[128] = "";
+ char offcpu_str[128] = "";
+ char offcpu_percentage_str[128] = "";
size_t buflen = sizeof(buf);
char ref[30] = " show reference callgraph, ";
bool enable_ref = false;
+
if (symbol_conf.filter_relative) {
nr_samples = hists->stats.nr_non_filtered_samples;
nr_events = hists->stats.total_non_filtered_period;
@@ -2785,10 +2799,21 @@ int __hists__scnprintf_title(struct hists *hists, char *bf, size_t size, bool sh
scnprintf(sample_freq_str, sizeof(sample_freq_str), " %d Hz,", evsel->core.attr.sample_freq);
nr_samples = convert_unit(nr_samples, &unit);
+
+ scnprintf(oncpu_str, sizeof(oncpu_str), "%lu%c of '%s',",
+ nr_samples - nr_samples_off_cpu, unit, ev_name);
+
+ if (evsel->core.attr.off_cpu) {
+ scnprintf(offcpu_str, sizeof(offcpu_str), "%lu%c of '%s',",
+ nr_samples_off_cpu, unit, "offcpu");
+ scnprintf(offcpu_percentage_str, sizeof(offcpu_percentage_str),
+ "(%d%% offcpu)", nr_events_off_cpu_percentage);
+ }
+
printed = scnprintf(bf, size,
- "Samples: %lu%c of event%s '%s',%s%sEvent count (approx.): %" PRIu64,
- nr_samples, unit, evsel->core.nr_members > 1 ? "s" : "",
- ev_name, sample_freq_str, enable_ref ? ref : " ", nr_events);
+ "Samples: %s %s %s%sEvent count: ~%" PRIu64 " %s",
+ oncpu_str, offcpu_str, sample_freq_str, enable_ref ? ref : " ",
+ nr_events, offcpu_percentage_str);
if (hists->uid_filter_str)
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 8fb3bdd29188..c64a07ce92fb 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -236,6 +236,7 @@ struct hist_entry {
/* We are added by hists__add_dummy_entry. */
bool dummy;
bool leaf;
+ bool off_cpu;
char level;
u8 filtered;
diff --git a/tools/perf/util/sample.h b/tools/perf/util/sample.h
index 70b2c3135555..59b0951f4718 100644
--- a/tools/perf/util/sample.h
+++ b/tools/perf/util/sample.h
@@ -109,6 +109,7 @@ struct perf_sample {
u16 retire_lat;
};
bool no_hw_idx; /* No hw_idx collected in branch_stack */
+ bool off_cpu;
char insn[MAX_INSN];
void *raw_data;
struct ip_callchain *callchain;
--
2.39.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH RFC 1/3] perf/core: add logic to collect off-cpu sample
2024-07-11 12:16 ` [PATCH RFC 1/3] perf/core: add logic to collect off-cpu sample Ajay Kaher
@ 2024-07-11 21:49 ` Peter Zijlstra
2024-07-14 16:23 ` Ajay Kaher
0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2024-07-11 21:49 UTC (permalink / raw)
To: Ajay Kaher
Cc: mingo, acme, namhyung, mark.rutland, rostedt, alexander.shishkin,
jolsa, irogers, adrian.hunter, kan.liang, yangjihong1, zegao2021,
leo.yan, asmadeus, siyanteng, sunhaiyong, linux-perf-users,
linux-kernel, alexey.makhalov, vasavi.sirnapalli
On Thu, Jul 11, 2024 at 05:46:17PM +0530, Ajay Kaher wrote:
> following logics has been added to collect the off-cpu sample:
>
> - 'task_pt_regs(current)' has been used to capture registers
> status off-cpu sample.
>
> - off-cpu time represent the time period for which the target
> process not occupying the cpu cycles. And calculate as:
>
> off-cpu time = swap-in time - swap-out time
>
I have absolutely no idea what you're trying to do :/ The above does not
constitute a comprehensible Changelog.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC 0/3] perf: add logic to collect off-cpu samples
2024-07-11 12:16 [PATCH RFC 0/3] perf: add logic to collect off-cpu samples Ajay Kaher
` (2 preceding siblings ...)
2024-07-11 12:16 ` [PATCH RFC 3/3] perf/report: add off-cpu samples Ajay Kaher
@ 2024-07-11 21:58 ` Ian Rogers
2024-07-13 7:42 ` Ajay Kaher
3 siblings, 1 reply; 10+ messages in thread
From: Ian Rogers @ 2024-07-11 21:58 UTC (permalink / raw)
To: Ajay Kaher, chu howard
Cc: peterz, mingo, acme, namhyung, mark.rutland, rostedt,
alexander.shishkin, jolsa, adrian.hunter, kan.liang, yangjihong1,
zegao2021, leo.yan, asmadeus, siyanteng, sunhaiyong,
linux-perf-users, linux-kernel, alexey.makhalov,
vasavi.sirnapalli
On Thu, Jul 11, 2024 at 5:16 AM Ajay Kaher <ajay.kaher@broadcom.com> wrote:
>
> Add --off-cpu-kernel option to capture off-cpu sample alongwith on-cpu
> samples.
>
> off-cpu samples represent time spent by task when it was on wait queue
> (schedule out to waiting for events, blocked on I/O, locks, timers,
> paging/swapping, etc)
>
> Refer following links for more details:
> https://lpc.events/event/17/contributions/1556/
> https://www.youtube.com/watch?v=sF2faKGRnjs
Hi Ajay,
I wonder if Howard's improvements (not landed) for `perf record
--off-cpu` would solve this problem for you?
https://lore.kernel.org/lkml/20240424024805.144759-1-howardchu95@gmail.com/
Or is that approach problematic due to the use of BPF?
Thanks,
Ian
> Ajay Kaher (3):
> perf/core: add logic to collect off-cpu sample
> perf/record: add options --off-cpu-kernel
> perf/report: add off-cpu samples
>
> include/linux/perf_event.h | 16 ++++++++++++++
> include/uapi/linux/perf_event.h | 3 ++-
> kernel/events/core.c | 27 ++++++++++++++++++-----
> tools/include/uapi/linux/perf_event.h | 3 ++-
> tools/perf/builtin-record.c | 2 ++
> tools/perf/util/events_stats.h | 2 ++
> tools/perf/util/evsel.c | 4 ++++
> tools/perf/util/hist.c | 31 ++++++++++++++++++++++++---
> tools/perf/util/hist.h | 1 +
> tools/perf/util/record.h | 1 +
> tools/perf/util/sample.h | 1 +
> 11 files changed, 81 insertions(+), 10 deletions(-)
>
> --
> 2.39.0
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC 0/3] perf: add logic to collect off-cpu samples
2024-07-11 21:58 ` [PATCH RFC 0/3] perf: add logic to collect " Ian Rogers
@ 2024-07-13 7:42 ` Ajay Kaher
2024-07-14 18:32 ` Ian Rogers
0 siblings, 1 reply; 10+ messages in thread
From: Ajay Kaher @ 2024-07-13 7:42 UTC (permalink / raw)
To: Ian Rogers
Cc: chu howard, peterz, mingo, acme, namhyung, mark.rutland, rostedt,
alexander.shishkin, jolsa, adrian.hunter, kan.liang, yangjihong1,
zegao2021, leo.yan, asmadeus, siyanteng, sunhaiyong,
linux-perf-users, linux-kernel, alexey.makhalov,
vasavi.sirnapalli, Vamsi Krishna Brahmajosyula, nadav.amit
On Fri, Jul 12, 2024 at 3:28 AM Ian Rogers <irogers@google.com> wrote:
>
> On Thu, Jul 11, 2024 at 5:16 AM Ajay Kaher <ajay.kaher@broadcom.com> wrote:
> >
> > Add --off-cpu-kernel option to capture off-cpu sample alongwith on-cpu
> > samples.
> >
> > off-cpu samples represent time spent by task when it was on wait queue
> > (schedule out to waiting for events, blocked on I/O, locks, timers,
> > paging/swapping, etc)
> >
> > Refer following links for more details:
> > https://lpc.events/event/17/contributions/1556/
> > https://www.youtube.com/watch?v=sF2faKGRnjs
>
> Hi Ajay,
>
> I wonder if Howard's improvements (not landed) for `perf record
> --off-cpu` would solve this problem for you?
> https://lore.kernel.org/lkml/20240424024805.144759-1-howardchu95@gmail.com/
> Or is that approach problematic due to the use of BPF?
>
Thanks Ian for your response and sharing Howard's improvements.
Yes, perf --off-cpu is based upon BPF and having following restrictions:
- target binary should be compiled with frame pointer, same mentioned
in tools/perf/Documentation/perf-record.txt:
Note that BPF can collect stack traces using frame pointer ("fp") only,
as of now. So the applications built without the frame pointer might see
bogus addresses.
- perf should be complied with BUILD_BPF_SKEL=1:
Warning: option `off-cpu' is being ignored because no BUILD_BPF_SKEL=1
- off-cpu, on-cpu samples are not on the same result page.
(I guess Howard has improve this, not tried his patches)
I have tried to collect the off-cpu sample same as on-cpu sample with the help
of kernel/events/core.c. We will get one off-cpu sample from the target task
sched-out to sched-in. Or we can say off-cpu samples are not dependent on
frequency provided by the user to perf record.
I am also worried about having so many samples if sched-in/out
frequency is high.
Thinking to merge samples if attributes are the same (i.e. stacktrace)
and add the
off-cpu period to previous samples with the same attribute.
-Ajay
> Thanks,
> Ian
>
> > Ajay Kaher (3):
> > perf/core: add logic to collect off-cpu sample
> > perf/record: add options --off-cpu-kernel
> > perf/report: add off-cpu samples
> >
> > include/linux/perf_event.h | 16 ++++++++++++++
> > include/uapi/linux/perf_event.h | 3 ++-
> > kernel/events/core.c | 27 ++++++++++++++++++-----
> > tools/include/uapi/linux/perf_event.h | 3 ++-
> > tools/perf/builtin-record.c | 2 ++
> > tools/perf/util/events_stats.h | 2 ++
> > tools/perf/util/evsel.c | 4 ++++
> > tools/perf/util/hist.c | 31 ++++++++++++++++++++++++---
> > tools/perf/util/hist.h | 1 +
> > tools/perf/util/record.h | 1 +
> > tools/perf/util/sample.h | 1 +
> > 11 files changed, 81 insertions(+), 10 deletions(-)
> >
> > --
> > 2.39.0
> >
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC 1/3] perf/core: add logic to collect off-cpu sample
2024-07-11 21:49 ` Peter Zijlstra
@ 2024-07-14 16:23 ` Ajay Kaher
2024-07-15 11:48 ` Peter Zijlstra
0 siblings, 1 reply; 10+ messages in thread
From: Ajay Kaher @ 2024-07-14 16:23 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, acme, namhyung, mark.rutland, rostedt, alexander.shishkin,
jolsa, irogers, adrian.hunter, kan.liang, yangjihong1, zegao2021,
leo.yan, asmadeus, siyanteng, sunhaiyong, linux-perf-users,
linux-kernel, alexey.makhalov, vasavi.sirnapalli,
Vamsi Krishna Brahmajosyula, nadav.amit
On Fri, Jul 12, 2024 at 3:19 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jul 11, 2024 at 05:46:17PM +0530, Ajay Kaher wrote:
> > following logics has been added to collect the off-cpu sample:
> >
> > - 'task_pt_regs(current)' has been used to capture registers
> > status off-cpu sample.
> >
> > - off-cpu time represent the time period for which the target
> > process not occupying the cpu cycles. And calculate as:
> >
> > off-cpu time = swap-in time - swap-out time
> >
>
> I have absolutely no idea what you're trying to do :/ The above does not
> constitute a comprehensible Changelog.
Sorry Peter, it’s sched-in/out (not swap-in/out).
'Perf record' captures on-cpu samples at frequency which is specified by
the user ( i.e. time period to collect sample is NSEC_PER_SEC / freq).
This patch is to collect the off_cpu sample and time period of off_cpu
sample is calculated based upon the time when target task was sched_out
to sched_in, as:
off-cpu time period = sched_in time - sched-out time
-Ajay
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC 0/3] perf: add logic to collect off-cpu samples
2024-07-13 7:42 ` Ajay Kaher
@ 2024-07-14 18:32 ` Ian Rogers
0 siblings, 0 replies; 10+ messages in thread
From: Ian Rogers @ 2024-07-14 18:32 UTC (permalink / raw)
To: Ajay Kaher
Cc: chu howard, peterz, mingo, acme, namhyung, mark.rutland, rostedt,
alexander.shishkin, jolsa, adrian.hunter, kan.liang, yangjihong1,
zegao2021, leo.yan, asmadeus, siyanteng, sunhaiyong,
linux-perf-users, linux-kernel, alexey.makhalov,
vasavi.sirnapalli, Vamsi Krishna Brahmajosyula, nadav.amit
On Sat, Jul 13, 2024 at 12:43 AM Ajay Kaher <ajay.kaher@broadcom.com> wrote:
>
> On Fri, Jul 12, 2024 at 3:28 AM Ian Rogers <irogers@google.com> wrote:
> >
> > On Thu, Jul 11, 2024 at 5:16 AM Ajay Kaher <ajay.kaher@broadcom.com> wrote:
> > >
> > > Add --off-cpu-kernel option to capture off-cpu sample alongwith on-cpu
> > > samples.
> > >
> > > off-cpu samples represent time spent by task when it was on wait queue
> > > (schedule out to waiting for events, blocked on I/O, locks, timers,
> > > paging/swapping, etc)
> > >
> > > Refer following links for more details:
> > > https://lpc.events/event/17/contributions/1556/
> > > https://www.youtube.com/watch?v=sF2faKGRnjs
> >
> > Hi Ajay,
> >
> > I wonder if Howard's improvements (not landed) for `perf record
> > --off-cpu` would solve this problem for you?
> > https://lore.kernel.org/lkml/20240424024805.144759-1-howardchu95@gmail.com/
> > Or is that approach problematic due to the use of BPF?
> >
>
> Thanks Ian for your response and sharing Howard's improvements.
>
> Yes, perf --off-cpu is based upon BPF and having following restrictions:
>
> - target binary should be compiled with frame pointer, same mentioned
> in tools/perf/Documentation/perf-record.txt:
> Note that BPF can collect stack traces using frame pointer ("fp") only,
> as of now. So the applications built without the frame pointer might see
> bogus addresses.
Agreed, if you want more than the leaf function. There's some evidence
that frame pointers may become the default:
https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html
> - perf should be complied with BUILD_BPF_SKEL=1:
> Warning: option `off-cpu' is being ignored because no BUILD_BPF_SKEL=1
We've made this the default build behavior but probably a bigger issue
is that generally to have permissions you need to run as root.
> - off-cpu, on-cpu samples are not on the same result page.
> (I guess Howard has improve this, not tried his patches)
>
> I have tried to collect the off-cpu sample same as on-cpu sample with the help
> of kernel/events/core.c. We will get one off-cpu sample from the target task
> sched-out to sched-in. Or we can say off-cpu samples are not dependent on
> frequency provided by the user to perf record.
>
> I am also worried about having so many samples if sched-in/out
> frequency is high.
> Thinking to merge samples if attributes are the same (i.e. stacktrace)
> and add the
> off-cpu period to previous samples with the same attribute.
Right, this would be useful to look at in the context of Howard's
patches. The proposal in his changes are that short off-cpu times are
aggregated in a BPF map and dumped at the end of perf record - this
matches the existing perf record off-CPU behavior. Longer off-CPU
times, where long is open for debate and will be a parameter (perhaps
100 microseconds) and cause a sample to be created. BPF programs
create BPF output events, Howard's last patch series would rewrite the
events in perf record to be off-CPU samples. At the last office hours
it was discussed that we should dump the BPF output events directly,
to keep perf record overhead minimal, and add the ability to rewrite
into samples in tools like perf report, etc.
Thanks,
Ian
> -Ajay
>
> > Thanks,
> > Ian
> >
> > > Ajay Kaher (3):
> > > perf/core: add logic to collect off-cpu sample
> > > perf/record: add options --off-cpu-kernel
> > > perf/report: add off-cpu samples
> > >
> > > include/linux/perf_event.h | 16 ++++++++++++++
> > > include/uapi/linux/perf_event.h | 3 ++-
> > > kernel/events/core.c | 27 ++++++++++++++++++-----
> > > tools/include/uapi/linux/perf_event.h | 3 ++-
> > > tools/perf/builtin-record.c | 2 ++
> > > tools/perf/util/events_stats.h | 2 ++
> > > tools/perf/util/evsel.c | 4 ++++
> > > tools/perf/util/hist.c | 31 ++++++++++++++++++++++++---
> > > tools/perf/util/hist.h | 1 +
> > > tools/perf/util/record.h | 1 +
> > > tools/perf/util/sample.h | 1 +
> > > 11 files changed, 81 insertions(+), 10 deletions(-)
> > >
> > > --
> > > 2.39.0
> > >
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC 1/3] perf/core: add logic to collect off-cpu sample
2024-07-14 16:23 ` Ajay Kaher
@ 2024-07-15 11:48 ` Peter Zijlstra
0 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2024-07-15 11:48 UTC (permalink / raw)
To: Ajay Kaher
Cc: mingo, acme, namhyung, mark.rutland, rostedt, alexander.shishkin,
jolsa, irogers, adrian.hunter, kan.liang, yangjihong1, zegao2021,
leo.yan, asmadeus, siyanteng, sunhaiyong, linux-perf-users,
linux-kernel, alexey.makhalov, vasavi.sirnapalli,
Vamsi Krishna Brahmajosyula, nadav.amit
On Sun, Jul 14, 2024 at 09:53:10PM +0530, Ajay Kaher wrote:
> On Fri, Jul 12, 2024 at 3:19 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Jul 11, 2024 at 05:46:17PM +0530, Ajay Kaher wrote:
> > > following logics has been added to collect the off-cpu sample:
> > >
> > > - 'task_pt_regs(current)' has been used to capture registers
> > > status off-cpu sample.
> > >
> > > - off-cpu time represent the time period for which the target
> > > process not occupying the cpu cycles. And calculate as:
> > >
> > > off-cpu time = swap-in time - swap-out time
> > >
> >
> > I have absolutely no idea what you're trying to do :/ The above does not
> > constitute a comprehensible Changelog.
>
> Sorry Peter, it’s sched-in/out (not swap-in/out).
>
> 'Perf record' captures on-cpu samples at frequency which is specified by
> the user ( i.e. time period to collect sample is NSEC_PER_SEC / freq).
>
> This patch is to collect the off_cpu sample and time period of off_cpu
> sample is calculated based upon the time when target task was sched_out
> to sched_in, as:
>
> off-cpu time period = sched_in time - sched-out time
But but but... you don't need anything new for that. The sched_wakeup
tracepoint should generate an event in both the task/cpu that does the
wakeup *AND* the task being woken.
So sched_switch + sched_wakeup should get you all this already. What am
I missing?
https://lore.kernel.org/all/1342016098-213063-1-git-send-email-avagin@openvz.org/T/#u
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-07-15 11:48 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-11 12:16 [PATCH RFC 0/3] perf: add logic to collect off-cpu samples Ajay Kaher
2024-07-11 12:16 ` [PATCH RFC 1/3] perf/core: add logic to collect off-cpu sample Ajay Kaher
2024-07-11 21:49 ` Peter Zijlstra
2024-07-14 16:23 ` Ajay Kaher
2024-07-15 11:48 ` Peter Zijlstra
2024-07-11 12:16 ` [PATCH RFC 2/3] perf/record: add options --off-cpu-kernel Ajay Kaher
2024-07-11 12:16 ` [PATCH RFC 3/3] perf/report: add off-cpu samples Ajay Kaher
2024-07-11 21:58 ` [PATCH RFC 0/3] perf: add logic to collect " Ian Rogers
2024-07-13 7:42 ` Ajay Kaher
2024-07-14 18:32 ` Ian Rogers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).