* Profiling sleep times? @ 2011-10-03 19:38 Arun Sharma 2011-10-03 20:17 ` Peter Zijlstra 0 siblings, 1 reply; 23+ messages in thread From: Arun Sharma @ 2011-10-03 19:38 UTC (permalink / raw) To: linux-perf-users; +Cc: acme, Peter Zijlstra, mingo, Stephane Eranian Some of our users want to use perf to profile not just the code that consumes cycles, but also the code that ends up waiting for I/O - otherwise known as wall clock profiling. I could not find ways of getting this info from the perf tool as-is. Wondering if a software event such as PERF_COUNT_SW_SLEEP_CLOCK below makes sense. The idea is, if a task sleeps for 1ms, it should generate 1000x more samples vs a task that sleeps for 1us. Also, the callchain emitted should be the user stack. If such an event is useful to a larger set of users, I could try to work out the details of how to get to event->attr.freq in the context switch path with low overhead and run some tests to verify that the profile that comes out of "perf report" looks sane. We'll also need ways of combining PERF_COUNT_SW_TASK_CLOCK and PERF_COUNT_SW_SLEEP_CLOCK (in userspace?) to get the full picture. -Arun diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index c2da40d..a3e2fb4 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -106,6 +106,7 @@ enum perf_sw_ids { PERF_COUNT_SW_PAGE_FAULTS_MAJ = 6, PERF_COUNT_SW_ALIGNMENT_FAULTS = 7, PERF_COUNT_SW_EMULATION_FAULTS = 8, + PERF_COUNT_SW_SLEEP_CLOCK = 9, PERF_COUNT_SW_MAX, /* non-ABI */ }; diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 7406f36..e973862 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -877,8 +877,10 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se) se->statistics.sum_sleep_runtime += delta; if (tsk) { + u64 freq = 1000000; /* XXX: Use event->attr.freq ? */ account_scheduler_latency(tsk, delta >> 10, 1); trace_sched_stat_sleep(tsk, delta); + perf_sw_event(PERF_COUNT_SW_SLEEP_CLOCK, delta/freq, 0, NULL, 0); } } if (se->statistics.block_start) { ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-03 19:38 Profiling sleep times? Arun Sharma @ 2011-10-03 20:17 ` Peter Zijlstra 2011-10-03 21:53 ` Arun Sharma 0 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2011-10-03 20:17 UTC (permalink / raw) To: Arun Sharma; +Cc: linux-perf-users, acme, mingo, Stephane Eranian On Mon, 2011-10-03 at 12:38 -0700, Arun Sharma wrote: > trace_sched_stat_sleep(tsk, delta); That one should be accessible as a tracepoint and will add delay to count on each occurrence. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-03 20:17 ` Peter Zijlstra @ 2011-10-03 21:53 ` Arun Sharma 2011-10-04 8:34 ` Peter Zijlstra 0 siblings, 1 reply; 23+ messages in thread From: Arun Sharma @ 2011-10-03 21:53 UTC (permalink / raw) To: Peter Zijlstra; +Cc: linux-perf-users, acme, mingo, Stephane Eranian On 10/3/11 1:17 PM, Peter Zijlstra wrote: > On Mon, 2011-10-03 at 12:38 -0700, Arun Sharma wrote: >> trace_sched_stat_sleep(tsk, delta); > > That one should be accessible as a tracepoint and will add delay to > count on each occurrence. Right - the tracepoint seems to work. Was looking for the user stack trace as well. For eg: # cat foo.c #include <unistd.h> #include <time.h> #include <sys/select.h> main() { struct timespec ts1; struct timeval tv1; int i; for (i = 0; i < 1000; i++) { ts1.tv_sec = 0; ts1.tv_nsec = 1000000; nanosleep(&ts1, NULL); tv1.tv_sec = 0; tv1.tv_usec = 4000; select(0, NULL, NULL, NULL, &tv1); } } I want something that gives me 4 times as many samples on select() as nanosleep(). perf record -g -e sched:sched_stat_sleep -- ./foo doesn't seem to do it. Additionally, if I have a burn_1us_cpu() loop in there, I'd expect to see as many samples as nanosleep() in the wall clock profile. -Arun ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-03 21:53 ` Arun Sharma @ 2011-10-04 8:34 ` Peter Zijlstra 2011-10-06 21:56 ` Arun Sharma 0 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2011-10-04 8:34 UTC (permalink / raw) To: Arun Sharma; +Cc: linux-perf-users, acme, mingo, Stephane Eranian On Mon, 2011-10-03 at 14:53 -0700, Arun Sharma wrote: > On 10/3/11 1:17 PM, Peter Zijlstra wrote: > > On Mon, 2011-10-03 at 12:38 -0700, Arun Sharma wrote: > >> trace_sched_stat_sleep(tsk, delta); > > > > That one should be accessible as a tracepoint and will add delay to > > count on each occurrence. > > Right - the tracepoint seems to work. Was looking for the user stack > trace as well. For eg: > > # cat foo.c > #include <unistd.h> > #include <time.h> > #include <sys/select.h> > > main() > { > struct timespec ts1; > struct timeval tv1; > int i; > > > for (i = 0; i < 1000; i++) { > ts1.tv_sec = 0; > ts1.tv_nsec = 1000000; > nanosleep(&ts1, NULL); > > tv1.tv_sec = 0; > tv1.tv_usec = 4000; > select(0, NULL, NULL, NULL, &tv1); > } > } > > I want something that gives me 4 times as many samples on select() as > nanosleep(). > > perf record -g -e sched:sched_stat_sleep -- ./foo > > doesn't seem to do it. Additionally, if I have a burn_1us_cpu() loop in > there, I'd expect to see as many samples as nanosleep() in the wall > clock profile. Would you per-chance be suffering from this: http://lkml.kernel.org/r/1317052535-1765247-2-git-send-email-avagin@openvz.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-04 8:34 ` Peter Zijlstra @ 2011-10-06 21:56 ` Arun Sharma 2011-10-07 0:05 ` Arun Sharma ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Arun Sharma @ 2011-10-06 21:56 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-perf-users, acme, mingo, Stephane Eranian, Frederic Weisbecker, Andrew Vagin On 10/4/11 1:34 AM, Peter Zijlstra wrote: >> # cat foo.c >> #include<unistd.h> >> #include<time.h> >> #include<sys/select.h> >> >> main() >> { >> struct timespec ts1; >> struct timeval tv1; >> int i; >> >> >> for (i = 0; i< 1000; i++) { >> ts1.tv_sec = 0; >> ts1.tv_nsec = 1000000; >> nanosleep(&ts1, NULL); >> >> tv1.tv_sec = 0; >> tv1.tv_usec = 4000; >> select(0, NULL, NULL, NULL,&tv1); >> } >> } >> > Would you per-chance be suffering from this: > > http://lkml.kernel.org/r/1317052535-1765247-2-git-send-email-avagin@openvz.org > Thanks for the pointer. Yes - Andrew's patches help. But looks like we need more user space plumbing like Frederic noted. perf record -ge sched:sched_stat_sleep -- ./foo doesn't quite work. perf record -age sched:sched_stat_sleep -- ./foo gives me: 58.62% foo [unknown] [k] 0 | --- schedule | |--54.99%-- schedule_hrtimeout_range_clock | schedule_hrtimeout_range | poll_schedule_timeout | do_select | core_sys_select | sys_select | system_call_fastpath | |--44.81%-- do_nanosleep | hrtimer_nanosleep | sys_nanosleep | system_call_fastpath --0.20%-- [...] i.e. select() should be weighted by 4x vs nanosleep() as confirmed via: perf script | grep comm=foo foo 15516 [006] 2291.187831: sched_stat_sleep: comm=foo pid=15516 delay=4054262 [ns] foo 15516 [006] 2291.187832: sched_stat_sleep: comm=foo pid=15516 delay=4054262 [ns] foo 15516 [006] 2291.188895: sched_stat_sleep: comm=foo pid=15516 delay=1053565 [ns] foo 15516 [006] 2291.188896: sched_stat_sleep: comm=foo pid=15516 delay=1053565 [ns] foo 15516 [006] 2291.188897: sched_stat_sleep: comm=foo pid=15516 delay=1053565 [ns] Andrew, are you already working on user space patches? -Arun ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-06 21:56 ` Arun Sharma @ 2011-10-07 0:05 ` Arun Sharma 2011-10-07 1:30 ` Peter Zijlstra 2011-10-08 1:45 ` avagin 2 siblings, 0 replies; 23+ messages in thread From: Arun Sharma @ 2011-10-07 0:05 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-perf-users, acme, mingo, Stephane Eranian, Frederic Weisbecker, Andrew Vagin [-- Attachment #1: Type: text/plain, Size: 1979 bytes --] On 10/6/11 2:56 PM, Arun Sharma wrote: > i.e. select() should be weighted by 4x vs nanosleep() as confirmed via: > > perf script | grep comm=foo > > foo 15516 [006] 2291.187831: sched_stat_sleep: comm=foo pid=15516 > delay=4054262 [ns] > foo 15516 [006] 2291.187832: sched_stat_sleep: comm=foo pid=15516 > delay=4054262 [ns] > foo 15516 [006] 2291.188895: sched_stat_sleep: comm=foo pid=15516 > delay=1053565 [ns] > foo 15516 [006] 2291.188896: sched_stat_sleep: comm=foo pid=15516 > delay=1053565 [ns] > foo 15516 [006] 2291.188897: sched_stat_sleep: comm=foo pid=15516 > delay=1053565 [ns] > > Andrew, are you already working on user space patches? Attached is a quick hack (mainly meant as a RFC to get the discussion going). After the patch I get the expected result: 0.16% foo [unknown] [k] 0 | --- schedule | |--79.26%-- schedule_hrtimeout_range_clock | schedule_hrtimeout_range | poll_schedule_timeout | do_select | core_sys_select | sys_select | system_call_fastpath | |--16.87%-- do_nanosleep | hrtimer_nanosleep | sys_nanosleep | system_call_fastpath | --3.88%-- pipe_wait pipe_read do_sync_read vfs_read sys_read system_call_fastpath We still have the issue that per-process profiling doesn't work. -Arun [-- Attachment #2: 0001-Use-delay-instead-of-sample-period.patch --] [-- Type: text/plain, Size: 3309 bytes --] From a2d6e0698eebb038b2d8e845137008dc3478cc51 Mon Sep 17 00:00:00 2001 From: Arun Sharma <asharma@fb.com> Date: Thu, 6 Oct 2011 16:48:44 -0700 Subject: [PATCH] Use delay instead of sample->period --- tools/perf/builtin-report.c | 6 ++++++ tools/perf/util/probe-finder.c | 6 ++++-- tools/perf/util/trace-event-parse.c | 20 ++++++++++++++++++++ tools/perf/util/trace-event.h | 1 + 4 files changed, 31 insertions(+), 2 deletions(-) diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c index d7ff277..6821597 100644 --- a/tools/perf/builtin-report.c +++ b/tools/perf/builtin-report.c @@ -25,6 +25,7 @@ #include "util/evsel.h" #include "util/header.h" #include "util/session.h" +#include "util/trace-event.h" #include "util/parse-options.h" #include "util/parse-events.h" @@ -111,6 +112,11 @@ static int process_sample_event(union perf_event *event, struct perf_session *session) { struct addr_location al; + struct perf_event_attr *attr = &evsel->attr; + + if (attr->type == PERF_TYPE_TRACEPOINT) { + sample->period = trace_get_delay(sample->raw_data, sample->raw_size); + } if (perf_event__preprocess_sample(event, session, &al, sample, annotate_init) < 0) { diff --git a/tools/perf/util/probe-finder.c b/tools/perf/util/probe-finder.c index 5d73262..e7179ff 100644 --- a/tools/perf/util/probe-finder.c +++ b/tools/perf/util/probe-finder.c @@ -19,9 +19,11 @@ * */ +#define _GNU_SOURCE #include <sys/utsname.h> #include <sys/types.h> #include <sys/stat.h> +#include <ctype.h> #include <fcntl.h> #include <errno.h> #include <stdio.h> @@ -30,13 +32,13 @@ #include <stdlib.h> #include <string.h> #include <stdarg.h> -#include <ctype.h> #include <dwarf-regs.h> +#undef _GNU_SOURCE #include <linux/bitops.h> +#include "util.h" #include "event.h" #include "debug.h" -#include "util.h" #include "symbol.h" #include "probe-finder.h" diff --git a/tools/perf/util/trace-event-parse.c b/tools/perf/util/trace-event-parse.c index 0a7ed5b..78b7194 100644 --- a/tools/perf/util/trace-event-parse.c +++ b/tools/perf/util/trace-event-parse.c @@ -2944,6 +2944,26 @@ void print_trace_event(int cpu, void *data, int size) pretty_print(data, size, event); } +size_t trace_get_delay(void *data, int size __unused) +{ + struct event *event; + int type; + struct format_field *field; + + type = trace_parse_common_type(data); + + event = trace_find_event(type); + if (!event) { + warning("ug! no event found for type %d", type); + return 0; + } + + field = find_field(event, "delay"); + if (!field) + die("can't find delay field in trace"); + return read_size(data + field->offset, field->size); +} + static void print_fields(struct print_flag_sym *field) { printf("{ %s, %s }", field->value, field->str); diff --git a/tools/perf/util/trace-event.h b/tools/perf/util/trace-event.h index f674dda..587b553 100644 --- a/tools/perf/util/trace-event.h +++ b/tools/perf/util/trace-event.h @@ -264,6 +264,7 @@ unsigned long long eval_flag(const char *flag); int read_tracing_data(int fd, struct list_head *pattrs); ssize_t read_tracing_data_size(int fd, struct list_head *pattrs); +size_t trace_get_delay(void *data, int size); /* taken from kernel/trace/trace.h */ enum trace_flag_type { -- 1.7.4 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-06 21:56 ` Arun Sharma 2011-10-07 0:05 ` Arun Sharma @ 2011-10-07 1:30 ` Peter Zijlstra 2011-10-07 5:42 ` avagin 2011-10-07 17:58 ` Arun Sharma 2011-10-08 1:45 ` avagin 2 siblings, 2 replies; 23+ messages in thread From: Peter Zijlstra @ 2011-10-07 1:30 UTC (permalink / raw) To: Arun Sharma Cc: linux-perf-users, acme, mingo, Stephane Eranian, Frederic Weisbecker, Andrew Vagin On Thu, 2011-10-06 at 14:56 -0700, Arun Sharma wrote: > foo 15516 [006] 2291.187831: sched_stat_sleep: comm=foo > pid=15516 delay=4054262 [ns] > foo 15516 [006] 2291.187832: sched_stat_sleep: comm=foo > pid=15516 delay=4054262 [ns] > foo 15516 [006] 2291.188895: sched_stat_sleep: comm=foo > pid=15516 delay=1053565 [ns] > foo 15516 [006] 2291.188896: sched_stat_sleep: comm=foo > pid=15516 delay=1053565 [ns] > foo 15516 [006] 2291.188897: sched_stat_sleep: comm=foo > pid=15516 delay=1053565 [ns] But the idea of the __perf_count() thing: DECLARE_EVENT_CLASS(sched_stat_template, TP_PROTO(struct task_struct *tsk, u64 delay), TP_ARGS(tsk, delay), TP_STRUCT__entry( __array( char, comm, TASK_COMM_LEN ) __field( pid_t, pid ) __field( u64, delay ) ), TP_fast_assign( memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); __entry->pid = tsk->pid; __entry->delay = delay; ) TP_perf_assign( __perf_count(delay); ), TP_printk("comm=%s pid=%d delay=%Lu [ns]", __entry->comm, __entry->pid, (unsigned long long)__entry->delay) ); is that the counter is incremented with the delay, so the event should get weighted right. So having to get the delay out of the raw tracepoint data shouldn't be needed. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-07 1:30 ` Peter Zijlstra @ 2011-10-07 5:42 ` avagin 2011-10-07 9:33 ` Peter Zijlstra 2011-10-07 17:58 ` Arun Sharma 1 sibling, 1 reply; 23+ messages in thread From: avagin @ 2011-10-07 5:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Arun Sharma, linux-perf-users, acme, mingo, Stephane Eranian, Frederic Weisbecker On 10/07/2011 05:30 AM, Peter Zijlstra wrote: > On Thu, 2011-10-06 at 14:56 -0700, Arun Sharma wrote: >> foo 15516 [006] 2291.187831: sched_stat_sleep: comm=foo >> pid=15516 delay=4054262 [ns] >> foo 15516 [006] 2291.187832: sched_stat_sleep: comm=foo >> pid=15516 delay=4054262 [ns] >> foo 15516 [006] 2291.188895: sched_stat_sleep: comm=foo >> pid=15516 delay=1053565 [ns] >> foo 15516 [006] 2291.188896: sched_stat_sleep: comm=foo >> pid=15516 delay=1053565 [ns] >> foo 15516 [006] 2291.188897: sched_stat_sleep: comm=foo >> pid=15516 delay=1053565 [ns] > > But the idea of the __perf_count() thing: __perf_count() doesn't work now and I have sent the patch, which fixes it. Could you commit it? It's subject: [PATCH 1/4] perf: fix counter of ftrace events but it's not all. By default perf doesn't use SAMPLE_PERIOD, it bases on events count. Each "trace" event is divided on some "perf" events in perf_swevent_overflow(). The number of perf events should be proportional __perf_count(), but the number of perf events restricted by sysctl_perf_event_sample_rate / HZ. In my case it's 100 events. .... if (hwc->interrupts != MAX_INTERRUPTS) { hwc->interrupts++; if (HZ * hwc->interrupts > (u64)sysctl_perf_event_sample_rate) { .... So Arun Sharma probably has perf.data which contains packages of events and each package contains 100 events irrespective of "delay". > > DECLARE_EVENT_CLASS(sched_stat_template, > > TP_PROTO(struct task_struct *tsk, u64 delay), > > TP_ARGS(tsk, delay), > > TP_STRUCT__entry( > __array( char, comm, TASK_COMM_LEN ) > __field( pid_t, pid ) > __field( u64, delay ) > ), > > TP_fast_assign( > memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); > __entry->pid = tsk->pid; > __entry->delay = delay; > ) > TP_perf_assign( > __perf_count(delay); > ), > > TP_printk("comm=%s pid=%d delay=%Lu [ns]", > __entry->comm, __entry->pid, > (unsigned long long)__entry->delay) > ); > > > is that the counter is incremented with the delay, so the event should > get weighted right. > > So having to get the delay out of the raw tracepoint data shouldn't be > needed. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-07 5:42 ` avagin @ 2011-10-07 9:33 ` Peter Zijlstra 0 siblings, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2011-10-07 9:33 UTC (permalink / raw) To: avagin Cc: Arun Sharma, linux-perf-users, acme, mingo, Stephane Eranian, Frederic Weisbecker On Fri, 2011-10-07 at 09:42 +0400, avagin@gmail.com wrote: > Could you commit it? It's subject: > [PATCH 1/4] perf: fix counter of ftrace events > tip/master: commit 92e51938f5d005026ba4bb5b1fae5a86dc195b86 Author: Andrew Vagin <avagin@openvz.org> Date: Mon Sep 26 19:55:32 2011 +0400 perf: Fix counter of ftrace events ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-07 1:30 ` Peter Zijlstra 2011-10-07 5:42 ` avagin @ 2011-10-07 17:58 ` Arun Sharma 2011-10-07 23:16 ` avagin 1 sibling, 1 reply; 23+ messages in thread From: Arun Sharma @ 2011-10-07 17:58 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-perf-users, acme, mingo, Stephane Eranian, Frederic Weisbecker, Andrew Vagin On 10/6/11 6:30 PM, Peter Zijlstra wrote: > > But the idea of the __perf_count() thing: > [..] > TP_perf_assign( > __perf_count(delay); > ), > > TP_printk("comm=%s pid=%d delay=%Lu [ns]", > __entry->comm, __entry->pid, > (unsigned long long)__entry->delay) > ); > > > is that the counter is incremented with the delay, so the event should > get weighted right. > > So having to get the delay out of the raw tracepoint data shouldn't be > needed. How does this work for dynamic tracepoints? Or tracepoints with multiple dimensions where the user might want to query each dimension separately? Andrew: what do you think about generalizing my patch to accept a command line option(s) to specify which fields to use for the purpose of computing the histogram? For static tracepoints, TP_perf_assign() could act as a hint on which field to default to. -Arun ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-07 17:58 ` Arun Sharma @ 2011-10-07 23:16 ` avagin 0 siblings, 0 replies; 23+ messages in thread From: avagin @ 2011-10-07 23:16 UTC (permalink / raw) To: Arun Sharma Cc: Peter Zijlstra, linux-perf-users, acme, mingo, Stephane Eranian, Frederic Weisbecker Hello Arun, > Andrew: what do you think about generalizing my patch to accept a > command line option(s) to specify which fields to use for the purpose of > computing the histogram? I have no objection, but I don't think, that we really need that. Now we have not got a real use case. All events which I've seen have not more than one parameters, which may be used as weight and for this one we already have parameter "period". I already said, that you have trouble with sched_stat_sleed due to some issues in kernel. The first issue is that __perf_cout doesn't work and I sent the patch, which fixes it. ([PATCH] perf: fix counter of ftrace events) And the second issue is that the trace events are divided on some perf events and their number is restricted by sysctl_perf_event_sample_rate/HZ. I don't sure, that we should generate more than one "perf" event on each "trace" event. I think the better way to use SAMPLE_PERIOD and now I think in this direction. If you want to fix the bug with sched_stat_sleep, you need to fix the second issue. I found workaround for your case: #./perf record -age sched:sched_stat_sleep --filter="comm == foo" -c 100000 -F 100000 -- ~/foo # ./perf report # Events: 5K sched:sched_stat_sleep # # Overhead Command Shared Object Symbol # ........ ....... ................. ...... # 99.98% foo [unknown] [k] 0 | --- schedule | |--80.20%-- schedule_hrtimeout_range_clock | schedule_hrtimeout_range | poll_schedule_timeout | do_select | core_sys_select | sys_select | system_call_fastpath | --19.80%-- do_nanosleep hrtimer_nanosleep sys_nanosleep system_call_fastpath > > For static tracepoints, TP_perf_assign() could act as a hint on which > field to default to. > > -Arun ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-06 21:56 ` Arun Sharma 2011-10-07 0:05 ` Arun Sharma 2011-10-07 1:30 ` Peter Zijlstra @ 2011-10-08 1:45 ` avagin 2011-10-10 18:50 ` Arun Sharma 2 siblings, 1 reply; 23+ messages in thread From: avagin @ 2011-10-08 1:45 UTC (permalink / raw) To: Arun Sharma Cc: Peter Zijlstra, linux-perf-users, acme, mingo, Stephane Eranian, Frederic Weisbecker [-- Attachment #1: Type: text/plain, Size: 416 bytes --] > Andrew, are you already working on user space patches? Yes, I'm working. I've attached the draft version. The example of usage: #./perf record -ag -e sched:sched_switch --filter "prev_state == 1" -e sched:sched_process_exit -e sched:sched_stat_sleep --filter "comm == foo" ~/foo #./perf inject -s -i perf.data -o perf.data.d #./perf report -i perf.data.d I will be glad to receive any comments. > > -Arun [-- Attachment #2: 0001-perf-teach-perf-inject-to-work-with-files.patch --] [-- Type: text/plain, Size: 2745 bytes --] From c2cd0d73c11687fbc67884aba337f497e13890f9 Mon Sep 17 00:00:00 2001 From: Andrew Vagin <avagin@openvz.org> Date: Tue, 4 Oct 2011 16:10:23 +0400 Subject: [PATCH 1/3] perf: teach "perf inject" to work with files Before this patch "perf inject" can only handle data from pipe. I want to use "perf inject" for reworking events. Look at my following patch. Signed-off-by: Andrew Vagin <avagin@openvz.org> --- tools/perf/builtin-inject.c | 33 +++++++++++++++++++++++++++++++-- 1 files changed, 31 insertions(+), 2 deletions(-) diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c index 8dfc12b..8df8b71 100644 --- a/tools/perf/builtin-inject.c +++ b/tools/perf/builtin-inject.c @@ -13,7 +13,12 @@ #include "util/parse-options.h" -static char const *input_name = "-"; +static char const *input_name = "-"; +static const char *output_name = "-"; +static int pipe_output = 0; +static int output; +static u64 bytes_written = 0; + static bool inject_build_ids; static int perf_event__repipe_synth(union perf_event *event, @@ -25,12 +30,14 @@ static int perf_event__repipe_synth(union perf_event *event, size = event->header.size; while (size) { - int ret = write(STDOUT_FILENO, buf, size); + int ret = write(output, buf, size); if (ret < 0) return -errno; size -= ret; buf += ret; + + bytes_written += ret; } return 0; @@ -213,8 +220,14 @@ static int __cmd_inject(void) if (session == NULL) return -ENOMEM; + if (!pipe_output) + lseek(output, session->header.data_offset, SEEK_SET); ret = perf_session__process_events(session, &inject_ops); + if (!pipe_output) { + session->header.data_size += bytes_written; + perf_session__write_header(session, session->evlist, output, true); + } perf_session__delete(session); return ret; @@ -228,6 +241,10 @@ static const char * const report_usage[] = { static const struct option options[] = { OPT_BOOLEAN('b', "build-ids", &inject_build_ids, "Inject build-ids into the output stream"), + OPT_STRING('i', "input", &input_name, "file", + "input file name"), + OPT_STRING('o', "output", &output_name, "file", + "output file name"), OPT_INCR('v', "verbose", &verbose, "be more verbose (show build ids, etc)"), OPT_END() @@ -243,6 +260,18 @@ int cmd_inject(int argc, const char **argv, const char *prefix __used) if (argc) usage_with_options(report_usage, options); + if (!strcmp(output_name, "-")) { + pipe_output = 1; + output = STDOUT_FILENO; + } else { + output = open(output_name, O_CREAT| O_WRONLY | O_TRUNC, + S_IRUSR | S_IWUSR); + if (output < 0) { + perror("failed to create output file"); + exit(-1); + } + } + if (symbol__init() < 0) return -1; -- 1.7.1 [-- Attachment #3: 0003-perf-add-scripts-for-collecting-D-state-statistics.patch --] [-- Type: text/plain, Size: 1420 bytes --] From e08809ba075d92e60f669fc62c48128e06c834fb Mon Sep 17 00:00:00 2001 From: Andrew Vagin <avagin@openvz.org> Date: Thu, 6 Oct 2011 12:18:44 +0400 Subject: [PATCH 3/3] perf: add scripts for collecting D-state statistics Signed-off-by: Andrew Vagin <avagin@openvz.org> --- .../perf/scripts/python/bin/task-in-d-state-record | 2 ++ .../perf/scripts/python/bin/task-in-d-state-report | 6 ++++++ 2 files changed, 8 insertions(+), 0 deletions(-) create mode 100644 tools/perf/scripts/python/bin/task-in-d-state-record create mode 100644 tools/perf/scripts/python/bin/task-in-d-state-report diff --git a/tools/perf/scripts/python/bin/task-in-d-state-record b/tools/perf/scripts/python/bin/task-in-d-state-record new file mode 100644 index 0000000..d70bed0 --- /dev/null +++ b/tools/perf/scripts/python/bin/task-in-d-state-record @@ -0,0 +1,2 @@ +#!/bin/bash +perf record -ag -e sched:sched_switch --filter "prev_state == 2" -e sched:sched_stat_iowait -e sched:sched_process_exit $@ diff --git a/tools/perf/scripts/python/bin/task-in-d-state-report b/tools/perf/scripts/python/bin/task-in-d-state-report new file mode 100644 index 0000000..f1ab71e --- /dev/null +++ b/tools/perf/scripts/python/bin/task-in-d-state-report @@ -0,0 +1,6 @@ +#!/bin/bash +# description: D-state statistics +# args: +perf inject -s -i perf.data -o perf.data.d || exit +perf report -i perf.data.d || exit +unlink perf.data.d -- 1.7.1 [-- Attachment #4: 0002-perf-teach-perf-inject-to-merge-sched_stat_-and-sche.patch --] [-- Type: text/plain, Size: 4355 bytes --] From 7a3152c03d07d30f47ab4fe9295212330bf821e4 Mon Sep 17 00:00:00 2001 From: Andrew Vagin <avagin@openvz.org> Date: Tue, 4 Oct 2011 16:54:15 +0400 Subject: [PATCH 2/3] perf: teach perf inject to merge sched_stat_* and sched_switch events You may want to know where and how long a task is sleeping. A callchain may be found in sched_switch and a time slice in stat_iowait, so I add handler in perf inject for merging this events. My code saves sched_switch event for each process and when it meets stat_iowait, it reports the sched_switch event, because it contains a correct callchain. By another words it replaces all stat_iowait events on proper sched_switch events. My code doesn't change events. Signed-off-by: Andrew Vagin <avagin@openvz.org> --- tools/perf/builtin-inject.c | 87 +++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 87 insertions(+), 0 deletions(-) diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c index 8df8b71..10bdd65 100644 --- a/tools/perf/builtin-inject.c +++ b/tools/perf/builtin-inject.c @@ -12,6 +12,8 @@ #include "util/debug.h" #include "util/parse-options.h" +#include "util/trace-event.h" + static char const *input_name = "-"; static const char *output_name = "-"; @@ -20,6 +22,7 @@ static int output; static u64 bytes_written = 0; static bool inject_build_ids; +static bool inject_sched_stat; static int perf_event__repipe_synth(union perf_event *event, struct perf_session *session __used) @@ -179,6 +182,85 @@ repipe: return 0; } +struct event_entry +{ + struct list_head list; + u32 pid; + struct perf_event_header header; +}; + +#define ENT_SIZE(size) ((size) + offsetof(event_entry, header)) + +static LIST_HEAD(samples); + +static int perf_event__sched_stat(union perf_event *event, + struct perf_sample *sample, + struct perf_evsel *evsel __used, + struct perf_session *session) +{ + int type; + struct event *e; + const char *evname = NULL; + uint32_t size; + void *buf = event; + struct event_entry *ent; + union perf_event *event_sw = NULL; + struct perf_sample sample_sw; + int sched_process_exit; + + size = event->header.size; + + type = trace_parse_common_type(sample->raw_data); + e = trace_find_event(type); + if (e) + evname = e->name; + + sched_process_exit = !strcmp(evname, "sched_process_exit"); + + if (!strcmp(evname, "sched_switch") || sched_process_exit) { + list_for_each_entry(ent, &samples, list) + if (sample->pid == ent->pid) + break; + + if (&ent->list != &samples) { + list_del(&ent->list); + free(ent); + } + + if (sched_process_exit) + return 0; + + ent = malloc(size + offsetof(struct event_entry, header)); + ent->pid = sample->pid; + memcpy(&ent->header, buf, size); + list_add(&ent->list, &samples); + return 0; + + } else if (!strncmp(evname, "sched_stat_", 11)) { + u32 pid; + + pid = raw_field_value(e, "pid", sample->raw_data); + + list_for_each_entry(ent, &samples, list) { + if (pid == ent->pid) + break; + } + + if (&ent->list == &samples) { + pr_warning("Could not find sched_switch for pid %u\n", pid); + return 0; + } + + event_sw = (union perf_event *) &ent->header; + perf_session__parse_sample(session, event_sw, &sample_sw); + perf_event__repipe(event_sw, &sample_sw, session); + return 0; + } + + perf_event__repipe(event, sample, session); + + return 0; +} struct perf_event_ops inject_ops = { .sample = perf_event__repipe_sample, .mmap = perf_event__repipe, @@ -214,6 +296,9 @@ static int __cmd_inject(void) inject_ops.mmap = perf_event__repipe_mmap; inject_ops.fork = perf_event__repipe_task; inject_ops.tracing_data = perf_event__repipe_tracing_data; + } else if (inject_sched_stat) { + inject_ops.sample = perf_event__sched_stat; + inject_ops.ordered_samples = true; } session = perf_session__new(input_name, O_RDONLY, false, true, &inject_ops); @@ -241,6 +326,8 @@ static const char * const report_usage[] = { static const struct option options[] = { OPT_BOOLEAN('b', "build-ids", &inject_build_ids, "Inject build-ids into the output stream"), + OPT_BOOLEAN('s', "sched-stat", &inject_sched_stat, + "correct call-chains for shed-stat-*"), OPT_STRING('i', "input", &input_name, "file", "input file name"), OPT_STRING('o', "output", &output_name, "file", -- 1.7.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-08 1:45 ` avagin @ 2011-10-10 18:50 ` Arun Sharma 2011-10-12 7:41 ` Ingo Molnar 0 siblings, 1 reply; 23+ messages in thread From: Arun Sharma @ 2011-10-10 18:50 UTC (permalink / raw) To: avagin Cc: Peter Zijlstra, linux-perf-users, acme, mingo, Stephane Eranian, Frederic Weisbecker On 10/7/11 6:45 PM, avagin@gmail.com wrote: >> Andrew, are you already working on user space patches? > > Yes, I'm working. I've attached the draft version. Generally I'm happy with the direction of these patches. For doing wall time based profiling we'll need support in perf inject to merge sleep + cycles events. Re: examples of why one might need histograms based on various fields in a dynamic kernel probe: I might want to add a probe such as: perf probe --add 'sys_read fd=%di' perf probe --add 'sys_read%return bytes=$retval' and expect to find stack traces that are responsible for most I/O calls based on bytes vs fd vs number of syscalls. -Arun ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-10 18:50 ` Arun Sharma @ 2011-10-12 7:41 ` Ingo Molnar 2011-10-13 5:39 ` Andrew Vagin 2011-10-14 21:19 ` Arun Sharma 0 siblings, 2 replies; 23+ messages in thread From: Ingo Molnar @ 2011-10-12 7:41 UTC (permalink / raw) To: Arun Sharma Cc: avagin, Peter Zijlstra, linux-perf-users, acme, Stephane Eranian, Frederic Weisbecker * Arun Sharma <asharma@fb.com> wrote: > On 10/7/11 6:45 PM, avagin@gmail.com wrote: > >>Andrew, are you already working on user space patches? > > > >Yes, I'm working. I've attached the draft version. > > Generally I'm happy with the direction of these patches. [...] Ok, will someone please send an updated series of all missing patches, with all acks included, for potential v3.2 pickup? Sleep time profiling is obviously a very hot feature ... Thanks, Ingo ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-12 7:41 ` Ingo Molnar @ 2011-10-13 5:39 ` Andrew Vagin 2011-10-14 21:19 ` Arun Sharma 1 sibling, 0 replies; 23+ messages in thread From: Andrew Vagin @ 2011-10-13 5:39 UTC (permalink / raw) To: Ingo Molnar Cc: Arun Sharma, Peter Zijlstra, linux-perf-users, acme, Stephane Eranian, Frederic Weisbecker > Ok, will someone please send an updated series of all missing > patches, with all acks included, for potential v3.2 pickup? I'm in vacation now, so I'm going to send my patches in two-three weeks. Sorry for delay. > > Sleep time profiling is obviously a very hot feature ... > > Thanks, > > Ingo ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-12 7:41 ` Ingo Molnar 2011-10-13 5:39 ` Andrew Vagin @ 2011-10-14 21:19 ` Arun Sharma 2011-10-15 17:00 ` Ingo Molnar 1 sibling, 1 reply; 23+ messages in thread From: Arun Sharma @ 2011-10-14 21:19 UTC (permalink / raw) To: Ingo Molnar Cc: avagin, Peter Zijlstra, linux-perf-users, acme, Stephane Eranian, Frederic Weisbecker On 10/12/11 12:41 AM, Ingo Molnar wrote: >> Generally I'm happy with the direction of these patches. [...] > > Ok, will someone please send an updated series of all missing > patches, with all acks included, for potential v3.2 pickup? > > Sleep time profiling is obviously a very hot feature ... > I acked two of Andrew's patches. The third one is the most important one. I haven't had a chance to review that carefully yet - although it's working well for me. We still need to figure out what's the best way to munge all this data in user space to implement wall clock profiling. -Arun ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-14 21:19 ` Arun Sharma @ 2011-10-15 17:00 ` Ingo Molnar 2011-10-15 19:22 ` Peter Zijlstra 0 siblings, 1 reply; 23+ messages in thread From: Ingo Molnar @ 2011-10-15 17:00 UTC (permalink / raw) To: Arun Sharma Cc: avagin, Peter Zijlstra, linux-perf-users, acme, Stephane Eranian, Frederic Weisbecker * Arun Sharma <asharma@fb.com> wrote: > On 10/12/11 12:41 AM, Ingo Molnar wrote: > >>Generally I'm happy with the direction of these patches. [...] > > > >Ok, will someone please send an updated series of all missing > >patches, with all acks included, for potential v3.2 pickup? > > > >Sleep time profiling is obviously a very hot feature ... > > > > I acked two of Andrew's patches. The third one is the most > important one. I haven't had a chance to review that carefully yet > - although it's working well for me. > > We still need to figure out what's the best way to munge all this > data in user space to implement wall clock profiling. Well, if 'perf report' can show sleep time ordered entries just fine by default (if a proper perf record was done), which also works in call graph recording mode then i'm a happy camper. Sleep time should really just be a different notion of 'cost of the function/callchain' and fit into the existing scheme, right? Thanks, Ingo ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-15 17:00 ` Ingo Molnar @ 2011-10-15 19:22 ` Peter Zijlstra 2011-10-15 19:29 ` Peter Zijlstra 0 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2011-10-15 19:22 UTC (permalink / raw) To: Ingo Molnar Cc: Arun Sharma, avagin, linux-perf-users, acme, Stephane Eranian, Frederic Weisbecker On Sat, 2011-10-15 at 19:00 +0200, Ingo Molnar wrote: > * Arun Sharma <asharma@fb.com> wrote: > > > On 10/12/11 12:41 AM, Ingo Molnar wrote: > > >>Generally I'm happy with the direction of these patches. [...] > > > > > >Ok, will someone please send an updated series of all missing > > >patches, with all acks included, for potential v3.2 pickup? > > > > > >Sleep time profiling is obviously a very hot feature ... > > > > > > > I acked two of Andrew's patches. The third one is the most > > important one. I haven't had a chance to review that carefully yet > > - although it's working well for me. > > > > We still need to figure out what's the best way to munge all this > > data in user space to implement wall clock profiling. > > Well, if 'perf report' can show sleep time ordered entries just fine > by default (if a proper perf record was done), which also works in > call graph recording mode then i'm a happy camper. > > Sleep time should really just be a different notion of 'cost of the > function/callchain' and fit into the existing scheme, right? The problem with andrew's patches is that it wrecks the callchain semantics. The waittime tracepoint is in the wakeup path (and hence generates the wakee's callchain) whereas they really want the callchain of the woken task to show where it spend time. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-15 19:22 ` Peter Zijlstra @ 2011-10-15 19:29 ` Peter Zijlstra 2011-10-18 1:07 ` Arun Sharma 0 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2011-10-15 19:29 UTC (permalink / raw) To: Ingo Molnar Cc: Arun Sharma, avagin, linux-perf-users, acme, Stephane Eranian, Frederic Weisbecker On Sat, 2011-10-15 at 21:22 +0200, Peter Zijlstra wrote: > > Sleep time should really just be a different notion of 'cost of the > > function/callchain' and fit into the existing scheme, right? > > The problem with andrew's patches is that it wrecks the callchain > semantics. The waittime tracepoint is in the wakeup path (and hence > generates the wakee's callchain) whereas they really want the callchain > of the woken task to show where it spend time. We could of course try to move the tracepoint into the schedule path, so we issue it the first time the task gets scheduled after the wakeup, but I suspect that will just add more overhead, and we really could do without that. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-15 19:29 ` Peter Zijlstra @ 2011-10-18 1:07 ` Arun Sharma 2011-10-22 10:49 ` Frederic Weisbecker 0 siblings, 1 reply; 23+ messages in thread From: Arun Sharma @ 2011-10-18 1:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, avagin, linux-perf-users, acme, Stephane Eranian, Frederic Weisbecker On 10/15/11 12:29 PM, Peter Zijlstra wrote: > On Sat, 2011-10-15 at 21:22 +0200, Peter Zijlstra wrote: > >>> Sleep time should really just be a different notion of 'cost of the >>> function/callchain' and fit into the existing scheme, right? >> >> The problem with andrew's patches is that it wrecks the callchain >> semantics. The waittime tracepoint is in the wakeup path (and hence >> generates the wakee's callchain) whereas they really want the callchain >> of the woken task to show where it spend time. > > We could of course try to move the tracepoint into the schedule path, so > we issue it the first time the task gets scheduled after the wakeup, but > I suspect that will just add more overhead, and we really could do > without that. Do we need to define new tracepoints? I suspect we could make the existing ones: trace_sched_stat_wait() trace_sched_stat_sleep() work for this purpose. The length of time the task was not on the cpu could then be computed as: sleep+wait. The downside is that the complexity moves to user space. perf record -e sched:sched_stat_sleep,sched:sched_stat_wait,... Re: changing the semantics of tracepoint callchains Yeah - this could be surprising. Luckily, most tracepoints retain their semantics, but a few special ones don't. I guess we just need to document the new behavior. -Arun ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-18 1:07 ` Arun Sharma @ 2011-10-22 10:49 ` Frederic Weisbecker 2011-10-22 16:22 ` Andrew Wagin 2011-10-23 0:27 ` Arun Sharma 0 siblings, 2 replies; 23+ messages in thread From: Frederic Weisbecker @ 2011-10-22 10:49 UTC (permalink / raw) To: Arun Sharma Cc: Peter Zijlstra, Ingo Molnar, avagin, linux-perf-users, acme, Stephane Eranian On Mon, Oct 17, 2011 at 06:07:00PM -0700, Arun Sharma wrote: > On 10/15/11 12:29 PM, Peter Zijlstra wrote: > >On Sat, 2011-10-15 at 21:22 +0200, Peter Zijlstra wrote: > > > >>>Sleep time should really just be a different notion of 'cost of the > >>>function/callchain' and fit into the existing scheme, right? > >> > >>The problem with andrew's patches is that it wrecks the callchain > >>semantics. The waittime tracepoint is in the wakeup path (and hence > >>generates the wakee's callchain) whereas they really want the callchain > >>of the woken task to show where it spend time. > > > >We could of course try to move the tracepoint into the schedule path, so > >we issue it the first time the task gets scheduled after the wakeup, but > >I suspect that will just add more overhead, and we really could do > >without that. > > Do we need to define new tracepoints? I suspect we could make the > existing ones: > > trace_sched_stat_wait() > trace_sched_stat_sleep() > > work for this purpose. The length of time the task was not on the > cpu could then be computed as: sleep+wait. The downside is that the > complexity moves to user space. > > perf record -e sched:sched_stat_sleep,sched:sched_stat_wait,... > > Re: changing the semantics of tracepoint callchains > > Yeah - this could be surprising. Luckily, most tracepoints retain > their semantics, but a few special ones don't. I guess we just need > to document the new behavior. That's not only a problem of semantics although that alone is a problem, people will seldom read the documentation for corner cases, we should really stay consistant here: if remote callchains are really needed, we want a specific interface for that, not abusing the existing one that would only confuse people. Now I still think doing remote callchains is asking for troubles: we need to ensure the target is really sleeping and is not going to be scheduled concurrently otherwise you might get weird or stale results. So the user needs to know which tracepoints are safe to perform this. Then comes the problem to deal with remote callchains in userspace: the event comes from a task but the callchain is from another. You need the perf tools to handle remote dsos/mapping/sym etc... That's a lot of unnecessary complications. I think we should use something like a perf report plugin: perhaps something that can create a virtual event on top of real ones: compute the sched:sched_switch events, find the time tasks are sleeping and create virtual sleep events on top of that with a period weighted with the sleep time. Just a thought. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-22 10:49 ` Frederic Weisbecker @ 2011-10-22 16:22 ` Andrew Wagin 2011-10-23 0:27 ` Arun Sharma 1 sibling, 0 replies; 23+ messages in thread From: Andrew Wagin @ 2011-10-22 16:22 UTC (permalink / raw) To: Frederic Weisbecker Cc: Arun Sharma, Peter Zijlstra, Ingo Molnar, linux-perf-users, acme, Stephane Eranian Hi, All. Sorry for late response. I have vacation. I see the miscomuniction. You explained me, that remote callchains was not a good idea and now I think the same. I added plugin to "perf inject". The example of usage: #./perf record -ag -e sched:sched_switch --filter "prev_state == 1" -e sched:sched_process_exit -e sched:sched_stat_sleep --filter "comm == foo" ~/foo #./perf inject -s -i perf.data -o perf.data.d #./perf report -i perf.data.d I'm going to send patches soon. 2011/10/22 Frederic Weisbecker <fweisbec@gmail.com>: > On Mon, Oct 17, 2011 at 06:07:00PM -0700, Arun Sharma wrote: >> On 10/15/11 12:29 PM, Peter Zijlstra wrote: >> >On Sat, 2011-10-15 at 21:22 +0200, Peter Zijlstra wrote: >> > >> >>>Sleep time should really just be a different notion of 'cost of the >> >>>function/callchain' and fit into the existing scheme, right? >> >> >> >>The problem with andrew's patches is that it wrecks the callchain >> >>semantics. The waittime tracepoint is in the wakeup path (and hence >> >>generates the wakee's callchain) whereas they really want the callchain >> >>of the woken task to show where it spend time. >> > >> >We could of course try to move the tracepoint into the schedule path, so >> >we issue it the first time the task gets scheduled after the wakeup, but >> >I suspect that will just add more overhead, and we really could do >> >without that. >> >> Do we need to define new tracepoints? I suspect we could make the >> existing ones: >> >> trace_sched_stat_wait() >> trace_sched_stat_sleep() >> >> work for this purpose. The length of time the task was not on the >> cpu could then be computed as: sleep+wait. The downside is that the >> complexity moves to user space. >> >> perf record -e sched:sched_stat_sleep,sched:sched_stat_wait,... >> >> Re: changing the semantics of tracepoint callchains >> >> Yeah - this could be surprising. Luckily, most tracepoints retain >> their semantics, but a few special ones don't. I guess we just need >> to document the new behavior. > > That's not only a problem of semantics although that alone is a problem, > people will seldom read the documentation for corner cases, we should > really stay consistant here: if remote callchains are really needed, we > want a specific interface for that, not abusing the existing one that would > only confuse people. > > Now I still think doing remote callchains is asking for troubles: we need to > ensure the target is really sleeping and is not going to be scheduled > concurrently otherwise you might get weird or stale results. So the user needs > to know which tracepoints are safe to perform this. > Then comes the problem to deal with remote callchains in userspace: the event > comes from a task but the callchain is from another. You need the perf tools > to handle remote dsos/mapping/sym etc... > > That's a lot of unnecessary complications. > > I think we should use something like a perf report plugin: perhaps something > that can create a virtual event on top of real ones: compute the sched:sched_switch > events, find the time tasks are sleeping and create virtual sleep events on top > of that with a period weighted with the sleep time. > Just a thought. > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Profiling sleep times? 2011-10-22 10:49 ` Frederic Weisbecker 2011-10-22 16:22 ` Andrew Wagin @ 2011-10-23 0:27 ` Arun Sharma 1 sibling, 0 replies; 23+ messages in thread From: Arun Sharma @ 2011-10-23 0:27 UTC (permalink / raw) To: Frederic Weisbecker Cc: Peter Zijlstra, Ingo Molnar, avagin, linux-perf-users, acme, Stephane Eranian On 10/22/11 3:49 AM, Frederic Weisbecker wrote: > That's not only a problem of semantics although that alone is a problem, > people will seldom read the documentation for corner cases, we should > really stay consistant here: if remote callchains are really needed, we > want a specific interface for that, not abusing the existing one that would > only confuse people. A separate interface sounds good. > > Now I still think doing remote callchains is asking for troubles: we need to > ensure the target is really sleeping and is not going to be scheduled > concurrently otherwise you might get weird or stale results. So the user needs > to know which tracepoints are safe to perform this. I expect this interface to be used in a small number of well known places in the kernel. [...] > > I think we should use something like a perf report plugin: perhaps something > that can create a virtual event on top of real ones: compute the sched:sched_switch > events, find the time tasks are sleeping and create virtual sleep events on top > of that with a period weighted with the sleep time. > Just a thought. Right - whether we're doing wall-clock profiling or sleep profiling, it'll involve looking at multiple real events. I still see one problem with doing: perf record -ag -e <bunch of events> and trying to sort through what happened. Unprivileged users who don't have permissions to system wide profiling, but have privileges to profile their own processes get locked out of this feature. -Arun ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2011-10-23 0:28 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-10-03 19:38 Profiling sleep times? Arun Sharma 2011-10-03 20:17 ` Peter Zijlstra 2011-10-03 21:53 ` Arun Sharma 2011-10-04 8:34 ` Peter Zijlstra 2011-10-06 21:56 ` Arun Sharma 2011-10-07 0:05 ` Arun Sharma 2011-10-07 1:30 ` Peter Zijlstra 2011-10-07 5:42 ` avagin 2011-10-07 9:33 ` Peter Zijlstra 2011-10-07 17:58 ` Arun Sharma 2011-10-07 23:16 ` avagin 2011-10-08 1:45 ` avagin 2011-10-10 18:50 ` Arun Sharma 2011-10-12 7:41 ` Ingo Molnar 2011-10-13 5:39 ` Andrew Vagin 2011-10-14 21:19 ` Arun Sharma 2011-10-15 17:00 ` Ingo Molnar 2011-10-15 19:22 ` Peter Zijlstra 2011-10-15 19:29 ` Peter Zijlstra 2011-10-18 1:07 ` Arun Sharma 2011-10-22 10:49 ` Frederic Weisbecker 2011-10-22 16:22 ` Andrew Wagin 2011-10-23 0:27 ` Arun Sharma
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).