Profiling sleep times?

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Profiling sleep times?
@ 2011-10-03 19:38 Arun Sharma
  2011-10-03 20:17 ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Arun Sharma @ 2011-10-03 19:38 UTC (permalink / raw)
  To: linux-perf-users; +Cc: acme, Peter Zijlstra, mingo, Stephane Eranian

Some of our users want to use perf to profile not just the code that 
consumes cycles, but also the code that ends up waiting for I/O - 
otherwise known as wall clock profiling.

I could not find ways of getting this info from the perf tool as-is. 
Wondering if a software event such as PERF_COUNT_SW_SLEEP_CLOCK below 
makes sense.

The idea is, if a task sleeps for 1ms, it should generate 1000x more 
samples vs a task that sleeps for 1us. Also, the callchain emitted 
should be the user stack.

If such an event is useful to a larger set of users, I could try to work 
out the details of how to get to event->attr.freq in the context switch 
path with low overhead and run some tests to verify that the profile 
that comes out of "perf report" looks sane.

We'll also need ways of combining PERF_COUNT_SW_TASK_CLOCK and 
PERF_COUNT_SW_SLEEP_CLOCK (in userspace?) to get the full picture.

  -Arun

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c2da40d..a3e2fb4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -106,6 +106,7 @@ enum perf_sw_ids {
         PERF_COUNT_SW_PAGE_FAULTS_MAJ           = 6,
         PERF_COUNT_SW_ALIGNMENT_FAULTS          = 7,
         PERF_COUNT_SW_EMULATION_FAULTS          = 8,
+       PERF_COUNT_SW_SLEEP_CLOCK               = 9,

         PERF_COUNT_SW_MAX,                      /* non-ABI */
  };
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 7406f36..e973862 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -877,8 +877,10 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, 
struct sched_entity *se)
                 se->statistics.sum_sleep_runtime += delta;

                 if (tsk) {
+                       u64 freq = 1000000; /* XXX: Use event->attr.freq 
? */
                         account_scheduler_latency(tsk, delta >> 10, 1);
                         trace_sched_stat_sleep(tsk, delta);
+                       perf_sw_event(PERF_COUNT_SW_SLEEP_CLOCK, 
delta/freq, 0, NULL, 0);
                 }
         }
         if (se->statistics.block_start) {

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-03 19:38 Profiling sleep times? Arun Sharma
@ 2011-10-03 20:17 ` Peter Zijlstra
  2011-10-03 21:53   ` Arun Sharma
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2011-10-03 20:17 UTC (permalink / raw)
  To: Arun Sharma; +Cc: linux-perf-users, acme, mingo, Stephane Eranian

On Mon, 2011-10-03 at 12:38 -0700, Arun Sharma wrote:
>                          trace_sched_stat_sleep(tsk, delta);

That one should be accessible as a tracepoint and will add delay to
count on each occurrence. 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-03 20:17 ` Peter Zijlstra
@ 2011-10-03 21:53   ` Arun Sharma
  2011-10-04  8:34     ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Arun Sharma @ 2011-10-03 21:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-perf-users, acme, mingo, Stephane Eranian

On 10/3/11 1:17 PM, Peter Zijlstra wrote:
> On Mon, 2011-10-03 at 12:38 -0700, Arun Sharma wrote:
>>                           trace_sched_stat_sleep(tsk, delta);
>
> That one should be accessible as a tracepoint and will add delay to
> count on each occurrence.

Right - the tracepoint seems to work. Was looking for the user stack 
trace as well. For eg:

# cat foo.c
#include <unistd.h>
#include <time.h>
#include <sys/select.h>

main()
{
         struct timespec ts1;
         struct timeval tv1;
         int i;


         for (i = 0; i < 1000; i++) {
                 ts1.tv_sec = 0;
                 ts1.tv_nsec = 1000000;
                 nanosleep(&ts1, NULL);

                 tv1.tv_sec = 0;
                 tv1.tv_usec = 4000;
                 select(0, NULL, NULL, NULL, &tv1);
         }
}

I want something that gives me 4 times as many samples on select() as 
nanosleep().

perf record -g -e sched:sched_stat_sleep -- ./foo

doesn't seem to do it. Additionally, if I have a burn_1us_cpu() loop in 
there, I'd expect to see as many samples as nanosleep() in the wall 
clock profile.

  -Arun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-03 21:53   ` Arun Sharma
@ 2011-10-04  8:34     ` Peter Zijlstra
  2011-10-06 21:56       ` Arun Sharma
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2011-10-04  8:34 UTC (permalink / raw)
  To: Arun Sharma; +Cc: linux-perf-users, acme, mingo, Stephane Eranian

On Mon, 2011-10-03 at 14:53 -0700, Arun Sharma wrote:
> On 10/3/11 1:17 PM, Peter Zijlstra wrote:
> > On Mon, 2011-10-03 at 12:38 -0700, Arun Sharma wrote:
> >>                           trace_sched_stat_sleep(tsk, delta);
> >
> > That one should be accessible as a tracepoint and will add delay to
> > count on each occurrence.
> 
> Right - the tracepoint seems to work. Was looking for the user stack 
> trace as well. For eg:
> 
> # cat foo.c
> #include <unistd.h>
> #include <time.h>
> #include <sys/select.h>
> 
> main()
> {
>          struct timespec ts1;
>          struct timeval tv1;
>          int i;
> 
> 
>          for (i = 0; i < 1000; i++) {
>                  ts1.tv_sec = 0;
>                  ts1.tv_nsec = 1000000;
>                  nanosleep(&ts1, NULL);
> 
>                  tv1.tv_sec = 0;
>                  tv1.tv_usec = 4000;
>                  select(0, NULL, NULL, NULL, &tv1);
>          }
> }
> 
> I want something that gives me 4 times as many samples on select() as 
> nanosleep().
> 
> perf record -g -e sched:sched_stat_sleep -- ./foo
> 
> doesn't seem to do it. Additionally, if I have a burn_1us_cpu() loop in 
> there, I'd expect to see as many samples as nanosleep() in the wall 
> clock profile.

Would you per-chance be suffering from this:

http://lkml.kernel.org/r/1317052535-1765247-2-git-send-email-avagin@openvz.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-04  8:34     ` Peter Zijlstra
@ 2011-10-06 21:56       ` Arun Sharma
  2011-10-07  0:05         ` Arun Sharma
                           ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Arun Sharma @ 2011-10-06 21:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-perf-users, acme, mingo, Stephane Eranian,
	Frederic Weisbecker, Andrew Vagin

On 10/4/11 1:34 AM, Peter Zijlstra wrote:

>> # cat foo.c
>> #include<unistd.h>
>> #include<time.h>
>> #include<sys/select.h>
>>
>> main()
>> {
>>           struct timespec ts1;
>>           struct timeval tv1;
>>           int i;
>>
>>
>>           for (i = 0; i<  1000; i++) {
>>                   ts1.tv_sec = 0;
>>                   ts1.tv_nsec = 1000000;
>>                   nanosleep(&ts1, NULL);
>>
>>                   tv1.tv_sec = 0;
>>                   tv1.tv_usec = 4000;
>>                   select(0, NULL, NULL, NULL,&tv1);
>>           }
>> }
>>

> Would you per-chance be suffering from this:
>
> http://lkml.kernel.org/r/1317052535-1765247-2-git-send-email-avagin@openvz.org
>

Thanks for the pointer. Yes - Andrew's patches help. But looks like we 
need more user space plumbing like Frederic noted.

perf record -ge sched:sched_stat_sleep -- ./foo

doesn't quite work.

perf record -age sched:sched_stat_sleep -- ./foo

gives me:

    58.62%              foo  [unknown]          [k] 0
                        |
                        --- schedule
                           |
                           |--54.99%-- schedule_hrtimeout_range_clock
                           |          schedule_hrtimeout_range
                           |          poll_schedule_timeout
                           |          do_select
                           |          core_sys_select
                           |          sys_select
                           |          system_call_fastpath
                           |
                           |--44.81%-- do_nanosleep
                           |          hrtimer_nanosleep
                           |          sys_nanosleep
                           |          system_call_fastpath
                            --0.20%-- [...]

i.e. select() should be weighted by 4x vs nanosleep() as confirmed via:

perf script | grep comm=foo

           foo 15516 [006]  2291.187831: sched_stat_sleep: comm=foo 
pid=15516 delay=4054262 [ns]
              foo 15516 [006]  2291.187832: sched_stat_sleep: comm=foo 
pid=15516 delay=4054262 [ns]
              foo 15516 [006]  2291.188895: sched_stat_sleep: comm=foo 
pid=15516 delay=1053565 [ns]
              foo 15516 [006]  2291.188896: sched_stat_sleep: comm=foo 
pid=15516 delay=1053565 [ns]
              foo 15516 [006]  2291.188897: sched_stat_sleep: comm=foo 
pid=15516 delay=1053565 [ns]

Andrew, are you already working on user space patches?

  -Arun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-06 21:56       ` Arun Sharma
@ 2011-10-07  0:05         ` Arun Sharma
  2011-10-07  1:30         ` Peter Zijlstra
  2011-10-08  1:45         ` avagin
  2 siblings, 0 replies; 23+ messages in thread
From: Arun Sharma @ 2011-10-07  0:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-perf-users, acme, mingo, Stephane Eranian,
	Frederic Weisbecker, Andrew Vagin

[-- Attachment #1: Type: text/plain, Size: 1979 bytes --]

On 10/6/11 2:56 PM, Arun Sharma wrote:

> i.e. select() should be weighted by 4x vs nanosleep() as confirmed via:
>
> perf script | grep comm=foo
>
> foo 15516 [006] 2291.187831: sched_stat_sleep: comm=foo pid=15516
> delay=4054262 [ns]
> foo 15516 [006] 2291.187832: sched_stat_sleep: comm=foo pid=15516
> delay=4054262 [ns]
> foo 15516 [006] 2291.188895: sched_stat_sleep: comm=foo pid=15516
> delay=1053565 [ns]
> foo 15516 [006] 2291.188896: sched_stat_sleep: comm=foo pid=15516
> delay=1053565 [ns]
> foo 15516 [006] 2291.188897: sched_stat_sleep: comm=foo pid=15516
> delay=1053565 [ns]
>
> Andrew, are you already working on user space patches?

Attached is a quick hack (mainly meant as a RFC to get the discussion 
going).

After the patch I get the expected result:

      0.16%              foo  [unknown]          [k] 0
                        |
                        --- schedule
                           |
                           |--79.26%-- schedule_hrtimeout_range_clock
                           |          schedule_hrtimeout_range
                           |          poll_schedule_timeout
                           |          do_select
                           |          core_sys_select
                           |          sys_select
                           |          system_call_fastpath
                           |
                           |--16.87%-- do_nanosleep
                           |          hrtimer_nanosleep
                           |          sys_nanosleep
                           |          system_call_fastpath
                           |
                            --3.88%-- pipe_wait
                                      pipe_read
                                      do_sync_read
                                      vfs_read
                                      sys_read
                                      system_call_fastpath

We still have the issue that per-process profiling doesn't work.

  -Arun

[-- Attachment #2: 0001-Use-delay-instead-of-sample-period.patch --]
[-- Type: text/plain, Size: 3309 bytes --]

From a2d6e0698eebb038b2d8e845137008dc3478cc51 Mon Sep 17 00:00:00 2001
From: Arun Sharma <asharma@fb.com>
Date: Thu, 6 Oct 2011 16:48:44 -0700
Subject: [PATCH] Use delay instead of sample->period

---
 tools/perf/builtin-report.c         |    6 ++++++
 tools/perf/util/probe-finder.c      |    6 ++++--
 tools/perf/util/trace-event-parse.c |   20 ++++++++++++++++++++
 tools/perf/util/trace-event.h       |    1 +
 4 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index d7ff277..6821597 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -25,6 +25,7 @@
 #include "util/evsel.h"
 #include "util/header.h"
 #include "util/session.h"
+#include "util/trace-event.h"
 
 #include "util/parse-options.h"
 #include "util/parse-events.h"
@@ -111,6 +112,11 @@ static int process_sample_event(union perf_event *event,
 				struct perf_session *session)
 {
 	struct addr_location al;
+	struct perf_event_attr *attr = &evsel->attr;
+              
+	if (attr->type == PERF_TYPE_TRACEPOINT) { 
+		sample->period = trace_get_delay(sample->raw_data, sample->raw_size);
+	}
 
 	if (perf_event__preprocess_sample(event, session, &al, sample,
 					  annotate_init) < 0) {
diff --git a/tools/perf/util/probe-finder.c b/tools/perf/util/probe-finder.c
index 5d73262..e7179ff 100644
--- a/tools/perf/util/probe-finder.c
+++ b/tools/perf/util/probe-finder.c
@@ -19,9 +19,11 @@
  *
  */
 
+#define _GNU_SOURCE
 #include <sys/utsname.h>
 #include <sys/types.h>
 #include <sys/stat.h>
+#include <ctype.h>
 #include <fcntl.h>
 #include <errno.h>
 #include <stdio.h>
@@ -30,13 +32,13 @@
 #include <stdlib.h>
 #include <string.h>
 #include <stdarg.h>
-#include <ctype.h>
 #include <dwarf-regs.h>
+#undef _GNU_SOURCE
 
 #include <linux/bitops.h>
+#include "util.h"
 #include "event.h"
 #include "debug.h"
-#include "util.h"
 #include "symbol.h"
 #include "probe-finder.h"
 
diff --git a/tools/perf/util/trace-event-parse.c b/tools/perf/util/trace-event-parse.c
index 0a7ed5b..78b7194 100644
--- a/tools/perf/util/trace-event-parse.c
+++ b/tools/perf/util/trace-event-parse.c
@@ -2944,6 +2944,26 @@ void print_trace_event(int cpu, void *data, int size)
 	pretty_print(data, size, event);
 }
 
+size_t trace_get_delay(void *data, int size __unused)
+{
+	struct event *event;
+	int type;
+	struct format_field *field;
+
+	type = trace_parse_common_type(data);
+
+	event = trace_find_event(type);
+	if (!event) {
+		warning("ug! no event found for type %d", type);
+		return 0;
+	}
+
+	field = find_field(event, "delay");
+	if (!field)
+		die("can't find delay field in trace");
+	return read_size(data + field->offset, field->size);
+}
+
 static void print_fields(struct print_flag_sym *field)
 {
 	printf("{ %s, %s }", field->value, field->str);
diff --git a/tools/perf/util/trace-event.h b/tools/perf/util/trace-event.h
index f674dda..587b553 100644
--- a/tools/perf/util/trace-event.h
+++ b/tools/perf/util/trace-event.h
@@ -264,6 +264,7 @@ unsigned long long eval_flag(const char *flag);
 
 int read_tracing_data(int fd, struct list_head *pattrs);
 ssize_t read_tracing_data_size(int fd, struct list_head *pattrs);
+size_t trace_get_delay(void *data, int size);
 
 /* taken from kernel/trace/trace.h */
 enum trace_flag_type {
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-06 21:56       ` Arun Sharma
  2011-10-07  0:05         ` Arun Sharma
@ 2011-10-07  1:30         ` Peter Zijlstra
  2011-10-07  5:42           ` avagin
  2011-10-07 17:58           ` Arun Sharma
  2011-10-08  1:45         ` avagin
  2 siblings, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2011-10-07  1:30 UTC (permalink / raw)
  To: Arun Sharma
  Cc: linux-perf-users, acme, mingo, Stephane Eranian,
	Frederic Weisbecker, Andrew Vagin

On Thu, 2011-10-06 at 14:56 -0700, Arun Sharma wrote:
>            foo 15516 [006]  2291.187831: sched_stat_sleep: comm=foo 
> pid=15516 delay=4054262 [ns]
>               foo 15516 [006]  2291.187832: sched_stat_sleep: comm=foo 
> pid=15516 delay=4054262 [ns]
>               foo 15516 [006]  2291.188895: sched_stat_sleep: comm=foo 
> pid=15516 delay=1053565 [ns]
>               foo 15516 [006]  2291.188896: sched_stat_sleep: comm=foo 
> pid=15516 delay=1053565 [ns]
>               foo 15516 [006]  2291.188897: sched_stat_sleep: comm=foo 
> pid=15516 delay=1053565 [ns] 

But the idea of the __perf_count() thing:

DECLARE_EVENT_CLASS(sched_stat_template,

	TP_PROTO(struct task_struct *tsk, u64 delay),

	TP_ARGS(tsk, delay),

	TP_STRUCT__entry(
		__array( char,	comm,	TASK_COMM_LEN	)
		__field( pid_t,	pid			)
		__field( u64,	delay			)
	),

	TP_fast_assign(
		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
		__entry->pid	= tsk->pid;
		__entry->delay	= delay;
	)
	TP_perf_assign(
		__perf_count(delay);
	),

	TP_printk("comm=%s pid=%d delay=%Lu [ns]",
			__entry->comm, __entry->pid,
			(unsigned long long)__entry->delay)
);


is that the counter is incremented with the delay, so the event should
get weighted right.

So having to get the delay out of the raw tracepoint data shouldn't be
needed.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-07  1:30         ` Peter Zijlstra
@ 2011-10-07  5:42           ` avagin
  2011-10-07  9:33             ` Peter Zijlstra
  2011-10-07 17:58           ` Arun Sharma
  1 sibling, 1 reply; 23+ messages in thread
From: avagin @ 2011-10-07  5:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arun Sharma, linux-perf-users, acme, mingo, Stephane Eranian,
	Frederic Weisbecker

On 10/07/2011 05:30 AM, Peter Zijlstra wrote:
> On Thu, 2011-10-06 at 14:56 -0700, Arun Sharma wrote:
>>             foo 15516 [006]  2291.187831: sched_stat_sleep: comm=foo
>> pid=15516 delay=4054262 [ns]
>>                foo 15516 [006]  2291.187832: sched_stat_sleep: comm=foo
>> pid=15516 delay=4054262 [ns]
>>                foo 15516 [006]  2291.188895: sched_stat_sleep: comm=foo
>> pid=15516 delay=1053565 [ns]
>>                foo 15516 [006]  2291.188896: sched_stat_sleep: comm=foo
>> pid=15516 delay=1053565 [ns]
>>                foo 15516 [006]  2291.188897: sched_stat_sleep: comm=foo
>> pid=15516 delay=1053565 [ns]
>
> But the idea of the __perf_count() thing:
__perf_count() doesn't work now and I have sent the patch, which fixes 
it. Could you commit it? It's subject:
[PATCH 1/4] perf: fix counter of ftrace events

but it's not all.
By default perf doesn't use SAMPLE_PERIOD, it bases on events count.
Each "trace" event is divided on some "perf" events in 
perf_swevent_overflow().

The number of perf events should be proportional __perf_count(), but the 
number of perf events restricted by sysctl_perf_event_sample_rate / HZ. 
In my case it's 100 events.

....
if (hwc->interrupts != MAX_INTERRUPTS) {
     hwc->interrupts++;
     if (HZ * hwc->interrupts > (u64)sysctl_perf_event_sample_rate) {
....


So Arun Sharma probably has perf.data which contains packages of events 
and each package contains 100 events irrespective of "delay".

>
> DECLARE_EVENT_CLASS(sched_stat_template,
>
> 	TP_PROTO(struct task_struct *tsk, u64 delay),
>
> 	TP_ARGS(tsk, delay),
>
> 	TP_STRUCT__entry(
> 		__array( char,	comm,	TASK_COMM_LEN	)
> 		__field( pid_t,	pid			)
> 		__field( u64,	delay			)
> 	),
>
> 	TP_fast_assign(
> 		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
> 		__entry->pid	= tsk->pid;
> 		__entry->delay	= delay;
> 	)
> 	TP_perf_assign(
> 		__perf_count(delay);
> 	),
>
> 	TP_printk("comm=%s pid=%d delay=%Lu [ns]",
> 			__entry->comm, __entry->pid,
> 			(unsigned long long)__entry->delay)
> );
>
>
> is that the counter is incremented with the delay, so the event should
> get weighted right.
>
> So having to get the delay out of the raw tracepoint data shouldn't be
> needed.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-07  5:42           ` avagin
@ 2011-10-07  9:33             ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2011-10-07  9:33 UTC (permalink / raw)
  To: avagin
  Cc: Arun Sharma, linux-perf-users, acme, mingo, Stephane Eranian,
	Frederic Weisbecker

On Fri, 2011-10-07 at 09:42 +0400, avagin@gmail.com wrote:
> Could you commit it? It's subject:
> [PATCH 1/4] perf: fix counter of ftrace events
> 
tip/master:

commit 92e51938f5d005026ba4bb5b1fae5a86dc195b86
Author: Andrew Vagin <avagin@openvz.org>
Date:   Mon Sep 26 19:55:32 2011 +0400

    perf: Fix counter of ftrace events

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-07  1:30         ` Peter Zijlstra
  2011-10-07  5:42           ` avagin
@ 2011-10-07 17:58           ` Arun Sharma
  2011-10-07 23:16             ` avagin
  1 sibling, 1 reply; 23+ messages in thread
From: Arun Sharma @ 2011-10-07 17:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-perf-users, acme, mingo, Stephane Eranian,
	Frederic Weisbecker, Andrew Vagin

On 10/6/11 6:30 PM, Peter Zijlstra wrote:

>
> But the idea of the __perf_count() thing:
> [..]
> 	TP_perf_assign(
> 		__perf_count(delay);
> 	),
>
> 	TP_printk("comm=%s pid=%d delay=%Lu [ns]",
> 			__entry->comm, __entry->pid,
> 			(unsigned long long)__entry->delay)
> );
>
>
> is that the counter is incremented with the delay, so the event should
> get weighted right.
>
> So having to get the delay out of the raw tracepoint data shouldn't be
> needed.

How does this work for dynamic tracepoints? Or tracepoints with multiple 
dimensions where the user might want to query each dimension separately?

Andrew: what do you think about generalizing my patch to accept a 
command line option(s) to specify which fields to use for the purpose of 
computing the histogram?

For static tracepoints, TP_perf_assign() could act as a hint on which 
field to default to.

  -Arun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-07 17:58           ` Arun Sharma
@ 2011-10-07 23:16             ` avagin
  0 siblings, 0 replies; 23+ messages in thread
From: avagin @ 2011-10-07 23:16 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Peter Zijlstra, linux-perf-users, acme, mingo, Stephane Eranian,
	Frederic Weisbecker

Hello Arun,

> Andrew: what do you think about generalizing my patch to accept a
> command line option(s) to specify which fields to use for the purpose of
> computing the histogram?
I have no objection, but I don't think, that we really need that.

Now we have not got a real use case. All events which I've seen have not 
more than one parameters, which may be used as weight and for this one 
we already have parameter "period".

I already said, that you have trouble with sched_stat_sleed due to some 
issues in kernel.

The first issue is that __perf_cout doesn't work and I sent the patch, 
which fixes it. ([PATCH] perf: fix counter of ftrace events)

And the second issue is that the trace events are divided on some perf 
events and their number is restricted by sysctl_perf_event_sample_rate/HZ.
I don't sure, that we should generate more than one "perf" event on each 
"trace" event. I think the better way to use SAMPLE_PERIOD and now I 
think in this direction.

If you want to fix the bug with sched_stat_sleep, you need to fix the 
second issue. I found workaround for your case:

#./perf record -age sched:sched_stat_sleep --filter="comm == foo" -c 
100000 -F 100000 -- ~/foo
# ./perf report
# Events: 5K sched:sched_stat_sleep
#
# Overhead  Command      Shared Object  Symbol
# ........  .......  .................  ......
#
     99.98%      foo  [unknown]          [k] 0
                 |
                 --- schedule
                    |
                    |--80.20%-- schedule_hrtimeout_range_clock
                    |          schedule_hrtimeout_range
                    |          poll_schedule_timeout
                    |          do_select
                    |          core_sys_select
                    |          sys_select
                    |          system_call_fastpath
                    |
                     --19.80%-- do_nanosleep
                               hrtimer_nanosleep
                               sys_nanosleep
                               system_call_fastpath

>
> For static tracepoints, TP_perf_assign() could act as a hint on which
> field to default to.
>
> -Arun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-06 21:56       ` Arun Sharma
  2011-10-07  0:05         ` Arun Sharma
  2011-10-07  1:30         ` Peter Zijlstra
@ 2011-10-08  1:45         ` avagin
  2011-10-10 18:50           ` Arun Sharma
  2 siblings, 1 reply; 23+ messages in thread
From: avagin @ 2011-10-08  1:45 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Peter Zijlstra, linux-perf-users, acme, mingo, Stephane Eranian,
	Frederic Weisbecker

[-- Attachment #1: Type: text/plain, Size: 416 bytes --]

> Andrew, are you already working on user space patches?

Yes, I'm working. I've attached the draft version.

The example of usage:
#./perf record -ag -e sched:sched_switch --filter "prev_state == 1" -e 
sched:sched_process_exit -e sched:sched_stat_sleep --filter "comm == 
foo" ~/foo

#./perf inject -s -i perf.data -o perf.data.d

#./perf report -i perf.data.d

I will be glad to receive any comments.

>
> -Arun


[-- Attachment #2: 0001-perf-teach-perf-inject-to-work-with-files.patch --]
[-- Type: text/plain, Size: 2745 bytes --]

From c2cd0d73c11687fbc67884aba337f497e13890f9 Mon Sep 17 00:00:00 2001
From: Andrew Vagin <avagin@openvz.org>
Date: Tue, 4 Oct 2011 16:10:23 +0400
Subject: [PATCH 1/3] perf: teach "perf inject" to work with files

Before this patch "perf inject" can only handle data from pipe.

I want to use "perf inject" for reworking events. Look at my following patch.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
 tools/perf/builtin-inject.c |   33 +++++++++++++++++++++++++++++++--
 1 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 8dfc12b..8df8b71 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -13,7 +13,12 @@
 
 #include "util/parse-options.h"
 
-static char		const *input_name = "-";
+static char		const *input_name	= "-";
+static const char	*output_name		= "-";
+static int		pipe_output		= 0;
+static int		output;
+static u64		bytes_written		= 0;
+
 static bool		inject_build_ids;
 
 static int perf_event__repipe_synth(union perf_event *event,
@@ -25,12 +30,14 @@ static int perf_event__repipe_synth(union perf_event *event,
 	size = event->header.size;
 
 	while (size) {
-		int ret = write(STDOUT_FILENO, buf, size);
+		int ret = write(output, buf, size);
 		if (ret < 0)
 			return -errno;
 
 		size -= ret;
 		buf += ret;
+
+		bytes_written += ret;
 	}
 
 	return 0;
@@ -213,8 +220,14 @@ static int __cmd_inject(void)
 	if (session == NULL)
 		return -ENOMEM;
 
+	if (!pipe_output)
+		lseek(output, session->header.data_offset, SEEK_SET);
 	ret = perf_session__process_events(session, &inject_ops);
 
+	if (!pipe_output) {
+		session->header.data_size += bytes_written;
+		perf_session__write_header(session, session->evlist, output, true);
+	}
 	perf_session__delete(session);
 
 	return ret;
@@ -228,6 +241,10 @@ static const char * const report_usage[] = {
 static const struct option options[] = {
 	OPT_BOOLEAN('b', "build-ids", &inject_build_ids,
 		    "Inject build-ids into the output stream"),
+	OPT_STRING('i', "input", &input_name, "file",
+		    "input file name"),
+	OPT_STRING('o', "output", &output_name, "file",
+		    "output file name"),
 	OPT_INCR('v', "verbose", &verbose,
 		 "be more verbose (show build ids, etc)"),
 	OPT_END()
@@ -243,6 +260,18 @@ int cmd_inject(int argc, const char **argv, const char *prefix __used)
 	if (argc)
 		usage_with_options(report_usage, options);
 
+	if (!strcmp(output_name, "-")) {
+		pipe_output = 1;
+		output = STDOUT_FILENO;
+	} else {
+		output = open(output_name, O_CREAT| O_WRONLY | O_TRUNC,
+							S_IRUSR | S_IWUSR);
+		if (output < 0) {
+			perror("failed to create output file");
+			exit(-1);
+		}
+	}
+
 	if (symbol__init() < 0)
 		return -1;
 
-- 
1.7.1


[-- Attachment #3: 0003-perf-add-scripts-for-collecting-D-state-statistics.patch --]
[-- Type: text/plain, Size: 1420 bytes --]

From e08809ba075d92e60f669fc62c48128e06c834fb Mon Sep 17 00:00:00 2001
From: Andrew Vagin <avagin@openvz.org>
Date: Thu, 6 Oct 2011 12:18:44 +0400
Subject: [PATCH 3/3] perf: add scripts for collecting D-state statistics


Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
 .../perf/scripts/python/bin/task-in-d-state-record |    2 ++
 .../perf/scripts/python/bin/task-in-d-state-report |    6 ++++++
 2 files changed, 8 insertions(+), 0 deletions(-)
 create mode 100644 tools/perf/scripts/python/bin/task-in-d-state-record
 create mode 100644 tools/perf/scripts/python/bin/task-in-d-state-report

diff --git a/tools/perf/scripts/python/bin/task-in-d-state-record b/tools/perf/scripts/python/bin/task-in-d-state-record
new file mode 100644
index 0000000..d70bed0
--- /dev/null
+++ b/tools/perf/scripts/python/bin/task-in-d-state-record
@@ -0,0 +1,2 @@
+#!/bin/bash
+perf record -ag -e sched:sched_switch --filter "prev_state == 2" -e sched:sched_stat_iowait -e sched:sched_process_exit $@
diff --git a/tools/perf/scripts/python/bin/task-in-d-state-report b/tools/perf/scripts/python/bin/task-in-d-state-report
new file mode 100644
index 0000000..f1ab71e
--- /dev/null
+++ b/tools/perf/scripts/python/bin/task-in-d-state-report
@@ -0,0 +1,6 @@
+#!/bin/bash
+# description: D-state statistics
+# args:
+perf inject -s -i perf.data -o perf.data.d || exit
+perf report -i perf.data.d || exit
+unlink perf.data.d
-- 
1.7.1


[-- Attachment #4: 0002-perf-teach-perf-inject-to-merge-sched_stat_-and-sche.patch --]
[-- Type: text/plain, Size: 4355 bytes --]

From 7a3152c03d07d30f47ab4fe9295212330bf821e4 Mon Sep 17 00:00:00 2001
From: Andrew Vagin <avagin@openvz.org>
Date: Tue, 4 Oct 2011 16:54:15 +0400
Subject: [PATCH 2/3] perf: teach perf inject to merge sched_stat_* and sched_switch events

You may want to know where and how long a task is sleeping. A callchain
may be found in sched_switch and a time slice in stat_iowait, so I add
handler in perf inject for merging this events.

My code saves sched_switch event for each process and when it meets
stat_iowait, it reports the sched_switch event, because it contains a
correct callchain. By another words it replaces all stat_iowait events
on proper sched_switch events.

My code doesn't change events.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
---
 tools/perf/builtin-inject.c |   87 +++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 87 insertions(+), 0 deletions(-)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 8df8b71..10bdd65 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -12,6 +12,8 @@
 #include "util/debug.h"
 
 #include "util/parse-options.h"
+#include "util/trace-event.h"
+
 
 static char		const *input_name	= "-";
 static const char	*output_name		= "-";
@@ -20,6 +22,7 @@ static int		output;
 static u64		bytes_written		= 0;
 
 static bool		inject_build_ids;
+static bool		inject_sched_stat;
 
 static int perf_event__repipe_synth(union perf_event *event,
 				    struct perf_session *session __used)
@@ -179,6 +182,85 @@ repipe:
 	return 0;
 }
 
+struct event_entry
+{
+	struct list_head list;
+	u32 pid;
+	struct perf_event_header header;
+};
+
+#define ENT_SIZE(size) ((size) + offsetof(event_entry, header))
+
+static LIST_HEAD(samples);
+
+static int perf_event__sched_stat(union perf_event *event,
+				      struct perf_sample *sample,
+				      struct perf_evsel *evsel __used,
+				      struct perf_session *session)
+{
+	int type;
+	struct event *e;
+	const char *evname = NULL;
+	uint32_t size;
+	void *buf = event;
+	struct event_entry *ent;
+	union perf_event *event_sw = NULL;
+	struct perf_sample sample_sw;
+	int sched_process_exit;
+
+	size = event->header.size;
+
+	type = trace_parse_common_type(sample->raw_data);
+	e = trace_find_event(type);
+	if (e)
+		evname = e->name;
+
+	sched_process_exit = !strcmp(evname, "sched_process_exit");
+
+	if (!strcmp(evname, "sched_switch") ||  sched_process_exit) {
+		list_for_each_entry(ent, &samples, list)
+			if (sample->pid == ent->pid)
+				break;
+
+		if (&ent->list != &samples) {
+			list_del(&ent->list);
+			free(ent);
+		}
+
+		if (sched_process_exit)
+			return 0;
+
+		ent = malloc(size + offsetof(struct event_entry, header));
+		ent->pid = sample->pid;
+		memcpy(&ent->header, buf, size);
+		list_add(&ent->list, &samples);
+		return 0;
+
+	} else if (!strncmp(evname, "sched_stat_", 11)) {
+		u32 pid;
+
+		pid = raw_field_value(e, "pid", sample->raw_data);
+
+		list_for_each_entry(ent, &samples, list) {
+			if (pid == ent->pid)
+				break;
+		}
+
+		if (&ent->list == &samples) {
+			pr_warning("Could not find sched_switch for pid %u\n", pid);
+			return 0;
+		}
+
+		event_sw = (union perf_event *) &ent->header;
+		perf_session__parse_sample(session, event_sw, &sample_sw);
+		perf_event__repipe(event_sw, &sample_sw, session);
+		return 0;
+	}
+
+	perf_event__repipe(event, sample, session);
+
+	return 0;
+}
 struct perf_event_ops inject_ops = {
 	.sample		= perf_event__repipe_sample,
 	.mmap		= perf_event__repipe,
@@ -214,6 +296,9 @@ static int __cmd_inject(void)
 		inject_ops.mmap		= perf_event__repipe_mmap;
 		inject_ops.fork		= perf_event__repipe_task;
 		inject_ops.tracing_data	= perf_event__repipe_tracing_data;
+	} else if (inject_sched_stat) {
+		inject_ops.sample	= perf_event__sched_stat;
+		inject_ops.ordered_samples		= true;
 	}
 
 	session = perf_session__new(input_name, O_RDONLY, false, true, &inject_ops);
@@ -241,6 +326,8 @@ static const char * const report_usage[] = {
 static const struct option options[] = {
 	OPT_BOOLEAN('b', "build-ids", &inject_build_ids,
 		    "Inject build-ids into the output stream"),
+	OPT_BOOLEAN('s', "sched-stat", &inject_sched_stat,
+		    "correct call-chains for shed-stat-*"),
 	OPT_STRING('i', "input", &input_name, "file",
 		    "input file name"),
 	OPT_STRING('o', "output", &output_name, "file",
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-08  1:45         ` avagin
@ 2011-10-10 18:50           ` Arun Sharma
  2011-10-12  7:41             ` Ingo Molnar
  0 siblings, 1 reply; 23+ messages in thread
From: Arun Sharma @ 2011-10-10 18:50 UTC (permalink / raw)
  To: avagin
  Cc: Peter Zijlstra, linux-perf-users, acme, mingo, Stephane Eranian,
	Frederic Weisbecker

On 10/7/11 6:45 PM, avagin@gmail.com wrote:
>> Andrew, are you already working on user space patches?
>
> Yes, I'm working. I've attached the draft version.

Generally I'm happy with the direction of these patches. For doing wall 
time based profiling we'll need support in perf inject to merge sleep + 
cycles events.

Re: examples of why one might need histograms based on various fields in 
a dynamic kernel probe:

I might want to add a probe such as:

perf probe --add 'sys_read fd=%di'
perf probe --add 'sys_read%return bytes=$retval'

and expect to find stack traces that are responsible for most I/O calls 
based on bytes vs fd vs number of syscalls.

  -Arun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-10 18:50           ` Arun Sharma
@ 2011-10-12  7:41             ` Ingo Molnar
  2011-10-13  5:39               ` Andrew Vagin
  2011-10-14 21:19               ` Arun Sharma
  0 siblings, 2 replies; 23+ messages in thread
From: Ingo Molnar @ 2011-10-12  7:41 UTC (permalink / raw)
  To: Arun Sharma
  Cc: avagin, Peter Zijlstra, linux-perf-users, acme, Stephane Eranian,
	Frederic Weisbecker


* Arun Sharma <asharma@fb.com> wrote:

> On 10/7/11 6:45 PM, avagin@gmail.com wrote:
> >>Andrew, are you already working on user space patches?
> >
> >Yes, I'm working. I've attached the draft version.
> 
> Generally I'm happy with the direction of these patches. [...]

Ok, will someone please send an updated series of all missing 
patches, with all acks included, for potential v3.2 pickup?

Sleep time profiling is obviously a very hot feature ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-12  7:41             ` Ingo Molnar
@ 2011-10-13  5:39               ` Andrew Vagin
  2011-10-14 21:19               ` Arun Sharma
  1 sibling, 0 replies; 23+ messages in thread
From: Andrew Vagin @ 2011-10-13  5:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arun Sharma, Peter Zijlstra, linux-perf-users, acme,
	Stephane Eranian, Frederic Weisbecker

> Ok, will someone please send an updated series of all missing
> patches, with all acks included, for potential v3.2 pickup?
I'm in vacation now, so I'm going to send my patches in two-three weeks.

Sorry for delay.
>
> Sleep time profiling is obviously a very hot feature ...
>
> Thanks,
>
> 	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-12  7:41             ` Ingo Molnar
  2011-10-13  5:39               ` Andrew Vagin
@ 2011-10-14 21:19               ` Arun Sharma
  2011-10-15 17:00                 ` Ingo Molnar
  1 sibling, 1 reply; 23+ messages in thread
From: Arun Sharma @ 2011-10-14 21:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: avagin, Peter Zijlstra, linux-perf-users, acme, Stephane Eranian,
	Frederic Weisbecker

On 10/12/11 12:41 AM, Ingo Molnar wrote:
>> Generally I'm happy with the direction of these patches. [...]
>
> Ok, will someone please send an updated series of all missing
> patches, with all acks included, for potential v3.2 pickup?
>
> Sleep time profiling is obviously a very hot feature ...
>

I acked two of Andrew's patches. The third one is the most important 
one. I haven't had a chance to review that carefully yet - although it's 
working well for me.

We still need to figure out what's the best way to munge all this data 
in user space to implement wall clock profiling.

  -Arun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-14 21:19               ` Arun Sharma
@ 2011-10-15 17:00                 ` Ingo Molnar
  2011-10-15 19:22                   ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2011-10-15 17:00 UTC (permalink / raw)
  To: Arun Sharma
  Cc: avagin, Peter Zijlstra, linux-perf-users, acme, Stephane Eranian,
	Frederic Weisbecker


* Arun Sharma <asharma@fb.com> wrote:

> On 10/12/11 12:41 AM, Ingo Molnar wrote:
> >>Generally I'm happy with the direction of these patches. [...]
> >
> >Ok, will someone please send an updated series of all missing
> >patches, with all acks included, for potential v3.2 pickup?
> >
> >Sleep time profiling is obviously a very hot feature ...
> >
> 
> I acked two of Andrew's patches. The third one is the most 
> important one. I haven't had a chance to review that carefully yet 
> - although it's working well for me.
> 
> We still need to figure out what's the best way to munge all this 
> data in user space to implement wall clock profiling.

Well, if 'perf report' can show sleep time ordered entries just fine 
by default (if a proper perf record was done), which also works in 
call graph recording mode then i'm a happy camper.

Sleep time should really just be a different notion of 'cost of the 
function/callchain' and fit into the existing scheme, right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-15 17:00                 ` Ingo Molnar
@ 2011-10-15 19:22                   ` Peter Zijlstra
  2011-10-15 19:29                     ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2011-10-15 19:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arun Sharma, avagin, linux-perf-users, acme, Stephane Eranian,
	Frederic Weisbecker

On Sat, 2011-10-15 at 19:00 +0200, Ingo Molnar wrote:
> * Arun Sharma <asharma@fb.com> wrote:
> 
> > On 10/12/11 12:41 AM, Ingo Molnar wrote:
> > >>Generally I'm happy with the direction of these patches. [...]
> > >
> > >Ok, will someone please send an updated series of all missing
> > >patches, with all acks included, for potential v3.2 pickup?
> > >
> > >Sleep time profiling is obviously a very hot feature ...
> > >
> > 
> > I acked two of Andrew's patches. The third one is the most 
> > important one. I haven't had a chance to review that carefully yet 
> > - although it's working well for me.
> > 
> > We still need to figure out what's the best way to munge all this 
> > data in user space to implement wall clock profiling.
> 
> Well, if 'perf report' can show sleep time ordered entries just fine 
> by default (if a proper perf record was done), which also works in 
> call graph recording mode then i'm a happy camper.
> 
> Sleep time should really just be a different notion of 'cost of the 
> function/callchain' and fit into the existing scheme, right?

The problem with andrew's patches is that it wrecks the callchain
semantics. The waittime tracepoint is in the wakeup path (and hence
generates the wakee's callchain) whereas they really want the callchain
of the woken task to show where it spend time.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-15 19:22                   ` Peter Zijlstra
@ 2011-10-15 19:29                     ` Peter Zijlstra
  2011-10-18  1:07                       ` Arun Sharma
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2011-10-15 19:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arun Sharma, avagin, linux-perf-users, acme, Stephane Eranian,
	Frederic Weisbecker

On Sat, 2011-10-15 at 21:22 +0200, Peter Zijlstra wrote:

> > Sleep time should really just be a different notion of 'cost of the 
> > function/callchain' and fit into the existing scheme, right?
> 
> The problem with andrew's patches is that it wrecks the callchain
> semantics. The waittime tracepoint is in the wakeup path (and hence
> generates the wakee's callchain) whereas they really want the callchain
> of the woken task to show where it spend time.

We could of course try to move the tracepoint into the schedule path, so
we issue it the first time the task gets scheduled after the wakeup, but
I suspect that will just add more overhead, and we really could do
without that.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-15 19:29                     ` Peter Zijlstra
@ 2011-10-18  1:07                       ` Arun Sharma
  2011-10-22 10:49                         ` Frederic Weisbecker
  0 siblings, 1 reply; 23+ messages in thread
From: Arun Sharma @ 2011-10-18  1:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, avagin, linux-perf-users, acme, Stephane Eranian,
	Frederic Weisbecker

On 10/15/11 12:29 PM, Peter Zijlstra wrote:
> On Sat, 2011-10-15 at 21:22 +0200, Peter Zijlstra wrote:
>
>>> Sleep time should really just be a different notion of 'cost of the
>>> function/callchain' and fit into the existing scheme, right?
>>
>> The problem with andrew's patches is that it wrecks the callchain
>> semantics. The waittime tracepoint is in the wakeup path (and hence
>> generates the wakee's callchain) whereas they really want the callchain
>> of the woken task to show where it spend time.
>
> We could of course try to move the tracepoint into the schedule path, so
> we issue it the first time the task gets scheduled after the wakeup, but
> I suspect that will just add more overhead, and we really could do
> without that.

Do we need to define new tracepoints? I suspect we could make the 
existing ones:

trace_sched_stat_wait()
trace_sched_stat_sleep()

work for this purpose. The length of time the task was not on the cpu 
could then be computed as: sleep+wait. The downside is that the 
complexity moves to user space.

perf record -e sched:sched_stat_sleep,sched:sched_stat_wait,...

Re: changing the semantics of tracepoint callchains

Yeah - this could be surprising. Luckily, most tracepoints retain their 
semantics, but a few special ones don't. I guess we just need to 
document the new behavior.

  -Arun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-18  1:07                       ` Arun Sharma
@ 2011-10-22 10:49                         ` Frederic Weisbecker
  2011-10-22 16:22                           ` Andrew Wagin
  2011-10-23  0:27                           ` Arun Sharma
  0 siblings, 2 replies; 23+ messages in thread
From: Frederic Weisbecker @ 2011-10-22 10:49 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Peter Zijlstra, Ingo Molnar, avagin, linux-perf-users, acme,
	Stephane Eranian

On Mon, Oct 17, 2011 at 06:07:00PM -0700, Arun Sharma wrote:
> On 10/15/11 12:29 PM, Peter Zijlstra wrote:
> >On Sat, 2011-10-15 at 21:22 +0200, Peter Zijlstra wrote:
> >
> >>>Sleep time should really just be a different notion of 'cost of the
> >>>function/callchain' and fit into the existing scheme, right?
> >>
> >>The problem with andrew's patches is that it wrecks the callchain
> >>semantics. The waittime tracepoint is in the wakeup path (and hence
> >>generates the wakee's callchain) whereas they really want the callchain
> >>of the woken task to show where it spend time.
> >
> >We could of course try to move the tracepoint into the schedule path, so
> >we issue it the first time the task gets scheduled after the wakeup, but
> >I suspect that will just add more overhead, and we really could do
> >without that.
> 
> Do we need to define new tracepoints? I suspect we could make the
> existing ones:
> 
> trace_sched_stat_wait()
> trace_sched_stat_sleep()
> 
> work for this purpose. The length of time the task was not on the
> cpu could then be computed as: sleep+wait. The downside is that the
> complexity moves to user space.
> 
> perf record -e sched:sched_stat_sleep,sched:sched_stat_wait,...
> 
> Re: changing the semantics of tracepoint callchains
> 
> Yeah - this could be surprising. Luckily, most tracepoints retain
> their semantics, but a few special ones don't. I guess we just need
> to document the new behavior.

That's not only a problem of semantics although that alone is a problem,
people will seldom read the documentation for corner cases, we should
really stay consistant here: if remote callchains are really needed, we
want a specific interface for that, not abusing the existing one that would
only confuse people.

Now I still think doing remote callchains is asking for troubles: we need to
ensure the target is really sleeping and is not going to be scheduled
concurrently otherwise you might get weird or stale results. So the user needs
to know which tracepoints are safe to perform this.
Then comes the problem to deal with remote callchains in userspace: the event
comes from a task but the callchain is from another. You need the perf tools
to handle remote dsos/mapping/sym etc...

That's a lot of unnecessary complications.

I think we should use something like a perf report plugin: perhaps something
that can create a virtual event on top of real ones: compute the sched:sched_switch
events, find the time tasks are sleeping and create virtual sleep events on top
of that with a period weighted with the sleep time.
Just a thought.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-22 10:49                         ` Frederic Weisbecker
@ 2011-10-22 16:22                           ` Andrew Wagin
  2011-10-23  0:27                           ` Arun Sharma
  1 sibling, 0 replies; 23+ messages in thread
From: Andrew Wagin @ 2011-10-22 16:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Arun Sharma, Peter Zijlstra, Ingo Molnar, linux-perf-users, acme,
	Stephane Eranian

Hi, All.

Sorry for late response. I have vacation.

I see the miscomuniction. You explained me, that remote callchains was
not a good idea and now I think the same. I added plugin to "perf
inject".

The example of usage:
#./perf record -ag -e sched:sched_switch --filter "prev_state == 1" -e
sched:sched_process_exit -e sched:sched_stat_sleep --filter "comm ==
foo" ~/foo

#./perf inject -s -i perf.data -o perf.data.d

#./perf report -i perf.data.d

I'm going to send patches soon.

2011/10/22 Frederic Weisbecker <fweisbec@gmail.com>:
> On Mon, Oct 17, 2011 at 06:07:00PM -0700, Arun Sharma wrote:
>> On 10/15/11 12:29 PM, Peter Zijlstra wrote:
>> >On Sat, 2011-10-15 at 21:22 +0200, Peter Zijlstra wrote:
>> >
>> >>>Sleep time should really just be a different notion of 'cost of the
>> >>>function/callchain' and fit into the existing scheme, right?
>> >>
>> >>The problem with andrew's patches is that it wrecks the callchain
>> >>semantics. The waittime tracepoint is in the wakeup path (and hence
>> >>generates the wakee's callchain) whereas they really want the callchain
>> >>of the woken task to show where it spend time.
>> >
>> >We could of course try to move the tracepoint into the schedule path, so
>> >we issue it the first time the task gets scheduled after the wakeup, but
>> >I suspect that will just add more overhead, and we really could do
>> >without that.
>>
>> Do we need to define new tracepoints? I suspect we could make the
>> existing ones:
>>
>> trace_sched_stat_wait()
>> trace_sched_stat_sleep()
>>
>> work for this purpose. The length of time the task was not on the
>> cpu could then be computed as: sleep+wait. The downside is that the
>> complexity moves to user space.
>>
>> perf record -e sched:sched_stat_sleep,sched:sched_stat_wait,...
>>
>> Re: changing the semantics of tracepoint callchains
>>
>> Yeah - this could be surprising. Luckily, most tracepoints retain
>> their semantics, but a few special ones don't. I guess we just need
>> to document the new behavior.
>
> That's not only a problem of semantics although that alone is a problem,
> people will seldom read the documentation for corner cases, we should
> really stay consistant here: if remote callchains are really needed, we
> want a specific interface for that, not abusing the existing one that would
> only confuse people.
>
> Now I still think doing remote callchains is asking for troubles: we need to
> ensure the target is really sleeping and is not going to be scheduled
> concurrently otherwise you might get weird or stale results. So the user needs
> to know which tracepoints are safe to perform this.
> Then comes the problem to deal with remote callchains in userspace: the event
> comes from a task but the callchain is from another. You need the perf tools
> to handle remote dsos/mapping/sym etc...
>
> That's a lot of unnecessary complications.
>
> I think we should use something like a perf report plugin: perhaps something
> that can create a virtual event on top of real ones: compute the sched:sched_switch
> events, find the time tasks are sleeping and create virtual sleep events on top
> of that with a period weighted with the sleep time.
> Just a thought.
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Profiling sleep times?
  2011-10-22 10:49                         ` Frederic Weisbecker
  2011-10-22 16:22                           ` Andrew Wagin
@ 2011-10-23  0:27                           ` Arun Sharma
  1 sibling, 0 replies; 23+ messages in thread
From: Arun Sharma @ 2011-10-23  0:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Ingo Molnar, avagin, linux-perf-users, acme,
	Stephane Eranian

On 10/22/11 3:49 AM, Frederic Weisbecker wrote:

> That's not only a problem of semantics although that alone is a problem,
> people will seldom read the documentation for corner cases, we should
> really stay consistant here: if remote callchains are really needed, we
> want a specific interface for that, not abusing the existing one that would
> only confuse people.

A separate interface sounds good.

>
> Now I still think doing remote callchains is asking for troubles: we need to
> ensure the target is really sleeping and is not going to be scheduled
> concurrently otherwise you might get weird or stale results. So the user needs
> to know which tracepoints are safe to perform this.

I expect this interface to be used in a small number of well known 
places in the kernel.

[...]

>
> I think we should use something like a perf report plugin: perhaps something
> that can create a virtual event on top of real ones: compute the sched:sched_switch
> events, find the time tasks are sleeping and create virtual sleep events on top
> of that with a period weighted with the sleep time.
> Just a thought.

Right - whether we're doing wall-clock profiling or sleep profiling, 
it'll involve looking at multiple real events.

I still see one problem with doing: perf record -ag -e <bunch of events> 
and trying to sort through what happened. Unprivileged users who don't 
have permissions to system wide profiling, but have privileges to 
profile their own processes get locked out of this feature.

  -Arun

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2011-10-23  0:28 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-03 19:38 Profiling sleep times? Arun Sharma
2011-10-03 20:17 ` Peter Zijlstra
2011-10-03 21:53   ` Arun Sharma
2011-10-04  8:34     ` Peter Zijlstra
2011-10-06 21:56       ` Arun Sharma
2011-10-07  0:05         ` Arun Sharma
2011-10-07  1:30         ` Peter Zijlstra
2011-10-07  5:42           ` avagin
2011-10-07  9:33             ` Peter Zijlstra
2011-10-07 17:58           ` Arun Sharma
2011-10-07 23:16             ` avagin
2011-10-08  1:45         ` avagin
2011-10-10 18:50           ` Arun Sharma
2011-10-12  7:41             ` Ingo Molnar
2011-10-13  5:39               ` Andrew Vagin
2011-10-14 21:19               ` Arun Sharma
2011-10-15 17:00                 ` Ingo Molnar
2011-10-15 19:22                   ` Peter Zijlstra
2011-10-15 19:29                     ` Peter Zijlstra
2011-10-18  1:07                       ` Arun Sharma
2011-10-22 10:49                         ` Frederic Weisbecker
2011-10-22 16:22                           ` Andrew Wagin
2011-10-23  0:27                           ` Arun Sharma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).