linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Sleep Profiling v2
@ 2011-12-20 18:09 Arun Sharma
  2011-12-20 18:09 ` [PATCH 1/2] sched: Retain sleep/block information Arun Sharma
  2011-12-20 18:09 ` [PATCH 2/2] tracing, sched: Add a new tracepoint for sleeptime Arun Sharma
  0 siblings, 2 replies; 9+ messages in thread
From: Arun Sharma @ 2011-12-20 18:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arun Sharma

Changes since v1:

* Define a new sched_stat tracepoint
* Rebased to -tip

When debugging latency, users want to know where their code
is executing on the CPU (cycle profiling) as well as where 
it's waiting sleeping or waiting for IO. The following patch set
tries to address the latter.

Normal users can use this tracepoint without root privileges
and on a system with lots of context switches, load can be
reduced by filtering out uninteresting switches.

Sample command lines:

perf record -gPe sched:sched_stat_sleeptime --filter "sleeptime > 10000" -- ./test
perf report --stdio -g graph -G

Arun Sharma (2):
  sched: Retain sleep/block information
  tracing, sched: Add a new tracepoint for sleeptime

 include/trace/events/sched.h |   48 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          |    1 +
 kernel/sched/fair.c          |    6 +++-
 3 files changed, 53 insertions(+), 2 deletions(-)

-- 
1.7.4


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] sched: Retain sleep/block information
  2011-12-20 18:09 [PATCH 0/2] Sleep Profiling v2 Arun Sharma
@ 2011-12-20 18:09 ` Arun Sharma
  2011-12-20 18:09 ` [PATCH 2/2] tracing, sched: Add a new tracepoint for sleeptime Arun Sharma
  1 sibling, 0 replies; 9+ messages in thread
From: Arun Sharma @ 2011-12-20 18:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Arun Sharma, Peter Zijlstra, Steven Rostedt, Mathieu Desnoyers,
	Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker,
	Ingo Molnar

This is so that we can compute the sleep/block
delay at context switch time for the purpose
of sleep profiling.

Signed-off-by: Arun Sharma <asharma@fb.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: linux-kernel@vger.kernel.org
---
 kernel/sched/fair.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd3b642..8ca98eb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1003,7 +1003,6 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		if (unlikely(delta > se->statistics.sleep_max))
 			se->statistics.sleep_max = delta;
 
-		se->statistics.sleep_start = 0;
 		se->statistics.sum_sleep_runtime += delta;
 
 		if (tsk) {
@@ -1020,7 +1019,6 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		if (unlikely(delta > se->statistics.block_max))
 			se->statistics.block_max = delta;
 
-		se->statistics.block_start = 0;
 		se->statistics.sum_sleep_runtime += delta;
 
 		if (tsk) {
@@ -1193,8 +1191,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 			if (tsk->state & TASK_INTERRUPTIBLE)
 				se->statistics.sleep_start = rq_of(cfs_rq)->clock;
+			else
+				se->statistics.sleep_start = 0;
 			if (tsk->state & TASK_UNINTERRUPTIBLE)
 				se->statistics.block_start = rq_of(cfs_rq)->clock;
+			else
+				se->statistics.block_start = 0;
 		}
 #endif
 	}
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] tracing, sched: Add a new tracepoint for sleeptime
  2011-12-20 18:09 [PATCH 0/2] Sleep Profiling v2 Arun Sharma
  2011-12-20 18:09 ` [PATCH 1/2] sched: Retain sleep/block information Arun Sharma
@ 2011-12-20 18:09 ` Arun Sharma
  2011-12-20 18:16   ` Steven Rostedt
  2011-12-21 10:43   ` Andrey Vagin
  1 sibling, 2 replies; 9+ messages in thread
From: Arun Sharma @ 2011-12-20 18:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Arun Sharma, Peter Zijlstra, Steven Rostedt, Mathieu Desnoyers,
	Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker,
	Ingo Molnar

If CONFIG_SCHEDSTATS is defined, the kernel maintains
information about how long the task was sleeping or
in the case of iowait, blocking in the kernel before
getting woken up.

Note: this information is only provided for sched_fair.
Other scheduling classes may choose to provide this in
the future.

Note: the delay includes the time spent on the runqueue
as well.

Signed-off-by: Arun Sharma <asharma@fb.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: linux-kernel@vger.kernel.org
---
 include/trace/events/sched.h |   48 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          |    1 +
 2 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 959ff18..5189220 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -363,6 +363,54 @@ TRACE_EVENT(sched_stat_runtime,
 			(unsigned long long)__entry->vruntime)
 );
 
+#ifdef CREATE_TRACE_POINTS
+static inline u64 trace_get_sleeptime(struct task_struct *tsk) 
+{
+#ifdef CONFIG_SCHEDSTATS
+	u64 block, sleep;
+
+	block = tsk->se.statistics.block_start;
+	sleep = tsk->se.statistics.sleep_start;
+
+	return block ? block : sleep ? sleep : 0;
+#else
+	return 0;
+#endif
+}
+#endif
+
+/*
+ * Tracepoint for accounting sleeptime (time the task is sleeping 
+ * or waiting for I/O).
+ */
+TRACE_EVENT(sched_stat_sleeptime,
+
+	TP_PROTO(struct task_struct *tsk, u64 now),
+
+	TP_ARGS(tsk, now),
+
+	TP_STRUCT__entry(
+		__array( char,	comm,	TASK_COMM_LEN	)
+		__field( pid_t,	pid			)
+		__field( u64,	sleeptime		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+		__entry->pid		= tsk->pid;
+		__entry->sleeptime = trace_get_sleeptime(tsk);
+		__entry->sleeptime = __entry->sleeptime ? 
+				now - __entry->sleeptime : 0;
+	)
+	TP_perf_assign(
+		__perf_count(__entry->sleeptime);
+	),
+
+	TP_printk("comm=%s pid=%d sleeptime=%Lu [ns]",
+			__entry->comm, __entry->pid,
+			(unsigned long long)__entry->sleeptime)
+);
+
 /*
  * Tracepoint for showing priority inheritance modifying a tasks
  * priority.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca8fd44..172e6ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1899,6 +1899,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	local_irq_enable();
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
 	finish_lock_switch(rq, prev);
+	trace_sched_stat_sleeptime(current, rq->clock);
 
 	fire_sched_in_preempt_notifiers(current);
 	if (mm)
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] tracing, sched: Add a new tracepoint for sleeptime
  2011-12-20 18:09 ` [PATCH 2/2] tracing, sched: Add a new tracepoint for sleeptime Arun Sharma
@ 2011-12-20 18:16   ` Steven Rostedt
  2011-12-20 18:22     ` Arun Sharma
  2011-12-21 10:43   ` Andrey Vagin
  1 sibling, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2011-12-20 18:16 UTC (permalink / raw)
  To: Arun Sharma
  Cc: linux-kernel, Peter Zijlstra, Mathieu Desnoyers,
	Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker,
	Ingo Molnar

On Tue, 2011-12-20 at 10:09 -0800, Arun Sharma wrote:
> If CONFIG_SCHEDSTATS is defined, the kernel maintains
> information about how long the task was sleeping or
> in the case of iowait, blocking in the kernel before
> getting woken up.
> 
> Note: this information is only provided for sched_fair.
> Other scheduling classes may choose to provide this in
> the future.
> 
> Note: the delay includes the time spent on the runqueue
> as well.
> 
> Signed-off-by: Arun Sharma <asharma@fb.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
> Cc: Andrew Vagin <avagin@openvz.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: linux-kernel@vger.kernel.org
> ---
>  include/trace/events/sched.h |   48 ++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/core.c          |    1 +
>  2 files changed, 49 insertions(+), 0 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 959ff18..5189220 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -363,6 +363,54 @@ TRACE_EVENT(sched_stat_runtime,
>  			(unsigned long long)__entry->vruntime)
>  );
>  
> +#ifdef CREATE_TRACE_POINTS

Why the #ifdef? a static inline is fine to keep in the open. If nothing
uses it, then it just wont be used or defined.

-- Steve

> +static inline u64 trace_get_sleeptime(struct task_struct *tsk) 
> +{
> +#ifdef CONFIG_SCHEDSTATS
> +	u64 block, sleep;
> +
> +	block = tsk->se.statistics.block_start;
> +	sleep = tsk->se.statistics.sleep_start;
> +
> +	return block ? block : sleep ? sleep : 0;
> +#else
> +	return 0;
> +#endif
> +}
> +#endif
> +
> +


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] tracing, sched: Add a new tracepoint for sleeptime
  2011-12-20 18:16   ` Steven Rostedt
@ 2011-12-20 18:22     ` Arun Sharma
  2011-12-20 19:19       ` Steven Rostedt
  0 siblings, 1 reply; 9+ messages in thread
From: Arun Sharma @ 2011-12-20 18:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Peter Zijlstra, Mathieu Desnoyers,
	Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker,
	Ingo Molnar

On 12/20/11 10:16 AM, Steven Rostedt wrote:

>> +#ifdef CREATE_TRACE_POINTS
>
> Why the #ifdef? a static inline is fine to keep in the open. If nothing
> uses it, then it just wont be used or defined.
>

Without the ifdef I get:

In file included from include/trace/define_trace.h:80,
                  from include/trace/events/sched.h:445,
                  from kernel/sched/core.c:86:
include/trace/events/sched.h:366: error: redefinition of 
‘trace_get_sleeptime’
include/trace/events/sched.h:366: note: previous definition of 
‘trace_get_sleeptime’ was here

So I shamelessly copied what the other static inline function in the 
file does.

I also noticed that sched_stat_sleeptime() is very similar to 
sched_stat_sleep(). The latter covers sleeps but not iowait and doesn't 
include the scheduler latency. Let me know if you think this is a show 
stopper that needs a rename.

  -Arun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] tracing, sched: Add a new tracepoint for sleeptime
  2011-12-20 18:22     ` Arun Sharma
@ 2011-12-20 19:19       ` Steven Rostedt
  0 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2011-12-20 19:19 UTC (permalink / raw)
  To: Arun Sharma
  Cc: linux-kernel, Peter Zijlstra, Mathieu Desnoyers,
	Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker,
	Ingo Molnar

On Tue, 2011-12-20 at 10:22 -0800, Arun Sharma wrote:
> On 12/20/11 10:16 AM, Steven Rostedt wrote:
> 
> >> +#ifdef CREATE_TRACE_POINTS
> >
> > Why the #ifdef? a static inline is fine to keep in the open. If nothing
> > uses it, then it just wont be used or defined.
> >
> 
> Without the ifdef I get:
> 
> In file included from include/trace/define_trace.h:80,
>                   from include/trace/events/sched.h:445,
>                   from kernel/sched/core.c:86:
> include/trace/events/sched.h:366: error: redefinition of 
> ‘trace_get_sleeptime’
> include/trace/events/sched.h:366: note: previous definition of 
> ‘trace_get_sleeptime’ was here
> 
> So I shamelessly copied what the other static inline function in the 
> file does.

Ah, OK, it's because of the duplicate includes. Still ugly though. But
we can clean that up later.


> 
> I also noticed that sched_stat_sleeptime() is very similar to 
> sched_stat_sleep(). The latter covers sleeps but not iowait and doesn't 
> include the scheduler latency. Let me know if you think this is a show 
> stopper that needs a rename.

what about sched_stat_iowait() too. As for latency, I don't know. That's
more Peter's decision.

-- Steve



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] tracing, sched: Add a new tracepoint for sleeptime
  2011-12-20 18:09 ` [PATCH 2/2] tracing, sched: Add a new tracepoint for sleeptime Arun Sharma
  2011-12-20 18:16   ` Steven Rostedt
@ 2011-12-21 10:43   ` Andrey Vagin
  2011-12-22  0:18     ` Arun Sharma
  2012-01-19 22:30     ` Arun Sharma
  1 sibling, 2 replies; 9+ messages in thread
From: Andrey Vagin @ 2011-12-21 10:43 UTC (permalink / raw)
  To: Arun Sharma
  Cc: linux-kernel, Peter Zijlstra, Steven Rostedt, Mathieu Desnoyers,
	Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker,
	Ingo Molnar


>
> +#ifdef CREATE_TRACE_POINTS
> +static inline u64 trace_get_sleeptime(struct task_struct *tsk)
> +{
> +#ifdef CONFIG_SCHEDSTATS
> +	u64 block, sleep;
> +
> +	block = tsk->se.statistics.block_start;
> +	sleep = tsk->se.statistics.sleep_start;
Arun, probably you have missed one of my comments.
block_start and sleep_start should be zeroized here.

tsk->se.statistics.block_start = 0;
tsk->se.statistics.sleep_start = 0;

The following program provokes this problem:

#include<unistd.h>
#include<time.h>
#include<sys/select.h>

int main()
{
           struct timespec ts1;
           struct timeval tv1;
           int i;
           long s;


           for (i = 0; i<  1000; i++) {
                   ts1.tv_sec = 0;
                   ts1.tv_nsec = 1000000;
                   nanosleep(&ts1, NULL);

                   tv1.tv_sec = 0;
                   tv1.tv_usec = 4000;
                   select(0, NULL, NULL, NULL,&tv1);
                   if (i % 100 == 0) {
                         s = time(NULL);
                         while (time(NULL) - s < 4);
                   }
           }
         return 1;
}

> +
> +	return block ? block : sleep ? sleep : 0;
> +#else
> +	return 0;
> +#endif
> +}
> +#endif
> +
> +/*
> + * Tracepoint for accounting sleeptime (time the task is sleeping
> + * or waiting for I/O).
> + */
> +TRACE_EVENT(sched_stat_sleeptime,
> +
> +	TP_PROTO(struct task_struct *tsk, u64 now),
> +
> +	TP_ARGS(tsk, now),
> +
> +	TP_STRUCT__entry(
> +		__array( char,	comm,	TASK_COMM_LEN	)
> +		__field( pid_t,	pid			)
> +		__field( u64,	sleeptime		)
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
> +		__entry->pid		= tsk->pid;
> +		__entry->sleeptime = trace_get_sleeptime(tsk);
> +		__entry->sleeptime = __entry->sleeptime ?
> +				now - __entry->sleeptime : 0;
> +	)
> +	TP_perf_assign(
> +		__perf_count(__entry->sleeptime);
> +	),
> +
> +	TP_printk("comm=%s pid=%d sleeptime=%Lu [ns]",
> +			__entry->comm, __entry->pid,
> +			(unsigned long long)__entry->sleeptime)
> +);
> +
>   /*
>    * Tracepoint for showing priority inheritance modifying a tasks
>    * priority.
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ca8fd44..172e6ee 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1899,6 +1899,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>   	local_irq_enable();
>   #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
>   	finish_lock_switch(rq, prev);
> +	trace_sched_stat_sleeptime(current, rq->clock);
>
>   	fire_sched_in_preempt_notifiers(current);
>   	if (mm)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] tracing, sched: Add a new tracepoint for sleeptime
  2011-12-21 10:43   ` Andrey Vagin
@ 2011-12-22  0:18     ` Arun Sharma
  2012-01-19 22:30     ` Arun Sharma
  1 sibling, 0 replies; 9+ messages in thread
From: Arun Sharma @ 2011-12-22  0:18 UTC (permalink / raw)
  To: avagin
  Cc: linux-kernel, Peter Zijlstra, Steven Rostedt, Mathieu Desnoyers,
	Arnaldo Carvalho de Melo, Frederic Weisbecker, Ingo Molnar

On 12/21/11 2:43 AM, Andrey Vagin wrote:

> Arun, probably you have missed one of my comments.
> block_start and sleep_start should be zeroized here.
>
> tsk->se.statistics.block_start = 0;
> tsk->se.statistics.sleep_start = 0;
>
> The following program provokes this problem:

Yes - I overlooked the comment. Thanks for bringing it up (especially 
the test program). Fixed in v3.

  -Arun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] tracing, sched: Add a new tracepoint for sleeptime
  2011-12-21 10:43   ` Andrey Vagin
  2011-12-22  0:18     ` Arun Sharma
@ 2012-01-19 22:30     ` Arun Sharma
  1 sibling, 0 replies; 9+ messages in thread
From: Arun Sharma @ 2012-01-19 22:30 UTC (permalink / raw)
  To: avagin
  Cc: linux-kernel, Peter Zijlstra, Steven Rostedt, Mathieu Desnoyers,
	Arnaldo Carvalho de Melo, Frederic Weisbecker, Ingo Molnar

On 12/21/11 2:43 AM, Andrey Vagin wrote:
>
>>
>> +#ifdef CREATE_TRACE_POINTS
>> +static inline u64 trace_get_sleeptime(struct task_struct *tsk)
>> +{
>> +#ifdef CONFIG_SCHEDSTATS
>> + u64 block, sleep;
>> +
>> + block = tsk->se.statistics.block_start;
>> + sleep = tsk->se.statistics.sleep_start;
> Arun, probably you have missed one of my comments.
> block_start and sleep_start should be zeroized here.
>
> tsk->se.statistics.block_start = 0;
> tsk->se.statistics.sleep_start = 0;

This still doesn't solve one minor problem: the first sample we get 
might be bad. Here's the sequence that could trigger it:

t1: task goes to sleep. sleep_start=t1
t2: task gets woken up. sleep_start is still t1
t3: context switch. trace_get_sleeptime() is not active. sleep_start is 
still t1
t4: trace_get_sleeptime() is activated
t5: task gets context switched out involuntarily.
t6: task gets context switched in. first sample from the task.

We compute sleeptime=t6-t1. The correct answer should be sleeptime=0.

We should set {sleep,block}_start to 0 regardless of whether the 
tracepoint is active or not.

I'll post a patch shortly.

  -Arun

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-01-19 22:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-20 18:09 [PATCH 0/2] Sleep Profiling v2 Arun Sharma
2011-12-20 18:09 ` [PATCH 1/2] sched: Retain sleep/block information Arun Sharma
2011-12-20 18:09 ` [PATCH 2/2] tracing, sched: Add a new tracepoint for sleeptime Arun Sharma
2011-12-20 18:16   ` Steven Rostedt
2011-12-20 18:22     ` Arun Sharma
2011-12-20 19:19       ` Steven Rostedt
2011-12-21 10:43   ` Andrey Vagin
2011-12-22  0:18     ` Arun Sharma
2012-01-19 22:30     ` Arun Sharma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).