* [PATCH 0/2] Sleep profiling
@ 2011-12-19 23:23 Arun Sharma
2011-12-19 23:23 ` [PATCH 1/2] tracing, sched: move the sched_switch tracepoint Arun Sharma
2011-12-19 23:23 ` [PATCH 2/2] tracing, sched: Add delay info to sched_switch Arun Sharma
0 siblings, 2 replies; 9+ messages in thread
From: Arun Sharma @ 2011-12-19 23:23 UTC (permalink / raw)
To: linux-kernel; +Cc: Arun Sharma
When debugging latency, users want to know where their code
is executing on the CPU (cycle profiling) as well as where
it's waiting sleeping or waiting for IO. The following patch set
tries to address the latter.
Normal users can use this tracepoint without root privileges
and on a system with lots of context switches, load can be
reduced by filtering out uninteresting switches.
Sample command lines:
perf record -gPe sched:sched_switch --filter "delay > 10000" -- ./test
perf report --stdio -g graph -G
Arun Sharma (2):
tracing, sched: move the sched_switch tracepoint
tracing, sched: Add delay info to sched_switch
include/trace/events/sched.h | 21 +++++++++++++++++----
kernel/sched.c | 2 +-
kernel/sched_fair.c | 6 ++++--
kernel/trace/trace_sched_switch.c | 2 +-
4 files changed, 23 insertions(+), 8 deletions(-)
--
1.7.4
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] tracing, sched: move the sched_switch tracepoint
2011-12-19 23:23 [PATCH 0/2] Sleep profiling Arun Sharma
@ 2011-12-19 23:23 ` Arun Sharma
2011-12-19 23:29 ` Peter Zijlstra
` (2 more replies)
2011-12-19 23:23 ` [PATCH 2/2] tracing, sched: Add delay info to sched_switch Arun Sharma
1 sibling, 3 replies; 9+ messages in thread
From: Arun Sharma @ 2011-12-19 23:23 UTC (permalink / raw)
To: linux-kernel
Cc: Arun Sharma, Peter Zijlstra, Steven Rostedt, Mathieu Desnoyers,
Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker,
Ingo Molnar
Move the tracepoint from prepare_task_switch() to
finish_task_switch().
Without this, the event gets attributed to prev rather
than next. For sleep profiling purposes, we want to collect
the stacktrace of "next".
This also makes the event usable in per-process mode, without
root privileges.
I can't think of this breaking the semantics of the existing
uses of this tracepoint. If there are any concerns, we should
be able to define a new tracepoint usable for sleep profiling.
Signed-off-by: Arun Sharma <asharma@fb.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: linux-kernel@vger.kernel.org
---
kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index d6b149c..b06b3c6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3157,7 +3157,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
fire_sched_out_preempt_notifiers(prev, next);
prepare_lock_switch(rq, next);
prepare_arch_switch(next);
- trace_sched_switch(prev, next);
}
/**
@@ -3205,6 +3204,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
finish_lock_switch(rq, prev);
+ trace_sched_switch(prev, current, rq->clock);
fire_sched_in_preempt_notifiers(current);
if (mm)
mmdrop(mm);
--
1.7.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/2] tracing, sched: Add delay info to sched_switch
2011-12-19 23:23 [PATCH 0/2] Sleep profiling Arun Sharma
2011-12-19 23:23 ` [PATCH 1/2] tracing, sched: move the sched_switch tracepoint Arun Sharma
@ 2011-12-19 23:23 ` Arun Sharma
2011-12-19 23:28 ` Peter Zijlstra
1 sibling, 1 reply; 9+ messages in thread
From: Arun Sharma @ 2011-12-19 23:23 UTC (permalink / raw)
To: linux-kernel
Cc: Arun Sharma, Peter Zijlstra, Steven Rostedt, Mathieu Desnoyers,
Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker,
Ingo Molnar
If CONFIG_SCHEDSTATS is defined, the kernel maintains
information about how long the task was sleeping or
in the case of iowait, blocking in the kernel before
getting woken up.
Note: this information is only provided for sched_fair.
Other scheduling classes may choose to provide this in
the future.
Note: the delay includes the time spent on the runqueue
as well.
Signed-off-by: Arun Sharma <asharma@fb.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: linux-kernel@vger.kernel.org
---
include/trace/events/sched.h | 21 +++++++++++++++++----
kernel/sched_fair.c | 6 ++++--
kernel/trace/trace_sched_switch.c | 2 +-
3 files changed, 22 insertions(+), 7 deletions(-)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 959ff18..830784b 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -113,9 +113,10 @@ static inline long __trace_sched_switch_state(struct task_struct *p)
TRACE_EVENT(sched_switch,
TP_PROTO(struct task_struct *prev,
- struct task_struct *next),
+ struct task_struct *next,
+ unsigned long now),
- TP_ARGS(prev, next),
+ TP_ARGS(prev, next, now),
TP_STRUCT__entry(
__array( char, prev_comm, TASK_COMM_LEN )
@@ -125,6 +126,7 @@ TRACE_EVENT(sched_switch,
__array( char, next_comm, TASK_COMM_LEN )
__field( pid_t, next_pid )
__field( int, next_prio )
+ __field( long, delay )
),
TP_fast_assign(
@@ -135,9 +137,20 @@ TRACE_EVENT(sched_switch,
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
__entry->next_pid = next->pid;
__entry->next_prio = next->prio;
+#ifdef CONFIG_SCHEDSTATS
+ __entry->delay = next->se.statistics.block_start ? next->se.statistics.block_start
+ : next->se.statistics.sleep_start ? next->se.statistics.sleep_start : 0;
+ __entry->delay = __entry->delay ? now - __entry->delay : 0;
+#else
+ __entry->delay = 0;
+#endif
+ )
+
+ TP_perf_assign(
+ __perf_count(__entry->delay);
),
- TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
+ TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d delay=%ld",
__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
__entry->prev_state & (TASK_STATE_MAX-1) ?
__print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
@@ -145,7 +158,7 @@ TRACE_EVENT(sched_switch,
{ 16, "Z" }, { 32, "X" }, { 64, "x" },
{ 128, "W" }) : "R",
__entry->prev_state & TASK_STATE_MAX ? "+" : "",
- __entry->next_comm, __entry->next_pid, __entry->next_prio)
+ __entry->next_comm, __entry->next_pid, __entry->next_prio, __entry->delay)
);
/*
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 8a39fa3..b51a613 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -893,7 +893,6 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (unlikely(delta > se->statistics.sleep_max))
se->statistics.sleep_max = delta;
- se->statistics.sleep_start = 0;
se->statistics.sum_sleep_runtime += delta;
if (tsk) {
@@ -910,7 +909,6 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (unlikely(delta > se->statistics.block_max))
se->statistics.block_max = delta;
- se->statistics.block_start = 0;
se->statistics.sum_sleep_runtime += delta;
if (tsk) {
@@ -1083,8 +1081,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (tsk->state & TASK_INTERRUPTIBLE)
se->statistics.sleep_start = rq_of(cfs_rq)->clock;
+ else
+ se->statistics.sleep_start = 0;
if (tsk->state & TASK_UNINTERRUPTIBLE)
se->statistics.block_start = rq_of(cfs_rq)->clock;
+ else
+ se->statistics.block_start = 0;
}
#endif
}
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index 7e62c0a..b8b98d4 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -50,7 +50,7 @@ tracing_sched_switch_trace(struct trace_array *tr,
}
static void
-probe_sched_switch(void *ignore, struct task_struct *prev, struct task_struct *next)
+probe_sched_switch(void *ignore, struct task_struct *prev, struct task_struct *next, unsigned long now)
{
struct trace_array_cpu *data;
unsigned long flags;
--
1.7.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] tracing, sched: Add delay info to sched_switch
2011-12-19 23:23 ` [PATCH 2/2] tracing, sched: Add delay info to sched_switch Arun Sharma
@ 2011-12-19 23:28 ` Peter Zijlstra
2011-12-19 23:34 ` Steven Rostedt
0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2011-12-19 23:28 UTC (permalink / raw)
To: Arun Sharma
Cc: linux-kernel, Steven Rostedt, Mathieu Desnoyers,
Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker,
Ingo Molnar
On Mon, 2011-12-19 at 15:23 -0800, Arun Sharma wrote:
> +#ifdef CONFIG_SCHEDSTATS
> + __entry->delay = next->se.statistics.block_start ? next->se.statistics.block_start
> + : next->se.statistics.sleep_start ? next->se.statistics.sleep_start : 0;
> + __entry->delay = __entry->delay ? now - __entry->delay : 0;
> +#else
> + __entry->delay = 0;
> +#endif
I don't really like this, this should be one of those sched_*_stat
thingies, not sched_switch().
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] tracing, sched: move the sched_switch tracepoint
2011-12-19 23:23 ` [PATCH 1/2] tracing, sched: move the sched_switch tracepoint Arun Sharma
@ 2011-12-19 23:29 ` Peter Zijlstra
2011-12-19 23:36 ` Peter Zijlstra
2011-12-19 23:40 ` Mathieu Desnoyers
2 siblings, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2011-12-19 23:29 UTC (permalink / raw)
To: Arun Sharma
Cc: linux-kernel, Steven Rostedt, Mathieu Desnoyers,
Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker,
Ingo Molnar
On Mon, 2011-12-19 at 15:23 -0800, Arun Sharma wrote:
> - trace_sched_switch(prev, next);
> + trace_sched_switch(prev, current, rq->clock);
You stuck an extra argument in (from the next patch) and will break the
build.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] tracing, sched: Add delay info to sched_switch
2011-12-19 23:28 ` Peter Zijlstra
@ 2011-12-19 23:34 ` Steven Rostedt
2011-12-19 23:40 ` Arun Sharma
0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2011-12-19 23:34 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Arun Sharma, linux-kernel, Mathieu Desnoyers,
Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker,
Ingo Molnar
On Tue, 2011-12-20 at 00:28 +0100, Peter Zijlstra wrote:
> On Mon, 2011-12-19 at 15:23 -0800, Arun Sharma wrote:
> > +#ifdef CONFIG_SCHEDSTATS
> > + __entry->delay = next->se.statistics.block_start ? next->se.statistics.block_start
> > + : next->se.statistics.sleep_start ? next->se.statistics.sleep_start : 0;
> > + __entry->delay = __entry->delay ? now - __entry->delay : 0;
> > +#else
> > + __entry->delay = 0;
> > +#endif
>
> I don't really like this, this should be one of those sched_*_stat
> thingies, not sched_switch().
Agreed.
-- Steve
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] tracing, sched: move the sched_switch tracepoint
2011-12-19 23:23 ` [PATCH 1/2] tracing, sched: move the sched_switch tracepoint Arun Sharma
2011-12-19 23:29 ` Peter Zijlstra
@ 2011-12-19 23:36 ` Peter Zijlstra
2011-12-19 23:40 ` Mathieu Desnoyers
2 siblings, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2011-12-19 23:36 UTC (permalink / raw)
To: Arun Sharma
Cc: linux-kernel, Steven Rostedt, Mathieu Desnoyers,
Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker,
Ingo Molnar
On Mon, 2011-12-19 at 15:23 -0800, Arun Sharma wrote:
> +++ b/kernel/sched.c
Also, you're patching a non-existent file :-)
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] tracing, sched: Add delay info to sched_switch
2011-12-19 23:34 ` Steven Rostedt
@ 2011-12-19 23:40 ` Arun Sharma
0 siblings, 0 replies; 9+ messages in thread
From: Arun Sharma @ 2011-12-19 23:40 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra, linux-kernel, Mathieu Desnoyers,
Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker,
Ingo Molnar
On 12/19/11 3:34 PM, Steven Rostedt wrote:
> On Tue, 2011-12-20 at 00:28 +0100, Peter Zijlstra wrote:
>> On Mon, 2011-12-19 at 15:23 -0800, Arun Sharma wrote:
>>> +#ifdef CONFIG_SCHEDSTATS
>>> + __entry->delay = next->se.statistics.block_start ? next->se.statistics.block_start
>>> + : next->se.statistics.sleep_start ? next->se.statistics.sleep_start : 0;
>>> + __entry->delay = __entry->delay ? now - __entry->delay : 0;
>>> +#else
>>> + __entry->delay = 0;
>>> +#endif
>>
>> I don't really like this, this should be one of those sched_*_stat
>> thingies, not sched_switch().
>
> Agreed.
How does:
sched_stat_sleeptime(current, now);
at the same point sound? I'll leave sched_switch() as is.
-Arun
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] tracing, sched: move the sched_switch tracepoint
2011-12-19 23:23 ` [PATCH 1/2] tracing, sched: move the sched_switch tracepoint Arun Sharma
2011-12-19 23:29 ` Peter Zijlstra
2011-12-19 23:36 ` Peter Zijlstra
@ 2011-12-19 23:40 ` Mathieu Desnoyers
2 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2011-12-19 23:40 UTC (permalink / raw)
To: Arun Sharma
Cc: linux-kernel, Peter Zijlstra, Steven Rostedt,
Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker,
Ingo Molnar
* Arun Sharma (asharma@fb.com) wrote:
> Move the tracepoint from prepare_task_switch() to
> finish_task_switch().
>
> Without this, the event gets attributed to prev rather
> than next.
I would have thought that the semantic of "being scheduled"
(sched_switch) would belong to the context of the thread being scheduled
out ?
> For sleep profiling purposes, we want to collect
> the stacktrace of "next". This also makes the event usable in
> per-process mode, without root privileges.
>
> I can't think of this breaking the semantics of the existing
> uses of this tracepoint.
See comments inlined below,
> If there are any concerns, we should
> be able to define a new tracepoint usable for sleep profiling.
If it can be achieved by adding analysis to user-level tools, without
extracting more data, I think a less intrusive option should be
preferred. But for your per-process use-case, you might need a second
trace point.
>
> Signed-off-by: Arun Sharma <asharma@fb.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
> Cc: Andrew Vagin <avagin@openvz.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: linux-kernel@vger.kernel.org
> ---
> kernel/sched.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index d6b149c..b06b3c6 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3157,7 +3157,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
> fire_sched_out_preempt_notifiers(prev, next);
> prepare_lock_switch(rq, next);
> prepare_arch_switch(next);
> - trace_sched_switch(prev, next);
> }
>
> /**
> @@ -3205,6 +3204,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
> finish_lock_switch(rq, prev);
>
> + trace_sched_switch(prev, current, rq->clock);
Hrm, this is moving the event outside of the lock_switch, which has
reenabled interrupts and released the lock. This would therefore
introduce new "interesting" sequences of events where an IRQ could
appear to nest over the wrong PID. Also, I'm not certain it is still
safe to access "prev->" at this point. So the patch definitely needs to
be reviewed as to what event order it will change and to make sure it
does not introduce races.
I'd also be tempted to keep the trace_sched_switch in
prepare_task_switch, and create a new event for finish_task_switch.
Best regards,
Mathieu
> fire_sched_in_preempt_notifiers(current);
> if (mm)
> mmdrop(mm);
> --
> 1.7.4
>
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2011-12-19 23:41 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-19 23:23 [PATCH 0/2] Sleep profiling Arun Sharma
2011-12-19 23:23 ` [PATCH 1/2] tracing, sched: move the sched_switch tracepoint Arun Sharma
2011-12-19 23:29 ` Peter Zijlstra
2011-12-19 23:36 ` Peter Zijlstra
2011-12-19 23:40 ` Mathieu Desnoyers
2011-12-19 23:23 ` [PATCH 2/2] tracing, sched: Add delay info to sched_switch Arun Sharma
2011-12-19 23:28 ` Peter Zijlstra
2011-12-19 23:34 ` Steven Rostedt
2011-12-19 23:40 ` Arun Sharma
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).