* [PATCH 0/2] Sleep profiling @ 2011-12-19 23:23 Arun Sharma 2011-12-19 23:23 ` [PATCH 1/2] tracing, sched: move the sched_switch tracepoint Arun Sharma 2011-12-19 23:23 ` [PATCH 2/2] tracing, sched: Add delay info to sched_switch Arun Sharma 0 siblings, 2 replies; 9+ messages in thread From: Arun Sharma @ 2011-12-19 23:23 UTC (permalink / raw) To: linux-kernel; +Cc: Arun Sharma When debugging latency, users want to know where their code is executing on the CPU (cycle profiling) as well as where it's waiting sleeping or waiting for IO. The following patch set tries to address the latter. Normal users can use this tracepoint without root privileges and on a system with lots of context switches, load can be reduced by filtering out uninteresting switches. Sample command lines: perf record -gPe sched:sched_switch --filter "delay > 10000" -- ./test perf report --stdio -g graph -G Arun Sharma (2): tracing, sched: move the sched_switch tracepoint tracing, sched: Add delay info to sched_switch include/trace/events/sched.h | 21 +++++++++++++++++---- kernel/sched.c | 2 +- kernel/sched_fair.c | 6 ++++-- kernel/trace/trace_sched_switch.c | 2 +- 4 files changed, 23 insertions(+), 8 deletions(-) -- 1.7.4 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] tracing, sched: move the sched_switch tracepoint 2011-12-19 23:23 [PATCH 0/2] Sleep profiling Arun Sharma @ 2011-12-19 23:23 ` Arun Sharma 2011-12-19 23:29 ` Peter Zijlstra ` (2 more replies) 2011-12-19 23:23 ` [PATCH 2/2] tracing, sched: Add delay info to sched_switch Arun Sharma 1 sibling, 3 replies; 9+ messages in thread From: Arun Sharma @ 2011-12-19 23:23 UTC (permalink / raw) To: linux-kernel Cc: Arun Sharma, Peter Zijlstra, Steven Rostedt, Mathieu Desnoyers, Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker, Ingo Molnar Move the tracepoint from prepare_task_switch() to finish_task_switch(). Without this, the event gets attributed to prev rather than next. For sleep profiling purposes, we want to collect the stacktrace of "next". This also makes the event usable in per-process mode, without root privileges. I can't think of this breaking the semantics of the existing uses of this tracepoint. If there are any concerns, we should be able to define a new tracepoint usable for sleep profiling. Signed-off-by: Arun Sharma <asharma@fb.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: linux-kernel@vger.kernel.org --- kernel/sched.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index d6b149c..b06b3c6 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3157,7 +3157,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev, fire_sched_out_preempt_notifiers(prev, next); prepare_lock_switch(rq, next); prepare_arch_switch(next); - trace_sched_switch(prev, next); } /** @@ -3205,6 +3204,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev) #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */ finish_lock_switch(rq, prev); + trace_sched_switch(prev, current, rq->clock); fire_sched_in_preempt_notifiers(current); if (mm) mmdrop(mm); -- 1.7.4 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] tracing, sched: move the sched_switch tracepoint 2011-12-19 23:23 ` [PATCH 1/2] tracing, sched: move the sched_switch tracepoint Arun Sharma @ 2011-12-19 23:29 ` Peter Zijlstra 2011-12-19 23:36 ` Peter Zijlstra 2011-12-19 23:40 ` Mathieu Desnoyers 2 siblings, 0 replies; 9+ messages in thread From: Peter Zijlstra @ 2011-12-19 23:29 UTC (permalink / raw) To: Arun Sharma Cc: linux-kernel, Steven Rostedt, Mathieu Desnoyers, Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker, Ingo Molnar On Mon, 2011-12-19 at 15:23 -0800, Arun Sharma wrote: > - trace_sched_switch(prev, next); > + trace_sched_switch(prev, current, rq->clock); You stuck an extra argument in (from the next patch) and will break the build. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] tracing, sched: move the sched_switch tracepoint 2011-12-19 23:23 ` [PATCH 1/2] tracing, sched: move the sched_switch tracepoint Arun Sharma 2011-12-19 23:29 ` Peter Zijlstra @ 2011-12-19 23:36 ` Peter Zijlstra 2011-12-19 23:40 ` Mathieu Desnoyers 2 siblings, 0 replies; 9+ messages in thread From: Peter Zijlstra @ 2011-12-19 23:36 UTC (permalink / raw) To: Arun Sharma Cc: linux-kernel, Steven Rostedt, Mathieu Desnoyers, Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker, Ingo Molnar On Mon, 2011-12-19 at 15:23 -0800, Arun Sharma wrote: > +++ b/kernel/sched.c Also, you're patching a non-existent file :-) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] tracing, sched: move the sched_switch tracepoint 2011-12-19 23:23 ` [PATCH 1/2] tracing, sched: move the sched_switch tracepoint Arun Sharma 2011-12-19 23:29 ` Peter Zijlstra 2011-12-19 23:36 ` Peter Zijlstra @ 2011-12-19 23:40 ` Mathieu Desnoyers 2 siblings, 0 replies; 9+ messages in thread From: Mathieu Desnoyers @ 2011-12-19 23:40 UTC (permalink / raw) To: Arun Sharma Cc: linux-kernel, Peter Zijlstra, Steven Rostedt, Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker, Ingo Molnar * Arun Sharma (asharma@fb.com) wrote: > Move the tracepoint from prepare_task_switch() to > finish_task_switch(). > > Without this, the event gets attributed to prev rather > than next. I would have thought that the semantic of "being scheduled" (sched_switch) would belong to the context of the thread being scheduled out ? > For sleep profiling purposes, we want to collect > the stacktrace of "next". This also makes the event usable in > per-process mode, without root privileges. > > I can't think of this breaking the semantics of the existing > uses of this tracepoint. See comments inlined below, > If there are any concerns, we should > be able to define a new tracepoint usable for sleep profiling. If it can be achieved by adding analysis to user-level tools, without extracting more data, I think a less intrusive option should be preferred. But for your per-process use-case, you might need a second trace point. > > Signed-off-by: Arun Sharma <asharma@fb.com> > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Arnaldo Carvalho de Melo <acme@infradead.org> > Cc: Andrew Vagin <avagin@openvz.org> > Cc: Frederic Weisbecker <fweisbec@gmail.com> > Cc: Ingo Molnar <mingo@elte.hu> > Cc: linux-kernel@vger.kernel.org > --- > kernel/sched.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/kernel/sched.c b/kernel/sched.c > index d6b149c..b06b3c6 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -3157,7 +3157,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev, > fire_sched_out_preempt_notifiers(prev, next); > prepare_lock_switch(rq, next); > prepare_arch_switch(next); > - trace_sched_switch(prev, next); > } > > /** > @@ -3205,6 +3204,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev) > #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */ > finish_lock_switch(rq, prev); > > + trace_sched_switch(prev, current, rq->clock); Hrm, this is moving the event outside of the lock_switch, which has reenabled interrupts and released the lock. This would therefore introduce new "interesting" sequences of events where an IRQ could appear to nest over the wrong PID. Also, I'm not certain it is still safe to access "prev->" at this point. So the patch definitely needs to be reviewed as to what event order it will change and to make sure it does not introduce races. I'd also be tempted to keep the trace_sched_switch in prepare_task_switch, and create a new event for finish_task_switch. Best regards, Mathieu > fire_sched_in_preempt_notifiers(current); > if (mm) > mmdrop(mm); > -- > 1.7.4 > -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/2] tracing, sched: Add delay info to sched_switch 2011-12-19 23:23 [PATCH 0/2] Sleep profiling Arun Sharma 2011-12-19 23:23 ` [PATCH 1/2] tracing, sched: move the sched_switch tracepoint Arun Sharma @ 2011-12-19 23:23 ` Arun Sharma 2011-12-19 23:28 ` Peter Zijlstra 1 sibling, 1 reply; 9+ messages in thread From: Arun Sharma @ 2011-12-19 23:23 UTC (permalink / raw) To: linux-kernel Cc: Arun Sharma, Peter Zijlstra, Steven Rostedt, Mathieu Desnoyers, Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker, Ingo Molnar If CONFIG_SCHEDSTATS is defined, the kernel maintains information about how long the task was sleeping or in the case of iowait, blocking in the kernel before getting woken up. Note: this information is only provided for sched_fair. Other scheduling classes may choose to provide this in the future. Note: the delay includes the time spent on the runqueue as well. Signed-off-by: Arun Sharma <asharma@fb.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: linux-kernel@vger.kernel.org --- include/trace/events/sched.h | 21 +++++++++++++++++---- kernel/sched_fair.c | 6 ++++-- kernel/trace/trace_sched_switch.c | 2 +- 3 files changed, 22 insertions(+), 7 deletions(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 959ff18..830784b 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -113,9 +113,10 @@ static inline long __trace_sched_switch_state(struct task_struct *p) TRACE_EVENT(sched_switch, TP_PROTO(struct task_struct *prev, - struct task_struct *next), + struct task_struct *next, + unsigned long now), - TP_ARGS(prev, next), + TP_ARGS(prev, next, now), TP_STRUCT__entry( __array( char, prev_comm, TASK_COMM_LEN ) @@ -125,6 +126,7 @@ TRACE_EVENT(sched_switch, __array( char, next_comm, TASK_COMM_LEN ) __field( pid_t, next_pid ) __field( int, next_prio ) + __field( long, delay ) ), TP_fast_assign( @@ -135,9 +137,20 @@ TRACE_EVENT(sched_switch, memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN); __entry->next_pid = next->pid; __entry->next_prio = next->prio; +#ifdef CONFIG_SCHEDSTATS + __entry->delay = next->se.statistics.block_start ? next->se.statistics.block_start + : next->se.statistics.sleep_start ? next->se.statistics.sleep_start : 0; + __entry->delay = __entry->delay ? now - __entry->delay : 0; +#else + __entry->delay = 0; +#endif + ) + + TP_perf_assign( + __perf_count(__entry->delay); ), - TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", + TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d delay=%ld", __entry->prev_comm, __entry->prev_pid, __entry->prev_prio, __entry->prev_state & (TASK_STATE_MAX-1) ? __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|", @@ -145,7 +158,7 @@ TRACE_EVENT(sched_switch, { 16, "Z" }, { 32, "X" }, { 64, "x" }, { 128, "W" }) : "R", __entry->prev_state & TASK_STATE_MAX ? "+" : "", - __entry->next_comm, __entry->next_pid, __entry->next_prio) + __entry->next_comm, __entry->next_pid, __entry->next_prio, __entry->delay) ); /* diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 8a39fa3..b51a613 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -893,7 +893,6 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se) if (unlikely(delta > se->statistics.sleep_max)) se->statistics.sleep_max = delta; - se->statistics.sleep_start = 0; se->statistics.sum_sleep_runtime += delta; if (tsk) { @@ -910,7 +909,6 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se) if (unlikely(delta > se->statistics.block_max)) se->statistics.block_max = delta; - se->statistics.block_start = 0; se->statistics.sum_sleep_runtime += delta; if (tsk) { @@ -1083,8 +1081,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) if (tsk->state & TASK_INTERRUPTIBLE) se->statistics.sleep_start = rq_of(cfs_rq)->clock; + else + se->statistics.sleep_start = 0; if (tsk->state & TASK_UNINTERRUPTIBLE) se->statistics.block_start = rq_of(cfs_rq)->clock; + else + se->statistics.block_start = 0; } #endif } diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c index 7e62c0a..b8b98d4 100644 --- a/kernel/trace/trace_sched_switch.c +++ b/kernel/trace/trace_sched_switch.c @@ -50,7 +50,7 @@ tracing_sched_switch_trace(struct trace_array *tr, } static void -probe_sched_switch(void *ignore, struct task_struct *prev, struct task_struct *next) +probe_sched_switch(void *ignore, struct task_struct *prev, struct task_struct *next, unsigned long now) { struct trace_array_cpu *data; unsigned long flags; -- 1.7.4 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] tracing, sched: Add delay info to sched_switch 2011-12-19 23:23 ` [PATCH 2/2] tracing, sched: Add delay info to sched_switch Arun Sharma @ 2011-12-19 23:28 ` Peter Zijlstra 2011-12-19 23:34 ` Steven Rostedt 0 siblings, 1 reply; 9+ messages in thread From: Peter Zijlstra @ 2011-12-19 23:28 UTC (permalink / raw) To: Arun Sharma Cc: linux-kernel, Steven Rostedt, Mathieu Desnoyers, Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker, Ingo Molnar On Mon, 2011-12-19 at 15:23 -0800, Arun Sharma wrote: > +#ifdef CONFIG_SCHEDSTATS > + __entry->delay = next->se.statistics.block_start ? next->se.statistics.block_start > + : next->se.statistics.sleep_start ? next->se.statistics.sleep_start : 0; > + __entry->delay = __entry->delay ? now - __entry->delay : 0; > +#else > + __entry->delay = 0; > +#endif I don't really like this, this should be one of those sched_*_stat thingies, not sched_switch(). ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] tracing, sched: Add delay info to sched_switch 2011-12-19 23:28 ` Peter Zijlstra @ 2011-12-19 23:34 ` Steven Rostedt 2011-12-19 23:40 ` Arun Sharma 0 siblings, 1 reply; 9+ messages in thread From: Steven Rostedt @ 2011-12-19 23:34 UTC (permalink / raw) To: Peter Zijlstra Cc: Arun Sharma, linux-kernel, Mathieu Desnoyers, Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker, Ingo Molnar On Tue, 2011-12-20 at 00:28 +0100, Peter Zijlstra wrote: > On Mon, 2011-12-19 at 15:23 -0800, Arun Sharma wrote: > > +#ifdef CONFIG_SCHEDSTATS > > + __entry->delay = next->se.statistics.block_start ? next->se.statistics.block_start > > + : next->se.statistics.sleep_start ? next->se.statistics.sleep_start : 0; > > + __entry->delay = __entry->delay ? now - __entry->delay : 0; > > +#else > > + __entry->delay = 0; > > +#endif > > I don't really like this, this should be one of those sched_*_stat > thingies, not sched_switch(). Agreed. -- Steve ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] tracing, sched: Add delay info to sched_switch 2011-12-19 23:34 ` Steven Rostedt @ 2011-12-19 23:40 ` Arun Sharma 0 siblings, 0 replies; 9+ messages in thread From: Arun Sharma @ 2011-12-19 23:40 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, linux-kernel, Mathieu Desnoyers, Arnaldo Carvalho de Melo, Andrew Vagin, Frederic Weisbecker, Ingo Molnar On 12/19/11 3:34 PM, Steven Rostedt wrote: > On Tue, 2011-12-20 at 00:28 +0100, Peter Zijlstra wrote: >> On Mon, 2011-12-19 at 15:23 -0800, Arun Sharma wrote: >>> +#ifdef CONFIG_SCHEDSTATS >>> + __entry->delay = next->se.statistics.block_start ? next->se.statistics.block_start >>> + : next->se.statistics.sleep_start ? next->se.statistics.sleep_start : 0; >>> + __entry->delay = __entry->delay ? now - __entry->delay : 0; >>> +#else >>> + __entry->delay = 0; >>> +#endif >> >> I don't really like this, this should be one of those sched_*_stat >> thingies, not sched_switch(). > > Agreed. How does: sched_stat_sleeptime(current, now); at the same point sound? I'll leave sched_switch() as is. -Arun ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2011-12-19 23:41 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-12-19 23:23 [PATCH 0/2] Sleep profiling Arun Sharma 2011-12-19 23:23 ` [PATCH 1/2] tracing, sched: move the sched_switch tracepoint Arun Sharma 2011-12-19 23:29 ` Peter Zijlstra 2011-12-19 23:36 ` Peter Zijlstra 2011-12-19 23:40 ` Mathieu Desnoyers 2011-12-19 23:23 ` [PATCH 2/2] tracing, sched: Add delay info to sched_switch Arun Sharma 2011-12-19 23:28 ` Peter Zijlstra 2011-12-19 23:34 ` Steven Rostedt 2011-12-19 23:40 ` Arun Sharma
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).