[RFC PATCH] sched: Fix sched

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] sched: Fix sched_wakeup tracepoint
@ 2015-06-05 11:41 Mathieu Desnoyers
  2015-06-05 12:09 ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Mathieu Desnoyers @ 2015-06-05 11:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Mathieu Desnoyers, Thomas Gleixner, Ingo Molnar,
	Steven Rostedt, Francis Giraldeau

Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu"
moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for
remote wakeups. This commit appeared upstream in Linux v3.0.

Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup"
tracepoint. Analyzing wakup latencies depends on getting the wakeup
chain right: which process is the waker, which is the wakee. Moving this
instrumention outside of the waker context prevents trace analysis tools
from getting the waker pid, either through "current" in the tracepoint
probe, or by deducing it using other scheduler events based on the CPU
executing the tracepoint.

Another side-effect of moving this instrumentation to the scheduler ipi
is that the delay during which the wakeup is sitting in the pending
queue is not accounted for when calculating wakeup latency.

Therefore, move the sched_wakeup instrumentation back to the waker
context to fix those two shortcomings.

This patch is build-tested only, submitted for feedback. Francis, can
you try it out with your critical path analysis ?

Fixes: 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu"
Reported-by: Francis Giraldeau <francis.giraldeau@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Ingo Molnar <mingo@kernel.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Francis Giraldeau <francis.giraldeau@gmail.com>
---
 kernel/sched/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3d5f6f6..0ed2021 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1457,7 +1457,6 @@ static void
 ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	check_preempt_curr(rq, p, wake_flags);
-	trace_sched_wakeup(p, true);

 	p->state = TASK_RUNNING;
 #ifdef CONFIG_SMP
@@ -1505,6 +1504,7 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
 	if (task_on_rq_queued(p)) {
 		/* check_preempt_curr() may use rq clock */
 		update_rq_clock(rq);
+		trace_sched_wakeup(p, true);
 		ttwu_do_wakeup(rq, p, wake_flags);
 		ret = 1;
 	}
@@ -1619,6 +1619,7 @@ static void ttwu_queue(struct task_struct *p, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);

+	trace_sched_wakeup(p, true);
 #if defined(CONFIG_SMP)
 	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
 		sched_clock_cpu(cpu); /* sync clocks x-cpu */
@@ -1734,6 +1735,7 @@ static void try_to_wake_up_local(struct task_struct *p)
 	if (!task_on_rq_queued(p))
 		ttwu_activate(rq, p, ENQUEUE_WAKEUP);

+	trace_sched_wakeup(p, true);
 	ttwu_do_wakeup(rq, p, 0);
 	ttwu_stat(p, smp_processor_id(), 0);
 out:
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-05 11:41 [RFC PATCH] sched: Fix sched_wakeup tracepoint Mathieu Desnoyers
@ 2015-06-05 12:09 ` Peter Zijlstra
  2015-06-05 12:32   ` Mathieu Desnoyers
  2015-06-05 12:32   ` Thomas Gleixner
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2015-06-05 12:09 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Steven Rostedt,
	Francis Giraldeau

On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote:
> Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu"
> moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for
> remote wakeups. This commit appeared upstream in Linux v3.0.
> 
> Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup"
> tracepoint. Analyzing wakup latencies depends on getting the wakeup
> chain right: which process is the waker, which is the wakee. Moving this
> instrumention outside of the waker context prevents trace analysis tools
> from getting the waker pid, either through "current" in the tracepoint
> probe, or by deducing it using other scheduler events based on the CPU
> executing the tracepoint.
> 
> Another side-effect of moving this instrumentation to the scheduler ipi
> is that the delay during which the wakeup is sitting in the pending
> queue is not accounted for when calculating wakeup latency.
> 
> Therefore, move the sched_wakeup instrumentation back to the waker
> context to fix those two shortcomings.

What do you consider wakeup-latency? I don't see how moving the
tracepoint into the caller will magically account the queue time.

> +++ b/kernel/sched/core.c
> @@ -1457,7 +1457,6 @@ static void
>  ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>  {
>  	check_preempt_curr(rq, p, wake_flags);
> -	trace_sched_wakeup(p, true);
>  
>  	p->state = TASK_RUNNING;
>  #ifdef CONFIG_SMP
> @@ -1505,6 +1504,7 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
>  	if (task_on_rq_queued(p)) {
>  		/* check_preempt_curr() may use rq clock */
>  		update_rq_clock(rq);
> +		trace_sched_wakeup(p, true);
>  		ttwu_do_wakeup(rq, p, wake_flags);
>  		ret = 1;
>  	}
> @@ -1619,6 +1619,7 @@ static void ttwu_queue(struct task_struct *p, int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
>  
> +	trace_sched_wakeup(p, true);
>  #if defined(CONFIG_SMP)
>  	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
>  		sched_clock_cpu(cpu); /* sync clocks x-cpu */

You only need one site in try_to_wake_up(), put it right after
success=1.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-05 12:09 ` Peter Zijlstra
@ 2015-06-05 12:32   ` Mathieu Desnoyers
  2015-06-05 12:51     ` Peter Zijlstra
  2015-06-05 12:32   ` Thomas Gleixner
  1 sibling, 1 reply; 19+ messages in thread
From: Mathieu Desnoyers @ 2015-06-05 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, rostedt,
	Francis Giraldeau

----- On Jun 5, 2015, at 2:09 PM, Peter Zijlstra peterz@infradead.org wrote:

> On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote:
>> Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu"
>> moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for
>> remote wakeups. This commit appeared upstream in Linux v3.0.
>> 
>> Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup"
>> tracepoint. Analyzing wakup latencies depends on getting the wakeup
>> chain right: which process is the waker, which is the wakee. Moving this
>> instrumention outside of the waker context prevents trace analysis tools
>> from getting the waker pid, either through "current" in the tracepoint
>> probe, or by deducing it using other scheduler events based on the CPU
>> executing the tracepoint.
>> 
>> Another side-effect of moving this instrumentation to the scheduler ipi
>> is that the delay during which the wakeup is sitting in the pending
>> queue is not accounted for when calculating wakeup latency.
>> 
>> Therefore, move the sched_wakeup instrumentation back to the waker
>> context to fix those two shortcomings.
> 
> What do you consider wakeup-latency? I don't see how moving the
> tracepoint into the caller will magically account the queue time.

That would be the delay between timestamp@wakeup (in the waker) to
timestamp@sched_switch_in (in the wakee).

ttwu_pending is going through a lockless list populated before sending
the IPI, and iterated on by the IPI handler. Therefore moving the tracepoint
before the enqueue will account spend in the queue in the wakeup latency
time.

> 
>> +++ b/kernel/sched/core.c
>> @@ -1457,7 +1457,6 @@ static void
>>  ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>>  {
>>  	check_preempt_curr(rq, p, wake_flags);
>> -	trace_sched_wakeup(p, true);
>>  
>>  	p->state = TASK_RUNNING;
>>  #ifdef CONFIG_SMP
>> @@ -1505,6 +1504,7 @@ static int ttwu_remote(struct task_struct *p, int
>> wake_flags)
>>  	if (task_on_rq_queued(p)) {
>>  		/* check_preempt_curr() may use rq clock */
>>  		update_rq_clock(rq);
>> +		trace_sched_wakeup(p, true);
>>  		ttwu_do_wakeup(rq, p, wake_flags);
>>  		ret = 1;
>>  	}
>> @@ -1619,6 +1619,7 @@ static void ttwu_queue(struct task_struct *p, int cpu)
>>  {
>>  	struct rq *rq = cpu_rq(cpu);
>>  
>> +	trace_sched_wakeup(p, true);
>>  #if defined(CONFIG_SMP)
>>  	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
>>  		sched_clock_cpu(cpu); /* sync clocks x-cpu */
> 
> You only need one site in try_to_wake_up(), put it right after
> success=1.

There are code paths further down the chain that check whether the task
is not fully dequeued from the runqueue (e.g. after grabbing the rq lock
in ttwu_remote()) and skips sched_wakeup event in those cases. Do we care
about not tracing an event in those cases, or would tracing it (as per your
suggestion) be preferable ?

Also, moving the tracepoint to try_to_wake_up() would not be enough. We would
also need to instrument try_to_wake_up_local().

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-05 12:32   ` Mathieu Desnoyers
@ 2015-06-05 12:51     ` Peter Zijlstra
  2015-06-05 13:23       ` Mathieu Desnoyers
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2015-06-05 12:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, rostedt,
	Francis Giraldeau

On Fri, 2015-06-05 at 12:32 +0000, Mathieu Desnoyers wrote:
> ----- On Jun 5, 2015, at 2:09 PM, Peter Zijlstra peterz@infradead.org wrote:
> 
> > On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote:
> >> Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu"
> >> moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for
> >> remote wakeups. This commit appeared upstream in Linux v3.0.
> >> 
> >> Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup"
> >> tracepoint. Analyzing wakup latencies depends on getting the wakeup
> >> chain right: which process is the waker, which is the wakee. Moving this
> >> instrumention outside of the waker context prevents trace analysis tools
> >> from getting the waker pid, either through "current" in the tracepoint
> >> probe, or by deducing it using other scheduler events based on the CPU
> >> executing the tracepoint.
> >> 
> >> Another side-effect of moving this instrumentation to the scheduler ipi
> >> is that the delay during which the wakeup is sitting in the pending
> >> queue is not accounted for when calculating wakeup latency.
> >> 
> >> Therefore, move the sched_wakeup instrumentation back to the waker
> >> context to fix those two shortcomings.
> > 
> > What do you consider wakeup-latency? I don't see how moving the
> > tracepoint into the caller will magically account the queue time.
> 
> That would be the delay between timestamp@wakeup (in the waker) to
> timestamp@sched_switch_in (in the wakee).

That's scheduling latency. Its how long it takes a runnable task to
become running. Wakeup latency would be how long it takes from 'event'
to the task being runnable.

The unfortunate situation here is that placing that tracepoint in the
wakee -- which makes sense -- makes measuring either of the above
latencies impossible because it does not capture the becoming RUNNABLE
part.

> ttwu_pending is going through a lockless list populated before sending
> the IPI, and iterated on by the IPI handler. Therefore moving the tracepoint
> before the enqueue will account spend in the queue in the wakeup latency
> time.

Which is wrong from all points of view latencies.

> > 
> >> +++ b/kernel/sched/core.c
> >> @@ -1457,7 +1457,6 @@ static void
> >>  ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> >>  {
> >>  	check_preempt_curr(rq, p, wake_flags);
> >> -	trace_sched_wakeup(p, true);
> >>  
> >>  	p->state = TASK_RUNNING;
> >>  #ifdef CONFIG_SMP
> >> @@ -1505,6 +1504,7 @@ static int ttwu_remote(struct task_struct *p, int
> >> wake_flags)
> >>  	if (task_on_rq_queued(p)) {
> >>  		/* check_preempt_curr() may use rq clock */
> >>  		update_rq_clock(rq);
> >> +		trace_sched_wakeup(p, true);
> >>  		ttwu_do_wakeup(rq, p, wake_flags);
> >>  		ret = 1;
> >>  	}
> >> @@ -1619,6 +1619,7 @@ static void ttwu_queue(struct task_struct *p, int cpu)
> >>  {
> >>  	struct rq *rq = cpu_rq(cpu);
> >>  
> >> +	trace_sched_wakeup(p, true);
> >>  #if defined(CONFIG_SMP)
> >>  	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> >>  		sched_clock_cpu(cpu); /* sync clocks x-cpu */
> > 
> > You only need one site in try_to_wake_up(), put it right after
> > success=1.
> 
> There are code paths further down the chain that check whether the task
> is not fully dequeued from the runqueue (e.g. after grabbing the rq lock
> in ttwu_remote()) and skips sched_wakeup event in those cases. Do we care
> about not tracing an event in those cases, or would tracing it (as per your
> suggestion) be preferable ?

Its an actual wakeup, we change task->state, therefore one should trace
it.

> 
> Also, moving the tracepoint to try_to_wake_up() would not be enough. We would
> also need to instrument try_to_wake_up_local().

Sure, I just objected to having 2 tracepoint in the 'normal' wakeup
path.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-05 12:51     ` Peter Zijlstra
@ 2015-06-05 13:23       ` Mathieu Desnoyers
  2015-06-06 12:02         ` Peter Zijlstra
  2015-06-08  6:55         ` [RFC PATCH] sched: Fix sched_wakeup tracepoint Peter Zijlstra
  0 siblings, 2 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2015-06-05 13:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, rostedt,
	Francis Giraldeau

----- On Jun 5, 2015, at 2:51 PM, null wrote:

> On Fri, 2015-06-05 at 12:32 +0000, Mathieu Desnoyers wrote:
>> ----- On Jun 5, 2015, at 2:09 PM, Peter Zijlstra peterz@infradead.org wrote:
>> 
>> > On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote:
>> >> Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu"
>> >> moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for
>> >> remote wakeups. This commit appeared upstream in Linux v3.0.
>> >> 
>> >> Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup"
>> >> tracepoint. Analyzing wakup latencies depends on getting the wakeup
>> >> chain right: which process is the waker, which is the wakee. Moving this
>> >> instrumention outside of the waker context prevents trace analysis tools
>> >> from getting the waker pid, either through "current" in the tracepoint
>> >> probe, or by deducing it using other scheduler events based on the CPU
>> >> executing the tracepoint.
>> >> 
>> >> Another side-effect of moving this instrumentation to the scheduler ipi
>> >> is that the delay during which the wakeup is sitting in the pending
>> >> queue is not accounted for when calculating wakeup latency.
>> >> 
>> >> Therefore, move the sched_wakeup instrumentation back to the waker
>> >> context to fix those two shortcomings.
>> > 
>> > What do you consider wakeup-latency? I don't see how moving the
>> > tracepoint into the caller will magically account the queue time.
>> 
>> That would be the delay between timestamp@wakeup (in the waker) to
>> timestamp@sched_switch_in (in the wakee).
> 
> That's scheduling latency. Its how long it takes a runnable task to
> become running. Wakeup latency would be how long it takes from 'event'
> to the task being runnable.
> 
> The unfortunate situation here is that placing that tracepoint in the
> wakee -- which makes sense -- makes measuring either of the above
> latencies impossible because it does not capture the becoming RUNNABLE
> part.
> 
>> ttwu_pending is going through a lockless list populated before sending
>> the IPI, and iterated on by the IPI handler. Therefore moving the tracepoint
>> before the enqueue will account spend in the queue in the wakeup latency
>> time.
> 
> Which is wrong from all points of view latencies.

OK, so considering the definition naming feedback you provided, we
may need a 3 tracepoint if we want to calculate both wakeup latency
and scheduling latency (naming ofc open to discussion):

sched_wakeup: when try_to_wake_up{,_local} is called in the waker.
sched_activate_task: when the wakee is marked runnable.
sched_switch: when scheduling actually happens.

We can then calculate wakeup latency as

  time@sched_activate - time@sched_wakeup

And scheduling latency as

  time@sched_switch - time@sched_activate

In the case of critical path analysis, we don't care about this
level of granularity. What we care about is is sum of the two,
which we can express as:

  time@sched_switch - time@sched_wakeup

Is there an officially blessed name for this ?

> 
>> > 
>> >> +++ b/kernel/sched/core.c
>> >> @@ -1457,7 +1457,6 @@ static void
>> >>  ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>> >>  {
>> >>  	check_preempt_curr(rq, p, wake_flags);
>> >> -	trace_sched_wakeup(p, true);
>> >>  
>> >>  	p->state = TASK_RUNNING;
>> >>  #ifdef CONFIG_SMP
>> >> @@ -1505,6 +1504,7 @@ static int ttwu_remote(struct task_struct *p, int
>> >> wake_flags)
>> >>  	if (task_on_rq_queued(p)) {
>> >>  		/* check_preempt_curr() may use rq clock */
>> >>  		update_rq_clock(rq);
>> >> +		trace_sched_wakeup(p, true);
>> >>  		ttwu_do_wakeup(rq, p, wake_flags);
>> >>  		ret = 1;
>> >>  	}
>> >> @@ -1619,6 +1619,7 @@ static void ttwu_queue(struct task_struct *p, int cpu)
>> >>  {
>> >>  	struct rq *rq = cpu_rq(cpu);
>> >>  
>> >> +	trace_sched_wakeup(p, true);
>> >>  #if defined(CONFIG_SMP)
>> >>  	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
>> >>  		sched_clock_cpu(cpu); /* sync clocks x-cpu */
>> > 
>> > You only need one site in try_to_wake_up(), put it right after
>> > success=1.
>> 
>> There are code paths further down the chain that check whether the task
>> is not fully dequeued from the runqueue (e.g. after grabbing the rq lock
>> in ttwu_remote()) and skips sched_wakeup event in those cases. Do we care
>> about not tracing an event in those cases, or would tracing it (as per your
>> suggestion) be preferable ?
> 
> Its an actual wakeup, we change task->state, therefore one should trace
> it.

Works for me!

> 
>> 
>> Also, moving the tracepoint to try_to_wake_up() would not be enough. We would
>> also need to instrument try_to_wake_up_local().
> 
> Sure, I just objected to having 2 tracepoint in the 'normal' wakeup
> path.

Agreed. We may need to see if we want to add a "sched_activate_task" tracepoint
though.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-05 13:23       ` Mathieu Desnoyers
@ 2015-06-06 12:02         ` Peter Zijlstra
  2015-06-07 10:20           ` Mathieu Desnoyers
  2015-06-08  6:55         ` [RFC PATCH] sched: Fix sched_wakeup tracepoint Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2015-06-06 12:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, rostedt,
	Francis Giraldeau

On Fri, 2015-06-05 at 13:23 +0000, Mathieu Desnoyers wrote:
> OK, so considering the definition naming feedback you provided, we
> may need a 3 tracepoint if we want to calculate both wakeup latency
> and scheduling latency (naming ofc open to discussion):
> 
> sched_wakeup: when try_to_wake_up{,_local} is called in the waker.
> sched_activate_task: when the wakee is marked runnable.
> sched_switch: when scheduling actually happens.

I would propose:

	sched_waking: upon calling try_to_wake_up() as soon as we know we need
to change state; guaranteed to be called from the context doing the
wakeup.

	sched_woken: the wakeup is complete (task is runnable, any delay
between this and actually getting on a cpu is down to the scheduler).

	sched_switch: when switching from task @prev to @next.

This means abandoning trace_sched_wakeup(); which might be a problem,
which is why I bloody hate tracepoints :-(

> We can then calculate wakeup latency as
> 
>   time@sched_activate - time@sched_wakeup
> 
> And scheduling latency as
> 
>   time@sched_switch - time@sched_activate
> 
> In the case of critical path analysis, we don't care about this
> level of granularity. What we care about is is sum of the two,
> which we can express as:
> 
>   time@sched_switch - time@sched_wakeup
> 
> Is there an officially blessed name for this ?

No idea.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-06 12:02         ` Peter Zijlstra
@ 2015-06-07 10:20           ` Mathieu Desnoyers
  2015-06-08 17:27             ` Steven Rostedt
  0 siblings, 1 reply; 19+ messages in thread
From: Mathieu Desnoyers @ 2015-06-07 10:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, rostedt,
	Francis Giraldeau, lttng-dev

----- On Jun 6, 2015, at 2:02 PM, Peter Zijlstra peterz@infradead.org wrote:

> On Fri, 2015-06-05 at 13:23 +0000, Mathieu Desnoyers wrote:
>> OK, so considering the definition naming feedback you provided, we
>> may need a 3 tracepoint if we want to calculate both wakeup latency
>> and scheduling latency (naming ofc open to discussion):
>> 
>> sched_wakeup: when try_to_wake_up{,_local} is called in the waker.
>> sched_activate_task: when the wakee is marked runnable.
>> sched_switch: when scheduling actually happens.
> 
> I would propose:
> 
>	sched_waking: upon calling try_to_wake_up() as soon as we know we need
> to change state; guaranteed to be called from the context doing the
> wakeup.
> 
>	sched_woken: the wakeup is complete (task is runnable, any delay
> between this and actually getting on a cpu is down to the scheduler).
> 
>	sched_switch: when switching from task @prev to @next.

Agreed,

> 
> This means abandoning trace_sched_wakeup(); which might be a problem,
> which is why I bloody hate tracepoints :-(

OK. I guess it's about time we dive into that question. Should tracepoint
semantic be kept cast in stone forever ? Not in my opinion, and here is why.

Most of the Linux kernel ABI exposed to userspace serves as support to
runtime (system calls, virtual file systems, etc). For all that, it makes
tons of sense to keep it stable, following the Documentation/ABI/README
guidelines. Even there, we have provisions for obsolescence and removal
of an ABI if need be, which provides userspace some time to adapt to
changes.

How are tracepoints different ? Well, those are not meant to be used in
runtime support, but rather for analyzing systems, which means that
userspace tools using the tracepoint content do not need it to _run_,
but rather as information source to perform analyses.

Even though I dislike analogies, I think we need one here. Let's consider
CAN bus ports for car debugging. Even though the transport is covered by
standards, it does not mandate the semantics of the data per se. I would
not expect a debugging device made in 2005 to work for newest generations
of car. However, I would expect that new debug devices are compatible with
older cars, and that those debug devices have means to query which type of
car it is debugging. Otherwise, the debugging device is simply crap,
because it cannot adapt to change. What should a debug device created in
2005 do if connected to a new car ? Ideally, it should gracefully decline
to interact with this car, and require a software upgrade.

OK, now back to kernel tracepoints. My opinion is that it is a fundamental
requirement that trace analysis tools should be able to detect that they
are unable understand tracepoint data they care about. It seems perfectly
fine to me to require that analysis tool upgrades are needed to interact
with a new kernel. However, a tool should be able to handle a range of
older kernel versions too.

This can be done by many means, including making sure preexisting event name
and fields semantic are immutable, or by versioning of tracepoints on a
per-event basis.

Here, in the case of sched_wakeup: we end up noticing that it accidentally
changed location in the kernel across versions, which makes it useless for
many analyses unless they use kernel version information to get the right
semantic associated with this event.

So here, for introducing sched_waking/sched_woken, we have a few ways
forward:

1) Keep sched_wakeup as it is, and add those two new events. Analyses
   can then continue using the old event for a while, and if they sees
   that sched_waking/sched_woken are there, they can use those more
   precise events instead. This could allow us to do a gradual
   deprecation phase for the sched_wakeup tracepoint.

2) Remove sched_wakeup event, replacing it by sched_waking/sched_woken.
   Require immediate analysis tool upgrade to deal with this new
   information. Old tools should gracefully fail and ask users to
   upgrade. If they don't, fix them so they can handle change.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-07 10:20           ` Mathieu Desnoyers
@ 2015-06-08 17:27             ` Steven Rostedt
  2015-06-09  9:13               ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2015-06-08 17:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Francis Giraldeau, lttng-dev, Linus Torvalds


[ Keeping entire email because I added Linus ]

On Sun, 7 Jun 2015 10:20:14 +0000 (UTC)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> ----- On Jun 6, 2015, at 2:02 PM, Peter Zijlstra peterz@infradead.org wrote:
> 
> > On Fri, 2015-06-05 at 13:23 +0000, Mathieu Desnoyers wrote:
> >> OK, so considering the definition naming feedback you provided, we
> >> may need a 3 tracepoint if we want to calculate both wakeup latency
> >> and scheduling latency (naming ofc open to discussion):
> >> 
> >> sched_wakeup: when try_to_wake_up{,_local} is called in the waker.
> >> sched_activate_task: when the wakee is marked runnable.
> >> sched_switch: when scheduling actually happens.
> > 
> > I would propose:
> > 
> >	sched_waking: upon calling try_to_wake_up() as soon as we know we need
> > to change state; guaranteed to be called from the context doing the
> > wakeup.
> > 
> >	sched_woken: the wakeup is complete (task is runnable, any delay
> > between this and actually getting on a cpu is down to the scheduler).
> > 
> >	sched_switch: when switching from task @prev to @next.
> 
> Agreed,
> 
> > 
> > This means abandoning trace_sched_wakeup(); which might be a problem,
> > which is why I bloody hate tracepoints :-(
> 
> OK. I guess it's about time we dive into that question. Should tracepoint
> semantic be kept cast in stone forever ? Not in my opinion, and here is why.
> 
> Most of the Linux kernel ABI exposed to userspace serves as support to
> runtime (system calls, virtual file systems, etc). For all that, it makes
> tons of sense to keep it stable, following the Documentation/ABI/README
> guidelines. Even there, we have provisions for obsolescence and removal
> of an ABI if need be, which provides userspace some time to adapt to
> changes.
> 
> How are tracepoints different ? Well, those are not meant to be used in
> runtime support, but rather for analyzing systems, which means that
> userspace tools using the tracepoint content do not need it to _run_,
> but rather as information source to perform analyses.
> 
> Even though I dislike analogies, I think we need one here. Let's consider
> CAN bus ports for car debugging. Even though the transport is covered by
> standards, it does not mandate the semantics of the data per se. I would
> not expect a debugging device made in 2005 to work for newest generations
> of car. However, I would expect that new debug devices are compatible with
> older cars, and that those debug devices have means to query which type of
> car it is debugging. Otherwise, the debugging device is simply crap,
> because it cannot adapt to change. What should a debug device created in
> 2005 do if connected to a new car ? Ideally, it should gracefully decline
> to interact with this car, and require a software upgrade.
> 
> OK, now back to kernel tracepoints. My opinion is that it is a fundamental
> requirement that trace analysis tools should be able to detect that they
> are unable understand tracepoint data they care about. It seems perfectly
> fine to me to require that analysis tool upgrades are needed to interact
> with a new kernel. However, a tool should be able to handle a range of
> older kernel versions too.
> 
> This can be done by many means, including making sure preexisting event name
> and fields semantic are immutable, or by versioning of tracepoints on a
> per-event basis.
> 
> Here, in the case of sched_wakeup: we end up noticing that it accidentally
> changed location in the kernel across versions, which makes it useless for
> many analyses unless they use kernel version information to get the right
> semantic associated with this event.
> 
> So here, for introducing sched_waking/sched_woken, we have a few ways
> forward:
> 
> 1) Keep sched_wakeup as it is, and add those two new events. Analyses
>    can then continue using the old event for a while, and if they sees
>    that sched_waking/sched_woken are there, they can use those more
>    precise events instead. This could allow us to do a gradual
>    deprecation phase for the sched_wakeup tracepoint.
> 
> 2) Remove sched_wakeup event, replacing it by sched_waking/sched_woken.
>    Require immediate analysis tool upgrade to deal with this new
>    information. Old tools should gracefully fail and ask users to
>    upgrade. If they don't, fix them so they can handle change.

2 will break tools and even if they fail "gracefully" that probably
still isn't acceptable as the sched_wakeup tracepoint is a popular one.


  3) Add the two tracepoints and remove the sched_wakeup() one, but
  then add a manual tracepoint for perf and ftrace that can simulate
  the sched_wakeup() from the other two tracepoints. This should keep
  tools working and we can have better wake up tracepoints implemented,
  without the overhead of the third "obsolete" tracepoint in the
  scheduling code.

-- Steve


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-08 17:27             ` Steven Rostedt
@ 2015-06-09  9:13               ` Peter Zijlstra
  2015-06-09 18:48                 ` Mathieu Desnoyers
                                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Peter Zijlstra @ 2015-06-09  9:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Francis Giraldeau, lttng-dev, Linus Torvalds


So how about we introduce the 'waking' tracepoint and leave the existing
wakeup one in place and preserve its woken semantics.

Steven, can we do aliases? Where one tracepoint is known to userspace
under multiple names? In that case we could rename the thing to woken
and have an alias wakeup which we can phase out over time.

The patch also takes away the success parameter to the tracepoint, but
does not quite go as far as actually removing it from the tracepoint
itself.

We can do that in a follow up patch which we can quickly revert if it
turns out people are actually still using that for something.

---
 include/trace/events/sched.h | 28 ++++++++++++++++++++--------
 kernel/sched/core.c          |  9 ++++++---
 2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index d57a575fe31f..b1608d5679e2 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -55,7 +55,7 @@ TRACE_EVENT(sched_kthread_stop_ret,
  */
 DECLARE_EVENT_CLASS(sched_wakeup_template,
 
-	TP_PROTO(struct task_struct *p, int success),
+	TP_PROTO(struct task_struct *p),
 
 	TP_ARGS(__perf_task(p), success),
 
@@ -71,25 +71,37 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
 		__entry->prio		= p->prio;
-		__entry->success	= success;
+		__entry->success	= 1; /* rudiment, kill when possible */
 		__entry->target_cpu	= task_cpu(p);
 	),
 
-	TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d",
+	TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
 		  __entry->comm, __entry->pid, __entry->prio,
-		  __entry->success, __entry->target_cpu)
+		  __entry->target_cpu)
 );
 
+/*
+ * Tracepoint called when waking a task; this tracepoint is guaranteed to be
+ * called from the waking context.
+ */
+DEFINE_EVENT(sched_wakeup_template, sched_waking,
+	     TP_PROTO(struct task_struct *p),
+	     TP_ARGS(p));
+
+/*
+ * Tracepoint called when the task is actually woken; p->state == TASK_RUNNNG.
+ * It it not always called from the waking context.
+ */
 DEFINE_EVENT(sched_wakeup_template, sched_wakeup,
-	     TP_PROTO(struct task_struct *p, int success),
-	     TP_ARGS(p, success));
+	     TP_PROTO(struct task_struct *p),
+	     TP_ARGS(p));
 
 /*
  * Tracepoint for waking up a new task:
  */
 DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
-	     TP_PROTO(struct task_struct *p, int success),
-	     TP_ARGS(p, success));
+	     TP_PROTO(struct task_struct *p),
+	     TP_ARGS(p));
 
 #ifdef CREATE_TRACE_POINTS
 static inline long __trace_sched_switch_state(struct task_struct *p)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d5078c0f20e6..354e667620a9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1487,9 +1487,9 @@ static void
 ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	check_preempt_curr(rq, p, wake_flags);
-	trace_sched_wakeup(p, true);
-
 	p->state = TASK_RUNNING;
+	trace_sched_wakeup(p);
+
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken)
 		p->sched_class->task_woken(rq, p);
@@ -1695,6 +1695,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		goto out;
 
 	success = 1; /* we're going to change ->state */
+	trace_sched_waking(p);
 	cpu = task_cpu(p);
 
 	if (p->on_rq && ttwu_remote(p, wake_flags))
@@ -1761,6 +1762,8 @@ static void try_to_wake_up_local(struct task_struct *p)
 	if (!(p->state & TASK_NORMAL))
 		goto out;
 
+	trace_sched_waking(p);
+
 	if (!task_on_rq_queued(p))
 		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
 
@@ -2119,7 +2122,7 @@ void wake_up_new_task(struct task_struct *p)
 	rq = __task_rq_lock(p);
 	activate_task(rq, p, 0);
 	p->on_rq = TASK_ON_RQ_QUEUED;
-	trace_sched_wakeup_new(p, true);
+	trace_sched_wakeup_new(p);
 	check_preempt_curr(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken)

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-09  9:13               ` Peter Zijlstra
@ 2015-06-09 18:48                 ` Mathieu Desnoyers
  2015-06-17 18:23                 ` Cong Wang
  2015-08-03 17:06                 ` [tip:sched/core] sched: Introduce the 'trace_sched_waking' tracepoint tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2015-06-09 18:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rostedt, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Francis Giraldeau, lttng-dev, Linus Torvalds

----- On Jun 9, 2015, at 11:13 AM, Peter Zijlstra peterz@infradead.org wrote:

> So how about we introduce the 'waking' tracepoint and leave the existing
> wakeup one in place and preserve its woken semantics.

That would work for me, but leaves me wondering how you would move
to the new 'woken' name.

> 
> Steven, can we do aliases? Where one tracepoint is known to userspace
> under multiple names? In that case we could rename the thing to woken
> and have an alias wakeup which we can phase out over time.

In LTTng, I have a LTTNG_TRACEPOINT_EVENT_MAP() macro, which allows me
to map a kernel event to my own event naming scheme. It allows me
to correct namespacing of some odd kernel tracepoints from within the
tracer. A similar implementation could be used to implement a tracepoint
alias within the kernel: one parameter is the named as exposed to userspace,
and the other field is the name of the tracepoint it hooks to.

You could then rename the tracepoint name from 'wakeup' to 'woken', and
create an alias exposing the old 'wakeup' event.

> 
> The patch also takes away the success parameter to the tracepoint, but
> does not quite go as far as actually removing it from the tracepoint
> itself.

Sounds like a good intermediate step. If you ensure that the alias
declaration allows you to show a different set of fields for the alias,
you could even remove the success parameter from 'woken', and only show
it for the 'wakeup' alias.

> 
> We can do that in a follow up patch which we can quickly revert if it
> turns out people are actually still using that for something.

Or just pull the plug on the old 'wakeup' event after a deprecation
phase (to be defined).

Thanks,

Mathieu

> 
> ---
> include/trace/events/sched.h | 28 ++++++++++++++++++++--------
> kernel/sched/core.c          |  9 ++++++---
> 2 files changed, 26 insertions(+), 11 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index d57a575fe31f..b1608d5679e2 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -55,7 +55,7 @@ TRACE_EVENT(sched_kthread_stop_ret,
>  */
> DECLARE_EVENT_CLASS(sched_wakeup_template,
> 
> -	TP_PROTO(struct task_struct *p, int success),
> +	TP_PROTO(struct task_struct *p),
> 
> 	TP_ARGS(__perf_task(p), success),
> 
> @@ -71,25 +71,37 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
> 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
> 		__entry->pid		= p->pid;
> 		__entry->prio		= p->prio;
> -		__entry->success	= success;
> +		__entry->success	= 1; /* rudiment, kill when possible */
> 		__entry->target_cpu	= task_cpu(p);
> 	),
> 
> -	TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d",
> +	TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
> 		  __entry->comm, __entry->pid, __entry->prio,
> -		  __entry->success, __entry->target_cpu)
> +		  __entry->target_cpu)
> );
> 
> +/*
> + * Tracepoint called when waking a task; this tracepoint is guaranteed to be
> + * called from the waking context.
> + */
> +DEFINE_EVENT(sched_wakeup_template, sched_waking,
> +	     TP_PROTO(struct task_struct *p),
> +	     TP_ARGS(p));
> +
> +/*
> + * Tracepoint called when the task is actually woken; p->state == TASK_RUNNNG.
> + * It it not always called from the waking context.
> + */
> DEFINE_EVENT(sched_wakeup_template, sched_wakeup,
> -	     TP_PROTO(struct task_struct *p, int success),
> -	     TP_ARGS(p, success));
> +	     TP_PROTO(struct task_struct *p),
> +	     TP_ARGS(p));
> 
> /*
>  * Tracepoint for waking up a new task:
>  */
> DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
> -	     TP_PROTO(struct task_struct *p, int success),
> -	     TP_ARGS(p, success));
> +	     TP_PROTO(struct task_struct *p),
> +	     TP_ARGS(p));
> 
> #ifdef CREATE_TRACE_POINTS
> static inline long __trace_sched_switch_state(struct task_struct *p)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d5078c0f20e6..354e667620a9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1487,9 +1487,9 @@ static void
> ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> {
> 	check_preempt_curr(rq, p, wake_flags);
> -	trace_sched_wakeup(p, true);
> -
> 	p->state = TASK_RUNNING;
> +	trace_sched_wakeup(p);
> +
> #ifdef CONFIG_SMP
> 	if (p->sched_class->task_woken)
> 		p->sched_class->task_woken(rq, p);
> @@ -1695,6 +1695,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state,
> int wake_flags)
> 		goto out;
> 
> 	success = 1; /* we're going to change ->state */
> +	trace_sched_waking(p);
> 	cpu = task_cpu(p);
> 
> 	if (p->on_rq && ttwu_remote(p, wake_flags))
> @@ -1761,6 +1762,8 @@ static void try_to_wake_up_local(struct task_struct *p)
> 	if (!(p->state & TASK_NORMAL))
> 		goto out;
> 
> +	trace_sched_waking(p);
> +
> 	if (!task_on_rq_queued(p))
> 		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
> 
> @@ -2119,7 +2122,7 @@ void wake_up_new_task(struct task_struct *p)
> 	rq = __task_rq_lock(p);
> 	activate_task(rq, p, 0);
> 	p->on_rq = TASK_ON_RQ_QUEUED;
> -	trace_sched_wakeup_new(p, true);
> +	trace_sched_wakeup_new(p);
> 	check_preempt_curr(rq, p, WF_FORK);
> #ifdef CONFIG_SMP
>  	if (p->sched_class->task_woken)

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-09  9:13               ` Peter Zijlstra
  2015-06-09 18:48                 ` Mathieu Desnoyers
@ 2015-06-17 18:23                 ` Cong Wang
  2015-06-17 18:47                   ` Steven Rostedt
  2015-08-03 17:06                 ` [tip:sched/core] sched: Introduce the 'trace_sched_waking' tracepoint tip-bot for Peter Zijlstra
  2 siblings, 1 reply; 19+ messages in thread
From: Cong Wang @ 2015-06-17 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Mathieu Desnoyers, LKML, Thomas Gleixner,
	Ingo Molnar, Francis Giraldeau, lttng-dev, Linus Torvalds

On Tue, Jun 9, 2015 at 2:13 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So how about we introduce the 'waking' tracepoint and leave the existing
> wakeup one in place and preserve its woken semantics.
>
> Steven, can we do aliases? Where one tracepoint is known to userspace
> under multiple names? In that case we could rename the thing to woken
> and have an alias wakeup which we can phase out over time.
>
> The patch also takes away the success parameter to the tracepoint, but
> does not quite go as far as actually removing it from the tracepoint
> itself.
>
> We can do that in a follow up patch which we can quickly revert if it
> turns out people are actually still using that for something.

+1 to this patch. How is it going?

Here at Twitter, we are analyzing scheduling latencies too, with our
own tool using existing tracepoints, it would be nice to have more
granularity on the scheduling latency.

And, you probably want to change perf sched to respect this
new 'waking' event too. ;)

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-17 18:23                 ` Cong Wang
@ 2015-06-17 18:47                   ` Steven Rostedt
  0 siblings, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2015-06-17 18:47 UTC (permalink / raw)
  To: Cong Wang
  Cc: Peter Zijlstra, Mathieu Desnoyers, LKML, Thomas Gleixner,
	Ingo Molnar, Francis Giraldeau, lttng-dev, Linus Torvalds

On Wed, 17 Jun 2015 11:23:00 -0700
Cong Wang <xiyou.wangcong@gmail.com> wrote:

> On Tue, Jun 9, 2015 at 2:13 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So how about we introduce the 'waking' tracepoint and leave the existing
> > wakeup one in place and preserve its woken semantics.
> >
> > Steven, can we do aliases? Where one tracepoint is known to userspace
> > under multiple names? In that case we could rename the thing to woken
> > and have an alias wakeup which we can phase out over time.
> >
> > The patch also takes away the success parameter to the tracepoint, but
> > does not quite go as far as actually removing it from the tracepoint
> > itself.
> >
> > We can do that in a follow up patch which we can quickly revert if it
> > turns out people are actually still using that for something.
> 
> +1 to this patch. How is it going?

It's not a top priority. But it shouldn't be too hard to implement.
This could be something I do after the 4.2 merge window closes.

-- Steve


> 
> Here at Twitter, we are analyzing scheduling latencies too, with our
> own tool using existing tracepoints, it would be nice to have more
> granularity on the scheduling latency.
> 
> And, you probably want to change perf sched to respect this
> new 'waking' event too. ;)
> 
> Thanks.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [tip:sched/core] sched: Introduce the 'trace_sched_waking' tracepoint
  2015-06-09  9:13               ` Peter Zijlstra
  2015-06-09 18:48                 ` Mathieu Desnoyers
  2015-06-17 18:23                 ` Cong Wang
@ 2015-08-03 17:06                 ` tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 19+ messages in thread
From: tip-bot for Peter Zijlstra @ 2015-08-03 17:06 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, mingo, torvalds, tglx, mathieu.desnoyers,
	francis.giraldeau, linux-kernel, rostedt, efault, hpa

Commit-ID:  fbd705a0c6184580d0e2fbcbd47a37b6e5822511
Gitweb:     http://git.kernel.org/tip/fbd705a0c6184580d0e2fbcbd47a37b6e5822511
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 9 Jun 2015 11:13:36 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 3 Aug 2015 12:21:22 +0200

sched: Introduce the 'trace_sched_waking' tracepoint

Mathieu reported that since 317f394160e9 ("sched: Move the second half
of ttwu() to the remote cpu") trace_sched_wakeup() can happen out of
context of the waker.

This is a problem when you want to analyse wakeup paths because it is
now very hard to correlate the wakeup event to whoever issued the
wakeup.

OTOH trace_sched_wakeup() is issued at the point where we set
p->state = TASK_RUNNING, which is right were we hand the task off to
the scheduler, so this is an important point when looking at
scheduling behaviour, up to here its been the wakeup path everything
hereafter is due to scheduler policy.

To bridge this gap, introduce a second tracepoint: trace_sched_waking.
It is guaranteed to be called in the waker context.

Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Francis Giraldeau <francis.giraldeau@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150609091336.GQ3644@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/trace/events/sched.h      | 30 +++++++++++++++++++++---------
 kernel/sched/core.c               | 10 +++++++---
 kernel/trace/trace_sched_switch.c |  2 +-
 kernel/trace/trace_sched_wakeup.c |  2 +-
 4 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index d57a575..539d6bc 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -55,9 +55,9 @@ TRACE_EVENT(sched_kthread_stop_ret,
  */
 DECLARE_EVENT_CLASS(sched_wakeup_template,
 
-	TP_PROTO(struct task_struct *p, int success),
+	TP_PROTO(struct task_struct *p),
 
-	TP_ARGS(__perf_task(p), success),
+	TP_ARGS(__perf_task(p)),
 
 	TP_STRUCT__entry(
 		__array(	char,	comm,	TASK_COMM_LEN	)
@@ -71,25 +71,37 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
 		__entry->prio		= p->prio;
-		__entry->success	= success;
+		__entry->success	= 1; /* rudiment, kill when possible */
 		__entry->target_cpu	= task_cpu(p);
 	),
 
-	TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d",
+	TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
 		  __entry->comm, __entry->pid, __entry->prio,
-		  __entry->success, __entry->target_cpu)
+		  __entry->target_cpu)
 );
 
+/*
+ * Tracepoint called when waking a task; this tracepoint is guaranteed to be
+ * called from the waking context.
+ */
+DEFINE_EVENT(sched_wakeup_template, sched_waking,
+	     TP_PROTO(struct task_struct *p),
+	     TP_ARGS(p));
+
+/*
+ * Tracepoint called when the task is actually woken; p->state == TASK_RUNNNG.
+ * It it not always called from the waking context.
+ */
 DEFINE_EVENT(sched_wakeup_template, sched_wakeup,
-	     TP_PROTO(struct task_struct *p, int success),
-	     TP_ARGS(p, success));
+	     TP_PROTO(struct task_struct *p),
+	     TP_ARGS(p));
 
 /*
  * Tracepoint for waking up a new task:
  */
 DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
-	     TP_PROTO(struct task_struct *p, int success),
-	     TP_ARGS(p, success));
+	     TP_PROTO(struct task_struct *p),
+	     TP_ARGS(p));
 
 #ifdef CREATE_TRACE_POINTS
 static inline long __trace_sched_switch_state(struct task_struct *p)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48be7dc..fa5826c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1654,9 +1654,9 @@ static void
 ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	check_preempt_curr(rq, p, wake_flags);
-	trace_sched_wakeup(p, true);
-
 	p->state = TASK_RUNNING;
+	trace_sched_wakeup(p);
+
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken) {
 		/*
@@ -1874,6 +1874,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (!(p->state & state))
 		goto out;
 
+	trace_sched_waking(p);
+
 	success = 1; /* we're going to change ->state */
 	cpu = task_cpu(p);
 
@@ -1949,6 +1951,8 @@ static void try_to_wake_up_local(struct task_struct *p)
 	if (!(p->state & TASK_NORMAL))
 		goto out;
 
+	trace_sched_waking(p);
+
 	if (!task_on_rq_queued(p))
 		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
 
@@ -2307,7 +2311,7 @@ void wake_up_new_task(struct task_struct *p)
 	rq = __task_rq_lock(p);
 	activate_task(rq, p, 0);
 	p->on_rq = TASK_ON_RQ_QUEUED;
-	trace_sched_wakeup_new(p, true);
+	trace_sched_wakeup_new(p);
 	check_preempt_curr(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken)
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index 419ca37..f270088 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -26,7 +26,7 @@ probe_sched_switch(void *ignore, struct task_struct *prev, struct task_struct *n
 }
 
 static void
-probe_sched_wakeup(void *ignore, struct task_struct *wakee, int success)
+probe_sched_wakeup(void *ignore, struct task_struct *wakee)
 {
 	if (unlikely(!sched_ref))
 		return;
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 9b33dd1..12cbe77 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -514,7 +514,7 @@ static void wakeup_reset(struct trace_array *tr)
 }
 
 static void
-probe_wakeup(void *ignore, struct task_struct *p, int success)
+probe_wakeup(void *ignore, struct task_struct *p)
 {
 	struct trace_array_cpu *data;
 	int cpu = smp_processor_id();

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-05 13:23       ` Mathieu Desnoyers
  2015-06-06 12:02         ` Peter Zijlstra
@ 2015-06-08  6:55         ` Peter Zijlstra
  2015-06-09  5:53           ` Mathieu Desnoyers
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2015-06-08  6:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, rostedt,
	Francis Giraldeau

On Fri, 2015-06-05 at 13:23 +0000, Mathieu Desnoyers wrote:
> sched_wakeup: when try_to_wake_up{,_local} is called in the waker.
> sched_activate_task: when the wakee is marked runnable.
> sched_switch: when scheduling actually happens.
> 
> We can then calculate wakeup latency as
> 
>   time@sched_activate - time@sched_wakeup

One more thing, I think I would disagree with this. I would suggest
never to use the 'wakeup' (or 'waking' in my proposal) for timing. I
would suggest to use your interrupt tracepoint (or whatever else causes
wakeup to be called for this).

The wakeup times should be measured in tasktime -- of course, if
interrupts/preemption are disabled then tasktime == walltime.

The scheduling bit OTOH always needs to be measured in walltime, and is
most affected by the presence of other tasks on the system.

This too is why I'm not sure it makes sense to combine the two into a
single measurement. They should be measured in different time domains.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-08  6:55         ` [RFC PATCH] sched: Fix sched_wakeup tracepoint Peter Zijlstra
@ 2015-06-09  5:53           ` Mathieu Desnoyers
  0 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2015-06-09  5:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, rostedt,
	Francis Giraldeau

----- On Jun 8, 2015, at 8:55 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Fri, 2015-06-05 at 13:23 +0000, Mathieu Desnoyers wrote:
>> sched_wakeup: when try_to_wake_up{,_local} is called in the waker.
>> sched_activate_task: when the wakee is marked runnable.
>> sched_switch: when scheduling actually happens.
>> 
>> We can then calculate wakeup latency as
>> 
>>   time@sched_activate - time@sched_wakeup
> 
> One more thing, I think I would disagree with this. I would suggest
> never to use the 'wakeup' (or 'waking' in my proposal) for timing. I
> would suggest to use your interrupt tracepoint (or whatever else causes
> wakeup to be called for this).

The nice thing about the 'waking' tracepoint is that there is only
one to trace if we care about wakeup latency, and it only executes when
there is an actual wake up performed.

If we do care about more timing precision of the wakeup latency, we might
indeed want to trace both 'waking' and whatever calls it (interrupt, set
of system calls, etc.)

An instrumentation of 'waking' however seems like a good approximation
of the moment where the wakeup is requested by the waker. Especially
if we compute critical path of a computation: then we already account
for process runtime within the waker before the 'woken' event. So we
don't really care about the extra precision that we would get by tracing
the exact syscall entry point.

In all cases we need the wakee PID awakened by 'waking', which is
unavailable unless we add the 'waking' event.

> 
> The wakeup times should be measured in tasktime -- of course, if
> interrupts/preemption are disabled then tasktime == walltime.
> 
> The scheduling bit OTOH always needs to be measured in walltime, and is
> most affected by the presence of other tasks on the system.
> 
> This too is why I'm not sure it makes sense to combine the two into a
> single measurement. They should be measured in different time domains.

That's a very good point! Agreed.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-05 12:09 ` Peter Zijlstra
  2015-06-05 12:32   ` Mathieu Desnoyers
@ 2015-06-05 12:32   ` Thomas Gleixner
  2015-06-05 12:36     ` Mathieu Desnoyers
  2015-06-05 12:46     ` Peter Zijlstra
  1 sibling, 2 replies; 19+ messages in thread
From: Thomas Gleixner @ 2015-06-05 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Steven Rostedt,
	Francis Giraldeau

On Fri, 5 Jun 2015, Peter Zijlstra wrote:
> On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote:
> > Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu"
> > moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for
> > remote wakeups. This commit appeared upstream in Linux v3.0.
> > 
> > Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup"
> > tracepoint. Analyzing wakup latencies depends on getting the wakeup
> > chain right: which process is the waker, which is the wakee. Moving this
> > instrumention outside of the waker context prevents trace analysis tools
> > from getting the waker pid, either through "current" in the tracepoint
> > probe, or by deducing it using other scheduler events based on the CPU
> > executing the tracepoint.
> > 
> > Another side-effect of moving this instrumentation to the scheduler ipi
> > is that the delay during which the wakeup is sitting in the pending
> > queue is not accounted for when calculating wakeup latency.
> > 
> > Therefore, move the sched_wakeup instrumentation back to the waker
> > context to fix those two shortcomings.
> 
> What do you consider wakeup-latency? I don't see how moving the
> tracepoint into the caller will magically account the queue time.

Well, the point of wakeup is when the wakee calls wakeup. If the trace
point is in the IPI then you account the time between the wakeup and
the actuall handling in the IPI to the wakee instead of accounting it
to the time between wakeup and sched switch.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-05 12:32   ` Thomas Gleixner
@ 2015-06-05 12:36     ` Mathieu Desnoyers
  2015-06-05 12:46     ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2015-06-05 12:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, rostedt,
	Giraldeau, Francis

----- On Jun 5, 2015, at 2:32 PM, Thomas Gleixner tglx@linutronix.de wrote:

> On Fri, 5 Jun 2015, Peter Zijlstra wrote:
>> On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote:
>> > Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu"
>> > moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for
>> > remote wakeups. This commit appeared upstream in Linux v3.0.
>> > 
>> > Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup"
>> > tracepoint. Analyzing wakup latencies depends on getting the wakeup
>> > chain right: which process is the waker, which is the wakee. Moving this
>> > instrumention outside of the waker context prevents trace analysis tools
>> > from getting the waker pid, either through "current" in the tracepoint
>> > probe, or by deducing it using other scheduler events based on the CPU
>> > executing the tracepoint.
>> > 
>> > Another side-effect of moving this instrumentation to the scheduler ipi
>> > is that the delay during which the wakeup is sitting in the pending
>> > queue is not accounted for when calculating wakeup latency.
>> > 
>> > Therefore, move the sched_wakeup instrumentation back to the waker
>> > context to fix those two shortcomings.
>> 
>> What do you consider wakeup-latency? I don't see how moving the
>> tracepoint into the caller will magically account the queue time.
> 
> Well, the point of wakeup is when the wakee calls wakeup. If the trace

^ I think you actually mean "when the waker calls wakeup".

> point is in the IPI then you account the time between the wakeup and
> the actuall handling in the IPI to the wakee instead of accounting it
> to the time between wakeup and sched switch.
> 

Thanks,

Mathieu

> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-05 12:32   ` Thomas Gleixner
  2015-06-05 12:36     ` Mathieu Desnoyers
@ 2015-06-05 12:46     ` Peter Zijlstra
  2015-06-08 16:54       ` Steven Rostedt
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2015-06-05 12:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Steven Rostedt,
	Francis Giraldeau

On Fri, 2015-06-05 at 14:32 +0200, Thomas Gleixner wrote:
> On Fri, 5 Jun 2015, Peter Zijlstra wrote:
> > On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote:
> > > Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu"
> > > moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for
> > > remote wakeups. This commit appeared upstream in Linux v3.0.
> > > 
> > > Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup"
> > > tracepoint. Analyzing wakup latencies depends on getting the wakeup
> > > chain right: which process is the waker, which is the wakee. Moving this
> > > instrumention outside of the waker context prevents trace analysis tools
> > > from getting the waker pid, either through "current" in the tracepoint
> > > probe, or by deducing it using other scheduler events based on the CPU
> > > executing the tracepoint.
> > > 
> > > Another side-effect of moving this instrumentation to the scheduler ipi
> > > is that the delay during which the wakeup is sitting in the pending
> > > queue is not accounted for when calculating wakeup latency.
> > > 
> > > Therefore, move the sched_wakeup instrumentation back to the waker
> > > context to fix those two shortcomings.
> > 
> > What do you consider wakeup-latency? I don't see how moving the
> > tracepoint into the caller will magically account the queue time.
> 
> Well, the point of wakeup is when the wakee calls wakeup. If the trace
> point is in the IPI then you account the time between the wakeup and
> the actuall handling in the IPI to the wakee instead of accounting it
> to the time between wakeup and sched switch.

My point exactly, wake->schedule is what we call the scheduling latency,
not the wake latency, which would be from 'event' to the task being
runnable.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint
  2015-06-05 12:46     ` Peter Zijlstra
@ 2015-06-08 16:54       ` Steven Rostedt
  0 siblings, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2015-06-08 16:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Mathieu Desnoyers, linux-kernel, Ingo Molnar,
	Francis Giraldeau

On Fri, 05 Jun 2015 14:46:46 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

 > to the time between wakeup and sched switch.
> 
> My point exactly, wake->schedule is what we call the scheduling latency,
> not the wake latency, which would be from 'event' to the task being
> runnable.

Right, which means the tracepoint should be as close to the
"wakeup_process()" or whatever calls try_to_wake_up().

Which would be the "sched_waking" part of your proposal.

-- Steve


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-08-03 17:07 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-05 11:41 [RFC PATCH] sched: Fix sched_wakeup tracepoint Mathieu Desnoyers
2015-06-05 12:09 ` Peter Zijlstra
2015-06-05 12:32   ` Mathieu Desnoyers
2015-06-05 12:51     ` Peter Zijlstra
2015-06-05 13:23       ` Mathieu Desnoyers
2015-06-06 12:02         ` Peter Zijlstra
2015-06-07 10:20           ` Mathieu Desnoyers
2015-06-08 17:27             ` Steven Rostedt
2015-06-09  9:13               ` Peter Zijlstra
2015-06-09 18:48                 ` Mathieu Desnoyers
2015-06-17 18:23                 ` Cong Wang
2015-06-17 18:47                   ` Steven Rostedt
2015-08-03 17:06                 ` [tip:sched/core] sched: Introduce the 'trace_sched_waking' tracepoint tip-bot for Peter Zijlstra
2015-06-08  6:55         ` [RFC PATCH] sched: Fix sched_wakeup tracepoint Peter Zijlstra
2015-06-09  5:53           ` Mathieu Desnoyers
2015-06-05 12:32   ` Thomas Gleixner
2015-06-05 12:36     ` Mathieu Desnoyers
2015-06-05 12:46     ` Peter Zijlstra
2015-06-08 16:54       ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox