* [RFC PATCH] sched: Fix sched_wakeup tracepoint @ 2015-06-05 11:41 Mathieu Desnoyers 2015-06-05 12:09 ` Peter Zijlstra 0 siblings, 1 reply; 19+ messages in thread From: Mathieu Desnoyers @ 2015-06-05 11:41 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Mathieu Desnoyers, Thomas Gleixner, Ingo Molnar, Steven Rostedt, Francis Giraldeau Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu" moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for remote wakeups. This commit appeared upstream in Linux v3.0. Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup" tracepoint. Analyzing wakup latencies depends on getting the wakeup chain right: which process is the waker, which is the wakee. Moving this instrumention outside of the waker context prevents trace analysis tools from getting the waker pid, either through "current" in the tracepoint probe, or by deducing it using other scheduler events based on the CPU executing the tracepoint. Another side-effect of moving this instrumentation to the scheduler ipi is that the delay during which the wakeup is sitting in the pending queue is not accounted for when calculating wakeup latency. Therefore, move the sched_wakeup instrumentation back to the waker context to fix those two shortcomings. This patch is build-tested only, submitted for feedback. Francis, can you try it out with your critical path analysis ? Fixes: 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu" Reported-by: Francis Giraldeau <francis.giraldeau@gmail.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Peter Zijlstra <peterz@infradead.org> CC: Ingo Molnar <mingo@kernel.org> CC: Steven Rostedt <rostedt@goodmis.org> CC: Francis Giraldeau <francis.giraldeau@gmail.com> --- kernel/sched/core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3d5f6f6..0ed2021 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1457,7 +1457,6 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) { check_preempt_curr(rq, p, wake_flags); - trace_sched_wakeup(p, true); p->state = TASK_RUNNING; #ifdef CONFIG_SMP @@ -1505,6 +1504,7 @@ static int ttwu_remote(struct task_struct *p, int wake_flags) if (task_on_rq_queued(p)) { /* check_preempt_curr() may use rq clock */ update_rq_clock(rq); + trace_sched_wakeup(p, true); ttwu_do_wakeup(rq, p, wake_flags); ret = 1; } @@ -1619,6 +1619,7 @@ static void ttwu_queue(struct task_struct *p, int cpu) { struct rq *rq = cpu_rq(cpu); + trace_sched_wakeup(p, true); #if defined(CONFIG_SMP) if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) { sched_clock_cpu(cpu); /* sync clocks x-cpu */ @@ -1734,6 +1735,7 @@ static void try_to_wake_up_local(struct task_struct *p) if (!task_on_rq_queued(p)) ttwu_activate(rq, p, ENQUEUE_WAKEUP); + trace_sched_wakeup(p, true); ttwu_do_wakeup(rq, p, 0); ttwu_stat(p, smp_processor_id(), 0); out: -- 2.1.4 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-05 11:41 [RFC PATCH] sched: Fix sched_wakeup tracepoint Mathieu Desnoyers @ 2015-06-05 12:09 ` Peter Zijlstra 2015-06-05 12:32 ` Mathieu Desnoyers 2015-06-05 12:32 ` Thomas Gleixner 0 siblings, 2 replies; 19+ messages in thread From: Peter Zijlstra @ 2015-06-05 12:09 UTC (permalink / raw) To: Mathieu Desnoyers Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Steven Rostedt, Francis Giraldeau On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote: > Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu" > moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for > remote wakeups. This commit appeared upstream in Linux v3.0. > > Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup" > tracepoint. Analyzing wakup latencies depends on getting the wakeup > chain right: which process is the waker, which is the wakee. Moving this > instrumention outside of the waker context prevents trace analysis tools > from getting the waker pid, either through "current" in the tracepoint > probe, or by deducing it using other scheduler events based on the CPU > executing the tracepoint. > > Another side-effect of moving this instrumentation to the scheduler ipi > is that the delay during which the wakeup is sitting in the pending > queue is not accounted for when calculating wakeup latency. > > Therefore, move the sched_wakeup instrumentation back to the waker > context to fix those two shortcomings. What do you consider wakeup-latency? I don't see how moving the tracepoint into the caller will magically account the queue time. > +++ b/kernel/sched/core.c > @@ -1457,7 +1457,6 @@ static void > ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) > { > check_preempt_curr(rq, p, wake_flags); > - trace_sched_wakeup(p, true); > > p->state = TASK_RUNNING; > #ifdef CONFIG_SMP > @@ -1505,6 +1504,7 @@ static int ttwu_remote(struct task_struct *p, int wake_flags) > if (task_on_rq_queued(p)) { > /* check_preempt_curr() may use rq clock */ > update_rq_clock(rq); > + trace_sched_wakeup(p, true); > ttwu_do_wakeup(rq, p, wake_flags); > ret = 1; > } > @@ -1619,6 +1619,7 @@ static void ttwu_queue(struct task_struct *p, int cpu) > { > struct rq *rq = cpu_rq(cpu); > > + trace_sched_wakeup(p, true); > #if defined(CONFIG_SMP) > if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) { > sched_clock_cpu(cpu); /* sync clocks x-cpu */ You only need one site in try_to_wake_up(), put it right after success=1. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-05 12:09 ` Peter Zijlstra @ 2015-06-05 12:32 ` Mathieu Desnoyers 2015-06-05 12:51 ` Peter Zijlstra 2015-06-05 12:32 ` Thomas Gleixner 1 sibling, 1 reply; 19+ messages in thread From: Mathieu Desnoyers @ 2015-06-05 12:32 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, rostedt, Francis Giraldeau ----- On Jun 5, 2015, at 2:09 PM, Peter Zijlstra peterz@infradead.org wrote: > On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote: >> Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu" >> moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for >> remote wakeups. This commit appeared upstream in Linux v3.0. >> >> Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup" >> tracepoint. Analyzing wakup latencies depends on getting the wakeup >> chain right: which process is the waker, which is the wakee. Moving this >> instrumention outside of the waker context prevents trace analysis tools >> from getting the waker pid, either through "current" in the tracepoint >> probe, or by deducing it using other scheduler events based on the CPU >> executing the tracepoint. >> >> Another side-effect of moving this instrumentation to the scheduler ipi >> is that the delay during which the wakeup is sitting in the pending >> queue is not accounted for when calculating wakeup latency. >> >> Therefore, move the sched_wakeup instrumentation back to the waker >> context to fix those two shortcomings. > > What do you consider wakeup-latency? I don't see how moving the > tracepoint into the caller will magically account the queue time. That would be the delay between timestamp@wakeup (in the waker) to timestamp@sched_switch_in (in the wakee). ttwu_pending is going through a lockless list populated before sending the IPI, and iterated on by the IPI handler. Therefore moving the tracepoint before the enqueue will account spend in the queue in the wakeup latency time. > >> +++ b/kernel/sched/core.c >> @@ -1457,7 +1457,6 @@ static void >> ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) >> { >> check_preempt_curr(rq, p, wake_flags); >> - trace_sched_wakeup(p, true); >> >> p->state = TASK_RUNNING; >> #ifdef CONFIG_SMP >> @@ -1505,6 +1504,7 @@ static int ttwu_remote(struct task_struct *p, int >> wake_flags) >> if (task_on_rq_queued(p)) { >> /* check_preempt_curr() may use rq clock */ >> update_rq_clock(rq); >> + trace_sched_wakeup(p, true); >> ttwu_do_wakeup(rq, p, wake_flags); >> ret = 1; >> } >> @@ -1619,6 +1619,7 @@ static void ttwu_queue(struct task_struct *p, int cpu) >> { >> struct rq *rq = cpu_rq(cpu); >> >> + trace_sched_wakeup(p, true); >> #if defined(CONFIG_SMP) >> if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) { >> sched_clock_cpu(cpu); /* sync clocks x-cpu */ > > You only need one site in try_to_wake_up(), put it right after > success=1. There are code paths further down the chain that check whether the task is not fully dequeued from the runqueue (e.g. after grabbing the rq lock in ttwu_remote()) and skips sched_wakeup event in those cases. Do we care about not tracing an event in those cases, or would tracing it (as per your suggestion) be preferable ? Also, moving the tracepoint to try_to_wake_up() would not be enough. We would also need to instrument try_to_wake_up_local(). Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-05 12:32 ` Mathieu Desnoyers @ 2015-06-05 12:51 ` Peter Zijlstra 2015-06-05 13:23 ` Mathieu Desnoyers 0 siblings, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2015-06-05 12:51 UTC (permalink / raw) To: Mathieu Desnoyers Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, rostedt, Francis Giraldeau On Fri, 2015-06-05 at 12:32 +0000, Mathieu Desnoyers wrote: > ----- On Jun 5, 2015, at 2:09 PM, Peter Zijlstra peterz@infradead.org wrote: > > > On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote: > >> Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu" > >> moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for > >> remote wakeups. This commit appeared upstream in Linux v3.0. > >> > >> Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup" > >> tracepoint. Analyzing wakup latencies depends on getting the wakeup > >> chain right: which process is the waker, which is the wakee. Moving this > >> instrumention outside of the waker context prevents trace analysis tools > >> from getting the waker pid, either through "current" in the tracepoint > >> probe, or by deducing it using other scheduler events based on the CPU > >> executing the tracepoint. > >> > >> Another side-effect of moving this instrumentation to the scheduler ipi > >> is that the delay during which the wakeup is sitting in the pending > >> queue is not accounted for when calculating wakeup latency. > >> > >> Therefore, move the sched_wakeup instrumentation back to the waker > >> context to fix those two shortcomings. > > > > What do you consider wakeup-latency? I don't see how moving the > > tracepoint into the caller will magically account the queue time. > > That would be the delay between timestamp@wakeup (in the waker) to > timestamp@sched_switch_in (in the wakee). That's scheduling latency. Its how long it takes a runnable task to become running. Wakeup latency would be how long it takes from 'event' to the task being runnable. The unfortunate situation here is that placing that tracepoint in the wakee -- which makes sense -- makes measuring either of the above latencies impossible because it does not capture the becoming RUNNABLE part. > ttwu_pending is going through a lockless list populated before sending > the IPI, and iterated on by the IPI handler. Therefore moving the tracepoint > before the enqueue will account spend in the queue in the wakeup latency > time. Which is wrong from all points of view latencies. > > > >> +++ b/kernel/sched/core.c > >> @@ -1457,7 +1457,6 @@ static void > >> ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) > >> { > >> check_preempt_curr(rq, p, wake_flags); > >> - trace_sched_wakeup(p, true); > >> > >> p->state = TASK_RUNNING; > >> #ifdef CONFIG_SMP > >> @@ -1505,6 +1504,7 @@ static int ttwu_remote(struct task_struct *p, int > >> wake_flags) > >> if (task_on_rq_queued(p)) { > >> /* check_preempt_curr() may use rq clock */ > >> update_rq_clock(rq); > >> + trace_sched_wakeup(p, true); > >> ttwu_do_wakeup(rq, p, wake_flags); > >> ret = 1; > >> } > >> @@ -1619,6 +1619,7 @@ static void ttwu_queue(struct task_struct *p, int cpu) > >> { > >> struct rq *rq = cpu_rq(cpu); > >> > >> + trace_sched_wakeup(p, true); > >> #if defined(CONFIG_SMP) > >> if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) { > >> sched_clock_cpu(cpu); /* sync clocks x-cpu */ > > > > You only need one site in try_to_wake_up(), put it right after > > success=1. > > There are code paths further down the chain that check whether the task > is not fully dequeued from the runqueue (e.g. after grabbing the rq lock > in ttwu_remote()) and skips sched_wakeup event in those cases. Do we care > about not tracing an event in those cases, or would tracing it (as per your > suggestion) be preferable ? Its an actual wakeup, we change task->state, therefore one should trace it. > > Also, moving the tracepoint to try_to_wake_up() would not be enough. We would > also need to instrument try_to_wake_up_local(). Sure, I just objected to having 2 tracepoint in the 'normal' wakeup path. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-05 12:51 ` Peter Zijlstra @ 2015-06-05 13:23 ` Mathieu Desnoyers 2015-06-06 12:02 ` Peter Zijlstra 2015-06-08 6:55 ` [RFC PATCH] sched: Fix sched_wakeup tracepoint Peter Zijlstra 0 siblings, 2 replies; 19+ messages in thread From: Mathieu Desnoyers @ 2015-06-05 13:23 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, rostedt, Francis Giraldeau ----- On Jun 5, 2015, at 2:51 PM, null wrote: > On Fri, 2015-06-05 at 12:32 +0000, Mathieu Desnoyers wrote: >> ----- On Jun 5, 2015, at 2:09 PM, Peter Zijlstra peterz@infradead.org wrote: >> >> > On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote: >> >> Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu" >> >> moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for >> >> remote wakeups. This commit appeared upstream in Linux v3.0. >> >> >> >> Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup" >> >> tracepoint. Analyzing wakup latencies depends on getting the wakeup >> >> chain right: which process is the waker, which is the wakee. Moving this >> >> instrumention outside of the waker context prevents trace analysis tools >> >> from getting the waker pid, either through "current" in the tracepoint >> >> probe, or by deducing it using other scheduler events based on the CPU >> >> executing the tracepoint. >> >> >> >> Another side-effect of moving this instrumentation to the scheduler ipi >> >> is that the delay during which the wakeup is sitting in the pending >> >> queue is not accounted for when calculating wakeup latency. >> >> >> >> Therefore, move the sched_wakeup instrumentation back to the waker >> >> context to fix those two shortcomings. >> > >> > What do you consider wakeup-latency? I don't see how moving the >> > tracepoint into the caller will magically account the queue time. >> >> That would be the delay between timestamp@wakeup (in the waker) to >> timestamp@sched_switch_in (in the wakee). > > That's scheduling latency. Its how long it takes a runnable task to > become running. Wakeup latency would be how long it takes from 'event' > to the task being runnable. > > The unfortunate situation here is that placing that tracepoint in the > wakee -- which makes sense -- makes measuring either of the above > latencies impossible because it does not capture the becoming RUNNABLE > part. > >> ttwu_pending is going through a lockless list populated before sending >> the IPI, and iterated on by the IPI handler. Therefore moving the tracepoint >> before the enqueue will account spend in the queue in the wakeup latency >> time. > > Which is wrong from all points of view latencies. OK, so considering the definition naming feedback you provided, we may need a 3 tracepoint if we want to calculate both wakeup latency and scheduling latency (naming ofc open to discussion): sched_wakeup: when try_to_wake_up{,_local} is called in the waker. sched_activate_task: when the wakee is marked runnable. sched_switch: when scheduling actually happens. We can then calculate wakeup latency as time@sched_activate - time@sched_wakeup And scheduling latency as time@sched_switch - time@sched_activate In the case of critical path analysis, we don't care about this level of granularity. What we care about is is sum of the two, which we can express as: time@sched_switch - time@sched_wakeup Is there an officially blessed name for this ? > >> > >> >> +++ b/kernel/sched/core.c >> >> @@ -1457,7 +1457,6 @@ static void >> >> ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) >> >> { >> >> check_preempt_curr(rq, p, wake_flags); >> >> - trace_sched_wakeup(p, true); >> >> >> >> p->state = TASK_RUNNING; >> >> #ifdef CONFIG_SMP >> >> @@ -1505,6 +1504,7 @@ static int ttwu_remote(struct task_struct *p, int >> >> wake_flags) >> >> if (task_on_rq_queued(p)) { >> >> /* check_preempt_curr() may use rq clock */ >> >> update_rq_clock(rq); >> >> + trace_sched_wakeup(p, true); >> >> ttwu_do_wakeup(rq, p, wake_flags); >> >> ret = 1; >> >> } >> >> @@ -1619,6 +1619,7 @@ static void ttwu_queue(struct task_struct *p, int cpu) >> >> { >> >> struct rq *rq = cpu_rq(cpu); >> >> >> >> + trace_sched_wakeup(p, true); >> >> #if defined(CONFIG_SMP) >> >> if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) { >> >> sched_clock_cpu(cpu); /* sync clocks x-cpu */ >> > >> > You only need one site in try_to_wake_up(), put it right after >> > success=1. >> >> There are code paths further down the chain that check whether the task >> is not fully dequeued from the runqueue (e.g. after grabbing the rq lock >> in ttwu_remote()) and skips sched_wakeup event in those cases. Do we care >> about not tracing an event in those cases, or would tracing it (as per your >> suggestion) be preferable ? > > Its an actual wakeup, we change task->state, therefore one should trace > it. Works for me! > >> >> Also, moving the tracepoint to try_to_wake_up() would not be enough. We would >> also need to instrument try_to_wake_up_local(). > > Sure, I just objected to having 2 tracepoint in the 'normal' wakeup > path. Agreed. We may need to see if we want to add a "sched_activate_task" tracepoint though. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-05 13:23 ` Mathieu Desnoyers @ 2015-06-06 12:02 ` Peter Zijlstra 2015-06-07 10:20 ` Mathieu Desnoyers 2015-06-08 6:55 ` [RFC PATCH] sched: Fix sched_wakeup tracepoint Peter Zijlstra 1 sibling, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2015-06-06 12:02 UTC (permalink / raw) To: Mathieu Desnoyers Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, rostedt, Francis Giraldeau On Fri, 2015-06-05 at 13:23 +0000, Mathieu Desnoyers wrote: > OK, so considering the definition naming feedback you provided, we > may need a 3 tracepoint if we want to calculate both wakeup latency > and scheduling latency (naming ofc open to discussion): > > sched_wakeup: when try_to_wake_up{,_local} is called in the waker. > sched_activate_task: when the wakee is marked runnable. > sched_switch: when scheduling actually happens. I would propose: sched_waking: upon calling try_to_wake_up() as soon as we know we need to change state; guaranteed to be called from the context doing the wakeup. sched_woken: the wakeup is complete (task is runnable, any delay between this and actually getting on a cpu is down to the scheduler). sched_switch: when switching from task @prev to @next. This means abandoning trace_sched_wakeup(); which might be a problem, which is why I bloody hate tracepoints :-( > We can then calculate wakeup latency as > > time@sched_activate - time@sched_wakeup > > And scheduling latency as > > time@sched_switch - time@sched_activate > > In the case of critical path analysis, we don't care about this > level of granularity. What we care about is is sum of the two, > which we can express as: > > time@sched_switch - time@sched_wakeup > > Is there an officially blessed name for this ? No idea. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-06 12:02 ` Peter Zijlstra @ 2015-06-07 10:20 ` Mathieu Desnoyers 2015-06-08 17:27 ` Steven Rostedt 0 siblings, 1 reply; 19+ messages in thread From: Mathieu Desnoyers @ 2015-06-07 10:20 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, rostedt, Francis Giraldeau, lttng-dev ----- On Jun 6, 2015, at 2:02 PM, Peter Zijlstra peterz@infradead.org wrote: > On Fri, 2015-06-05 at 13:23 +0000, Mathieu Desnoyers wrote: >> OK, so considering the definition naming feedback you provided, we >> may need a 3 tracepoint if we want to calculate both wakeup latency >> and scheduling latency (naming ofc open to discussion): >> >> sched_wakeup: when try_to_wake_up{,_local} is called in the waker. >> sched_activate_task: when the wakee is marked runnable. >> sched_switch: when scheduling actually happens. > > I would propose: > > sched_waking: upon calling try_to_wake_up() as soon as we know we need > to change state; guaranteed to be called from the context doing the > wakeup. > > sched_woken: the wakeup is complete (task is runnable, any delay > between this and actually getting on a cpu is down to the scheduler). > > sched_switch: when switching from task @prev to @next. Agreed, > > This means abandoning trace_sched_wakeup(); which might be a problem, > which is why I bloody hate tracepoints :-( OK. I guess it's about time we dive into that question. Should tracepoint semantic be kept cast in stone forever ? Not in my opinion, and here is why. Most of the Linux kernel ABI exposed to userspace serves as support to runtime (system calls, virtual file systems, etc). For all that, it makes tons of sense to keep it stable, following the Documentation/ABI/README guidelines. Even there, we have provisions for obsolescence and removal of an ABI if need be, which provides userspace some time to adapt to changes. How are tracepoints different ? Well, those are not meant to be used in runtime support, but rather for analyzing systems, which means that userspace tools using the tracepoint content do not need it to _run_, but rather as information source to perform analyses. Even though I dislike analogies, I think we need one here. Let's consider CAN bus ports for car debugging. Even though the transport is covered by standards, it does not mandate the semantics of the data per se. I would not expect a debugging device made in 2005 to work for newest generations of car. However, I would expect that new debug devices are compatible with older cars, and that those debug devices have means to query which type of car it is debugging. Otherwise, the debugging device is simply crap, because it cannot adapt to change. What should a debug device created in 2005 do if connected to a new car ? Ideally, it should gracefully decline to interact with this car, and require a software upgrade. OK, now back to kernel tracepoints. My opinion is that it is a fundamental requirement that trace analysis tools should be able to detect that they are unable understand tracepoint data they care about. It seems perfectly fine to me to require that analysis tool upgrades are needed to interact with a new kernel. However, a tool should be able to handle a range of older kernel versions too. This can be done by many means, including making sure preexisting event name and fields semantic are immutable, or by versioning of tracepoints on a per-event basis. Here, in the case of sched_wakeup: we end up noticing that it accidentally changed location in the kernel across versions, which makes it useless for many analyses unless they use kernel version information to get the right semantic associated with this event. So here, for introducing sched_waking/sched_woken, we have a few ways forward: 1) Keep sched_wakeup as it is, and add those two new events. Analyses can then continue using the old event for a while, and if they sees that sched_waking/sched_woken are there, they can use those more precise events instead. This could allow us to do a gradual deprecation phase for the sched_wakeup tracepoint. 2) Remove sched_wakeup event, replacing it by sched_waking/sched_woken. Require immediate analysis tool upgrade to deal with this new information. Old tools should gracefully fail and ask users to upgrade. If they don't, fix them so they can handle change. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-07 10:20 ` Mathieu Desnoyers @ 2015-06-08 17:27 ` Steven Rostedt 2015-06-09 9:13 ` Peter Zijlstra 0 siblings, 1 reply; 19+ messages in thread From: Steven Rostedt @ 2015-06-08 17:27 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Ingo Molnar, Francis Giraldeau, lttng-dev, Linus Torvalds [ Keeping entire email because I added Linus ] On Sun, 7 Jun 2015 10:20:14 +0000 (UTC) Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > ----- On Jun 6, 2015, at 2:02 PM, Peter Zijlstra peterz@infradead.org wrote: > > > On Fri, 2015-06-05 at 13:23 +0000, Mathieu Desnoyers wrote: > >> OK, so considering the definition naming feedback you provided, we > >> may need a 3 tracepoint if we want to calculate both wakeup latency > >> and scheduling latency (naming ofc open to discussion): > >> > >> sched_wakeup: when try_to_wake_up{,_local} is called in the waker. > >> sched_activate_task: when the wakee is marked runnable. > >> sched_switch: when scheduling actually happens. > > > > I would propose: > > > > sched_waking: upon calling try_to_wake_up() as soon as we know we need > > to change state; guaranteed to be called from the context doing the > > wakeup. > > > > sched_woken: the wakeup is complete (task is runnable, any delay > > between this and actually getting on a cpu is down to the scheduler). > > > > sched_switch: when switching from task @prev to @next. > > Agreed, > > > > > This means abandoning trace_sched_wakeup(); which might be a problem, > > which is why I bloody hate tracepoints :-( > > OK. I guess it's about time we dive into that question. Should tracepoint > semantic be kept cast in stone forever ? Not in my opinion, and here is why. > > Most of the Linux kernel ABI exposed to userspace serves as support to > runtime (system calls, virtual file systems, etc). For all that, it makes > tons of sense to keep it stable, following the Documentation/ABI/README > guidelines. Even there, we have provisions for obsolescence and removal > of an ABI if need be, which provides userspace some time to adapt to > changes. > > How are tracepoints different ? Well, those are not meant to be used in > runtime support, but rather for analyzing systems, which means that > userspace tools using the tracepoint content do not need it to _run_, > but rather as information source to perform analyses. > > Even though I dislike analogies, I think we need one here. Let's consider > CAN bus ports for car debugging. Even though the transport is covered by > standards, it does not mandate the semantics of the data per se. I would > not expect a debugging device made in 2005 to work for newest generations > of car. However, I would expect that new debug devices are compatible with > older cars, and that those debug devices have means to query which type of > car it is debugging. Otherwise, the debugging device is simply crap, > because it cannot adapt to change. What should a debug device created in > 2005 do if connected to a new car ? Ideally, it should gracefully decline > to interact with this car, and require a software upgrade. > > OK, now back to kernel tracepoints. My opinion is that it is a fundamental > requirement that trace analysis tools should be able to detect that they > are unable understand tracepoint data they care about. It seems perfectly > fine to me to require that analysis tool upgrades are needed to interact > with a new kernel. However, a tool should be able to handle a range of > older kernel versions too. > > This can be done by many means, including making sure preexisting event name > and fields semantic are immutable, or by versioning of tracepoints on a > per-event basis. > > Here, in the case of sched_wakeup: we end up noticing that it accidentally > changed location in the kernel across versions, which makes it useless for > many analyses unless they use kernel version information to get the right > semantic associated with this event. > > So here, for introducing sched_waking/sched_woken, we have a few ways > forward: > > 1) Keep sched_wakeup as it is, and add those two new events. Analyses > can then continue using the old event for a while, and if they sees > that sched_waking/sched_woken are there, they can use those more > precise events instead. This could allow us to do a gradual > deprecation phase for the sched_wakeup tracepoint. > > 2) Remove sched_wakeup event, replacing it by sched_waking/sched_woken. > Require immediate analysis tool upgrade to deal with this new > information. Old tools should gracefully fail and ask users to > upgrade. If they don't, fix them so they can handle change. 2 will break tools and even if they fail "gracefully" that probably still isn't acceptable as the sched_wakeup tracepoint is a popular one. 3) Add the two tracepoints and remove the sched_wakeup() one, but then add a manual tracepoint for perf and ftrace that can simulate the sched_wakeup() from the other two tracepoints. This should keep tools working and we can have better wake up tracepoints implemented, without the overhead of the third "obsolete" tracepoint in the scheduling code. -- Steve ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-08 17:27 ` Steven Rostedt @ 2015-06-09 9:13 ` Peter Zijlstra 2015-06-09 18:48 ` Mathieu Desnoyers ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Peter Zijlstra @ 2015-06-09 9:13 UTC (permalink / raw) To: Steven Rostedt Cc: Mathieu Desnoyers, linux-kernel, Thomas Gleixner, Ingo Molnar, Francis Giraldeau, lttng-dev, Linus Torvalds So how about we introduce the 'waking' tracepoint and leave the existing wakeup one in place and preserve its woken semantics. Steven, can we do aliases? Where one tracepoint is known to userspace under multiple names? In that case we could rename the thing to woken and have an alias wakeup which we can phase out over time. The patch also takes away the success parameter to the tracepoint, but does not quite go as far as actually removing it from the tracepoint itself. We can do that in a follow up patch which we can quickly revert if it turns out people are actually still using that for something. --- include/trace/events/sched.h | 28 ++++++++++++++++++++-------- kernel/sched/core.c | 9 ++++++--- 2 files changed, 26 insertions(+), 11 deletions(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index d57a575fe31f..b1608d5679e2 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -55,7 +55,7 @@ TRACE_EVENT(sched_kthread_stop_ret, */ DECLARE_EVENT_CLASS(sched_wakeup_template, - TP_PROTO(struct task_struct *p, int success), + TP_PROTO(struct task_struct *p), TP_ARGS(__perf_task(p), success), @@ -71,25 +71,37 @@ DECLARE_EVENT_CLASS(sched_wakeup_template, memcpy(__entry->comm, p->comm, TASK_COMM_LEN); __entry->pid = p->pid; __entry->prio = p->prio; - __entry->success = success; + __entry->success = 1; /* rudiment, kill when possible */ __entry->target_cpu = task_cpu(p); ), - TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d", + TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d", __entry->comm, __entry->pid, __entry->prio, - __entry->success, __entry->target_cpu) + __entry->target_cpu) ); +/* + * Tracepoint called when waking a task; this tracepoint is guaranteed to be + * called from the waking context. + */ +DEFINE_EVENT(sched_wakeup_template, sched_waking, + TP_PROTO(struct task_struct *p), + TP_ARGS(p)); + +/* + * Tracepoint called when the task is actually woken; p->state == TASK_RUNNNG. + * It it not always called from the waking context. + */ DEFINE_EVENT(sched_wakeup_template, sched_wakeup, - TP_PROTO(struct task_struct *p, int success), - TP_ARGS(p, success)); + TP_PROTO(struct task_struct *p), + TP_ARGS(p)); /* * Tracepoint for waking up a new task: */ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new, - TP_PROTO(struct task_struct *p, int success), - TP_ARGS(p, success)); + TP_PROTO(struct task_struct *p), + TP_ARGS(p)); #ifdef CREATE_TRACE_POINTS static inline long __trace_sched_switch_state(struct task_struct *p) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d5078c0f20e6..354e667620a9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1487,9 +1487,9 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) { check_preempt_curr(rq, p, wake_flags); - trace_sched_wakeup(p, true); - p->state = TASK_RUNNING; + trace_sched_wakeup(p); + #ifdef CONFIG_SMP if (p->sched_class->task_woken) p->sched_class->task_woken(rq, p); @@ -1695,6 +1695,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) goto out; success = 1; /* we're going to change ->state */ + trace_sched_waking(p); cpu = task_cpu(p); if (p->on_rq && ttwu_remote(p, wake_flags)) @@ -1761,6 +1762,8 @@ static void try_to_wake_up_local(struct task_struct *p) if (!(p->state & TASK_NORMAL)) goto out; + trace_sched_waking(p); + if (!task_on_rq_queued(p)) ttwu_activate(rq, p, ENQUEUE_WAKEUP); @@ -2119,7 +2122,7 @@ void wake_up_new_task(struct task_struct *p) rq = __task_rq_lock(p); activate_task(rq, p, 0); p->on_rq = TASK_ON_RQ_QUEUED; - trace_sched_wakeup_new(p, true); + trace_sched_wakeup_new(p); check_preempt_curr(rq, p, WF_FORK); #ifdef CONFIG_SMP if (p->sched_class->task_woken) ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-09 9:13 ` Peter Zijlstra @ 2015-06-09 18:48 ` Mathieu Desnoyers 2015-06-17 18:23 ` Cong Wang 2015-08-03 17:06 ` [tip:sched/core] sched: Introduce the 'trace_sched_waking' tracepoint tip-bot for Peter Zijlstra 2 siblings, 0 replies; 19+ messages in thread From: Mathieu Desnoyers @ 2015-06-09 18:48 UTC (permalink / raw) To: Peter Zijlstra Cc: rostedt, linux-kernel, Thomas Gleixner, Ingo Molnar, Francis Giraldeau, lttng-dev, Linus Torvalds ----- On Jun 9, 2015, at 11:13 AM, Peter Zijlstra peterz@infradead.org wrote: > So how about we introduce the 'waking' tracepoint and leave the existing > wakeup one in place and preserve its woken semantics. That would work for me, but leaves me wondering how you would move to the new 'woken' name. > > Steven, can we do aliases? Where one tracepoint is known to userspace > under multiple names? In that case we could rename the thing to woken > and have an alias wakeup which we can phase out over time. In LTTng, I have a LTTNG_TRACEPOINT_EVENT_MAP() macro, which allows me to map a kernel event to my own event naming scheme. It allows me to correct namespacing of some odd kernel tracepoints from within the tracer. A similar implementation could be used to implement a tracepoint alias within the kernel: one parameter is the named as exposed to userspace, and the other field is the name of the tracepoint it hooks to. You could then rename the tracepoint name from 'wakeup' to 'woken', and create an alias exposing the old 'wakeup' event. > > The patch also takes away the success parameter to the tracepoint, but > does not quite go as far as actually removing it from the tracepoint > itself. Sounds like a good intermediate step. If you ensure that the alias declaration allows you to show a different set of fields for the alias, you could even remove the success parameter from 'woken', and only show it for the 'wakeup' alias. > > We can do that in a follow up patch which we can quickly revert if it > turns out people are actually still using that for something. Or just pull the plug on the old 'wakeup' event after a deprecation phase (to be defined). Thanks, Mathieu > > --- > include/trace/events/sched.h | 28 ++++++++++++++++++++-------- > kernel/sched/core.c | 9 ++++++--- > 2 files changed, 26 insertions(+), 11 deletions(-) > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > index d57a575fe31f..b1608d5679e2 100644 > --- a/include/trace/events/sched.h > +++ b/include/trace/events/sched.h > @@ -55,7 +55,7 @@ TRACE_EVENT(sched_kthread_stop_ret, > */ > DECLARE_EVENT_CLASS(sched_wakeup_template, > > - TP_PROTO(struct task_struct *p, int success), > + TP_PROTO(struct task_struct *p), > > TP_ARGS(__perf_task(p), success), > > @@ -71,25 +71,37 @@ DECLARE_EVENT_CLASS(sched_wakeup_template, > memcpy(__entry->comm, p->comm, TASK_COMM_LEN); > __entry->pid = p->pid; > __entry->prio = p->prio; > - __entry->success = success; > + __entry->success = 1; /* rudiment, kill when possible */ > __entry->target_cpu = task_cpu(p); > ), > > - TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d", > + TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d", > __entry->comm, __entry->pid, __entry->prio, > - __entry->success, __entry->target_cpu) > + __entry->target_cpu) > ); > > +/* > + * Tracepoint called when waking a task; this tracepoint is guaranteed to be > + * called from the waking context. > + */ > +DEFINE_EVENT(sched_wakeup_template, sched_waking, > + TP_PROTO(struct task_struct *p), > + TP_ARGS(p)); > + > +/* > + * Tracepoint called when the task is actually woken; p->state == TASK_RUNNNG. > + * It it not always called from the waking context. > + */ > DEFINE_EVENT(sched_wakeup_template, sched_wakeup, > - TP_PROTO(struct task_struct *p, int success), > - TP_ARGS(p, success)); > + TP_PROTO(struct task_struct *p), > + TP_ARGS(p)); > > /* > * Tracepoint for waking up a new task: > */ > DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new, > - TP_PROTO(struct task_struct *p, int success), > - TP_ARGS(p, success)); > + TP_PROTO(struct task_struct *p), > + TP_ARGS(p)); > > #ifdef CREATE_TRACE_POINTS > static inline long __trace_sched_switch_state(struct task_struct *p) > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index d5078c0f20e6..354e667620a9 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1487,9 +1487,9 @@ static void > ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) > { > check_preempt_curr(rq, p, wake_flags); > - trace_sched_wakeup(p, true); > - > p->state = TASK_RUNNING; > + trace_sched_wakeup(p); > + > #ifdef CONFIG_SMP > if (p->sched_class->task_woken) > p->sched_class->task_woken(rq, p); > @@ -1695,6 +1695,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, > int wake_flags) > goto out; > > success = 1; /* we're going to change ->state */ > + trace_sched_waking(p); > cpu = task_cpu(p); > > if (p->on_rq && ttwu_remote(p, wake_flags)) > @@ -1761,6 +1762,8 @@ static void try_to_wake_up_local(struct task_struct *p) > if (!(p->state & TASK_NORMAL)) > goto out; > > + trace_sched_waking(p); > + > if (!task_on_rq_queued(p)) > ttwu_activate(rq, p, ENQUEUE_WAKEUP); > > @@ -2119,7 +2122,7 @@ void wake_up_new_task(struct task_struct *p) > rq = __task_rq_lock(p); > activate_task(rq, p, 0); > p->on_rq = TASK_ON_RQ_QUEUED; > - trace_sched_wakeup_new(p, true); > + trace_sched_wakeup_new(p); > check_preempt_curr(rq, p, WF_FORK); > #ifdef CONFIG_SMP > if (p->sched_class->task_woken) -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-09 9:13 ` Peter Zijlstra 2015-06-09 18:48 ` Mathieu Desnoyers @ 2015-06-17 18:23 ` Cong Wang 2015-06-17 18:47 ` Steven Rostedt 2015-08-03 17:06 ` [tip:sched/core] sched: Introduce the 'trace_sched_waking' tracepoint tip-bot for Peter Zijlstra 2 siblings, 1 reply; 19+ messages in thread From: Cong Wang @ 2015-06-17 18:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Mathieu Desnoyers, LKML, Thomas Gleixner, Ingo Molnar, Francis Giraldeau, lttng-dev, Linus Torvalds On Tue, Jun 9, 2015 at 2:13 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > So how about we introduce the 'waking' tracepoint and leave the existing > wakeup one in place and preserve its woken semantics. > > Steven, can we do aliases? Where one tracepoint is known to userspace > under multiple names? In that case we could rename the thing to woken > and have an alias wakeup which we can phase out over time. > > The patch also takes away the success parameter to the tracepoint, but > does not quite go as far as actually removing it from the tracepoint > itself. > > We can do that in a follow up patch which we can quickly revert if it > turns out people are actually still using that for something. +1 to this patch. How is it going? Here at Twitter, we are analyzing scheduling latencies too, with our own tool using existing tracepoints, it would be nice to have more granularity on the scheduling latency. And, you probably want to change perf sched to respect this new 'waking' event too. ;) Thanks. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-17 18:23 ` Cong Wang @ 2015-06-17 18:47 ` Steven Rostedt 0 siblings, 0 replies; 19+ messages in thread From: Steven Rostedt @ 2015-06-17 18:47 UTC (permalink / raw) To: Cong Wang Cc: Peter Zijlstra, Mathieu Desnoyers, LKML, Thomas Gleixner, Ingo Molnar, Francis Giraldeau, lttng-dev, Linus Torvalds On Wed, 17 Jun 2015 11:23:00 -0700 Cong Wang <xiyou.wangcong@gmail.com> wrote: > On Tue, Jun 9, 2015 at 2:13 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > So how about we introduce the 'waking' tracepoint and leave the existing > > wakeup one in place and preserve its woken semantics. > > > > Steven, can we do aliases? Where one tracepoint is known to userspace > > under multiple names? In that case we could rename the thing to woken > > and have an alias wakeup which we can phase out over time. > > > > The patch also takes away the success parameter to the tracepoint, but > > does not quite go as far as actually removing it from the tracepoint > > itself. > > > > We can do that in a follow up patch which we can quickly revert if it > > turns out people are actually still using that for something. > > +1 to this patch. How is it going? It's not a top priority. But it shouldn't be too hard to implement. This could be something I do after the 4.2 merge window closes. -- Steve > > Here at Twitter, we are analyzing scheduling latencies too, with our > own tool using existing tracepoints, it would be nice to have more > granularity on the scheduling latency. > > And, you probably want to change perf sched to respect this > new 'waking' event too. ;) > > Thanks. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [tip:sched/core] sched: Introduce the 'trace_sched_waking' tracepoint 2015-06-09 9:13 ` Peter Zijlstra 2015-06-09 18:48 ` Mathieu Desnoyers 2015-06-17 18:23 ` Cong Wang @ 2015-08-03 17:06 ` tip-bot for Peter Zijlstra 2 siblings, 0 replies; 19+ messages in thread From: tip-bot for Peter Zijlstra @ 2015-08-03 17:06 UTC (permalink / raw) To: linux-tip-commits Cc: peterz, mingo, torvalds, tglx, mathieu.desnoyers, francis.giraldeau, linux-kernel, rostedt, efault, hpa Commit-ID: fbd705a0c6184580d0e2fbcbd47a37b6e5822511 Gitweb: http://git.kernel.org/tip/fbd705a0c6184580d0e2fbcbd47a37b6e5822511 Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Tue, 9 Jun 2015 11:13:36 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Mon, 3 Aug 2015 12:21:22 +0200 sched: Introduce the 'trace_sched_waking' tracepoint Mathieu reported that since 317f394160e9 ("sched: Move the second half of ttwu() to the remote cpu") trace_sched_wakeup() can happen out of context of the waker. This is a problem when you want to analyse wakeup paths because it is now very hard to correlate the wakeup event to whoever issued the wakeup. OTOH trace_sched_wakeup() is issued at the point where we set p->state = TASK_RUNNING, which is right were we hand the task off to the scheduler, so this is an important point when looking at scheduling behaviour, up to here its been the wakeup path everything hereafter is due to scheduler policy. To bridge this gap, introduce a second tracepoint: trace_sched_waking. It is guaranteed to be called in the waker context. Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Francis Giraldeau <francis.giraldeau@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150609091336.GQ3644@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> --- include/trace/events/sched.h | 30 +++++++++++++++++++++--------- kernel/sched/core.c | 10 +++++++--- kernel/trace/trace_sched_switch.c | 2 +- kernel/trace/trace_sched_wakeup.c | 2 +- 4 files changed, 30 insertions(+), 14 deletions(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index d57a575..539d6bc 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -55,9 +55,9 @@ TRACE_EVENT(sched_kthread_stop_ret, */ DECLARE_EVENT_CLASS(sched_wakeup_template, - TP_PROTO(struct task_struct *p, int success), + TP_PROTO(struct task_struct *p), - TP_ARGS(__perf_task(p), success), + TP_ARGS(__perf_task(p)), TP_STRUCT__entry( __array( char, comm, TASK_COMM_LEN ) @@ -71,25 +71,37 @@ DECLARE_EVENT_CLASS(sched_wakeup_template, memcpy(__entry->comm, p->comm, TASK_COMM_LEN); __entry->pid = p->pid; __entry->prio = p->prio; - __entry->success = success; + __entry->success = 1; /* rudiment, kill when possible */ __entry->target_cpu = task_cpu(p); ), - TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d", + TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d", __entry->comm, __entry->pid, __entry->prio, - __entry->success, __entry->target_cpu) + __entry->target_cpu) ); +/* + * Tracepoint called when waking a task; this tracepoint is guaranteed to be + * called from the waking context. + */ +DEFINE_EVENT(sched_wakeup_template, sched_waking, + TP_PROTO(struct task_struct *p), + TP_ARGS(p)); + +/* + * Tracepoint called when the task is actually woken; p->state == TASK_RUNNNG. + * It it not always called from the waking context. + */ DEFINE_EVENT(sched_wakeup_template, sched_wakeup, - TP_PROTO(struct task_struct *p, int success), - TP_ARGS(p, success)); + TP_PROTO(struct task_struct *p), + TP_ARGS(p)); /* * Tracepoint for waking up a new task: */ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new, - TP_PROTO(struct task_struct *p, int success), - TP_ARGS(p, success)); + TP_PROTO(struct task_struct *p), + TP_ARGS(p)); #ifdef CREATE_TRACE_POINTS static inline long __trace_sched_switch_state(struct task_struct *p) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 48be7dc..fa5826c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1654,9 +1654,9 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) { check_preempt_curr(rq, p, wake_flags); - trace_sched_wakeup(p, true); - p->state = TASK_RUNNING; + trace_sched_wakeup(p); + #ifdef CONFIG_SMP if (p->sched_class->task_woken) { /* @@ -1874,6 +1874,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) if (!(p->state & state)) goto out; + trace_sched_waking(p); + success = 1; /* we're going to change ->state */ cpu = task_cpu(p); @@ -1949,6 +1951,8 @@ static void try_to_wake_up_local(struct task_struct *p) if (!(p->state & TASK_NORMAL)) goto out; + trace_sched_waking(p); + if (!task_on_rq_queued(p)) ttwu_activate(rq, p, ENQUEUE_WAKEUP); @@ -2307,7 +2311,7 @@ void wake_up_new_task(struct task_struct *p) rq = __task_rq_lock(p); activate_task(rq, p, 0); p->on_rq = TASK_ON_RQ_QUEUED; - trace_sched_wakeup_new(p, true); + trace_sched_wakeup_new(p); check_preempt_curr(rq, p, WF_FORK); #ifdef CONFIG_SMP if (p->sched_class->task_woken) diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c index 419ca37..f270088 100644 --- a/kernel/trace/trace_sched_switch.c +++ b/kernel/trace/trace_sched_switch.c @@ -26,7 +26,7 @@ probe_sched_switch(void *ignore, struct task_struct *prev, struct task_struct *n } static void -probe_sched_wakeup(void *ignore, struct task_struct *wakee, int success) +probe_sched_wakeup(void *ignore, struct task_struct *wakee) { if (unlikely(!sched_ref)) return; diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c index 9b33dd1..12cbe77 100644 --- a/kernel/trace/trace_sched_wakeup.c +++ b/kernel/trace/trace_sched_wakeup.c @@ -514,7 +514,7 @@ static void wakeup_reset(struct trace_array *tr) } static void -probe_wakeup(void *ignore, struct task_struct *p, int success) +probe_wakeup(void *ignore, struct task_struct *p) { struct trace_array_cpu *data; int cpu = smp_processor_id(); ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-05 13:23 ` Mathieu Desnoyers 2015-06-06 12:02 ` Peter Zijlstra @ 2015-06-08 6:55 ` Peter Zijlstra 2015-06-09 5:53 ` Mathieu Desnoyers 1 sibling, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2015-06-08 6:55 UTC (permalink / raw) To: Mathieu Desnoyers Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, rostedt, Francis Giraldeau On Fri, 2015-06-05 at 13:23 +0000, Mathieu Desnoyers wrote: > sched_wakeup: when try_to_wake_up{,_local} is called in the waker. > sched_activate_task: when the wakee is marked runnable. > sched_switch: when scheduling actually happens. > > We can then calculate wakeup latency as > > time@sched_activate - time@sched_wakeup One more thing, I think I would disagree with this. I would suggest never to use the 'wakeup' (or 'waking' in my proposal) for timing. I would suggest to use your interrupt tracepoint (or whatever else causes wakeup to be called for this). The wakeup times should be measured in tasktime -- of course, if interrupts/preemption are disabled then tasktime == walltime. The scheduling bit OTOH always needs to be measured in walltime, and is most affected by the presence of other tasks on the system. This too is why I'm not sure it makes sense to combine the two into a single measurement. They should be measured in different time domains. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-08 6:55 ` [RFC PATCH] sched: Fix sched_wakeup tracepoint Peter Zijlstra @ 2015-06-09 5:53 ` Mathieu Desnoyers 0 siblings, 0 replies; 19+ messages in thread From: Mathieu Desnoyers @ 2015-06-09 5:53 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, rostedt, Francis Giraldeau ----- On Jun 8, 2015, at 8:55 AM, Peter Zijlstra peterz@infradead.org wrote: > On Fri, 2015-06-05 at 13:23 +0000, Mathieu Desnoyers wrote: >> sched_wakeup: when try_to_wake_up{,_local} is called in the waker. >> sched_activate_task: when the wakee is marked runnable. >> sched_switch: when scheduling actually happens. >> >> We can then calculate wakeup latency as >> >> time@sched_activate - time@sched_wakeup > > One more thing, I think I would disagree with this. I would suggest > never to use the 'wakeup' (or 'waking' in my proposal) for timing. I > would suggest to use your interrupt tracepoint (or whatever else causes > wakeup to be called for this). The nice thing about the 'waking' tracepoint is that there is only one to trace if we care about wakeup latency, and it only executes when there is an actual wake up performed. If we do care about more timing precision of the wakeup latency, we might indeed want to trace both 'waking' and whatever calls it (interrupt, set of system calls, etc.) An instrumentation of 'waking' however seems like a good approximation of the moment where the wakeup is requested by the waker. Especially if we compute critical path of a computation: then we already account for process runtime within the waker before the 'woken' event. So we don't really care about the extra precision that we would get by tracing the exact syscall entry point. In all cases we need the wakee PID awakened by 'waking', which is unavailable unless we add the 'waking' event. > > The wakeup times should be measured in tasktime -- of course, if > interrupts/preemption are disabled then tasktime == walltime. > > The scheduling bit OTOH always needs to be measured in walltime, and is > most affected by the presence of other tasks on the system. > > This too is why I'm not sure it makes sense to combine the two into a > single measurement. They should be measured in different time domains. That's a very good point! Agreed. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-05 12:09 ` Peter Zijlstra 2015-06-05 12:32 ` Mathieu Desnoyers @ 2015-06-05 12:32 ` Thomas Gleixner 2015-06-05 12:36 ` Mathieu Desnoyers 2015-06-05 12:46 ` Peter Zijlstra 1 sibling, 2 replies; 19+ messages in thread From: Thomas Gleixner @ 2015-06-05 12:32 UTC (permalink / raw) To: Peter Zijlstra Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Steven Rostedt, Francis Giraldeau On Fri, 5 Jun 2015, Peter Zijlstra wrote: > On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote: > > Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu" > > moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for > > remote wakeups. This commit appeared upstream in Linux v3.0. > > > > Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup" > > tracepoint. Analyzing wakup latencies depends on getting the wakeup > > chain right: which process is the waker, which is the wakee. Moving this > > instrumention outside of the waker context prevents trace analysis tools > > from getting the waker pid, either through "current" in the tracepoint > > probe, or by deducing it using other scheduler events based on the CPU > > executing the tracepoint. > > > > Another side-effect of moving this instrumentation to the scheduler ipi > > is that the delay during which the wakeup is sitting in the pending > > queue is not accounted for when calculating wakeup latency. > > > > Therefore, move the sched_wakeup instrumentation back to the waker > > context to fix those two shortcomings. > > What do you consider wakeup-latency? I don't see how moving the > tracepoint into the caller will magically account the queue time. Well, the point of wakeup is when the wakee calls wakeup. If the trace point is in the IPI then you account the time between the wakeup and the actuall handling in the IPI to the wakee instead of accounting it to the time between wakeup and sched switch. Thanks, tglx ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-05 12:32 ` Thomas Gleixner @ 2015-06-05 12:36 ` Mathieu Desnoyers 2015-06-05 12:46 ` Peter Zijlstra 1 sibling, 0 replies; 19+ messages in thread From: Mathieu Desnoyers @ 2015-06-05 12:36 UTC (permalink / raw) To: Thomas Gleixner Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, rostedt, Giraldeau, Francis ----- On Jun 5, 2015, at 2:32 PM, Thomas Gleixner tglx@linutronix.de wrote: > On Fri, 5 Jun 2015, Peter Zijlstra wrote: >> On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote: >> > Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu" >> > moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for >> > remote wakeups. This commit appeared upstream in Linux v3.0. >> > >> > Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup" >> > tracepoint. Analyzing wakup latencies depends on getting the wakeup >> > chain right: which process is the waker, which is the wakee. Moving this >> > instrumention outside of the waker context prevents trace analysis tools >> > from getting the waker pid, either through "current" in the tracepoint >> > probe, or by deducing it using other scheduler events based on the CPU >> > executing the tracepoint. >> > >> > Another side-effect of moving this instrumentation to the scheduler ipi >> > is that the delay during which the wakeup is sitting in the pending >> > queue is not accounted for when calculating wakeup latency. >> > >> > Therefore, move the sched_wakeup instrumentation back to the waker >> > context to fix those two shortcomings. >> >> What do you consider wakeup-latency? I don't see how moving the >> tracepoint into the caller will magically account the queue time. > > Well, the point of wakeup is when the wakee calls wakeup. If the trace ^ I think you actually mean "when the waker calls wakeup". > point is in the IPI then you account the time between the wakeup and > the actuall handling in the IPI to the wakee instead of accounting it > to the time between wakeup and sched switch. > Thanks, Mathieu > Thanks, > > tglx -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-05 12:32 ` Thomas Gleixner 2015-06-05 12:36 ` Mathieu Desnoyers @ 2015-06-05 12:46 ` Peter Zijlstra 2015-06-08 16:54 ` Steven Rostedt 1 sibling, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2015-06-05 12:46 UTC (permalink / raw) To: Thomas Gleixner Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Steven Rostedt, Francis Giraldeau On Fri, 2015-06-05 at 14:32 +0200, Thomas Gleixner wrote: > On Fri, 5 Jun 2015, Peter Zijlstra wrote: > > On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote: > > > Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu" > > > moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for > > > remote wakeups. This commit appeared upstream in Linux v3.0. > > > > > > Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup" > > > tracepoint. Analyzing wakup latencies depends on getting the wakeup > > > chain right: which process is the waker, which is the wakee. Moving this > > > instrumention outside of the waker context prevents trace analysis tools > > > from getting the waker pid, either through "current" in the tracepoint > > > probe, or by deducing it using other scheduler events based on the CPU > > > executing the tracepoint. > > > > > > Another side-effect of moving this instrumentation to the scheduler ipi > > > is that the delay during which the wakeup is sitting in the pending > > > queue is not accounted for when calculating wakeup latency. > > > > > > Therefore, move the sched_wakeup instrumentation back to the waker > > > context to fix those two shortcomings. > > > > What do you consider wakeup-latency? I don't see how moving the > > tracepoint into the caller will magically account the queue time. > > Well, the point of wakeup is when the wakee calls wakeup. If the trace > point is in the IPI then you account the time between the wakeup and > the actuall handling in the IPI to the wakee instead of accounting it > to the time between wakeup and sched switch. My point exactly, wake->schedule is what we call the scheduling latency, not the wake latency, which would be from 'event' to the task being runnable. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint 2015-06-05 12:46 ` Peter Zijlstra @ 2015-06-08 16:54 ` Steven Rostedt 0 siblings, 0 replies; 19+ messages in thread From: Steven Rostedt @ 2015-06-08 16:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Thomas Gleixner, Mathieu Desnoyers, linux-kernel, Ingo Molnar, Francis Giraldeau On Fri, 05 Jun 2015 14:46:46 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > to the time between wakeup and sched switch. > > My point exactly, wake->schedule is what we call the scheduling latency, > not the wake latency, which would be from 'event' to the task being > runnable. Right, which means the tracepoint should be as close to the "wakeup_process()" or whatever calls try_to_wake_up(). Which would be the "sched_waking" part of your proposal. -- Steve ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2015-08-03 17:07 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-06-05 11:41 [RFC PATCH] sched: Fix sched_wakeup tracepoint Mathieu Desnoyers 2015-06-05 12:09 ` Peter Zijlstra 2015-06-05 12:32 ` Mathieu Desnoyers 2015-06-05 12:51 ` Peter Zijlstra 2015-06-05 13:23 ` Mathieu Desnoyers 2015-06-06 12:02 ` Peter Zijlstra 2015-06-07 10:20 ` Mathieu Desnoyers 2015-06-08 17:27 ` Steven Rostedt 2015-06-09 9:13 ` Peter Zijlstra 2015-06-09 18:48 ` Mathieu Desnoyers 2015-06-17 18:23 ` Cong Wang 2015-06-17 18:47 ` Steven Rostedt 2015-08-03 17:06 ` [tip:sched/core] sched: Introduce the 'trace_sched_waking' tracepoint tip-bot for Peter Zijlstra 2015-06-08 6:55 ` [RFC PATCH] sched: Fix sched_wakeup tracepoint Peter Zijlstra 2015-06-09 5:53 ` Mathieu Desnoyers 2015-06-05 12:32 ` Thomas Gleixner 2015-06-05 12:36 ` Mathieu Desnoyers 2015-06-05 12:46 ` Peter Zijlstra 2015-06-08 16:54 ` Steven Rostedt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox