* [PATCH] tracing/sched: Check preempt_count() for current when reading task->state
@ 2014-12-10 22:44 Steven Rostedt
2014-12-11 4:58 ` Steven Rostedt
2014-12-11 6:38 ` Ingo Molnar
0 siblings, 2 replies; 7+ messages in thread
From: Steven Rostedt @ 2014-12-10 22:44 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, Andrew Morton, Thomas Gleixner
When recording the state of a task for the sched_switch tracepoint a check of
task_preempt_count() is performed to see if PREEMPT_ACTIVE is set. This is
because, technically, a task being preempted is really in the TASK_RUNNING
state, and that is what should be recorded when tracing a sched_switch,
even if the task put itself into another state (it hasn't scheduled out
in that state yet).
But with the change to use per_cpu preempt counts, the
task_thread_info(p)->preempt_count is no longer used, and instead
task_preempt_count(p) is used.
The problem is that this does not use the current preempt count but a
stale one from a previous sched_switch. The task_preempt_count(p) uses
saved_preempt_count and not preempt_count(). But for tracing
sched_switch, if p is current, we really want preempt_count().
I hit this bug when I was tracing sleep and the call from do_nanosleep()
scheduled out in the "RUNNING" state.
sleep-4290 [000] 537272.259992: sched_switch: sleep:4290 [120] R ==> swapper/0:0 [120]
sleep-4290 [000] 537272.260015: kernel_stack: <stack trace>
=> __schedule (ffffffff8150864a)
=> schedule (ffffffff815089f8)
=> do_nanosleep (ffffffff8150b76c)
=> hrtimer_nanosleep (ffffffff8108d66b)
=> SyS_nanosleep (ffffffff8108d750)
=> return_to_handler (ffffffff8150e8e5)
=> tracesys_phase2 (ffffffff8150c844)
After a bit of hair pulling, I found that the state was really
TASK_INTERRUPTIBLE, but the saved_preempt_count had an old PREEMPT_ACTIVE
set and caused the sched_switch tracepoint to show it as RUNNING.
Cc: stable@vger.kernel.org # 3.13+
Fixes: 01028747559a "sched: Create more preempt_count accessors"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
include/trace/events/sched.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 0a68d5ae584e..13fbadcc172b 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -97,10 +97,14 @@ static inline long __trace_sched_switch_state(struct task_struct *p)
long state = p->state;
#ifdef CONFIG_PREEMPT
+ unsigned long pc;
+
+ pc = (p == current) ? preempt_count() : task_preempt_count(p);
+
/*
* For all intents and purposes a preempted task is a running task.
*/
- if (task_preempt_count(p) & PREEMPT_ACTIVE)
+ if (pc & PREEMPT_ACTIVE)
state = TASK_RUNNING | TASK_STATE_MAX;
#endif
--
1.8.1.4
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] tracing/sched: Check preempt_count() for current when reading task->state
2014-12-10 22:44 [PATCH] tracing/sched: Check preempt_count() for current when reading task->state Steven Rostedt
@ 2014-12-11 4:58 ` Steven Rostedt
2014-12-11 6:38 ` Ingo Molnar
1 sibling, 0 replies; 7+ messages in thread
From: Steven Rostedt @ 2014-12-11 4:58 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, Andrew Morton, Thomas Gleixner
Peter,
If you give me your ack, I can pull this through my tree. Otherwise, it
can go through tip. I just kicked off my test suite to test it
overnight.
-- Steve
On Wed, 10 Dec 2014 17:44:28 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
>
> When recording the state of a task for the sched_switch tracepoint a check of
> task_preempt_count() is performed to see if PREEMPT_ACTIVE is set. This is
> because, technically, a task being preempted is really in the TASK_RUNNING
> state, and that is what should be recorded when tracing a sched_switch,
> even if the task put itself into another state (it hasn't scheduled out
> in that state yet).
>
> But with the change to use per_cpu preempt counts, the
> task_thread_info(p)->preempt_count is no longer used, and instead
> task_preempt_count(p) is used.
>
> The problem is that this does not use the current preempt count but a
> stale one from a previous sched_switch. The task_preempt_count(p) uses
> saved_preempt_count and not preempt_count(). But for tracing
> sched_switch, if p is current, we really want preempt_count().
>
> I hit this bug when I was tracing sleep and the call from do_nanosleep()
> scheduled out in the "RUNNING" state.
>
> sleep-4290 [000] 537272.259992: sched_switch: sleep:4290 [120] R ==> swapper/0:0 [120]
> sleep-4290 [000] 537272.260015: kernel_stack: <stack trace>
> => __schedule (ffffffff8150864a)
> => schedule (ffffffff815089f8)
> => do_nanosleep (ffffffff8150b76c)
> => hrtimer_nanosleep (ffffffff8108d66b)
> => SyS_nanosleep (ffffffff8108d750)
> => return_to_handler (ffffffff8150e8e5)
> => tracesys_phase2 (ffffffff8150c844)
>
> After a bit of hair pulling, I found that the state was really
> TASK_INTERRUPTIBLE, but the saved_preempt_count had an old PREEMPT_ACTIVE
> set and caused the sched_switch tracepoint to show it as RUNNING.
>
> Cc: stable@vger.kernel.org # 3.13+
> Fixes: 01028747559a "sched: Create more preempt_count accessors"
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
> include/trace/events/sched.h | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 0a68d5ae584e..13fbadcc172b 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -97,10 +97,14 @@ static inline long __trace_sched_switch_state(struct task_struct *p)
> long state = p->state;
>
> #ifdef CONFIG_PREEMPT
> + unsigned long pc;
> +
> + pc = (p == current) ? preempt_count() : task_preempt_count(p);
> +
> /*
> * For all intents and purposes a preempted task is a running task.
> */
> - if (task_preempt_count(p) & PREEMPT_ACTIVE)
> + if (pc & PREEMPT_ACTIVE)
> state = TASK_RUNNING | TASK_STATE_MAX;
> #endif
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] tracing/sched: Check preempt_count() for current when reading task->state
2014-12-10 22:44 [PATCH] tracing/sched: Check preempt_count() for current when reading task->state Steven Rostedt
2014-12-11 4:58 ` Steven Rostedt
@ 2014-12-11 6:38 ` Ingo Molnar
2014-12-11 11:37 ` Steven Rostedt
1 sibling, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2014-12-11 6:38 UTC (permalink / raw)
To: Steven Rostedt; +Cc: Peter Zijlstra, LKML, Andrew Morton, Thomas Gleixner
* Steven Rostedt <rostedt@goodmis.org> wrote:
>
> When recording the state of a task for the sched_switch tracepoint a check of
> task_preempt_count() is performed to see if PREEMPT_ACTIVE is set. This is
> because, technically, a task being preempted is really in the TASK_RUNNING
> state, and that is what should be recorded when tracing a sched_switch,
> even if the task put itself into another state (it hasn't scheduled out
> in that state yet).
>
> But with the change to use per_cpu preempt counts, the
> task_thread_info(p)->preempt_count is no longer used, and instead
> task_preempt_count(p) is used.
>
> The problem is that this does not use the current preempt count but a
> stale one from a previous sched_switch. The task_preempt_count(p) uses
> saved_preempt_count and not preempt_count(). But for tracing
> sched_switch, if p is current, we really want preempt_count().
>
> I hit this bug when I was tracing sleep and the call from do_nanosleep()
> scheduled out in the "RUNNING" state.
>
> sleep-4290 [000] 537272.259992: sched_switch: sleep:4290 [120] R ==> swapper/0:0 [120]
> sleep-4290 [000] 537272.260015: kernel_stack: <stack trace>
> => __schedule (ffffffff8150864a)
> => schedule (ffffffff815089f8)
> => do_nanosleep (ffffffff8150b76c)
> => hrtimer_nanosleep (ffffffff8108d66b)
> => SyS_nanosleep (ffffffff8108d750)
> => return_to_handler (ffffffff8150e8e5)
> => tracesys_phase2 (ffffffff8150c844)
>
> After a bit of hair pulling, I found that the state was really
> TASK_INTERRUPTIBLE, but the saved_preempt_count had an old PREEMPT_ACTIVE
> set and caused the sched_switch tracepoint to show it as RUNNING.
>
> Cc: stable@vger.kernel.org # 3.13+
> Fixes: 01028747559a "sched: Create more preempt_count accessors"
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
> include/trace/events/sched.h | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 0a68d5ae584e..13fbadcc172b 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -97,10 +97,14 @@ static inline long __trace_sched_switch_state(struct task_struct *p)
> long state = p->state;
>
> #ifdef CONFIG_PREEMPT
> + unsigned long pc;
> +
> + pc = (p == current) ? preempt_count() : task_preempt_count(p);
> +
> /*
> * For all intents and purposes a preempted task is a running task.
> */
> - if (task_preempt_count(p) & PREEMPT_ACTIVE)
> + if (pc & PREEMPT_ACTIVE)
> state = TASK_RUNNING | TASK_STATE_MAX;
I really don't like the overhead around here.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] tracing/sched: Check preempt_count() for current when reading task->state
2014-12-11 6:38 ` Ingo Molnar
@ 2014-12-11 11:37 ` Steven Rostedt
2014-12-11 12:31 ` Ingo Molnar
0 siblings, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2014-12-11 11:37 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Peter Zijlstra, LKML, Andrew Morton, Thomas Gleixner
On Thu, 11 Dec 2014 07:38:11 +0100
Ingo Molnar <mingo@kernel.org> wrote:
> > Cc: stable@vger.kernel.org # 3.13+
> > Fixes: 01028747559a "sched: Create more preempt_count accessors"
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> > ---
> > include/trace/events/sched.h | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > index 0a68d5ae584e..13fbadcc172b 100644
> > --- a/include/trace/events/sched.h
> > +++ b/include/trace/events/sched.h
> > @@ -97,10 +97,14 @@ static inline long __trace_sched_switch_state(struct task_struct *p)
> > long state = p->state;
> >
> > #ifdef CONFIG_PREEMPT
> > + unsigned long pc;
> > +
> > + pc = (p == current) ? preempt_count() : task_preempt_count(p);
> > +
> > /*
> > * For all intents and purposes a preempted task is a running task.
> > */
> > - if (task_preempt_count(p) & PREEMPT_ACTIVE)
> > + if (pc & PREEMPT_ACTIVE)
> > state = TASK_RUNNING | TASK_STATE_MAX;
>
> I really don't like the overhead around here.
Hi Ingo!
What overhead are you worried about? Note, this is in the schedule
tracepoint and does not affect the scheduler itself (as long as the
tracepoint is not enabled).
I'm also thinking that as long as "prev" is always guaranteed to be
"current" we can remove the check and just use preempt_count() always.
But I'm worried that we can't guaranteed that.
What other ideas do you have? Because wrong data is worse than the
overhead of the above code. If Thomas taught me anything, it's that!
-- Steve
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] tracing/sched: Check preempt_count() for current when reading task->state
2014-12-11 11:37 ` Steven Rostedt
@ 2014-12-11 12:31 ` Ingo Molnar
2014-12-11 14:17 ` Steven Rostedt
0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2014-12-11 12:31 UTC (permalink / raw)
To: Steven Rostedt; +Cc: Peter Zijlstra, LKML, Andrew Morton, Thomas Gleixner
* Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 11 Dec 2014 07:38:11 +0100
> Ingo Molnar <mingo@kernel.org> wrote:
>
> > > Cc: stable@vger.kernel.org # 3.13+
> > > Fixes: 01028747559a "sched: Create more preempt_count accessors"
> > > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> > > ---
> > > include/trace/events/sched.h | 6 +++++-
> > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > > index 0a68d5ae584e..13fbadcc172b 100644
> > > --- a/include/trace/events/sched.h
> > > +++ b/include/trace/events/sched.h
> > > @@ -97,10 +97,14 @@ static inline long __trace_sched_switch_state(struct task_struct *p)
> > > long state = p->state;
> > >
> > > #ifdef CONFIG_PREEMPT
> > > + unsigned long pc;
> > > +
> > > + pc = (p == current) ? preempt_count() : task_preempt_count(p);
> > > +
> > > /*
> > > * For all intents and purposes a preempted task is a running task.
> > > */
> > > - if (task_preempt_count(p) & PREEMPT_ACTIVE)
> > > + if (pc & PREEMPT_ACTIVE)
> > > state = TASK_RUNNING | TASK_STATE_MAX;
> >
> > I really don't like the overhead around here.
>
> Hi Ingo!
>
> What overhead are you worried about? Note, this is in the
> schedule tracepoint and does not affect the scheduler itself
> (as long as the tracepoint is not enabled).
Scheduler tracepoints are pretty popular, so I'm worried about
their complexity when they are activated.
> I'm also thinking that as long as "prev" is always guaranteed
> to be "current" we can remove the check and just use
> preempt_count() always. But I'm worried that we can't
> guaranteed that.
You could add a WARN_ON_ONCE() or so to double check that
assumption?
> What other ideas do you have? Because wrong data is worse than
> the overhead of the above code. If Thomas taught me anything,
> it's that!
My idea is to have simpler, yet correct code. And ponies!
Thanks,
Ingo
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] tracing/sched: Check preempt_count() for current when reading task->state
2014-12-11 12:31 ` Ingo Molnar
@ 2014-12-11 14:17 ` Steven Rostedt
2014-12-11 16:50 ` Ingo Molnar
0 siblings, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2014-12-11 14:17 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Peter Zijlstra, LKML, Andrew Morton, Thomas Gleixner
On Thu, 11 Dec 2014 13:31:21 +0100
Ingo Molnar <mingo@kernel.org> wrote:
> > What overhead are you worried about? Note, this is in the
> > schedule tracepoint and does not affect the scheduler itself
> > (as long as the tracepoint is not enabled).
>
> Scheduler tracepoints are pretty popular, so I'm worried about
> their complexity when they are activated.
Understood.
>
> > I'm also thinking that as long as "prev" is always guaranteed
> > to be "current" we can remove the check and just use
> > preempt_count() always. But I'm worried that we can't
> > guaranteed that.
>
> You could add a WARN_ON_ONCE() or so to double check that
> assumption?
I actually thought about that, but that gives us the same overhead as
the branch we want to remove.
But if you are going for simpler, then that would make sense.
>
> > What other ideas do you have? Because wrong data is worse than
> > the overhead of the above code. If Thomas taught me anything,
> > it's that!
>
> My idea is to have simpler, yet correct code. And ponies!
So something like this instead?
-- Steve
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 0a68d5ae584e..782018b135ff 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -97,10 +97,12 @@ static inline long __trace_sched_switch_state(struct task_struct *p)
long state = p->state;
#ifdef CONFIG_PREEMPT
+ WARN_ON_ONCE(p != current);
+
/*
* For all intents and purposes a preempted task is a running task.
*/
- if (task_preempt_count(p) & PREEMPT_ACTIVE)
+ if (preempt_count() & PREEMPT_ACTIVE)
state = TASK_RUNNING | TASK_STATE_MAX;
#endif
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] tracing/sched: Check preempt_count() for current when reading task->state
2014-12-11 14:17 ` Steven Rostedt
@ 2014-12-11 16:50 ` Ingo Molnar
0 siblings, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2014-12-11 16:50 UTC (permalink / raw)
To: Steven Rostedt; +Cc: Peter Zijlstra, LKML, Andrew Morton, Thomas Gleixner
* Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 11 Dec 2014 13:31:21 +0100
> Ingo Molnar <mingo@kernel.org> wrote:
>
>
> > > What overhead are you worried about? Note, this is in the
> > > schedule tracepoint and does not affect the scheduler itself
> > > (as long as the tracepoint is not enabled).
> >
> > Scheduler tracepoints are pretty popular, so I'm worried about
> > their complexity when they are activated.
>
> Understood.
>
> >
> > > I'm also thinking that as long as "prev" is always guaranteed
> > > to be "current" we can remove the check and just use
> > > preempt_count() always. But I'm worried that we can't
> > > guaranteed that.
> >
> > You could add a WARN_ON_ONCE() or so to double check that
> > assumption?
>
> I actually thought about that, but that gives us the same overhead as
> the branch we want to remove.
>
> But if you are going for simpler, then that would make sense.
>
> >
> > > What other ideas do you have? Because wrong data is worse than
> > > the overhead of the above code. If Thomas taught me anything,
> > > it's that!
> >
> > My idea is to have simpler, yet correct code. And ponies!
>
>
> So something like this instead?
>
> -- Steve
>
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 0a68d5ae584e..782018b135ff 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -97,10 +97,12 @@ static inline long __trace_sched_switch_state(struct task_struct *p)
> long state = p->state;
>
> #ifdef CONFIG_PREEMPT
> + WARN_ON_ONCE(p != current);
> +
> /*
> * For all intents and purposes a preempted task is a running task.
> */
> - if (task_preempt_count(p) & PREEMPT_ACTIVE)
> + if (preempt_count() & PREEMPT_ACTIVE)
> state = TASK_RUNNING | TASK_STATE_MAX;
Yeah, that looks a lot better IMHO, 'p' is supposed to be the
current task, at least on a booted up system with a working
scheduler. Not sure about transient initialization states such as
very early boot and idle thread initialization - but it might
work out for them as well.
If the WARN_ON_ONCE() remains silent on your testbox then I'd
suggest removing the WARN_ON_ONCE(), the change looks good to me:
Acked-by: Ingo Molnar <mingo@kernel.org>
Thanks,
Ingo
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-12-11 16:50 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-10 22:44 [PATCH] tracing/sched: Check preempt_count() for current when reading task->state Steven Rostedt
2014-12-11 4:58 ` Steven Rostedt
2014-12-11 6:38 ` Ingo Molnar
2014-12-11 11:37 ` Steven Rostedt
2014-12-11 12:31 ` Ingo Molnar
2014-12-11 14:17 ` Steven Rostedt
2014-12-11 16:50 ` Ingo Molnar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox