* [RFC PATCH] tracing: add sched_prio_update
@ 2016-07-04 19:46 Julien Desfossez
2016-07-05 15:19 ` Steven Rostedt
2016-08-11 10:43 ` Peter Zijlstra
0 siblings, 2 replies; 8+ messages in thread
From: Julien Desfossez @ 2016-07-04 19:46 UTC (permalink / raw)
To: peterz, tglx, rostedt, mingo, daolivei
Cc: mathieu.desnoyers, linux-kernel, Julien Desfossez
This tracepoint allows to keep track of all priority changes of a task.
It outputs the scheduling policy, the nice value, the rt_priority and
the deadline-related attributes (dl_runtime, dl_deadline and dl_period).
It is emitted in the code path of the sched_setscheduler, sched_setattr,
sched_setparam, nice and the fork system calls. For fork, it is emitted
after the sched_process_fork tracepoint for timeline consistency and
because the PID is not yet set when sched_fork() is called.
This allows the analysis of real-time scheduling delays based on the
configured scheduling priorities and policies, which cannot be performed
with the current instrumentation in sched_switch. Also, instead of
exposing the internal kernel prio field, this tracepoint only outputs
the user-visible priority attributes.
The effective priority of running threads can also be temporarily
changed in the PI code, but a dedicated tracepoint is already in place
to cover this case.
Here are a few output examples:
After fork of a normal task:
sched_prio_update: comm=bash pid=2104, policy=SCHED_NORMAL, nice=0,
rt_priority=0, dl_runtime=0, dl_deadline=0, dl_period=0
renice -n 10 of a normal task:
sched_prio_update: comm=sleep pid=2130, policy=SCHED_NORMAL, nice=10,
rt_priority=0, dl_runtime=0, dl_deadline=0, dl_period=0
SCHED_FIFO 60:
sched_prio_update: comm=chrt pid=2105, policy=SCHED_FIFO, nice=0,
rt_priority=60, dl_runtime=0, dl_deadline=0, dl_period=0
SCHED_RR 60:
sched_prio_update: comm=chrt pid=2109, policy=SCHED_RR, nice=0,
rt_priority=60, dl_runtime=0, dl_deadline=0, dl_period=0
SCHED_DEADLINE:
sched_prio_update: comm=b pid=2110, policy=SCHED_DEADLING, nice=0,
rt_priority=0, dl_runtime=10000000, dl_deadline=30000000,
dl_period=30000000
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
---
include/trace/events/sched.h | 68 ++++++++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 1 +
kernel/sched/core.c | 3 ++
3 files changed, 72 insertions(+)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9b90c57..fcb0f29 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -8,6 +8,34 @@
#include <linux/tracepoint.h>
#include <linux/binfmts.h>
+#define SCHEDULING_POLICY \
+ EM( SCHED_NORMAL, "SCHED_NORMAL") \
+ EM( SCHED_FIFO, "SCHED_FIFO") \
+ EM( SCHED_RR, "SCHED_RR") \
+ EM( SCHED_BATCH, "SCHED_BATCH") \
+ EM( SCHED_IDLE, "SCHED_IDLE") \
+ EMe(SCHED_DEADLINE, "SCHED_DEADLINE")
+
+/*
+ * First define the enums in the above macros to be exported to userspace
+ * via TRACE_DEFINE_ENUM().
+ */
+#undef EM
+#undef EMe
+#define EM(a, b) TRACE_DEFINE_ENUM(a);
+#define EMe(a, b) TRACE_DEFINE_ENUM(a);
+
+SCHEDULING_POLICY
+
+/*
+ * Now redefine the EM() and EMe() macros to map the enums to the strings
+ * that will be printed in the output.
+ */
+#undef EM
+#undef EMe
+#define EM(a, b) {a, b},
+#define EMe(a, b) {a, b}
+
/*
* Tracepoint for calling kthread_stop, performed to end a kthread:
*/
@@ -562,6 +590,46 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
TP_printk("cpu=%d", __entry->cpu)
);
+
+/*
+ * Tracepoint for showing scheduling priority changes.
+ */
+TRACE_EVENT(sched_prio_update,
+
+ TP_PROTO(struct task_struct *tsk),
+
+ TP_ARGS(tsk),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( unsigned int, policy )
+ __field( int, nice )
+ __field( unsigned int, rt_priority )
+ __field( u64, dl_runtime )
+ __field( u64, dl_deadline )
+ __field( u64, dl_period )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+ __entry->pid = tsk->pid;
+ __entry->policy = tsk->policy;
+ __entry->nice = task_nice(tsk);
+ __entry->rt_priority = tsk->rt_priority;
+ __entry->dl_runtime = tsk->dl.dl_runtime;
+ __entry->dl_deadline = tsk->dl.dl_deadline;
+ __entry->dl_period = tsk->dl.dl_period;
+ ),
+
+ TP_printk("comm=%s pid=%d, policy=%s, nice=%d, rt_priority=%u, "
+ "dl_runtime=%Lu, dl_deadline=%Lu, dl_period=%Lu",
+ __entry->comm, __entry->pid,
+ __print_symbolic(__entry->policy, SCHEDULING_POLICY),
+ __entry->nice, __entry->rt_priority,
+ __entry->dl_runtime, __entry->dl_deadline,
+ __entry->dl_period)
+);
#endif /* _TRACE_SCHED_H */
/* This part must be outside protection */
diff --git a/kernel/fork.c b/kernel/fork.c
index 7926993..ac4294a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1773,6 +1773,7 @@ long _do_fork(unsigned long clone_flags,
struct pid *pid;
trace_sched_process_fork(current, p);
+ trace_sched_prio_update(p);
pid = get_task_pid(p, PIDTYPE_PID);
nr = pid_vnr(pid);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ce83e39..c729425 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3708,6 +3708,7 @@ void set_user_nice(struct task_struct *p, long nice)
resched_curr(rq);
}
out_unlock:
+ trace_sched_prio_update(p);
task_rq_unlock(rq, p, &rf);
}
EXPORT_SYMBOL(set_user_nice);
@@ -3912,6 +3913,8 @@ static void __setscheduler(struct rq *rq, struct task_struct *p,
p->sched_class = &rt_sched_class;
else
p->sched_class = &fair_sched_class;
+
+ trace_sched_prio_update(p);
}
static void
--
1.9.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] tracing: add sched_prio_update
2016-07-04 19:46 [RFC PATCH] tracing: add sched_prio_update Julien Desfossez
@ 2016-07-05 15:19 ` Steven Rostedt
2016-07-05 21:50 ` Mathieu Desnoyers
2016-08-11 10:43 ` Peter Zijlstra
1 sibling, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2016-07-05 15:19 UTC (permalink / raw)
To: Julien Desfossez
Cc: peterz, tglx, mingo, daolivei, mathieu.desnoyers, linux-kernel
On Mon, 4 Jul 2016 15:46:04 -0400
Julien Desfossez <jdesfossez@efficios.com> wrote:
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 9b90c57..fcb0f29 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -8,6 +8,34 @@
> #include <linux/tracepoint.h>
> #include <linux/binfmts.h>
>
> +#define SCHEDULING_POLICY \
> + EM( SCHED_NORMAL, "SCHED_NORMAL") \
> + EM( SCHED_FIFO, "SCHED_FIFO") \
> + EM( SCHED_RR, "SCHED_RR") \
> + EM( SCHED_BATCH, "SCHED_BATCH") \
> + EM( SCHED_IDLE, "SCHED_IDLE") \
> + EMe(SCHED_DEADLINE, "SCHED_DEADLINE")
> +
> +/*
> + * First define the enums in the above macros to be exported to userspace
> + * via TRACE_DEFINE_ENUM().
> + */
> +#undef EM
> +#undef EMe
> +#define EM(a, b) TRACE_DEFINE_ENUM(a);
> +#define EMe(a, b) TRACE_DEFINE_ENUM(a);
> +
> +SCHEDULING_POLICY
> +
> +/*
> + * Now redefine the EM() and EMe() macros to map the enums to the strings
> + * that will be printed in the output.
> + */
> +#undef EM
> +#undef EMe
> +#define EM(a, b) {a, b},
> +#define EMe(a, b) {a, b}
> +
> /*
> * Tracepoint for calling kthread_stop, performed to end a kthread:
> */
> @@ -562,6 +590,46 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
>
> TP_printk("cpu=%d", __entry->cpu)
> );
> +
> +/*
> + * Tracepoint for showing scheduling priority changes.
> + */
> +TRACE_EVENT(sched_prio_update,
I'm fine with the addition of this tracepoint. You'll have to get by
Peter Zijlstra for it.
> +
> + TP_PROTO(struct task_struct *tsk),
> +
> + TP_ARGS(tsk),
> +
> + TP_STRUCT__entry(
> + __array( char, comm, TASK_COMM_LEN )
I could imagine this being a high frequency tracepoint, especially with
a lot of boosting going on. Can we nuke the comm recording and let the
userspace tools just hook to the sched_switch tracepoint for that?
-- Steve
> + __field( pid_t, pid )
> + __field( unsigned int, policy )
> + __field( int, nice )
> + __field( unsigned int, rt_priority )
> + __field( u64, dl_runtime )
> + __field( u64, dl_deadline )
> + __field( u64, dl_period )
> + ),
> +
> + TP_fast_assign(
> + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
> + __entry->pid = tsk->pid;
> + __entry->policy = tsk->policy;
> + __entry->nice = task_nice(tsk);
> + __entry->rt_priority = tsk->rt_priority;
> + __entry->dl_runtime = tsk->dl.dl_runtime;
> + __entry->dl_deadline = tsk->dl.dl_deadline;
> + __entry->dl_period = tsk->dl.dl_period;
> + ),
> +
> + TP_printk("comm=%s pid=%d, policy=%s, nice=%d, rt_priority=%u, "
> + "dl_runtime=%Lu, dl_deadline=%Lu, dl_period=%Lu",
> + __entry->comm, __entry->pid,
> + __print_symbolic(__entry->policy, SCHEDULING_POLICY),
> + __entry->nice, __entry->rt_priority,
> + __entry->dl_runtime, __entry->dl_deadline,
> + __entry->dl_period)
> +);
> #endif /* _TRACE_SCHED_H */
>
> /* This part must be outside protection */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 7926993..ac4294a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1773,6 +1773,7 @@ long _do_fork(unsigned long clone_flags,
> struct pid *pid;
>
> trace_sched_process_fork(current, p);
> + trace_sched_prio_update(p);
>
> pid = get_task_pid(p, PIDTYPE_PID);
> nr = pid_vnr(pid);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ce83e39..c729425 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3708,6 +3708,7 @@ void set_user_nice(struct task_struct *p, long nice)
> resched_curr(rq);
> }
> out_unlock:
> + trace_sched_prio_update(p);
> task_rq_unlock(rq, p, &rf);
> }
> EXPORT_SYMBOL(set_user_nice);
> @@ -3912,6 +3913,8 @@ static void __setscheduler(struct rq *rq, struct task_struct *p,
> p->sched_class = &rt_sched_class;
> else
> p->sched_class = &fair_sched_class;
> +
> + trace_sched_prio_update(p);
> }
>
> static void
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] tracing: add sched_prio_update
2016-07-05 15:19 ` Steven Rostedt
@ 2016-07-05 21:50 ` Mathieu Desnoyers
2016-07-06 13:13 ` Steven Rostedt
0 siblings, 1 reply; 8+ messages in thread
From: Mathieu Desnoyers @ 2016-07-05 21:50 UTC (permalink / raw)
To: rostedt
Cc: Julien Desfossez, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
daolivei, linux-kernel
----- On Jul 5, 2016, at 11:19 AM, rostedt rostedt@goodmis.org wrote:
> On Mon, 4 Jul 2016 15:46:04 -0400
> Julien Desfossez <jdesfossez@efficios.com> wrote:
>
>
>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>> index 9b90c57..fcb0f29 100644
>> --- a/include/trace/events/sched.h
>> +++ b/include/trace/events/sched.h
>> @@ -8,6 +8,34 @@
>> #include <linux/tracepoint.h>
>> #include <linux/binfmts.h>
>>
>> +#define SCHEDULING_POLICY \
>> + EM( SCHED_NORMAL, "SCHED_NORMAL") \
>> + EM( SCHED_FIFO, "SCHED_FIFO") \
>> + EM( SCHED_RR, "SCHED_RR") \
>> + EM( SCHED_BATCH, "SCHED_BATCH") \
>> + EM( SCHED_IDLE, "SCHED_IDLE") \
>> + EMe(SCHED_DEADLINE, "SCHED_DEADLINE")
>> +
>> +/*
>> + * First define the enums in the above macros to be exported to userspace
>> + * via TRACE_DEFINE_ENUM().
>> + */
>> +#undef EM
>> +#undef EMe
>> +#define EM(a, b) TRACE_DEFINE_ENUM(a);
>> +#define EMe(a, b) TRACE_DEFINE_ENUM(a);
>> +
>> +SCHEDULING_POLICY
>> +
>> +/*
>> + * Now redefine the EM() and EMe() macros to map the enums to the strings
>> + * that will be printed in the output.
>> + */
>> +#undef EM
>> +#undef EMe
>> +#define EM(a, b) {a, b},
>> +#define EMe(a, b) {a, b}
>> +
>> /*
>> * Tracepoint for calling kthread_stop, performed to end a kthread:
>> */
>> @@ -562,6 +590,46 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
>>
>> TP_printk("cpu=%d", __entry->cpu)
>> );
>> +
>> +/*
>> + * Tracepoint for showing scheduling priority changes.
>> + */
>> +TRACE_EVENT(sched_prio_update,
>
> I'm fine with the addition of this tracepoint. You'll have to get by
> Peter Zijlstra for it.
Great!
>
>> +
>> + TP_PROTO(struct task_struct *tsk),
>> +
>> + TP_ARGS(tsk),
>> +
>> + TP_STRUCT__entry(
>> + __array( char, comm, TASK_COMM_LEN )
>
> I could imagine this being a high frequency tracepoint, especially with
> a lot of boosting going on. Can we nuke the comm recording and let the
> userspace tools just hook to the sched_switch tracepoint for that?
We can surely do that.
Just to clarify: currently this tracepoint is *not* hooked on PI boosting,
as described in the changelog. This tracepoint is about the prio attributes
set by user-space. The PI boosting temporarily changes the task struct prio
without updating the associated policy, which seems rather
implementation-specific and odd to expose.
Thoughts ?
Thanks,
Mathieu
>
> -- Steve
>
>
>> + __field( pid_t, pid )
>> + __field( unsigned int, policy )
>> + __field( int, nice )
>> + __field( unsigned int, rt_priority )
>> + __field( u64, dl_runtime )
>> + __field( u64, dl_deadline )
>> + __field( u64, dl_period )
>> + ),
>> +
>> + TP_fast_assign(
>> + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
>> + __entry->pid = tsk->pid;
>> + __entry->policy = tsk->policy;
>> + __entry->nice = task_nice(tsk);
>> + __entry->rt_priority = tsk->rt_priority;
>> + __entry->dl_runtime = tsk->dl.dl_runtime;
>> + __entry->dl_deadline = tsk->dl.dl_deadline;
>> + __entry->dl_period = tsk->dl.dl_period;
>> + ),
>> +
>> + TP_printk("comm=%s pid=%d, policy=%s, nice=%d, rt_priority=%u, "
>> + "dl_runtime=%Lu, dl_deadline=%Lu, dl_period=%Lu",
>> + __entry->comm, __entry->pid,
>> + __print_symbolic(__entry->policy, SCHEDULING_POLICY),
>> + __entry->nice, __entry->rt_priority,
>> + __entry->dl_runtime, __entry->dl_deadline,
>> + __entry->dl_period)
>> +);
>> #endif /* _TRACE_SCHED_H */
>>
>> /* This part must be outside protection */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 7926993..ac4294a 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1773,6 +1773,7 @@ long _do_fork(unsigned long clone_flags,
>> struct pid *pid;
>>
>> trace_sched_process_fork(current, p);
>> + trace_sched_prio_update(p);
>>
>> pid = get_task_pid(p, PIDTYPE_PID);
>> nr = pid_vnr(pid);
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index ce83e39..c729425 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3708,6 +3708,7 @@ void set_user_nice(struct task_struct *p, long nice)
>> resched_curr(rq);
>> }
>> out_unlock:
>> + trace_sched_prio_update(p);
>> task_rq_unlock(rq, p, &rf);
>> }
>> EXPORT_SYMBOL(set_user_nice);
>> @@ -3912,6 +3913,8 @@ static void __setscheduler(struct rq *rq, struct
>> task_struct *p,
>> p->sched_class = &rt_sched_class;
>> else
>> p->sched_class = &fair_sched_class;
>> +
>> + trace_sched_prio_update(p);
>> }
>>
> > static void
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] tracing: add sched_prio_update
2016-07-05 21:50 ` Mathieu Desnoyers
@ 2016-07-06 13:13 ` Steven Rostedt
2016-07-06 13:53 ` Julien Desfossez
0 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2016-07-06 13:13 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Julien Desfossez, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
daolivei, linux-kernel
On Tue, 5 Jul 2016 21:50:34 +0000 (UTC)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >
> >> +
> >> + TP_PROTO(struct task_struct *tsk),
> >> +
> >> + TP_ARGS(tsk),
> >> +
> >> + TP_STRUCT__entry(
> >> + __array( char, comm, TASK_COMM_LEN )
> >
> > I could imagine this being a high frequency tracepoint, especially with
> > a lot of boosting going on. Can we nuke the comm recording and let the
> > userspace tools just hook to the sched_switch tracepoint for that?
>
> We can surely do that.
>
> Just to clarify: currently this tracepoint is *not* hooked on PI boosting,
> as described in the changelog. This tracepoint is about the prio attributes
> set by user-space. The PI boosting temporarily changes the task struct prio
> without updating the associated policy, which seems rather
> implementation-specific and odd to expose.
>
> Thoughts ?
Ah, you're right, I was thinking it was at boosting. But still, it's a
rather hefty tracepoint (lots of fields), probably want to keep from
adding comm too.
>
> Thanks,
>
> Mathieu
>
>
> >
> > -- Steve
> >
> >
> >> + __field( pid_t, pid )
> >> + __field( unsigned int, policy )
> >> + __field( int, nice )
> >> + __field( unsigned int, rt_priority )
> >> + __field( u64, dl_runtime )
> >> + __field( u64, dl_deadline )
> >> + __field( u64, dl_period )
> >> + ),
> >> +
> >> + TP_fast_assign(
> >> + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
> >> + __entry->pid = tsk->pid;
> >> + __entry->policy = tsk->policy;
> >> + __entry->nice = task_nice(tsk);
> >> + __entry->rt_priority = tsk->rt_priority;
> >> + __entry->dl_runtime = tsk->dl.dl_runtime;
> >> + __entry->dl_deadline = tsk->dl.dl_deadline;
> >> + __entry->dl_period = tsk->dl.dl_period;
> >> + ),
> >> +
> >> + TP_printk("comm=%s pid=%d, policy=%s, nice=%d, rt_priority=%u, "
> >> + "dl_runtime=%Lu, dl_deadline=%Lu, dl_period=%Lu",
> >> + __entry->comm, __entry->pid,
> >> + __print_symbolic(__entry->policy, SCHEDULING_POLICY),
> >> + __entry->nice, __entry->rt_priority,
> >> + __entry->dl_runtime, __entry->dl_deadline,
> >> + __entry->dl_period)
> >> +);
> >> #endif /* _TRACE_SCHED_H */
> >>
> >> /* This part must be outside protection */
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index 7926993..ac4294a 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -1773,6 +1773,7 @@ long _do_fork(unsigned long clone_flags,
> >> struct pid *pid;
> >>
> >> trace_sched_process_fork(current, p);
> >> + trace_sched_prio_update(p);
>From the change log:
"It is emitted in the code path of the sched_setscheduler,
sched_setattr, sched_setparam, nice and the fork system calls. For fork, it is emitted
after the sched_process_fork tracepoint for timeline consistency and
because the PID is not yet set when sched_fork() is called."
I'm not convinced this should be needed. I hate adding back to back
tracepoints.
-- Steve
> >> pid = get_task_pid(p, PIDTYPE_PID);
> >> nr = pid_vnr(pid);
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index ce83e39..c729425 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -3708,6 +3708,7 @@ void set_user_nice(struct task_struct *p, long nice)
> >> resched_curr(rq);
> >> }
> >> out_unlock:
> >> + trace_sched_prio_update(p);
> >> task_rq_unlock(rq, p, &rf);
> >> }
> >> EXPORT_SYMBOL(set_user_nice);
> >> @@ -3912,6 +3913,8 @@ static void __setscheduler(struct rq *rq, struct
> >> task_struct *p,
> >> p->sched_class = &rt_sched_class;
> >> else
> >> p->sched_class = &fair_sched_class;
> >> +
> >> + trace_sched_prio_update(p);
> >> }
> >>
> > > static void
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] tracing: add sched_prio_update
2016-07-06 13:13 ` Steven Rostedt
@ 2016-07-06 13:53 ` Julien Desfossez
2016-07-06 14:14 ` Steven Rostedt
2016-07-15 17:46 ` Daniel Bristot de Oliveira
0 siblings, 2 replies; 8+ messages in thread
From: Julien Desfossez @ 2016-07-06 13:53 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
daolivei, linux-kernel
On 06-Jul-2016 09:13:25 AM, Steven Rostedt wrote:
> On Tue, 5 Jul 2016 21:50:34 +0000 (UTC)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> > >
> > >> +
> > >> + TP_PROTO(struct task_struct *tsk),
> > >> +
> > >> + TP_ARGS(tsk),
> > >> +
> > >> + TP_STRUCT__entry(
> > >> + __array( char, comm, TASK_COMM_LEN )
> > >
> > > I could imagine this being a high frequency tracepoint, especially with
> > > a lot of boosting going on. Can we nuke the comm recording and let the
> > > userspace tools just hook to the sched_switch tracepoint for that?
> >
> > We can surely do that.
> >
> > Just to clarify: currently this tracepoint is *not* hooked on PI boosting,
> > as described in the changelog. This tracepoint is about the prio attributes
> > set by user-space. The PI boosting temporarily changes the task struct prio
> > without updating the associated policy, which seems rather
> > implementation-specific and odd to expose.
> >
> > Thoughts ?
>
> Ah, you're right, I was thinking it was at boosting. But still, it's a
> rather hefty tracepoint (lots of fields), probably want to keep from
> adding comm too.
Yes, I agree we can remove the comm field, it is easy to get from the
previous sched_switch.
> > >> + __field( pid_t, pid )
> > >> + __field( unsigned int, policy )
> > >> + __field( int, nice )
> > >> + __field( unsigned int, rt_priority )
> > >> + __field( u64, dl_runtime )
> > >> + __field( u64, dl_deadline )
> > >> + __field( u64, dl_period )
> > >> + ),
> > >> +
> > >> + TP_fast_assign(
> > >> + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
> > >> + __entry->pid = tsk->pid;
> > >> + __entry->policy = tsk->policy;
> > >> + __entry->nice = task_nice(tsk);
> > >> + __entry->rt_priority = tsk->rt_priority;
> > >> + __entry->dl_runtime = tsk->dl.dl_runtime;
> > >> + __entry->dl_deadline = tsk->dl.dl_deadline;
> > >> + __entry->dl_period = tsk->dl.dl_period;
> > >> + ),
> > >> +
> > >> + TP_printk("comm=%s pid=%d, policy=%s, nice=%d, rt_priority=%u, "
> > >> + "dl_runtime=%Lu, dl_deadline=%Lu, dl_period=%Lu",
> > >> + __entry->comm, __entry->pid,
> > >> + __print_symbolic(__entry->policy, SCHEDULING_POLICY),
> > >> + __entry->nice, __entry->rt_priority,
> > >> + __entry->dl_runtime, __entry->dl_deadline,
> > >> + __entry->dl_period)
> > >> +);
> > >> #endif /* _TRACE_SCHED_H */
> > >>
> > >> /* This part must be outside protection */
> > >> diff --git a/kernel/fork.c b/kernel/fork.c
> > >> index 7926993..ac4294a 100644
> > >> --- a/kernel/fork.c
> > >> +++ b/kernel/fork.c
> > >> @@ -1773,6 +1773,7 @@ long _do_fork(unsigned long clone_flags,
> > >> struct pid *pid;
> > >>
> > >> trace_sched_process_fork(current, p);
> > >> + trace_sched_prio_update(p);
>
> From the change log:
>
> "It is emitted in the code path of the sched_setscheduler,
> sched_setattr, sched_setparam, nice and the fork system calls. For fork, it is emitted
> after the sched_process_fork tracepoint for timeline consistency and
> because the PID is not yet set when sched_fork() is called."
>
> I'm not convinced this should be needed. I hate adding back to back
> tracepoints.
Indeed, having two tracepoints back to back is not pretty. We placed it
here to get the priority of the newly created threads. Maybe a more
appropriate way of doing that would be to extend the sched_process_fork
tracepoint to output the same scheduling informations. Would you prefer
that option ?
Thanks,
Julien
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] tracing: add sched_prio_update
2016-07-06 13:53 ` Julien Desfossez
@ 2016-07-06 14:14 ` Steven Rostedt
2016-07-15 17:46 ` Daniel Bristot de Oliveira
1 sibling, 0 replies; 8+ messages in thread
From: Steven Rostedt @ 2016-07-06 14:14 UTC (permalink / raw)
To: Julien Desfossez
Cc: Mathieu Desnoyers, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
daolivei, linux-kernel
On Wed, 6 Jul 2016 09:53:24 -0400
Julien Desfossez <jdesfossez@efficios.com> wrote:
> > I'm not convinced this should be needed. I hate adding back to back
> > tracepoints.
>
> Indeed, having two tracepoints back to back is not pretty. We placed it
> here to get the priority of the newly created threads. Maybe a more
> appropriate way of doing that would be to extend the sched_process_fork
> tracepoint to output the same scheduling informations. Would you prefer
> that option ?
That may be a possibility. Let's see what Peter thinks.
-- Steve
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] tracing: add sched_prio_update
2016-07-06 13:53 ` Julien Desfossez
2016-07-06 14:14 ` Steven Rostedt
@ 2016-07-15 17:46 ` Daniel Bristot de Oliveira
1 sibling, 0 replies; 8+ messages in thread
From: Daniel Bristot de Oliveira @ 2016-07-15 17:46 UTC (permalink / raw)
To: Julien Desfossez, Steven Rostedt
Cc: Mathieu Desnoyers, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
daolivei, linux-kernel
On 07/06/2016 10:53 AM, Julien Desfossez wrote:
>> But still, it's a
>> > rather hefty tracepoint (lots of fields), probably want to keep from
>> > adding comm too.
> Yes, I agree we can remove the comm field, it is easy to get from the
> previous sched_switch.
>
Sorry for the delay. I do liked this tracepoint! about comm, I think
that having the comm will make the tracepoint easier to understand, so I
vote for maintain the comm field. But I also see Steven's point and I
would not complain if you more expert guys agree on removing comm :-).
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] tracing: add sched_prio_update
2016-07-04 19:46 [RFC PATCH] tracing: add sched_prio_update Julien Desfossez
2016-07-05 15:19 ` Steven Rostedt
@ 2016-08-11 10:43 ` Peter Zijlstra
1 sibling, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2016-08-11 10:43 UTC (permalink / raw)
To: Julien Desfossez
Cc: tglx, rostedt, mingo, daolivei, mathieu.desnoyers, linux-kernel
On Mon, Jul 04, 2016 at 03:46:04PM -0400, Julien Desfossez wrote:
> The effective priority of running threads can also be temporarily
> changed in the PI code, but a dedicated tracepoint is already in place
> to cover this case.
So while we have that tracepoint its not really all that useful.
I would suggest removing it and replacing it with a combination of this
tracepoint and a new blocked-on tracepoint that would let us trace the
entire blocking graph (we could even include !PI primitives).
> Here are a few output examples:
> After fork of a normal task:
> sched_prio_update: comm=bash pid=2104, policy=SCHED_NORMAL, nice=0,
> rt_priority=0, dl_runtime=0, dl_deadline=0, dl_period=0
>
> renice -n 10 of a normal task:
> sched_prio_update: comm=sleep pid=2130, policy=SCHED_NORMAL, nice=10,
> rt_priority=0, dl_runtime=0, dl_deadline=0, dl_period=0
>
> SCHED_FIFO 60:
> sched_prio_update: comm=chrt pid=2105, policy=SCHED_FIFO, nice=0,
> rt_priority=60, dl_runtime=0, dl_deadline=0, dl_period=0
>
> SCHED_RR 60:
> sched_prio_update: comm=chrt pid=2109, policy=SCHED_RR, nice=0,
> rt_priority=60, dl_runtime=0, dl_deadline=0, dl_period=0
>
> SCHED_DEADLINE:
> sched_prio_update: comm=b pid=2110, policy=SCHED_DEADLING, nice=0,
> rt_priority=0, dl_runtime=10000000, dl_deadline=30000000,
> dl_period=30000000
Looks OK.
> +++ b/kernel/fork.c
> @@ -1773,6 +1773,7 @@ long _do_fork(unsigned long clone_flags,
> struct pid *pid;
>
> trace_sched_process_fork(current, p);
> + trace_sched_prio_update(p);
>
> pid = get_task_pid(p, PIDTYPE_PID);
> nr = pid_vnr(pid);
I'm a bit torn on this, like Steven I loathe back to back tracepoints.
Also, per a minimalist argument, we don't need this tracepoint, since a
child will inherit the parents attributes, except in the
SCHED_FLAG_RESET_ON_FORK case.
At the same time, this delta approach to state has the problem that at
the start of tracing we know nothing, which makes it hard to interpret
traces.
Adding the tracepoint here cures it for new tasks in the trace, but
doesn't help anything for pre-existing tasks.
I do feel this is something we need to cure; because I often trace bits
during the 'running' phase of a program and would not get anything.
Suggestions?
So while I have no objections, I would really rather like to see a more
complete approach before moving on this.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2016-08-11 10:43 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-04 19:46 [RFC PATCH] tracing: add sched_prio_update Julien Desfossez
2016-07-05 15:19 ` Steven Rostedt
2016-07-05 21:50 ` Mathieu Desnoyers
2016-07-06 13:13 ` Steven Rostedt
2016-07-06 13:53 ` Julien Desfossez
2016-07-06 14:14 ` Steven Rostedt
2016-07-15 17:46 ` Daniel Bristot de Oliveira
2016-08-11 10:43 ` Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox