* [RFC PATCH v2 0/5] Additional scheduling information in tracepoints
@ 2016-09-23 16:49 Julien Desfossez
2016-09-23 16:49 ` [RFC PATCH v2 1/5] sched: get effective policy and rt_prio Julien Desfossez
` (5 more replies)
0 siblings, 6 replies; 9+ messages in thread
From: Julien Desfossez @ 2016-09-23 16:49 UTC (permalink / raw)
To: peterz, tglx, rostedt, mingo, daolivei
Cc: mathieu.desnoyers, linux-kernel, Julien Desfossez
This patchset is a proposal to extract more accurate scheduling information in
the kernel trace. The existing scheduling tracepoints currently expose the
"prio" field which is an internal detail of the kernel and is not enough to
understand the behaviour of the scheduler. In order to get more accurate
information, we need the nice value, rt_priority, the policy and deadline
parameters (period, runtime and deadline).
The problem is that adding all these fields to the existing tracepoints will
quickly bloat the traces, especially for users who do not need these fields.
Moreover, removing the "prio" field might break existing tools.
This patchset, proposes a way to connect new probes to existing tracepoints
with the introduction of the TRACE_EVENT_MAP macro so that the instrumented
code does not have to change and we can create alternative versions of the
existing tracepoints.
With this macro, we propose new versions of the sched_switch, sched_waking,
sched_process_fork and sched_pi_setprio tracepoint probes that contain more
scheduling information and get rid of the "prio" field. We also add the PI
information to these tracepoints, so if a process is currently boosted, we show
the name and PID of the top waiter. This allows to quickly see the blocking
chain even if some of the trace background is missing.
In addition, we also propose a new tracepoint (sched_update_prio) that is
called whenever the scheduling configuration of a process is explicitly
changed.
Changes from v1:
- Add a cover letter
- Fix the signed-off-by chain
- Remove an effect-less fix that was proposed
- Move the effective_policy/rt_prio helpers to sched/core.c
- Reorder the patchset so that the new TP sched_update_prio is the last one
Julien Desfossez (5):
sched: get effective policy and rt_prio
tracing: add TRACE_EVENT_MAP
tracing: extend scheduling tracepoints
tracing: extend sched_pi_setprio
tracing: add sched_update_prio
include/linux/sched.h | 2 +
include/linux/trace_events.h | 14 +-
include/linux/tracepoint.h | 11 +-
include/trace/define_trace.h | 4 +
include/trace/events/sched.h | 386 +++++++++++++++++++++++++++++++++++++++++++
include/trace/perf.h | 7 +
include/trace/trace_events.h | 50 ++++++
kernel/sched/core.c | 39 +++++
kernel/trace/trace_events.c | 15 +-
9 files changed, 522 insertions(+), 6 deletions(-)
--
1.9.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC PATCH v2 1/5] sched: get effective policy and rt_prio
2016-09-23 16:49 [RFC PATCH v2 0/5] Additional scheduling information in tracepoints Julien Desfossez
@ 2016-09-23 16:49 ` Julien Desfossez
2016-09-23 16:49 ` [RFC PATCH v2 2/5] tracing: add TRACE_EVENT_MAP Julien Desfossez
` (4 subsequent siblings)
5 siblings, 0 replies; 9+ messages in thread
From: Julien Desfossez @ 2016-09-23 16:49 UTC (permalink / raw)
To: peterz, tglx, rostedt, mingo, daolivei
Cc: mathieu.desnoyers, linux-kernel, Julien Desfossez
Helper functions to get the effective policy and rt_priority from the
prio and policy values. This is useful in PI situations because these
fields are not updated in the task, only the sched_class is temporarily
modified.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index af39baf..0c03595 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2594,6 +2594,8 @@ static inline bool is_idle_task(const struct task_struct *p)
struct thread_info thread_info;
unsigned long stack[THREAD_SIZE/sizeof(long)];
};
+extern int effective_policy(int policy, int prio);
+extern int effective_rt_prio(int prio);
#ifndef __HAVE_ARCH_KSTACK_END
static inline int kstack_end(void *addr)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5f7b3c..f3817b5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -916,6 +916,42 @@ static int effective_prio(struct task_struct *p)
return p->prio;
}
+/*
+ * Get the effective policy based on the current prio value.
+ */
+int effective_policy(int policy, int prio)
+{
+ if (dl_prio(prio))
+ return SCHED_DEADLINE;
+
+ /* With RT, the default class is SCHED_FIFO. */
+ if (rt_prio(prio)) {
+ if (policy == SCHED_RR)
+ return SCHED_RR;
+ return SCHED_FIFO;
+ }
+
+ /* With fair, the default class is SCHED_NORMAL. */
+ switch (policy) {
+ case SCHED_NORMAL:
+ case SCHED_IDLE:
+ case SCHED_BATCH:
+ return policy;
+ }
+ return SCHED_NORMAL;
+}
+
+/*
+ * Get the effective rt priority based on the current prio value.
+ */
+int effective_rt_prio(int prio)
+{
+ if (!rt_prio(prio))
+ return 0;
+
+ return MAX_RT_PRIO - 1 - prio;
+}
+
/**
* task_curr - is this task currently executing on a CPU?
* @p: the task in question.
--
1.9.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC PATCH v2 2/5] tracing: add TRACE_EVENT_MAP
2016-09-23 16:49 [RFC PATCH v2 0/5] Additional scheduling information in tracepoints Julien Desfossez
2016-09-23 16:49 ` [RFC PATCH v2 1/5] sched: get effective policy and rt_prio Julien Desfossez
@ 2016-09-23 16:49 ` Julien Desfossez
2016-09-23 16:49 ` [RFC PATCH v2 3/5] tracing: extend scheduling tracepoints Julien Desfossez
` (3 subsequent siblings)
5 siblings, 0 replies; 9+ messages in thread
From: Julien Desfossez @ 2016-09-23 16:49 UTC (permalink / raw)
To: peterz, tglx, rostedt, mingo, daolivei
Cc: mathieu.desnoyers, linux-kernel, Julien Desfossez
This macro allows to create an alias of an existing TRACE_EVENT. A
TRACE_EVENT_MAP connects a new probe to an existing tracepoint, so we
can use it to create another output of the same tracepoint without
changing the instrumented code.
This allows to create alternate versions of existing tracepoints to
output more/other fields only in specific use-cases and not all the time
(which could break existing tools and/or bloat the trace with too many
useless fields).
The usage is the same as the TRACE_EVENT macro with the addition of the
"map" parameter which is the name of the alias, the "name" field is the
name of the original tracepoint:
TRACE_EVENT_MAP(name, map, proto, args, tstruct, assign, print)
DEFINE_EVENT_MAP(template, name, map, proto, args)
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
---
include/linux/trace_events.h | 14 ++++++++++++-
include/linux/tracepoint.h | 11 +++++++++-
include/trace/define_trace.h | 4 ++++
include/trace/perf.h | 7 +++++++
include/trace/trace_events.h | 50 ++++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace_events.c | 15 +++++++++----
6 files changed, 95 insertions(+), 6 deletions(-)
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index be00761..1f7e0ec 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -217,6 +217,7 @@ enum {
TRACE_EVENT_FL_TRACEPOINT_BIT,
TRACE_EVENT_FL_KPROBE_BIT,
TRACE_EVENT_FL_UPROBE_BIT,
+ TRACE_EVENT_FL_MAP_BIT,
};
/*
@@ -231,6 +232,7 @@ enum {
* TRACEPOINT - Event is a tracepoint
* KPROBE - Event is a kprobe
* UPROBE - Event is a uprobe
+ * MAP - Event maps to a tracepoint as an alias
*/
enum {
TRACE_EVENT_FL_FILTERED = (1 << TRACE_EVENT_FL_FILTERED_BIT),
@@ -241,10 +243,16 @@ enum {
TRACE_EVENT_FL_TRACEPOINT = (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
TRACE_EVENT_FL_KPROBE = (1 << TRACE_EVENT_FL_KPROBE_BIT),
TRACE_EVENT_FL_UPROBE = (1 << TRACE_EVENT_FL_UPROBE_BIT),
+ TRACE_EVENT_FL_MAP = (1 << TRACE_EVENT_FL_MAP_BIT),
};
#define TRACE_EVENT_FL_UKPROBE (TRACE_EVENT_FL_KPROBE | TRACE_EVENT_FL_UPROBE)
+struct trace_event_map {
+ struct tracepoint *tp;
+ char *name;
+};
+
struct trace_event_call {
struct list_head list;
struct trace_event_class *class;
@@ -252,6 +260,8 @@ struct trace_event_call {
char *name;
/* Set TRACE_EVENT_FL_TRACEPOINT flag when using "tp" */
struct tracepoint *tp;
+ /* Set TRACE_EVENT_FL_MAP flag when using "map" instead */
+ struct trace_event_map *map;
};
struct trace_event event;
char *print_fmt;
@@ -282,7 +292,9 @@ struct trace_event_call {
static inline const char *
trace_event_name(struct trace_event_call *call)
{
- if (call->flags & TRACE_EVENT_FL_TRACEPOINT)
+ if (call->flags & TRACE_EVENT_FL_MAP)
+ return call->map->name;
+ else if (call->flags & TRACE_EVENT_FL_TRACEPOINT)
return call->tp ? call->tp->name : NULL;
else
return call->name;
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index be586c6..b8ab12a 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -276,6 +276,7 @@ static inline void tracepoint_synchronize_unregister(void)
#define DEFINE_TRACE_FN(name, reg, unreg)
#define DEFINE_TRACE(name)
+#define DEFINE_TRACE_MAP(name, map)
#define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
#define EXPORT_TRACEPOINT_SYMBOL(name)
@@ -466,6 +467,13 @@ static inline void tracepoint_synchronize_unregister(void)
*
* A set of (un)registration functions can be passed to the variant
* TRACE_EVENT_FN to perform any (un)registration work.
+ *
+ * TRACE_EVENT_MAP can be used to create alternate versions of a
+ * TRACE_EVENT without modifying the instrumented code. It connects
+ * a different probe to an existing tracepoint, so other fields can be
+ * extracted. The "name" field is the name of the original TRACE_EVENT,
+ * the "map" field is the name of the alias. They can be enabled
+ * independently.
*/
#define DECLARE_EVENT_CLASS(name, proto, args, tstruct, assign, print)
@@ -493,9 +501,10 @@ static inline void tracepoint_synchronize_unregister(void)
struct, assign, print) \
DECLARE_TRACE_CONDITION(name, PARAMS(proto), \
PARAMS(args), PARAMS(cond))
-
#define TRACE_EVENT_FLAGS(event, flag)
#define TRACE_EVENT_PERF_PERM(event, expr...)
+#define TRACE_EVENT_MAP(name, map, proto, args, struct, assign, print)
+
#endif /* ifdef TRACE_EVENT (see note above) */
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index 6e3945f..fdd8845 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -26,6 +26,9 @@
#define TRACE_EVENT(name, proto, args, tstruct, assign, print) \
DEFINE_TRACE(name)
+#undef TRACE_EVENT_MAP
+#define TRACE_EVENT_MAP(name, map, proto, args, tstruct, assign, print)
+
#undef TRACE_EVENT_CONDITION
#define TRACE_EVENT_CONDITION(name, proto, args, cond, tstruct, assign, print) \
TRACE_EVENT(name, \
@@ -100,6 +103,7 @@
#undef TRACE_EVENT_FN
#undef TRACE_EVENT_FN_COND
#undef TRACE_EVENT_CONDITION
+#undef TRACE_EVENT_MAP
#undef DECLARE_EVENT_CLASS
#undef DEFINE_EVENT
#undef DEFINE_EVENT_FN
diff --git a/include/trace/perf.h b/include/trace/perf.h
index 04fe68bb..563d441 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -86,5 +86,12 @@
#define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
+#undef DEFINE_EVENT_MAP
+#define DEFINE_EVENT_MAP(template, call, map, proto, args) \
+static inline void perf_test_probe_##map(void) \
+{ \
+ check_trace_callback_type_##call(perf_trace_##template); \
+}
+
#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
#endif /* CONFIG_PERF_EVENTS */
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 467e12f..deb5bc6 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -65,6 +65,15 @@
PARAMS(print)); \
DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
+#undef TRACE_EVENT_MAP
+#define TRACE_EVENT_MAP(name, map, proto, args, tstruct, assign, print) \
+ DECLARE_EVENT_CLASS(map, \
+ PARAMS(proto), \
+ PARAMS(args), \
+ PARAMS(tstruct), \
+ PARAMS(assign), \
+ PARAMS(print)); \
+ DEFINE_EVENT_MAP(map, name, map, PARAMS(proto), PARAMS(args));
#undef __field
#define __field(type, item) type item;
@@ -108,6 +117,11 @@
static struct trace_event_call __used \
__attribute__((__aligned__(4))) event_##name
+#undef DEFINE_EVENT_MAP
+#define DEFINE_EVENT_MAP(template, name, map, proto, args) \
+ static struct trace_event_call __used \
+ __attribute__((__aligned__(4))) event_##map
+
#undef DEFINE_EVENT_FN
#define DEFINE_EVENT_FN(template, name, proto, args, reg, unreg) \
DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
@@ -191,6 +205,9 @@
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, name, proto, args)
+#undef DEFINE_EVENT_MAP
+#define DEFINE_EVENT_MAP(template, name, map, proto, args)
+
#undef DEFINE_EVENT_PRINT
#define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
@@ -426,6 +443,9 @@
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, name, proto, args)
+#undef DEFINE_EVENT_MAP
+#define DEFINE_EVENT_MAP(template, name, map, proto, args)
+
#undef DEFINE_EVENT_PRINT
#define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
@@ -506,6 +526,9 @@
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, name, proto, args)
+#undef DEFINE_EVENT_MAP
+#define DEFINE_EVENT_MAP(template, name, map, proto, args)
+
#undef DEFINE_EVENT_PRINT
#define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
@@ -700,6 +723,13 @@
check_trace_callback_type_##call(trace_event_raw_event_##template); \
}
+#undef DEFINE_EVENT_MAP
+#define DEFINE_EVENT_MAP(template, call, map, proto, args) \
+static inline void ftrace_test_probe_##map(void) \
+{ \
+ check_trace_callback_type_##call(trace_event_raw_event_##template); \
+}
+
#undef DEFINE_EVENT_PRINT
#define DEFINE_EVENT_PRINT(template, name, proto, args, print)
@@ -749,6 +779,26 @@
static struct trace_event_call __used \
__attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
+#undef DEFINE_EVENT_MAP
+#define DEFINE_EVENT_MAP(_template, _call, _map, _proto, _args) \
+ \
+static struct trace_event_map event_map_##_map = { \
+ .tp = &__tracepoint_##_call, \
+ .name = #_map, \
+}; \
+ \
+static struct trace_event_call __used event_##_map = { \
+ .class = &event_class_##_template, \
+ { \
+ .map = &event_map_##_map, \
+ }, \
+ .event.funcs = &trace_event_type_funcs_##_template, \
+ .print_fmt = print_fmt_##_template, \
+ .flags = TRACE_EVENT_FL_TRACEPOINT | TRACE_EVENT_FL_MAP,\
+}; \
+static struct trace_event_call __used \
+__attribute__((section("_ftrace_events"))) *__event_##_map = &event_##_map
+
#undef DEFINE_EVENT_PRINT
#define DEFINE_EVENT_PRINT(template, call, proto, args, print) \
\
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 03c0a48..8b46dd8 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -327,26 +327,33 @@ int trace_event_reg(struct trace_event_call *call,
enum trace_reg type, void *data)
{
struct trace_event_file *file = data;
+ struct tracepoint *tp;
WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
+
+ if (call->flags & TRACE_EVENT_FL_MAP)
+ tp = call->map->tp;
+ else
+ tp = call->tp;
+
switch (type) {
case TRACE_REG_REGISTER:
- return tracepoint_probe_register(call->tp,
+ return tracepoint_probe_register(tp,
call->class->probe,
file);
case TRACE_REG_UNREGISTER:
- tracepoint_probe_unregister(call->tp,
+ tracepoint_probe_unregister(tp,
call->class->probe,
file);
return 0;
#ifdef CONFIG_PERF_EVENTS
case TRACE_REG_PERF_REGISTER:
- return tracepoint_probe_register(call->tp,
+ return tracepoint_probe_register(tp,
call->class->perf_probe,
call);
case TRACE_REG_PERF_UNREGISTER:
- tracepoint_probe_unregister(call->tp,
+ tracepoint_probe_unregister(tp,
call->class->perf_probe,
call);
return 0;
--
1.9.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC PATCH v2 3/5] tracing: extend scheduling tracepoints
2016-09-23 16:49 [RFC PATCH v2 0/5] Additional scheduling information in tracepoints Julien Desfossez
2016-09-23 16:49 ` [RFC PATCH v2 1/5] sched: get effective policy and rt_prio Julien Desfossez
2016-09-23 16:49 ` [RFC PATCH v2 2/5] tracing: add TRACE_EVENT_MAP Julien Desfossez
@ 2016-09-23 16:49 ` Julien Desfossez
2016-09-23 16:49 ` [RFC PATCH v2 4/5] tracing: extend sched_pi_setprio Julien Desfossez
` (2 subsequent siblings)
5 siblings, 0 replies; 9+ messages in thread
From: Julien Desfossez @ 2016-09-23 16:49 UTC (permalink / raw)
To: peterz, tglx, rostedt, mingo, daolivei
Cc: mathieu.desnoyers, linux-kernel, Julien Desfossez
Create alternate versions of the sched_switch, sched_waking and
sched_process_fork tracepoint probes to output priority-related fields
and PI top-waiter if any.
This uses the TRACE_EVENT_MAP macro, so the instrumented code and the
already existing tracepoints are untouched.
We only expose the priority-related fields visible from userspace,
leaving out the "prio" value which should really be a kernel-internal
representation of the task priority, and must be expected to be
eventually deprecated. The values output are the effective values, not
necessarily the normal values.
We also output the comm and PID of the process blocked by the task if it
is in a PI situation. These fields allow to quickly identify the PI
situations without requiring to keep track of all the
sched_pi_setprio/sched_pi_update_prio events and state.
The values traced are the effective values, which may differ from the
thread normal values in PI scenarios.
Here is an example of the output from these new probes:
sched_process_fork_prio: comm=bash, pid=1988, child_comm=bash,
child_pid=2129, child_policy=SCHED_NORMAL, child_nice=0,
child_rt_priority=0, child_dl_runtime=0,
child_dl_deadline=0, child_dl_period=0
No PI:
sched_switch_prio: prev_comm=swapper/6, prev_pid=0, prev_policy=SCHED_NORMAL,
prev_nice=0, prev_rt_priority=0, prev_dl_runtime=0,
prev_dl_deadline=0, prev_dl_period=0, prev_state=R,
prev_top_waiter_comm=, prev_top_waiter_pid=-1 ==>
next_comm=bash, next_pid=3817, next_policy=SCHED_NORMAL,
next_nice=0, next_rt_priority=0, next_dl_runtime=0,
next_dl_deadline=0, next_dl_period=0, next_top_waiter_comm=,
next_top_waiter_pid=-1
sched_waking_prio: comm=migration/6, pid=38, target_cpu=006,
policy=SCHED_FIFO, nice=0, rt_priority=99, dl_runtime=0,
dl_deadline=0, dl_period=0, top_waiter_comm=, top_waiter_pid=-1
PI:
sched_switch_prio: prev_comm=swapper/1, prev_pid=0, prev_policy=SCHED_NORMAL,
prev_nice=0, prev_rt_priority=0, prev_dl_runtime=0,
prev_dl_deadline=0, prev_dl_period=0, prev_state=R,
prev_top_waiter_comm=, prev_top_waiter_pid=-1 ==>
next_comm=lowprio1, next_pid=3818, next_policy=SCHED_NORMAL,
next_nice=0, next_rt_priority=90, next_dl_runtime=0,
next_dl_deadline=0, next_dl_period=0,
next_top_waiter_comm=highprio0, next_top_waiter_pid=3820
sched_waking_prio: comm=lowprio1, pid=3818, target_cpu=001, policy=SCHED_FIFO,
nice=0, rt_priority=90, dl_runtime=0, dl_deadline=0,
dl_period=0, top_waiter_comm=highprio0, top_waiter_pid=3820
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
---
include/trace/events/sched.h | 222 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 222 insertions(+)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9b90c57..6880682 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -103,6 +103,64 @@
TP_PROTO(struct task_struct *p),
TP_ARGS(p));
+TRACE_EVENT_MAP(sched_waking, sched_waking_prio,
+
+ TP_PROTO(struct task_struct *p),
+
+ TP_ARGS(p),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( int, target_cpu )
+ __field( unsigned int, policy )
+ __field( int, nice )
+ __field( unsigned int, rt_priority )
+ __field( u64, dl_runtime )
+ __field( u64, dl_deadline )
+ __field( u64, dl_period )
+ __array( char, top_waiter_comm, TASK_COMM_LEN )
+ __field( pid_t, top_waiter_pid )
+ ),
+
+ TP_fast_assign(
+ struct task_struct *top_waiter = rt_mutex_get_top_task(p);
+
+ memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+ __entry->pid = p->pid;
+ __entry->target_cpu = task_cpu(p);
+ __entry->policy = effective_policy(
+ p->policy, p->prio);
+ __entry->nice = task_nice(p);
+ __entry->rt_priority = effective_rt_prio(
+ p->prio);
+ __entry->dl_runtime = dl_prio(p->prio) ?
+ p->dl.dl_runtime : 0;
+ __entry->dl_deadline = dl_prio(p->prio) ?
+ p->dl.dl_deadline : 0;
+ __entry->dl_period = dl_prio(p->prio) ?
+ p->dl.dl_period : 0;
+ if (top_waiter) {
+ memcpy(__entry->top_waiter_comm, top_waiter->comm,
+ TASK_COMM_LEN);
+ __entry->top_waiter_pid = top_waiter->pid;
+ } else {
+ __entry->top_waiter_comm[0] = '\0';
+ __entry->top_waiter_pid = -1;
+ }
+ ),
+
+ TP_printk("comm=%s, pid=%d, target_cpu=%03d, policy=%s, "
+ "nice=%d, rt_priority=%u, dl_runtime=%Lu, "
+ "dl_deadline=%Lu, dl_period=%Lu, "
+ "top_waiter_comm=%s, top_waiter_pid=%d",
+ __entry->comm, __entry->pid, __entry->target_cpu,
+ __print_symbolic(__entry->policy, SCHEDULING_POLICY),
+ __entry->nice, __entry->rt_priority, __entry->dl_runtime,
+ __entry->dl_deadline, __entry->dl_period,
+ __entry->top_waiter_comm, __entry->top_waiter_pid)
+);
+
#ifdef CREATE_TRACE_POINTS
static inline long __trace_sched_switch_state(bool preempt, struct task_struct *p)
{
@@ -162,6 +220,121 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
);
/*
+ * Tracepoint for task switches, performed by the scheduler:
+ */
+TRACE_EVENT_MAP(sched_switch, sched_switch_prio,
+ TP_PROTO(bool preempt,
+ struct task_struct *prev,
+ struct task_struct *next),
+
+ TP_ARGS(preempt, prev, next),
+
+ TP_STRUCT__entry(
+ __array( char, prev_comm, TASK_COMM_LEN )
+ __field( pid_t, prev_pid )
+ __field( long, prev_state )
+ __field( unsigned int, prev_policy )
+ __field( int, prev_nice )
+ __field( unsigned int, prev_rt_priority )
+ __field( u64, prev_dl_runtime )
+ __field( u64, prev_dl_deadline )
+ __field( u64, prev_dl_period )
+ __array( char, prev_top_waiter_comm, TASK_COMM_LEN )
+ __field( pid_t, prev_top_waiter_pid )
+ __array( char, next_comm, TASK_COMM_LEN )
+ __field( pid_t, next_pid )
+ __field( unsigned int, next_policy )
+ __field( int, next_nice )
+ __field( unsigned int, next_rt_priority )
+ __field( u64, next_dl_runtime )
+ __field( u64, next_dl_deadline )
+ __field( u64, next_dl_period )
+ __array( char, next_top_waiter_comm, TASK_COMM_LEN )
+ __field( pid_t, next_top_waiter_pid )
+ ),
+
+ TP_fast_assign(
+ struct task_struct *prev_top = rt_mutex_get_top_task(prev);
+ struct task_struct *next_top = rt_mutex_get_top_task(next);
+
+ memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
+ __entry->prev_pid = prev->pid;
+ __entry->prev_state = __trace_sched_switch_state(
+ preempt, prev);
+ __entry->prev_policy = effective_policy(
+ prev->policy, prev->prio);
+ __entry->prev_nice = task_nice(prev);
+ __entry->prev_rt_priority = effective_rt_prio(
+ prev->prio);
+ __entry->prev_dl_runtime = dl_prio(prev->prio) ?
+ prev->dl.dl_runtime : 0;
+ __entry->prev_dl_deadline = dl_prio(prev->prio) ?
+ prev->dl.dl_deadline : 0;
+ __entry->prev_dl_period = dl_prio(prev->prio) ?
+ prev->dl.dl_period : 0;
+ if (prev_top) {
+ memcpy(__entry->prev_top_waiter_comm, prev_top->comm,
+ TASK_COMM_LEN);
+ __entry->prev_top_waiter_pid = prev_top->pid;
+ } else {
+ __entry->prev_top_waiter_comm[0] = '\0';
+ __entry->prev_top_waiter_pid = -1;
+ }
+
+ memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+ __entry->next_pid = next->pid;
+ __entry->next_policy = effective_policy(
+ next->policy, prev->prio);
+ __entry->next_nice = task_nice(next);
+ __entry->next_rt_priority = effective_rt_prio(
+ next->prio);
+ __entry->next_dl_runtime = dl_prio(next->prio) ?
+ next->dl.dl_runtime : 0;
+ __entry->next_dl_deadline = dl_prio(next->prio) ?
+ next->dl.dl_deadline : 0;
+ __entry->next_dl_period = dl_prio(next->prio) ?
+ next->dl.dl_period : 0;
+ if (next_top) {
+ memcpy(__entry->next_top_waiter_comm, next_top->comm,
+ TASK_COMM_LEN);
+ __entry->next_top_waiter_pid = next_top->pid;
+ } else {
+ __entry->next_top_waiter_comm[0] = '\0';
+ __entry->next_top_waiter_pid = -1;
+ }
+ ),
+
+ TP_printk("prev_comm=%s, prev_pid=%d, prev_policy=%s, prev_nice=%d, "
+ "prev_rt_priority=%u, prev_dl_runtime=%Lu, "
+ "prev_dl_deadline=%Lu, prev_dl_period=%Lu, "
+ "prev_state=%s%s, prev_top_waiter_comm=%s, "
+ "prev_top_waiter_pid=%d ==> next_comm=%s, next_pid=%d, "
+ "next_policy=%s, next_nice=%d, next_rt_priority=%u, "
+ "next_dl_runtime=%Lu, next_dl_deadline=%Lu, "
+ "next_dl_period=%Lu, next_top_waiter_comm=%s, "
+ "next_top_waiter_pid=%d",
+ __entry->prev_comm, __entry->prev_pid,
+ __print_symbolic(__entry->prev_policy, SCHEDULING_POLICY),
+ __entry->prev_nice, __entry->prev_rt_priority,
+ __entry->prev_dl_runtime, __entry->prev_dl_deadline,
+ __entry->prev_dl_period,
+ __entry->prev_state & (TASK_STATE_MAX-1) ?
+ __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
+ { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
+ { 16, "Z" }, { 32, "X" }, { 64, "x" },
+ { 128, "K" }, { 256, "W" }, { 512, "P" },
+ { 1024, "N" }) : "R",
+ __entry->prev_state & TASK_STATE_MAX ? "+" : "",
+ __entry->prev_top_waiter_comm, __entry->prev_top_waiter_pid,
+ __entry->next_comm, __entry->next_pid,
+ __print_symbolic(__entry->next_policy, SCHEDULING_POLICY),
+ __entry->next_nice, __entry->next_rt_priority,
+ __entry->next_dl_runtime, __entry->next_dl_deadline,
+ __entry->next_dl_period, __entry->next_top_waiter_comm,
+ __entry->next_top_waiter_pid)
+);
+
+/*
* Tracepoint for a task being migrated:
*/
TRACE_EVENT(sched_migrate_task,
@@ -288,6 +461,55 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
__entry->child_comm, __entry->child_pid)
);
+TRACE_EVENT_MAP(sched_process_fork, sched_process_fork_prio,
+
+ TP_PROTO(struct task_struct *parent, struct task_struct *child),
+
+ TP_ARGS(parent, child),
+
+ TP_STRUCT__entry(
+ __array( char, parent_comm, TASK_COMM_LEN )
+ __field( pid_t, parent_pid )
+ __array( char, child_comm, TASK_COMM_LEN )
+ __field( pid_t, child_pid )
+ __field( unsigned int, child_policy )
+ __field( int, child_nice )
+ __field( unsigned int, child_rt_priority )
+ __field( u64, child_dl_runtime )
+ __field( u64, child_dl_deadline )
+ __field( u64, child_dl_period )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->parent_comm, parent->comm, TASK_COMM_LEN);
+ __entry->parent_pid = parent->pid;
+ memcpy(__entry->child_comm, child->comm, TASK_COMM_LEN);
+ __entry->child_pid = child->pid;
+ __entry->child_policy = effective_policy(
+ child->policy, child->prio);
+ __entry->child_nice = task_nice(child);
+ __entry->child_rt_priority = effective_rt_prio(
+ child->prio);
+ __entry->child_dl_runtime = dl_prio(child->prio) ?
+ child->dl.dl_runtime : 0;
+ __entry->child_dl_deadline = dl_prio(child->prio) ?
+ child->dl.dl_deadline : 0;
+ __entry->child_dl_period = dl_prio(child->prio) ?
+ child->dl.dl_period : 0;
+ ),
+
+ TP_printk("comm=%s, pid=%d, child_comm=%s, child_pid=%d, "
+ "child_policy=%s, child_nice=%d, "
+ "child_rt_priority=%u, child_dl_runtime=%Lu, "
+ "child_dl_deadline=%Lu, child_dl_period=%Lu",
+ __entry->parent_comm, __entry->parent_pid,
+ __entry->child_comm, __entry->child_pid,
+ __print_symbolic(__entry->child_policy, SCHEDULING_POLICY),
+ __entry->child_nice, __entry->child_rt_priority,
+ __entry->child_dl_runtime, __entry->child_dl_deadline,
+ __entry->child_dl_period)
+);
+
/*
* Tracepoint for exec:
*/
--
1.9.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC PATCH v2 4/5] tracing: extend sched_pi_setprio
2016-09-23 16:49 [RFC PATCH v2 0/5] Additional scheduling information in tracepoints Julien Desfossez
` (2 preceding siblings ...)
2016-09-23 16:49 ` [RFC PATCH v2 3/5] tracing: extend scheduling tracepoints Julien Desfossez
@ 2016-09-23 16:49 ` Julien Desfossez
2016-09-23 16:49 ` [RFC PATCH v2 5/5] tracing: add sched_update_prio Julien Desfossez
2016-09-26 12:27 ` [RFC PATCH v2 0/5] Additional scheduling information in tracepoints Peter Zijlstra
5 siblings, 0 replies; 9+ messages in thread
From: Julien Desfossez @ 2016-09-23 16:49 UTC (permalink / raw)
To: peterz, tglx, rostedt, mingo, daolivei
Cc: mathieu.desnoyers, linux-kernel, Julien Desfossez
Use the TRACE_EVENT_MAP macro to extend the sched_pi_setprio into
sched_pi_update_prio. The pre-existing event is untouched. This gets rid
of the old/new prio fields, and instead outputs the scheduling update
based on the top waiter of the rtmutex.
Boosting:
sched_pi_update_prio: comm=lowprio1, pid=3818, old_policy=SCHED_NORMAL,
old_nice=0, old_rt_priority=0, old_dl_runtime=0,
old_dl_deadline=0, old_dl_period=0, top_waiter_comm=highprio0,
top_waiter_pid=3820, new_policy=SCHED_FIFO, new_nice=0,
new_rt_priority=90, new_dl_runtime=0, new_dl_deadline=0,
new_dl_period=0
Unboosting:
sched_pi_update_prio: comm=lowprio1, pid=3818, old_policy=SCHED_FIFO,
old_nice=0, old_rt_priority=90, old_dl_runtime=0, old_dl_deadline=0,
old_dl_period=0, top_waiter_comm=, top_waiter_pid=-1,
new_policy=SCHED_NORMAL, new_nice=0, new_rt_priority=0,
new_dl_runtime=0, new_dl_deadline=0, new_dl_period=0
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
---
include/trace/events/sched.h | 96 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 96 insertions(+)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 6880682..582357d 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -658,6 +658,102 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
__entry->oldprio, __entry->newprio)
);
+/*
+ * Extract the complete scheduling information from the before
+ * and after the change of priority.
+ */
+TRACE_EVENT_MAP(sched_pi_setprio, sched_pi_update_prio,
+
+ TP_PROTO(struct task_struct *tsk, int newprio),
+
+ TP_ARGS(tsk, newprio),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( unsigned int, old_policy )
+ __field( int, old_nice )
+ __field( unsigned int, old_rt_priority )
+ __field( u64, old_dl_runtime )
+ __field( u64, old_dl_deadline )
+ __field( u64, old_dl_period )
+ __array( char, top_waiter_comm, TASK_COMM_LEN )
+ __field( pid_t, top_waiter_pid )
+ __field( unsigned int, new_policy )
+ __field( int, new_nice )
+ __field( unsigned int, new_rt_priority )
+ __field( u64, new_dl_runtime )
+ __field( u64, new_dl_deadline )
+ __field( u64, new_dl_period )
+ ),
+
+ TP_fast_assign(
+ struct task_struct *top_waiter = rt_mutex_get_top_task(tsk);
+
+ memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+ __entry->pid = tsk->pid;
+ __entry->old_policy = effective_policy(
+ tsk->policy, tsk->prio);
+ __entry->old_nice = task_nice(tsk);
+ __entry->old_rt_priority = effective_rt_prio(
+ tsk->prio);
+ __entry->old_dl_runtime = dl_prio(tsk->prio) ?
+ tsk->dl.dl_runtime : 0;
+ __entry->old_dl_deadline = dl_prio(tsk->prio) ?
+ tsk->dl.dl_deadline : 0;
+ __entry->old_dl_period = dl_prio(tsk->prio) ?
+ tsk->dl.dl_period : 0;
+ if (top_waiter) {
+ memcpy(__entry->top_waiter_comm, top_waiter->comm, TASK_COMM_LEN);
+ __entry->top_waiter_pid = top_waiter->pid;
+ /*
+ * The effective policy depends on the current policy of
+ * the target task.
+ */
+ __entry->new_policy = effective_policy(
+ tsk->policy, top_waiter->prio);
+ __entry->new_nice = task_nice(top_waiter);
+ __entry->new_rt_priority = effective_rt_prio(
+ top_waiter->prio);
+ __entry->new_dl_runtime = dl_prio(top_waiter->prio) ?
+ top_waiter->dl.dl_runtime : 0;
+ __entry->new_dl_deadline = dl_prio(top_waiter->prio) ?
+ top_waiter->dl.dl_deadline : 0;
+ __entry->new_dl_period = dl_prio(top_waiter->prio) ?
+ top_waiter->dl.dl_period : 0;
+ } else {
+ __entry->top_waiter_comm[0] = '\0';
+ __entry->top_waiter_pid = -1;
+ __entry->new_policy = 0;
+ __entry->new_nice = 0;
+ __entry->new_rt_priority = 0;
+ __entry->new_dl_runtime = 0;
+ __entry->new_dl_deadline = 0;
+ __entry->new_dl_period = 0;
+ }
+ ),
+
+ TP_printk("comm=%s, pid=%d, old_policy=%s, old_nice=%d, "
+ "old_rt_priority=%u, old_dl_runtime=%Lu, "
+ "old_dl_deadline=%Lu, old_dl_period=%Lu, "
+ "top_waiter_comm=%s, top_waiter_pid=%d, new_policy=%s, "
+ "new_nice=%d, new_rt_priority=%u, "
+ "new_dl_runtime=%Lu, new_dl_deadline=%Lu, "
+ "new_dl_period=%Lu",
+ __entry->comm, __entry->pid,
+ __print_symbolic(__entry->old_policy, SCHEDULING_POLICY),
+ __entry->old_nice, __entry->old_rt_priority,
+ __entry->old_dl_runtime, __entry->old_dl_deadline,
+ __entry->old_dl_period,
+ __entry->top_waiter_comm, __entry->top_waiter_pid,
+ __entry->new_policy >= 0 ?
+ __print_symbolic(__entry->new_policy,
+ SCHEDULING_POLICY) : "",
+ __entry->new_nice, __entry->new_rt_priority,
+ __entry->new_dl_runtime, __entry->new_dl_deadline,
+ __entry->new_dl_period)
+);
+
#ifdef CONFIG_DETECT_HUNG_TASK
TRACE_EVENT(sched_process_hang,
TP_PROTO(struct task_struct *tsk),
--
1.9.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC PATCH v2 5/5] tracing: add sched_update_prio
2016-09-23 16:49 [RFC PATCH v2 0/5] Additional scheduling information in tracepoints Julien Desfossez
` (3 preceding siblings ...)
2016-09-23 16:49 ` [RFC PATCH v2 4/5] tracing: extend sched_pi_setprio Julien Desfossez
@ 2016-09-23 16:49 ` Julien Desfossez
2016-09-24 13:28 ` Mathieu Desnoyers
2016-09-26 12:27 ` [RFC PATCH v2 0/5] Additional scheduling information in tracepoints Peter Zijlstra
5 siblings, 1 reply; 9+ messages in thread
From: Julien Desfossez @ 2016-09-23 16:49 UTC (permalink / raw)
To: peterz, tglx, rostedt, mingo, daolivei
Cc: mathieu.desnoyers, linux-kernel, Julien Desfossez
This tracepoint allows to keep track of all explicit priority changes of
a task. It outputs the scheduling policy, the nice value, the
rt_priority and the deadline-related attributes (dl_runtime, dl_deadline
and dl_period).
It is emitted in the code path of the sched_setscheduler, sched_setattr,
sched_setparam, and nice system calls.
This allows the analysis of real-time scheduling delays based on the
configured scheduling priorities and policies, which cannot be performed
with the current instrumentation in sched_switch. Also, instead of
exposing the internal kernel prio field, this tracepoint only outputs
the user-visible priority attributes.
The effective priority of running threads can also be temporarily
changed in the PI code, but a dedicated tracepoint is already in place
to cover this case.
Here are a few output examples:
After fork of a normal task:
sched_update_prio: comm=bash pid=2104, policy=SCHED_NORMAL, nice=0,
rt_priority=0, dl_runtime=0, dl_deadline=0, dl_period=0
renice -n 10 of a normal task:
sched_update_prio: comm=sleep pid=2130, policy=SCHED_NORMAL, nice=10,
rt_priority=0, dl_runtime=0, dl_deadline=0, dl_period=0
SCHED_FIFO 60:
sched_update_prio: comm=chrt pid=2105, policy=SCHED_FIFO, nice=0,
rt_priority=60, dl_runtime=0, dl_deadline=0, dl_period=0
SCHED_RR 60:
sched_update_prio: comm=chrt pid=2109, policy=SCHED_RR, nice=0,
rt_priority=60, dl_runtime=0, dl_deadline=0, dl_period=0
SCHED_DEADLINE:
sched_update_prio: comm=b pid=2110, policy=SCHED_DEADLINE, nice=0,
rt_priority=0, dl_runtime=10000000, dl_deadline=30000000,
dl_period=30000000
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
---
include/trace/events/sched.h | 68 ++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/core.c | 3 ++
2 files changed, 71 insertions(+)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 582357d..8d3343b 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -8,6 +8,34 @@
#include <linux/tracepoint.h>
#include <linux/binfmts.h>
+#define SCHEDULING_POLICY \
+ EM( SCHED_NORMAL, "SCHED_NORMAL") \
+ EM( SCHED_FIFO, "SCHED_FIFO") \
+ EM( SCHED_RR, "SCHED_RR") \
+ EM( SCHED_BATCH, "SCHED_BATCH") \
+ EM( SCHED_IDLE, "SCHED_IDLE") \
+ EMe(SCHED_DEADLINE, "SCHED_DEADLINE")
+
+/*
+ * First define the enums in the above macros to be exported to userspace
+ * via TRACE_DEFINE_ENUM().
+ */
+#undef EM
+#undef EMe
+#define EM(a, b) TRACE_DEFINE_ENUM(a);
+#define EMe(a, b) TRACE_DEFINE_ENUM(a);
+
+SCHEDULING_POLICY
+
+/*
+ * Now redefine the EM() and EMe() macros to map the enums to the strings
+ * that will be printed in the output.
+ */
+#undef EM
+#undef EMe
+#define EM(a, b) {a, b},
+#define EMe(a, b) {a, b}
+
/*
* Tracepoint for calling kthread_stop, performed to end a kthread:
*/
@@ -880,6 +908,46 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
TP_printk("cpu=%d", __entry->cpu)
);
+
+/*
+ * Tracepoint for showing scheduling priority changes.
+ */
+TRACE_EVENT(sched_update_prio,
+
+ TP_PROTO(struct task_struct *tsk),
+
+ TP_ARGS(tsk),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( unsigned int, policy )
+ __field( int, nice )
+ __field( unsigned int, rt_priority )
+ __field( u64, dl_runtime )
+ __field( u64, dl_deadline )
+ __field( u64, dl_period )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+ __entry->pid = tsk->pid;
+ __entry->policy = tsk->policy;
+ __entry->nice = task_nice(tsk);
+ __entry->rt_priority = tsk->rt_priority;
+ __entry->dl_runtime = tsk->dl.dl_runtime;
+ __entry->dl_deadline = tsk->dl.dl_deadline;
+ __entry->dl_period = tsk->dl.dl_period;
+ ),
+
+ TP_printk("comm=%s pid=%d, policy=%s, nice=%d, rt_priority=%u, "
+ "dl_runtime=%Lu, dl_deadline=%Lu, dl_period=%Lu",
+ __entry->comm, __entry->pid,
+ __print_symbolic(__entry->policy, SCHEDULING_POLICY),
+ __entry->nice, __entry->rt_priority,
+ __entry->dl_runtime, __entry->dl_deadline,
+ __entry->dl_period)
+);
#endif /* _TRACE_SCHED_H */
/* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f3817b5..f3c4a7d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3787,6 +3787,7 @@ void set_user_nice(struct task_struct *p, long nice)
resched_curr(rq);
}
out_unlock:
+ trace_sched_update_prio(p);
task_rq_unlock(rq, p, &rf);
}
EXPORT_SYMBOL(set_user_nice);
@@ -3991,6 +3992,8 @@ static void __setscheduler(struct rq *rq, struct task_struct *p,
p->sched_class = &rt_sched_class;
else
p->sched_class = &fair_sched_class;
+
+ trace_sched_update_prio(p);
}
static void
--
1.9.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC PATCH v2 5/5] tracing: add sched_update_prio
2016-09-23 16:49 ` [RFC PATCH v2 5/5] tracing: add sched_update_prio Julien Desfossez
@ 2016-09-24 13:28 ` Mathieu Desnoyers
0 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2016-09-24 13:28 UTC (permalink / raw)
To: Julien Desfossez
Cc: Peter Zijlstra, Thomas Gleixner, rostedt, Ingo Molnar, daolivei,
linux-kernel
----- On Sep 23, 2016, at 12:49 PM, Julien Desfossez jdesfossez@efficios.com wrote:
> This tracepoint allows to keep track of all explicit priority changes of
> a task. It outputs the scheduling policy, the nice value, the
> rt_priority and the deadline-related attributes (dl_runtime, dl_deadline
> and dl_period).
>
> It is emitted in the code path of the sched_setscheduler, sched_setattr,
> sched_setparam, and nice system calls.
>
> This allows the analysis of real-time scheduling delays based on the
> configured scheduling priorities and policies, which cannot be performed
> with the current instrumentation in sched_switch. Also, instead of
> exposing the internal kernel prio field, this tracepoint only outputs
> the user-visible priority attributes.
>
> The effective priority of running threads can also be temporarily
> changed in the PI code, but a dedicated tracepoint is already in place
> to cover this case.
>
> Here are a few output examples:
> After fork of a normal task:
> sched_update_prio: comm=bash pid=2104, policy=SCHED_NORMAL, nice=0,
> rt_priority=0, dl_runtime=0, dl_deadline=0, dl_period=0
>
> renice -n 10 of a normal task:
> sched_update_prio: comm=sleep pid=2130, policy=SCHED_NORMAL, nice=10,
> rt_priority=0, dl_runtime=0, dl_deadline=0, dl_period=0
>
> SCHED_FIFO 60:
> sched_update_prio: comm=chrt pid=2105, policy=SCHED_FIFO, nice=0,
> rt_priority=60, dl_runtime=0, dl_deadline=0, dl_period=0
>
> SCHED_RR 60:
> sched_update_prio: comm=chrt pid=2109, policy=SCHED_RR, nice=0,
> rt_priority=60, dl_runtime=0, dl_deadline=0, dl_period=0
>
> SCHED_DEADLINE:
> sched_update_prio: comm=b pid=2110, policy=SCHED_DEADLINE, nice=0,
> rt_priority=0, dl_runtime=10000000, dl_deadline=30000000,
> dl_period=30000000
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
> ---
> include/trace/events/sched.h | 68 ++++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/core.c | 3 ++
> 2 files changed, 71 insertions(+)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 582357d..8d3343b 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -8,6 +8,34 @@
> #include <linux/tracepoint.h>
> #include <linux/binfmts.h>
>
> +#define SCHEDULING_POLICY \
> + EM( SCHED_NORMAL, "SCHED_NORMAL") \
> + EM( SCHED_FIFO, "SCHED_FIFO") \
> + EM( SCHED_RR, "SCHED_RR") \
> + EM( SCHED_BATCH, "SCHED_BATCH") \
> + EM( SCHED_IDLE, "SCHED_IDLE") \
> + EMe(SCHED_DEADLINE, "SCHED_DEADLINE")
> +
As spotted by the build bot, you should either move this
definition to patch 3/5 (which requires it), or reorder your
patchset to put this patch before the two that require it.
> +/*
> + * First define the enums in the above macros to be exported to userspace
> + * via TRACE_DEFINE_ENUM().
> + */
> +#undef EM
> +#undef EMe
> +#define EM(a, b) TRACE_DEFINE_ENUM(a);
> +#define EMe(a, b) TRACE_DEFINE_ENUM(a);
> +
> +SCHEDULING_POLICY
> +
> +/*
> + * Now redefine the EM() and EMe() macros to map the enums to the strings
> + * that will be printed in the output.
> + */
> +#undef EM
> +#undef EMe
> +#define EM(a, b) {a, b},
> +#define EMe(a, b) {a, b}
> +
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH v2 0/5] Additional scheduling information in tracepoints
2016-09-23 16:49 [RFC PATCH v2 0/5] Additional scheduling information in tracepoints Julien Desfossez
` (4 preceding siblings ...)
2016-09-23 16:49 ` [RFC PATCH v2 5/5] tracing: add sched_update_prio Julien Desfossez
@ 2016-09-26 12:27 ` Peter Zijlstra
2016-09-26 19:37 ` Mathieu Desnoyers
5 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2016-09-26 12:27 UTC (permalink / raw)
To: Julien Desfossez
Cc: tglx, rostedt, mingo, daolivei, mathieu.desnoyers, linux-kernel
On Fri, Sep 23, 2016 at 12:49:30PM -0400, Julien Desfossez wrote:
> With this macro, we propose new versions of the sched_switch, sched_waking,
> sched_process_fork and sched_pi_setprio tracepoint probes that contain more
> scheduling information and get rid of the "prio" field. We also add the PI
> information to these tracepoints, so if a process is currently boosted, we show
> the name and PID of the top waiter. This allows to quickly see the blocking
> chain even if some of the trace background is missing.
Urgh.. bigger mess than ever :-(
So I thought the initial idea was to provide a 'blocked-on' tracepoint,
along with with the 'prio-changed' tracepoint, so you can reconstruct
the entire PI chain.
The only problem with that was initial state; when you start tracing (or
miss the start of a trace) its hard (impossible) to know what the
current state is.
But now you send a patch-set that just adds a metric ton of tracepoints.
This doesn't fix the current mess, it makes it worse :-(
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH v2 0/5] Additional scheduling information in tracepoints
2016-09-26 12:27 ` [RFC PATCH v2 0/5] Additional scheduling information in tracepoints Peter Zijlstra
@ 2016-09-26 19:37 ` Mathieu Desnoyers
0 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2016-09-26 19:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Julien Desfossez, Thomas Gleixner, rostedt, Ingo Molnar, daolivei,
linux-kernel
----- On Sep 26, 2016, at 8:27 AM, Peter Zijlstra peterz@infradead.org wrote:
> On Fri, Sep 23, 2016 at 12:49:30PM -0400, Julien Desfossez wrote:
>> With this macro, we propose new versions of the sched_switch, sched_waking,
>> sched_process_fork and sched_pi_setprio tracepoint probes that contain more
>> scheduling information and get rid of the "prio" field. We also add the PI
>> information to these tracepoints, so if a process is currently boosted, we show
>> the name and PID of the top waiter. This allows to quickly see the blocking
>> chain even if some of the trace background is missing.
>
> Urgh.. bigger mess than ever :-(
>
> So I thought the initial idea was to provide a 'blocked-on' tracepoint,
> along with with the 'prio-changed' tracepoint, so you can reconstruct
> the entire PI chain.
>
> The only problem with that was initial state; when you start tracing (or
> miss the start of a trace) its hard (impossible) to know what the
> current state is.
>
> But now you send a patch-set that just adds a metric ton of tracepoints.
>
> This doesn't fix the current mess, it makes it worse :-(
There are actually four problems we try to tackle here with this
patchset:
1) Missing explicit priority change instrumentation
We're covering it by adding a new "sched_update_prio" callsite
and user-visible tracepoint.
2) Missing "blocked-on" information for PI
We're covering it by adding a new user-visible tracepoint to
the sched_pi_setprio callsite. The following fields provide
the blocked-on info:
top_waiter_comm, top_waiter_pid
We chose to add it in a new user-visible tracepoint rather
than the current sched_pi_setprio so the new event would not
expose the internal "prio" task struct field, which is an internal
implementation detail of the scheduler AFAIU, and could
go away eventually.
We could move those fields to the preexisting sched_pi_setprio
event if you prefer, but then we would have to keep the "oldprio"
and "newprio" fields forever.
I would not call this a "blocked-on" tracepoint, because it is
specific to PI. The general "blocking" concept imply blocking
on a resource (e.g. waitqueue), and we only know which PID we
were waiting for when we are later awakened. In the PI case,
we know which PID owns the resource we are blocked on.
3) Missing deadline scheduler instrumentation
We understood that exposing "prio" really does not cover the
deadline scheduler, as is clearly pointed out in your patchset.
We have added deadline scheduler info to a new set of user-visible
tracepoints, which are connected to the pre-existing tracepoint
callsites in the scheduler. We're therefore not "adding" scheduler
tracepoints in the fast-path source-code wise. We have named those
alternative versions with a "_prio" suffix.
4) Missing initial state
The sched_switch_prio tracepoint deals with the problem of missing
initial state: it's a tracepoint that occurs periodically, and we
can therefore get the initial state of a running thread when it
is scheduled. We chose to present it as an alternative tracepoint
from a user POV because we did not want to bloat the current
sched_switch event with lots of extra fields when prio information
is not needed.
Do you recommend that we bring those new extra fields into the
pre-existing tracepoints instead, even considering the extra
bloat ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-09-26 19:36 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-23 16:49 [RFC PATCH v2 0/5] Additional scheduling information in tracepoints Julien Desfossez
2016-09-23 16:49 ` [RFC PATCH v2 1/5] sched: get effective policy and rt_prio Julien Desfossez
2016-09-23 16:49 ` [RFC PATCH v2 2/5] tracing: add TRACE_EVENT_MAP Julien Desfossez
2016-09-23 16:49 ` [RFC PATCH v2 3/5] tracing: extend scheduling tracepoints Julien Desfossez
2016-09-23 16:49 ` [RFC PATCH v2 4/5] tracing: extend sched_pi_setprio Julien Desfossez
2016-09-23 16:49 ` [RFC PATCH v2 5/5] tracing: add sched_update_prio Julien Desfossez
2016-09-24 13:28 ` Mathieu Desnoyers
2016-09-26 12:27 ` [RFC PATCH v2 0/5] Additional scheduling information in tracepoints Peter Zijlstra
2016-09-26 19:37 ` Mathieu Desnoyers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox