linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] report task state in symbolic chars in sched tracepoints
@ 2023-07-25  7:22 Ze Gao
  2023-07-25  7:22 ` [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead Ze Gao
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Ze Gao @ 2023-07-25  7:22 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Namhyung Kim,
	Ian Rogers, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Adrian Hunter, Alexander Shishkin
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel, Ze Gao

In the status quo, we should see three different outcomes of the reported
sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
of tracepoint sched_switch.  And it's not hard to figure out that the
former two are built upon the third one, and the reason why we see this
inconsistency is that the former two does not catch up with the internal
change of reported task state definitions as the kernel evolves.

IMHO, exporting internal representations of task state in the tracepoint
sched_switch is not a good practice and not encouraged at all, which can
easily break userspace tools that relies on it. Especially when tracepoints
are massively used in many observability tools nowadays due to its stable
nature, which makes them no longer used for debug only purpose and we
should be careful to decide what ought to be reported to userspace and what
ought not.

Therefore, to fix the issues mentioned above for good, instead of choosing
to sync the userspace tracing tools with the latest task states constants
mapping, I proposed to replace reported task state in sched_switch with
a symbolic character, and save the further processing of userspace tools
and spare them from knowing further implementation details in the kernel.

After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint only.

Reviews welcome!

Regards,

Ze

Ze Gao (2):
  sched, tracing: report task state in symbolic chars instead
  perf sched: sync with latest sched_switch tracepoint definition

 include/trace/events/sched.h | 41 ++++++++++-----------------
 tools/perf/builtin-sched.c   | 55 ++++++------------------------------
 2 files changed, 24 insertions(+), 72 deletions(-)

Ze Gao (1):
  libtraceevent: sync with latest sched_switch tracepoint definition

 plugins/plugin_sched_switch.c | 23 +----------------------
 1 file changed, 1 insertion(+), 22 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead
  2023-07-25  7:22 [RFC PATCH 0/3] report task state in symbolic chars in sched tracepoints Ze Gao
@ 2023-07-25  7:22 ` Ze Gao
  2023-07-25  8:33   ` Peter Zijlstra
  2023-07-25 17:59   ` Steven Rostedt
  2023-07-25  7:22 ` [RFC PATCH 2/3] perf sched: sync with latest sched_switch tracepoint definition Ze Gao
  2023-07-25  7:22 ` [RFC PATCH 3/3] libtraceevent: " Ze Gao
  2 siblings, 2 replies; 13+ messages in thread
From: Ze Gao @ 2023-07-25  7:22 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Namhyung Kim,
	Ian Rogers, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Adrian Hunter, Alexander Shishkin
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel, Ze Gao

Internal representations of task state are likely to be changed or ordered,
and reporting them to userspace without exporting them as part of API is not
a good choice, which can easily break a userspace observability tool as kernel
evolves. For example, perf suffers from this and still reports wrong states
by this patch.

OTOH, some masqueraded state like TASK_REPORT_IDLE and TASK_REPORT_MAX are
also reported inadvertently, which confuses things even more.

So report task state in symbolic char instead, which is self-explaining and
no further translation is needed.

Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use the old
conventions for the rest.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 include/trace/events/sched.h | 41 +++++++++++++-----------------------
 1 file changed, 15 insertions(+), 26 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index fbb99a61f714..51102a7c0c2d 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -6,6 +6,7 @@
 #define _TRACE_SCHED_H
 
 #include <linux/kthread.h>
+#include <linux/sched.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
@@ -187,7 +188,7 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
 	     TP_ARGS(p));
 
 #ifdef CREATE_TRACE_POINTS
-static inline long __trace_sched_switch_state(bool preempt,
+static inline const char __trace_sched_switch_state(bool preempt,
 					      unsigned int prev_state,
 					      struct task_struct *p)
 {
@@ -197,13 +198,6 @@ static inline long __trace_sched_switch_state(bool preempt,
 	BUG_ON(p != current);
 #endif /* CONFIG_SCHED_DEBUG */
 
-	/*
-	 * Preemption ignores task state, therefore preempted tasks are always
-	 * RUNNING (we will not have dequeued if state != RUNNING).
-	 */
-	if (preempt)
-		return TASK_REPORT_MAX;
-
 	/*
 	 * task_state_index() uses fls() and returns a value from 0-8 range.
 	 * Decrement it by 1 (except TASK_RUNNING state i.e 0) before using
@@ -212,7 +206,16 @@ static inline long __trace_sched_switch_state(bool preempt,
 	 */
 	state = __task_state_index(prev_state, p->exit_state);
 
-	return state ? (1 << (state - 1)) : state;
+	/*
+	 * Preemption ignores task state, therefore preempted tasks are always
+	 * RUNNING (we will not have dequeued if state != RUNNING).
+	 * Here, we use 'p' to denote this case and only for this case.
+	 */
+	if (preempt)
+		return 'p';
+
+
+	return task_index_to_char(state);
 }
 #endif /* CREATE_TRACE_POINTS */
 
@@ -232,7 +235,7 @@ TRACE_EVENT(sched_switch,
 		__array(	char,	prev_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	prev_pid			)
 		__field(	int,	prev_prio			)
-		__field(	long,	prev_state			)
+		__field(	char,	prev_state			)
 		__array(	char,	next_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	next_pid			)
 		__field(	int,	next_prio			)
@@ -249,22 +252,8 @@ TRACE_EVENT(sched_switch,
 		/* XXX SCHED_DEADLINE */
 	),
 
-	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
-		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
-
-		(__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
-		  __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|",
-				{ TASK_INTERRUPTIBLE, "S" },
-				{ TASK_UNINTERRUPTIBLE, "D" },
-				{ __TASK_STOPPED, "T" },
-				{ __TASK_TRACED, "t" },
-				{ EXIT_DEAD, "X" },
-				{ EXIT_ZOMBIE, "Z" },
-				{ TASK_PARKED, "P" },
-				{ TASK_DEAD, "I" }) :
-		  "R",
-
-		__entry->prev_state & TASK_REPORT_MAX ? "+" : "",
+	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%c ==> next_comm=%s next_pid=%d next_prio=%d",
+		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio, __entry->prev_state,
 		__entry->next_comm, __entry->next_pid, __entry->next_prio)
 );
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/3] perf sched: sync with latest sched_switch tracepoint definition
  2023-07-25  7:22 [RFC PATCH 0/3] report task state in symbolic chars in sched tracepoints Ze Gao
  2023-07-25  7:22 ` [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead Ze Gao
@ 2023-07-25  7:22 ` Ze Gao
  2023-07-25  7:22 ` [RFC PATCH 3/3] libtraceevent: " Ze Gao
  2 siblings, 0 replies; 13+ messages in thread
From: Ze Gao @ 2023-07-25  7:22 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Namhyung Kim,
	Ian Rogers, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Adrian Hunter, Alexander Shishkin
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel, Ze Gao

Since tracepoint sched_switch changes its reported task state type,
update the parsing logic accordingly.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 tools/perf/builtin-sched.c | 55 +++++++-------------------------------
 1 file changed, 9 insertions(+), 46 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index cc4ba506e119..b92f376fdd9a 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -92,24 +92,6 @@ struct sched_atom {
 	struct task_desc	*wakee;
 };
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
-
-/* task state bitmask, copied from include/linux/sched.h */
-#define TASK_RUNNING		0
-#define TASK_INTERRUPTIBLE	1
-#define TASK_UNINTERRUPTIBLE	2
-#define __TASK_STOPPED		4
-#define __TASK_TRACED		8
-/* in tsk->exit_state */
-#define EXIT_DEAD		16
-#define EXIT_ZOMBIE		32
-#define EXIT_TRACE		(EXIT_ZOMBIE | EXIT_DEAD)
-/* in tsk->state again */
-#define TASK_DEAD		64
-#define TASK_WAKEKILL		128
-#define TASK_WAKING		256
-#define TASK_PARKED		512
-
 enum thread_state {
 	THREAD_SLEEPING = 0,
 	THREAD_WAIT_CPU,
@@ -266,7 +248,7 @@ struct thread_runtime {
 	u64 total_preempt_time;
 	u64 total_delay_time;
 
-	int last_state;
+	char last_state;
 
 	char shortname[3];
 	bool comm_changed;
@@ -860,7 +842,7 @@ static int replay_switch_event(struct perf_sched *sched,
 		   *next_comm  = evsel__strval(evsel, sample, "next_comm");
 	const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
 		  next_pid = evsel__intval(evsel, sample, "next_pid");
-	const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
+	const char prev_state = evsel__intval(evsel, sample, "prev_state");
 	struct task_desc *prev, __maybe_unused *next;
 	u64 timestamp0, timestamp = sample->time;
 	int cpu = sample->cpu;
@@ -1050,13 +1032,6 @@ static int thread_atoms_insert(struct perf_sched *sched, struct thread *thread)
 	return 0;
 }
 
-static char sched_out_state(u64 prev_state)
-{
-	const char *str = TASK_STATE_TO_CHAR_STR;
-
-	return str[prev_state];
-}
-
 static int
 add_sched_out_event(struct work_atoms *atoms,
 		    char run_state,
@@ -1132,7 +1107,7 @@ static int latency_switch_event(struct perf_sched *sched,
 {
 	const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
 		  next_pid = evsel__intval(evsel, sample, "next_pid");
-	const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
+	const char prev_state = evsel__intval(evsel, sample, "prev_state");
 	struct work_atoms *out_events, *in_events;
 	struct thread *sched_out, *sched_in;
 	u64 timestamp0, timestamp = sample->time;
@@ -1168,7 +1143,7 @@ static int latency_switch_event(struct perf_sched *sched,
 			goto out_put;
 		}
 	}
-	if (add_sched_out_event(out_events, sched_out_state(prev_state), timestamp))
+	if (add_sched_out_event(out_events, prev_state, timestamp))
 		return -1;
 
 	in_events = thread_atoms_search(&sched->atom_root, sched_in, &sched->cmp_pid);
@@ -2026,24 +2001,12 @@ static void timehist_header(struct perf_sched *sched)
 	printf("\n");
 }
 
-static char task_state_char(struct thread *thread, int state)
-{
-	static const char state_to_char[] = TASK_STATE_TO_CHAR_STR;
-	unsigned bit = state ? ffs(state) : 0;
-
-	/* 'I' for idle */
-	if (thread->tid == 0)
-		return 'I';
-
-	return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
-}
-
 static void timehist_print_sample(struct perf_sched *sched,
 				  struct evsel *evsel,
 				  struct perf_sample *sample,
 				  struct addr_location *al,
 				  struct thread *thread,
-				  u64 t, int state)
+				  u64 t, char state)
 {
 	struct thread_runtime *tr = thread__priv(thread);
 	const char *next_comm = evsel__strval(evsel, sample, "next_comm");
@@ -2084,7 +2047,7 @@ static void timehist_print_sample(struct perf_sched *sched,
 	print_sched_time(tr->dt_run, 6);
 
 	if (sched->show_state)
-		printf(" %5c ", task_state_char(thread, state));
+		printf(" %5c ", thread->tid == 0 ? 'I' : state);
 
 	if (sched->show_next) {
 		snprintf(nstr, sizeof(nstr), "next: %s[%d]", next_comm, next_pid);
@@ -2156,9 +2119,9 @@ static void timehist_update_runtime_stats(struct thread_runtime *r,
 		else if (r->last_time) {
 			u64 dt_wait = tprev - r->last_time;
 
-			if (r->last_state == TASK_RUNNING)
+			if (r->last_state == 'R' || r->last_state == 'p')
 				r->dt_preempt = dt_wait;
-			else if (r->last_state == TASK_UNINTERRUPTIBLE)
+			else if (r->last_state == 'D')
 				r->dt_iowait = dt_wait;
 			else
 				r->dt_sleep = dt_wait;
@@ -2575,7 +2538,7 @@ static int timehist_sched_change_event(struct perf_tool *tool,
 	struct thread_runtime *tr = NULL;
 	u64 tprev, t = sample->time;
 	int rc = 0;
-	int state = evsel__intval(evsel, sample, "prev_state");
+	char state = evsel__intval(evsel, sample, "prev_state");
 
 	if (machine__resolve(machine, &al, sample) < 0) {
 		pr_err("problem processing %d event. skipping it\n",
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 3/3] libtraceevent: sync with latest sched_switch tracepoint definition
  2023-07-25  7:22 [RFC PATCH 0/3] report task state in symbolic chars in sched tracepoints Ze Gao
  2023-07-25  7:22 ` [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead Ze Gao
  2023-07-25  7:22 ` [RFC PATCH 2/3] perf sched: sync with latest sched_switch tracepoint definition Ze Gao
@ 2023-07-25  7:22 ` Ze Gao
  2 siblings, 0 replies; 13+ messages in thread
From: Ze Gao @ 2023-07-25  7:22 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Steven Rostedt, Namhyung Kim,
	Ian Rogers, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Adrian Hunter, Alexander Shishkin
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel, Ze Gao

Since tracepoint sched_switch changes its reported task state type,
update the parsing logic accordingly.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 plugins/plugin_sched_switch.c | 23 +----------------------
 1 file changed, 1 insertion(+), 22 deletions(-)

diff --git a/plugins/plugin_sched_switch.c b/plugins/plugin_sched_switch.c
index 8752cae..37c1be2 100644
--- a/plugins/plugin_sched_switch.c
+++ b/plugins/plugin_sched_switch.c
@@ -9,27 +9,6 @@
 #include "event-parse.h"
 #include "trace-seq.h"
 
-static void write_state(struct trace_seq *s, int val)
-{
-	const char states[] = "SDTtZXxW";
-	int found = 0;
-	int i;
-
-	for (i = 0; i < (sizeof(states) - 1); i++) {
-		if (!(val & (1 << i)))
-			continue;
-
-		if (found)
-			trace_seq_putc(s, '|');
-
-		found = 1;
-		trace_seq_putc(s, states[i]);
-	}
-
-	if (!found)
-		trace_seq_putc(s, 'R');
-}
-
 static void write_and_save_comm(struct tep_format_field *field,
 				struct tep_record *record,
 				struct trace_seq *s, int pid)
@@ -100,7 +79,7 @@ static int sched_switch_handler(struct trace_seq *s,
 		trace_seq_printf(s, "[%d] ", (int) val);
 
 	if (tep_get_field_val(s,  event, "prev_state", record, &val, 1) == 0)
-		write_state(s, val);
+		trace_seq_putc(s, (char) val);
 
 	trace_seq_puts(s, " ==> ");
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead
  2023-07-25  7:22 ` [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead Ze Gao
@ 2023-07-25  8:33   ` Peter Zijlstra
  2023-07-25 10:53     ` Ze Gao
  2023-07-25 17:59   ` Steven Rostedt
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2023-07-25  8:33 UTC (permalink / raw)
  To: Ze Gao
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim, Ian Rogers,
	Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Adrian Hunter, Alexander Shishkin, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Tue, Jul 25, 2023 at 03:22:52PM +0800, Ze Gao wrote:

> @@ -197,13 +198,6 @@ static inline long __trace_sched_switch_state(bool preempt,
>  	BUG_ON(p != current);
>  #endif /* CONFIG_SCHED_DEBUG */
>  
> -	/*
> -	 * Preemption ignores task state, therefore preempted tasks are always
> -	 * RUNNING (we will not have dequeued if state != RUNNING).
> -	 */
> -	if (preempt)
> -		return TASK_REPORT_MAX;
> -
>  	/*
>  	 * task_state_index() uses fls() and returns a value from 0-8 range.
>  	 * Decrement it by 1 (except TASK_RUNNING state i.e 0) before using
> @@ -212,7 +206,16 @@ static inline long __trace_sched_switch_state(bool preempt,
>  	 */
>  	state = __task_state_index(prev_state, p->exit_state);
>  
> -	return state ? (1 << (state - 1)) : state;
> +	/*
> +	 * Preemption ignores task state, therefore preempted tasks are always
> +	 * RUNNING (we will not have dequeued if state != RUNNING).
> +	 * Here, we use 'p' to denote this case and only for this case.
> +	 */
> +	if (preempt)
> +		return 'p';
> +

I don't get this move, why compute state before this return?

> +
> +	return task_index_to_char(state);
>  }
>  #endif /* CREATE_TRACE_POINTS */
>  
> @@ -232,7 +235,7 @@ TRACE_EVENT(sched_switch,
>  		__array(	char,	prev_comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	prev_pid			)
>  		__field(	int,	prev_prio			)
> -		__field(	long,	prev_state			)
> +		__field(	char,	prev_state			)
>  		__array(	char,	next_comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	next_pid			)
>  		__field(	int,	next_prio			)

This is a format change and will likely break a ton of programs :/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead
  2023-07-25  8:33   ` Peter Zijlstra
@ 2023-07-25 10:53     ` Ze Gao
  2023-07-25 13:31       ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Ze Gao @ 2023-07-25 10:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim, Ian Rogers,
	Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Adrian Hunter, Alexander Shishkin, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Tue, Jul 25, 2023 at 4:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jul 25, 2023 at 03:22:52PM +0800, Ze Gao wrote:
>
> > @@ -197,13 +198,6 @@ static inline long __trace_sched_switch_state(bool preempt,
> >       BUG_ON(p != current);
> >  #endif /* CONFIG_SCHED_DEBUG */
> >
> > -     /*
> > -      * Preemption ignores task state, therefore preempted tasks are always
> > -      * RUNNING (we will not have dequeued if state != RUNNING).
> > -      */
> > -     if (preempt)
> > -             return TASK_REPORT_MAX;
> > -
> >       /*
> >        * task_state_index() uses fls() and returns a value from 0-8 range.
> >        * Decrement it by 1 (except TASK_RUNNING state i.e 0) before using
> > @@ -212,7 +206,16 @@ static inline long __trace_sched_switch_state(bool preempt,
> >        */
> >       state = __task_state_index(prev_state, p->exit_state);
> >
> > -     return state ? (1 << (state - 1)) : state;
> > +     /*
> > +      * Preemption ignores task state, therefore preempted tasks are always
> > +      * RUNNING (we will not have dequeued if state != RUNNING).
> > +      * Here, we use 'p' to denote this case and only for this case.
> > +      */
> > +     if (preempt)
> > +             return 'p';
> > +
>
> I don't get this move, why compute state before this return?

Oops,  I was going to ignore the PREEMP_ACTIVE in the first attempt
and changed it to 'state = 0'
which I decided to introduce 'p' to denote this after second thoughts.
Will fix it and revert this move.

> > +
> > +     return task_index_to_char(state);
> >  }
> >  #endif /* CREATE_TRACE_POINTS */
> >
> > @@ -232,7 +235,7 @@ TRACE_EVENT(sched_switch,
> >               __array(        char,   prev_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  prev_pid                        )
> >               __field(        int,    prev_prio                       )
> > -             __field(        long,   prev_state                      )
> > +             __field(        char,   prev_state                      )
> >               __array(        char,   next_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  next_pid                        )
> >               __field(        int,    next_prio                       )
>
> This is a format change and will likely break a ton of programs :/

Yeah,  I admit that.  And I believe this kind of break happens each
time the internal
task state constant mapping is rearranged, it's of no big difference
here, since the
most renowned perf itself is still broken at this time after.  And
IMHO it's time to fix
this and do things correctly.  That is why I propose this and mark it as RFC.

BTW, could you help to point to any possible tools/programs that would
break other
than perf/libtraceevent, because these two are the only users I run
into so far.

Regards,
Ze

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead
  2023-07-25 10:53     ` Ze Gao
@ 2023-07-25 13:31       ` Peter Zijlstra
  2023-07-25 20:13         ` Steven Rostedt
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2023-07-25 13:31 UTC (permalink / raw)
  To: Ze Gao
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim, Ian Rogers,
	Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Adrian Hunter, Alexander Shishkin, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Tue, Jul 25, 2023 at 06:53:07PM +0800, Ze Gao wrote:

> > > @@ -232,7 +235,7 @@ TRACE_EVENT(sched_switch,
> > >               __array(        char,   prev_comm,      TASK_COMM_LEN   )
> > >               __field(        pid_t,  prev_pid                        )
> > >               __field(        int,    prev_prio                       )
> > > -             __field(        long,   prev_state                      )
> > > +             __field(        char,   prev_state                      )
> > >               __array(        char,   next_comm,      TASK_COMM_LEN   )
> > >               __field(        pid_t,  next_pid                        )
> > >               __field(        int,    next_prio                       )
> >
> > This is a format change and will likely break a ton of programs :/
> 

> BTW, could you help to point to any possible tools/programs that would
> break other than perf/libtraceevent, because these two are the only
> users I run into so far.

Latencytop was the one breaking a few years ago, but there's a metric
ton of sched_switch users out there, this is bound to generate pain.

Steve, you remember what the status of all this was? at the time
breaking this was considered on par with ABI breakage and we reverted or
something. Is this still so?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead
  2023-07-25  7:22 ` [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead Ze Gao
  2023-07-25  8:33   ` Peter Zijlstra
@ 2023-07-25 17:59   ` Steven Rostedt
  2023-07-25 20:16     ` Peter Zijlstra
  2023-07-26  2:41     ` Ze Gao
  1 sibling, 2 replies; 13+ messages in thread
From: Steven Rostedt @ 2023-07-25 17:59 UTC (permalink / raw)
  To: Ze Gao
  Cc: Ingo Molnar, Peter Zijlstra, Namhyung Kim, Ian Rogers,
	Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Adrian Hunter, Alexander Shishkin, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Tue, 25 Jul 2023 15:22:52 +0800
Ze Gao <zegao2021@gmail.com> wrote:

>  #ifdef CREATE_TRACE_POINTS
> -static inline long __trace_sched_switch_state(bool preempt,
> +static inline const char __trace_sched_switch_state(bool preempt,

Does it really need to be "const"?

>  					      unsigned int prev_state,
>  					      struct task_struct *p)
>  {
> @@ -197,13 +198,6 @@ static inline long __trace_sched_switch_state(bool preempt,
>  	BUG_ON(p != current);
>  #endif /* CONFIG_SCHED_DEBUG */
>  
> -	/*
> -	 * Preemption ignores task state, therefore preempted tasks are always
> -	 * RUNNING (we will not have dequeued if state != RUNNING).
> -	 */
> -	if (preempt)
> -		return TASK_REPORT_MAX;
> -
>  	/*
>  	 * task_state_index() uses fls() and returns a value from 0-8 range.
>  	 * Decrement it by 1 (except TASK_RUNNING state i.e 0) before using
> @@ -212,7 +206,16 @@ static inline long __trace_sched_switch_state(bool preempt,
>  	 */
>  	state = __task_state_index(prev_state, p->exit_state);
>  
> -	return state ? (1 << (state - 1)) : state;
> +	/*
> +	 * Preemption ignores task state, therefore preempted tasks are always
> +	 * RUNNING (we will not have dequeued if state != RUNNING).
> +	 * Here, we use 'p' to denote this case and only for this case.
> +	 */
> +	if (preempt)
> +		return 'p';
> +
> +
> +	return task_index_to_char(state);
>  }
>  #endif /* CREATE_TRACE_POINTS */
>  
> @@ -232,7 +235,7 @@ TRACE_EVENT(sched_switch,
>  		__array(	char,	prev_comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	prev_pid			)
>  		__field(	int,	prev_prio			)
> -		__field(	long,	prev_state			)
> +		__field(	char,	prev_state			)
>  		__array(	char,	next_comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	next_pid			)
>  		__field(	int,	next_prio			)

This will break userspace. Just because you update libtraceevent
doesn't mean that it will get to all the users of it. There's still
tools that have the old method hard coded and doesn't use the library.

Now, because the old tools still do the parsing of this format, we can
add a new field called prev_state_char that will give you this. Now to
save space, we should change prev_state to int (can't make it short as
there's that test for "+" which does sometimes happen). I believe we
can make prev_prio and next prio into shorts (and possibly chars!).

-- Steve


> @@ -249,22 +252,8 @@ TRACE_EVENT(sched_switch,
>  		/* XXX SCHED_DEADLINE */
>  	),
>  
> -	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
> -		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
> -
> -		(__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
> -		  __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|",
> -				{ TASK_INTERRUPTIBLE, "S" },
> -				{ TASK_UNINTERRUPTIBLE, "D" },
> -				{ __TASK_STOPPED, "T" },
> -				{ __TASK_TRACED, "t" },
> -				{ EXIT_DEAD, "X" },
> -				{ EXIT_ZOMBIE, "Z" },
> -				{ TASK_PARKED, "P" },
> -				{ TASK_DEAD, "I" }) :
> -		  "R",
> -
> -		__entry->prev_state & TASK_REPORT_MAX ? "+" : "",
> +	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%c ==> next_comm=%s next_pid=%d next_prio=%d",
> +		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio, __entry->prev_state,
>  		__entry->next_comm, __entry->next_pid, __entry->next_prio)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead
  2023-07-25 13:31       ` Peter Zijlstra
@ 2023-07-25 20:13         ` Steven Rostedt
  0 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2023-07-25 20:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ze Gao, Ingo Molnar, Namhyung Kim, Ian Rogers,
	Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Adrian Hunter, Alexander Shishkin, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Tue, 25 Jul 2023 15:31:00 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > > This is a format change and will likely break a ton of programs :/  
> >   
> 
> > BTW, could you help to point to any possible tools/programs that would
> > break other than perf/libtraceevent, because these two are the only
> > users I run into so far.  
> 
> Latencytop was the one breaking a few years ago, but there's a metric
> ton of sched_switch users out there, this is bound to generate pain.
> 
> Steve, you remember what the status of all this was? at the time
> breaking this was considered on par with ABI breakage and we reverted or
> something. Is this still so?

I did reply to the patch (before I switched to thread mode, and notice
that there were already replies ;-)

Pretty much all the tools have been switched over to libtraceevent, but
some just copied the code internally. Although, newer code (like
rasdaemon) are moving over to the newer shared library. Anyway, I did
post a possibly solution in my other email.

-- Steve

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead
  2023-07-25 17:59   ` Steven Rostedt
@ 2023-07-25 20:16     ` Peter Zijlstra
  2023-07-25 21:55       ` Steven Rostedt
  2023-07-26  2:41     ` Ze Gao
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2023-07-25 20:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ze Gao, Ingo Molnar, Namhyung Kim, Ian Rogers,
	Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Adrian Hunter, Alexander Shishkin, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Tue, Jul 25, 2023 at 01:59:38PM -0400, Steven Rostedt wrote:
> > @@ -232,7 +235,7 @@ TRACE_EVENT(sched_switch,
> >  		__array(	char,	prev_comm,	TASK_COMM_LEN	)
> >  		__field(	pid_t,	prev_pid			)
> >  		__field(	int,	prev_prio			)
> > -		__field(	long,	prev_state			)
> > +		__field(	char,	prev_state			)
> >  		__array(	char,	next_comm,	TASK_COMM_LEN	)
> >  		__field(	pid_t,	next_pid			)
> >  		__field(	int,	next_prio			)
> 
> This will break userspace. Just because you update libtraceevent
> doesn't mean that it will get to all the users of it. There's still
> tools that have the old method hard coded and doesn't use the library.
> 
> Now, because the old tools still do the parsing of this format, we can
> add a new field called prev_state_char that will give you this. Now to
> save space, we should change prev_state to int (can't make it short as
> there's that test for "+" which does sometimes happen). I believe we
> can make prev_prio and next prio into shorts (and possibly chars!).

Or just leave the thing alone?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead
  2023-07-25 20:16     ` Peter Zijlstra
@ 2023-07-25 21:55       ` Steven Rostedt
  2023-07-26  2:48         ` Ze Gao
  0 siblings, 1 reply; 13+ messages in thread
From: Steven Rostedt @ 2023-07-25 21:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ze Gao, Ingo Molnar, Namhyung Kim, Ian Rogers,
	Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Adrian Hunter, Alexander Shishkin, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Tue, 25 Jul 2023 22:16:03 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > Now, because the old tools still do the parsing of this format, we can
> > add a new field called prev_state_char that will give you this. Now to
> > save space, we should change prev_state to int (can't make it short as
> > there's that test for "+" which does sometimes happen). I believe we
> > can make prev_prio and next prio into shorts (and possibly chars!).  
> 
> Or just leave the thing alone?

Sure, but I would like to change the fields to smaller ones just so
that the event wastes less space on the buffer.

-- Steve

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead
  2023-07-25 17:59   ` Steven Rostedt
  2023-07-25 20:16     ` Peter Zijlstra
@ 2023-07-26  2:41     ` Ze Gao
  1 sibling, 0 replies; 13+ messages in thread
From: Ze Gao @ 2023-07-26  2:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Peter Zijlstra, Namhyung Kim, Ian Rogers,
	Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Adrian Hunter, Alexander Shishkin, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

On Wed, Jul 26, 2023 at 1:59 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 25 Jul 2023 15:22:52 +0800
> Ze Gao <zegao2021@gmail.com> wrote:
>
> >  #ifdef CREATE_TRACE_POINTS
> > -static inline long __trace_sched_switch_state(bool preempt,
> > +static inline const char __trace_sched_switch_state(bool preempt,
>
> Does it really need to be "const"?

Not really since it's rvalue after all, will remove this.

> >                                             unsigned int prev_state,
> >                                             struct task_struct *p)
> >  {
> > @@ -197,13 +198,6 @@ static inline long __trace_sched_switch_state(bool preempt,
> >       BUG_ON(p != current);
> >  #endif /* CONFIG_SCHED_DEBUG */
> >
> > -     /*
> > -      * Preemption ignores task state, therefore preempted tasks are always
> > -      * RUNNING (we will not have dequeued if state != RUNNING).
> > -      */
> > -     if (preempt)
> > -             return TASK_REPORT_MAX;
> > -
> >       /*
> >        * task_state_index() uses fls() and returns a value from 0-8 range.
> >        * Decrement it by 1 (except TASK_RUNNING state i.e 0) before using
> > @@ -212,7 +206,16 @@ static inline long __trace_sched_switch_state(bool preempt,
> >        */
> >       state = __task_state_index(prev_state, p->exit_state);
> >
> > -     return state ? (1 << (state - 1)) : state;
> > +     /*
> > +      * Preemption ignores task state, therefore preempted tasks are always
> > +      * RUNNING (we will not have dequeued if state != RUNNING).
> > +      * Here, we use 'p' to denote this case and only for this case.
> > +      */
> > +     if (preempt)
> > +             return 'p';
> > +
> > +
> > +     return task_index_to_char(state);
> >  }
> >  #endif /* CREATE_TRACE_POINTS */
> >
> > @@ -232,7 +235,7 @@ TRACE_EVENT(sched_switch,
> >               __array(        char,   prev_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  prev_pid                        )
> >               __field(        int,    prev_prio                       )
> > -             __field(        long,   prev_state                      )
> > +             __field(        char,   prev_state                      )
> >               __array(        char,   next_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  next_pid                        )
> >               __field(        int,    next_prio                       )
>
> This will break userspace. Just because you update libtraceevent
> doesn't mean that it will get to all the users of it. There's still
> tools that have the old method hard coded and doesn't use the library.
>
> Now, because the old tools still do the parsing of this format, we can
> add a new field called prev_state_char that will give you this. Now to
> save space, we should change prev_state to int (can't make it short as
> there's that test for "+" which does sometimes happen). I believe we
> can make prev_prio and next prio into shorts (and possibly chars!).
>
> -- Steve
>

Agreed, and my first attempt was to not break anything here by
introducing a new "char",  which is kinda redundant, but is a good way
for smooth transition to decouple userspace from kernel on this weird
design(of ABI).

However, given the fact the in-tree perf still suffers this, I doubt many
other tools could spare from it. And the best guess is that they are
still partially broken silently and reports wrong and confusing states.
And this is basically why I came up with this RFC and deliberately
let them fail loudly to do a painful but quick recovery.

Anyway, if we want to keep status quo, I can send over another two
patches to fix perf and libtraceevent specifically and leave the tracepoint
alone. But seriously, imho we might have no chance to correct this.
and without a formal abi, the raw value (let alone the masqueraded ones)
is just meaningless for userspace and that leaves it intended for kernel
developers only.

Thanks,
Ze

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead
  2023-07-25 21:55       ` Steven Rostedt
@ 2023-07-26  2:48         ` Ze Gao
  0 siblings, 0 replies; 13+ messages in thread
From: Ze Gao @ 2023-07-26  2:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, Namhyung Kim, Ian Rogers,
	Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Adrian Hunter, Alexander Shishkin, linux-kernel,
	linux-perf-users, linux-trace-kernel, Ze Gao

I'd like to try this one, and request for another RFC.

Regards,
Ze

On Wed, Jul 26, 2023 at 5:55 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 25 Jul 2023 22:16:03 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > Now, because the old tools still do the parsing of this format, we can
> > > add a new field called prev_state_char that will give you this. Now to
> > > save space, we should change prev_state to int (can't make it short as
> > > there's that test for "+" which does sometimes happen). I believe we
> > > can make prev_prio and next prio into shorts (and possibly chars!).
> >
> > Or just leave the thing alone?
>
> Sure, but I would like to change the fields to smaller ones just so
> that the event wastes less space on the buffer.
>
> -- Steve

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-07-26  2:48 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-25  7:22 [RFC PATCH 0/3] report task state in symbolic chars in sched tracepoints Ze Gao
2023-07-25  7:22 ` [RFC PATCH 1/3] sched, tracing: report task state in symbolic chars instead Ze Gao
2023-07-25  8:33   ` Peter Zijlstra
2023-07-25 10:53     ` Ze Gao
2023-07-25 13:31       ` Peter Zijlstra
2023-07-25 20:13         ` Steven Rostedt
2023-07-25 17:59   ` Steven Rostedt
2023-07-25 20:16     ` Peter Zijlstra
2023-07-25 21:55       ` Steven Rostedt
2023-07-26  2:48         ` Ze Gao
2023-07-26  2:41     ` Ze Gao
2023-07-25  7:22 ` [RFC PATCH 2/3] perf sched: sync with latest sched_switch tracepoint definition Ze Gao
2023-07-25  7:22 ` [RFC PATCH 3/3] libtraceevent: " Ze Gao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).