linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/6] add to report task state in symbolic chars from sched tracepoint
@ 2023-08-01  9:01 Ze Gao
  2023-08-01  9:01 ` [RFC PATCH v3 1/6] perf sched: sync state char array with the kernel Ze Gao
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Ze Gao @ 2023-08-01  9:01 UTC (permalink / raw)
  To: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel,
	linux-trace-devel, Ze Gao

This is the 3rd attempt to fix the report task state issue in sched
tracepint, you can check out previous discussions here:

v1: https://lore.kernel.org/linux-trace-kernel/20230725072254.32045-1-zegao@tencent.com
v2: https://lore.kernel.org/linux-trace-kernel/20230726121618.19198-1-zegao@tencent.com

FYI, this series are designed not to break anything now and still do the 
1-1 correspondence int-char mapping for each distinct task state we want to
report, and thus will not lose any details intended for debug purposes. Of
course, this might be compromised because of bugs introduced due to my
stupidity. So your sage comments are very important and appreciated!


diff from v2:
1. reorder to condense sched_switch record structure
2. fallback to older method to maintain backward compatibility
   for perf/libtraceevent
3. split housekeeping work into separate ones for readability

--

In the status quo, we should see three different outcomes of the reported
sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
of tracepoint sched_switch.  And it's not hard to figure out that the
former two are built upon the third one, and the reason why we see this
inconsistency is that the former two does not catch up with the internal
change of reported task state definitions as the kernel evolves.

IMHO, exporting internal representations of task state in the tracepoint
sched_switch is not a good practice and not encouraged at all, which can
easily break userspace tools that relies on it. Especially when tracepoints
are massively used in many observability tools nowadays due to its stable
nature, which makes them no longer used for debug only purpose and we
should be careful to decide what ought to be reported to userspace and what
ought not.

Therefore, to fix the issues mentioned above for good, I proposed to add
a new variable to report task state in sched_switch with a symbolic char
along with the old hardcoded value, and save the further processing of
userspace tools and spare them from knowing implementation details in the
kernel.

After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint
only.

Reviews welcome!

Regards,
Ze

Ze Gao (5):
  perf sched: sync state char array with the kernel
  perf sched: reorganize sched-out task state report code
  sched, tracing: add to report task state in symbolic chars
  sched, tracing: reorganize fields of switch event struct
  perf sched: prefer to use prev_state_char introduced in sched_switch

 include/trace/events/sched.h | 68 +++++++++++++++++-------------
 tools/perf/builtin-sched.c   | 82 ++++++++++++++++--------------------
 2 files changed, 76 insertions(+), 74 deletions(-)

Ze Gao (1):
  libtraceevent: prefer to use prev_state_char introduced in
    sched_switch

 plugins/plugin_sched_switch.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v3 1/6] perf sched: sync state char array with the kernel
  2023-08-01  9:01 [RFC PATCH v3 0/6] add to report task state in symbolic chars from sched tracepoint Ze Gao
@ 2023-08-01  9:01 ` Ze Gao
  2023-08-01  9:01 ` [RFC PATCH v3 2/6] perf sched: reorganize sched-out task state report code Ze Gao
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Ze Gao @ 2023-08-01  9:01 UTC (permalink / raw)
  To: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel,
	linux-trace-devel, Ze Gao

update state char array and then remove unused and stale
macros, which are kernel internal representations and not
encouraged to use anymore.

this fix is for old kernels, and we will steer to newly
added symbolic char to report task state but also maintain
compatibility to not break anything.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 tools/perf/builtin-sched.c | 13 +------------
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index cc4ba506e119..c0d0ad18fc76 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -92,23 +92,12 @@ struct sched_atom {
 	struct task_desc	*wakee;
 };
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
+#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
 
 /* task state bitmask, copied from include/linux/sched.h */
 #define TASK_RUNNING		0
 #define TASK_INTERRUPTIBLE	1
 #define TASK_UNINTERRUPTIBLE	2
-#define __TASK_STOPPED		4
-#define __TASK_TRACED		8
-/* in tsk->exit_state */
-#define EXIT_DEAD		16
-#define EXIT_ZOMBIE		32
-#define EXIT_TRACE		(EXIT_ZOMBIE | EXIT_DEAD)
-/* in tsk->state again */
-#define TASK_DEAD		64
-#define TASK_WAKEKILL		128
-#define TASK_WAKING		256
-#define TASK_PARKED		512
 
 enum thread_state {
 	THREAD_SLEEPING = 0,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v3 2/6] perf sched: reorganize sched-out task state report code
  2023-08-01  9:01 [RFC PATCH v3 0/6] add to report task state in symbolic chars from sched tracepoint Ze Gao
  2023-08-01  9:01 ` [RFC PATCH v3 1/6] perf sched: sync state char array with the kernel Ze Gao
@ 2023-08-01  9:01 ` Ze Gao
  2023-08-01  9:01 ` [RFC PATCH v3 3/6] sched, tracing: add to report task state in symbolic chars Ze Gao
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Ze Gao @ 2023-08-01  9:01 UTC (permalink / raw)
  To: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel,
	linux-trace-devel, Ze Gao

this patch mainly does housekeeping work and not
introduce any functional change.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 tools/perf/builtin-sched.c | 57 ++++++++++++++++----------------------
 1 file changed, 24 insertions(+), 33 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index c0d0ad18fc76..275da655b67a 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -94,11 +94,6 @@ struct sched_atom {
 
 #define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
 
-/* task state bitmask, copied from include/linux/sched.h */
-#define TASK_RUNNING		0
-#define TASK_INTERRUPTIBLE	1
-#define TASK_UNINTERRUPTIBLE	2
-
 enum thread_state {
 	THREAD_SLEEPING = 0,
 	THREAD_WAIT_CPU,
@@ -255,7 +250,7 @@ struct thread_runtime {
 	u64 total_preempt_time;
 	u64 total_delay_time;
 
-	int last_state;
+	char last_state;
 
 	char shortname[3];
 	bool comm_changed;
@@ -425,7 +420,7 @@ static void add_sched_event_wakeup(struct perf_sched *sched, struct task_desc *t
 }
 
 static void add_sched_event_sleep(struct perf_sched *sched, struct task_desc *task,
-				  u64 timestamp, u64 task_state __maybe_unused)
+				  u64 timestamp, char task_state __maybe_unused)
 {
 	struct sched_atom *event = get_new_event(task, timestamp);
 
@@ -840,6 +835,20 @@ replay_wakeup_event(struct perf_sched *sched,
 	return 0;
 }
 
+static inline char task_state_char(int state)
+{
+	static const char state_to_char[] = "RSDTtXZPI";
+	unsigned bit = state ? ffs(state) : 0;
+	return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
+}
+
+static inline char get_task_prev_state(struct evsel *evsel,
+				       struct perf_sample *sample)
+{
+	const int prev_state = evsel__intval(evsel, sample, "prev_state");
+	return task_state_char(prev_state);
+}
+
 static int replay_switch_event(struct perf_sched *sched,
 			       struct evsel *evsel,
 			       struct perf_sample *sample,
@@ -849,7 +858,7 @@ static int replay_switch_event(struct perf_sched *sched,
 		   *next_comm  = evsel__strval(evsel, sample, "next_comm");
 	const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
 		  next_pid = evsel__intval(evsel, sample, "next_pid");
-	const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
+	const char prev_state = get_task_prev_state(evsel, sample);
 	struct task_desc *prev, __maybe_unused *next;
 	u64 timestamp0, timestamp = sample->time;
 	int cpu = sample->cpu;
@@ -1039,12 +1048,6 @@ static int thread_atoms_insert(struct perf_sched *sched, struct thread *thread)
 	return 0;
 }
 
-static char sched_out_state(u64 prev_state)
-{
-	const char *str = TASK_STATE_TO_CHAR_STR;
-
-	return str[prev_state];
-}
 
 static int
 add_sched_out_event(struct work_atoms *atoms,
@@ -1121,7 +1124,7 @@ static int latency_switch_event(struct perf_sched *sched,
 {
 	const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
 		  next_pid = evsel__intval(evsel, sample, "next_pid");
-	const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
+	const char prev_state = get_task_prev_state(evsel, sample);
 	struct work_atoms *out_events, *in_events;
 	struct thread *sched_out, *sched_in;
 	u64 timestamp0, timestamp = sample->time;
@@ -1157,7 +1160,7 @@ static int latency_switch_event(struct perf_sched *sched,
 			goto out_put;
 		}
 	}
-	if (add_sched_out_event(out_events, sched_out_state(prev_state), timestamp))
+	if (add_sched_out_event(out_events, prev_state, timestamp))
 		return -1;
 
 	in_events = thread_atoms_search(&sched->atom_root, sched_in, &sched->cmp_pid);
@@ -2015,24 +2018,12 @@ static void timehist_header(struct perf_sched *sched)
 	printf("\n");
 }
 
-static char task_state_char(struct thread *thread, int state)
-{
-	static const char state_to_char[] = TASK_STATE_TO_CHAR_STR;
-	unsigned bit = state ? ffs(state) : 0;
-
-	/* 'I' for idle */
-	if (thread->tid == 0)
-		return 'I';
-
-	return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
-}
-
 static void timehist_print_sample(struct perf_sched *sched,
 				  struct evsel *evsel,
 				  struct perf_sample *sample,
 				  struct addr_location *al,
 				  struct thread *thread,
-				  u64 t, int state)
+				  u64 t, char state)
 {
 	struct thread_runtime *tr = thread__priv(thread);
 	const char *next_comm = evsel__strval(evsel, sample, "next_comm");
@@ -2073,7 +2064,7 @@ static void timehist_print_sample(struct perf_sched *sched,
 	print_sched_time(tr->dt_run, 6);
 
 	if (sched->show_state)
-		printf(" %5c ", task_state_char(thread, state));
+		printf(" %5c ", thread->tid == 0 ? 'I' : state);
 
 	if (sched->show_next) {
 		snprintf(nstr, sizeof(nstr), "next: %s[%d]", next_comm, next_pid);
@@ -2145,9 +2136,9 @@ static void timehist_update_runtime_stats(struct thread_runtime *r,
 		else if (r->last_time) {
 			u64 dt_wait = tprev - r->last_time;
 
-			if (r->last_state == TASK_RUNNING)
+			if (r->last_state == 'R' || r->last_state == 'p')
 				r->dt_preempt = dt_wait;
-			else if (r->last_state == TASK_UNINTERRUPTIBLE)
+			else if (r->last_state == 'D')
 				r->dt_iowait = dt_wait;
 			else
 				r->dt_sleep = dt_wait;
@@ -2564,7 +2555,7 @@ static int timehist_sched_change_event(struct perf_tool *tool,
 	struct thread_runtime *tr = NULL;
 	u64 tprev, t = sample->time;
 	int rc = 0;
-	int state = evsel__intval(evsel, sample, "prev_state");
+	char state = get_task_prev_state(evsel, sample);
 
 	if (machine__resolve(machine, &al, sample) < 0) {
 		pr_err("problem processing %d event. skipping it\n",
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v3 3/6] sched, tracing: add to report task state in symbolic chars
  2023-08-01  9:01 [RFC PATCH v3 0/6] add to report task state in symbolic chars from sched tracepoint Ze Gao
  2023-08-01  9:01 ` [RFC PATCH v3 1/6] perf sched: sync state char array with the kernel Ze Gao
  2023-08-01  9:01 ` [RFC PATCH v3 2/6] perf sched: reorganize sched-out task state report code Ze Gao
@ 2023-08-01  9:01 ` Ze Gao
  2023-08-01 11:34   ` Peter Zijlstra
  2023-08-01 11:45   ` Peter Zijlstra
  2023-08-01  9:01 ` [RFC PATCH v3 4/6] sched, tracing: reorganize fields of switch event struct Ze Gao
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Ze Gao @ 2023-08-01  9:01 UTC (permalink / raw)
  To: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel,
	linux-trace-devel, Ze Gao

Internal representations of task state are likely to be changed
or ordered, and reporting them to userspace without exporting
them as part of API is basically wrong, which can easily break
a userspace observability tool as kernel evolves. For example,
perf suffers from this and still reports wrong states as of this
writing.

OTOH, some masqueraded states like TASK_REPORT_IDLE and
TASK_REPORT_MAX are also reported inadvertently, which confuses
things even more and most userspace tools do not even take them
into consideration.

So add a new variable in company with the old raw value to
report task state in symbolic chars, which are self-explaining
and no further translation is needed. Of course this does not
break any userspace tool.

Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use
the old conventions for the rest.

Signed-off-by: Ze Gao <zegao@tencent.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
---
 include/trace/events/sched.h | 54 +++++++++++++++++++++---------------
 1 file changed, 32 insertions(+), 22 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index fbb99a61f714..e507901bcab8 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -6,6 +6,7 @@
 #define _TRACE_SCHED_H
 
 #include <linux/kthread.h>
+#include <linux/sched.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
@@ -214,6 +215,27 @@ static inline long __trace_sched_switch_state(bool preempt,
 
 	return state ? (1 << (state - 1)) : state;
 }
+
+static inline char __trace_sched_switch_state_char(bool preempt,
+						   unsigned int prev_state,
+						   struct task_struct *p)
+{
+	long state;
+
+#ifdef CONFIG_SCHED_DEBUG
+	BUG_ON(p != current);
+#endif /* CONFIG_SCHED_DEBUG */
+
+	/*
+	 * For PREEMPT_ACTIVE, we introduce 'p' to report it and use the old
+	 * conventions for the rest.
+	 */
+	if (preempt)
+		return 'p';
+
+	state = __task_state_index(prev_state, p->exit_state);
+	return task_index_to_char(state);
+}
 #endif /* CREATE_TRACE_POINTS */
 
 /*
@@ -233,6 +255,7 @@ TRACE_EVENT(sched_switch,
 		__field(	pid_t,	prev_pid			)
 		__field(	int,	prev_prio			)
 		__field(	long,	prev_state			)
+		__field(	char,	prev_state_char			)
 		__array(	char,	next_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	next_pid			)
 		__field(	int,	next_prio			)
@@ -240,32 +263,19 @@ TRACE_EVENT(sched_switch,
 
 	TP_fast_assign(
 		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
-		__entry->prev_pid	= prev->pid;
-		__entry->prev_prio	= prev->prio;
-		__entry->prev_state	= __trace_sched_switch_state(preempt, prev_state, prev);
+		__entry->prev_pid		= prev->pid;
+		__entry->prev_prio		= prev->prio;
+		__entry->prev_state		= __trace_sched_switch_state(preempt, prev_state, prev);
+		__entry->prev_state_char	= __trace_sched_switch_state_char(preempt, prev_state, prev);
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
-		__entry->next_pid	= next->pid;
-		__entry->next_prio	= next->prio;
+		__entry->next_pid		= next->pid;
+		__entry->next_prio		= next->prio;
 		/* XXX SCHED_DEADLINE */
 	),
 
-	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
-		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
-
-		(__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
-		  __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|",
-				{ TASK_INTERRUPTIBLE, "S" },
-				{ TASK_UNINTERRUPTIBLE, "D" },
-				{ __TASK_STOPPED, "T" },
-				{ __TASK_TRACED, "t" },
-				{ EXIT_DEAD, "X" },
-				{ EXIT_ZOMBIE, "Z" },
-				{ TASK_PARKED, "P" },
-				{ TASK_DEAD, "I" }) :
-		  "R",
-
-		__entry->prev_state & TASK_REPORT_MAX ? "+" : "",
-		__entry->next_comm, __entry->next_pid, __entry->next_prio)
+	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%c ==> next_comm=%s next_pid=%d next_prio=%d",
+		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio, __entry->prev_state_char, __entry->next_comm,
+		__entry->next_pid, __entry->next_prio)
 );
 
 /*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v3 4/6] sched, tracing: reorganize fields of switch event struct
  2023-08-01  9:01 [RFC PATCH v3 0/6] add to report task state in symbolic chars from sched tracepoint Ze Gao
                   ` (2 preceding siblings ...)
  2023-08-01  9:01 ` [RFC PATCH v3 3/6] sched, tracing: add to report task state in symbolic chars Ze Gao
@ 2023-08-01  9:01 ` Ze Gao
  2023-08-01 11:46   ` Peter Zijlstra
  2023-08-01 14:16   ` Steven Rostedt
  2023-08-01  9:01 ` [RFC PATCH v3 5/6] perf sched: prefer to use prev_state_char introduced in sched_switch Ze Gao
  2023-08-01  9:01 ` [RFC PATCH v3 6/6] libtraceevent: " Ze Gao
  5 siblings, 2 replies; 22+ messages in thread
From: Ze Gao @ 2023-08-01  9:01 UTC (permalink / raw)
  To: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel,
	linux-trace-devel, Ze Gao

Report priorities in 'short' and prev_state in 'int' to save
some buffer space. And also reorder the fields so that we take
struct alignment into consideration to make the record compact.

Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Ze Gao <zegao@tencent.com>
---
 include/trace/events/sched.h | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index e507901bcab8..36863ffb00c6 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -188,7 +188,7 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
 	     TP_ARGS(p));
 
 #ifdef CREATE_TRACE_POINTS
-static inline long __trace_sched_switch_state(bool preempt,
+static inline int __trace_sched_switch_state(bool preempt,
 					      unsigned int prev_state,
 					      struct task_struct *p)
 {
@@ -251,25 +251,25 @@ TRACE_EVENT(sched_switch,
 	TP_ARGS(preempt, prev, next, prev_state),
 
 	TP_STRUCT__entry(
-		__array(	char,	prev_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	prev_pid			)
-		__field(	int,	prev_prio			)
-		__field(	long,	prev_state			)
-		__field(	char,	prev_state_char			)
-		__array(	char,	next_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	next_pid			)
-		__field(	int,	next_prio			)
+		__field(	short,	prev_prio			)
+		__field(	short,	next_prio			)
+		__field(	int,	prev_state			)
+		__array(	char,	prev_comm,	TASK_COMM_LEN	)
+		__array(	char,	next_comm,	TASK_COMM_LEN	)
+		__field(	char,	prev_state_char			)
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
 		__entry->prev_pid		= prev->pid;
-		__entry->prev_prio		= prev->prio;
+		__entry->next_pid		= next->pid;
+		__entry->prev_prio		= (short) prev->prio;
+		__entry->next_prio		= (short) next->prio;
 		__entry->prev_state		= __trace_sched_switch_state(preempt, prev_state, prev);
 		__entry->prev_state_char	= __trace_sched_switch_state_char(preempt, prev_state, prev);
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
-		__entry->next_pid		= next->pid;
-		__entry->next_prio		= next->prio;
+		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
 		/* XXX SCHED_DEADLINE */
 	),
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v3 5/6] perf sched: prefer to use prev_state_char introduced in sched_switch
  2023-08-01  9:01 [RFC PATCH v3 0/6] add to report task state in symbolic chars from sched tracepoint Ze Gao
                   ` (3 preceding siblings ...)
  2023-08-01  9:01 ` [RFC PATCH v3 4/6] sched, tracing: reorganize fields of switch event struct Ze Gao
@ 2023-08-01  9:01 ` Ze Gao
  2023-08-01  9:01 ` [RFC PATCH v3 6/6] libtraceevent: " Ze Gao
  5 siblings, 0 replies; 22+ messages in thread
From: Ze Gao @ 2023-08-01  9:01 UTC (permalink / raw)
  To: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel,
	linux-trace-devel, Ze Gao

Since the sched_switch tracepoint introduces a new variable to
report sched-out task state in symbolic char, we prefer to use
it to spare from knowing internal implementations in kernel.

Also we keep the old parsing logic intact but sync the state char
array with the latest kernel.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 tools/perf/builtin-sched.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 275da655b67a..6ca60d4773d3 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -837,7 +837,7 @@ replay_wakeup_event(struct perf_sched *sched,
 
 static inline char task_state_char(int state)
 {
-	static const char state_to_char[] = "RSDTtXZPI";
+	static const char state_to_char[] = "RSDTtXZPIp";
 	unsigned bit = state ? ffs(state) : 0;
 	return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
 }
@@ -845,8 +845,20 @@ static inline char task_state_char(int state)
 static inline char get_task_prev_state(struct evsel *evsel,
 				       struct perf_sample *sample)
 {
-	const int prev_state = evsel__intval(evsel, sample, "prev_state");
-	return task_state_char(prev_state);
+	char prev_state_char;
+	int prev_state;
+
+	//prefer to use prev_state_char
+	if (evsel__field(evsel, "prev_state_char"))
+		prev_state_char = (char) evsel__intval(evsel,
+				sample, "prev_state_char");
+	else {
+		prev_state = (int) evsel__intval(evsel,
+				sample, "prev_state");
+		prev_state_char = task_state_char(prev_state);
+	}
+
+	return prev_state_char;
 }
 
 static int replay_switch_event(struct perf_sched *sched,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v3 6/6] libtraceevent: prefer to use prev_state_char introduced in sched_switch
  2023-08-01  9:01 [RFC PATCH v3 0/6] add to report task state in symbolic chars from sched tracepoint Ze Gao
                   ` (4 preceding siblings ...)
  2023-08-01  9:01 ` [RFC PATCH v3 5/6] perf sched: prefer to use prev_state_char introduced in sched_switch Ze Gao
@ 2023-08-01  9:01 ` Ze Gao
  2023-08-01 14:19   ` Steven Rostedt
  5 siblings, 1 reply; 22+ messages in thread
From: Ze Gao @ 2023-08-01  9:01 UTC (permalink / raw)
  To: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linux-perf-users, linux-trace-kernel,
	linux-trace-devel, Ze Gao

Since the sched_switch tracepoint introduces a new variable to
report sched-out task state in symbolic char, we prefer to use
it to spare from knowing internal implementations in kernel.

Also we keep the old parsing logic intact but sync the state char
array with the latest kernel.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 plugins/plugin_sched_switch.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/plugins/plugin_sched_switch.c b/plugins/plugin_sched_switch.c
index 8752cae..4c57322 100644
--- a/plugins/plugin_sched_switch.c
+++ b/plugins/plugin_sched_switch.c
@@ -11,7 +11,7 @@
 
 static void write_state(struct trace_seq *s, int val)
 {
-	const char states[] = "SDTtZXxW";
+	const char states[] = "SDTtXZPIp";
 	int found = 0;
 	int i;
 
@@ -99,7 +99,12 @@ static int sched_switch_handler(struct trace_seq *s,
 	if (tep_get_field_val(s, event, "prev_prio", record, &val, 1) == 0)
 		trace_seq_printf(s, "[%d] ", (int) val);
 
-	if (tep_get_field_val(s,  event, "prev_state", record, &val, 1) == 0)
+	//find if has prev_state_char, otherwise fallback to prev_state
+	if (tep_find_field(event, "prev_state_char")) {
+		if (tep_get_field_val(s,  event, "prev_state_char", record, &val, 1) == 0)
+			trace_seq_putc(s, (char) val);
+	}
+	else if (tep_get_field_val(s,  event, "prev_state", record, &val, 1) == 0)
 		write_state(s, val);
 
 	trace_seq_puts(s, " ==> ");
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 3/6] sched, tracing: add to report task state in symbolic chars
  2023-08-01  9:01 ` [RFC PATCH v3 3/6] sched, tracing: add to report task state in symbolic chars Ze Gao
@ 2023-08-01 11:34   ` Peter Zijlstra
  2023-08-01 13:03     ` Ze Gao
  2023-08-01 11:45   ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2023-08-01 11:34 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Steven Rostedt, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

On Tue, Aug 01, 2023 at 05:01:21PM +0800, Ze Gao wrote:
> Internal representations of task state are likely to be changed
> or ordered, and reporting them to userspace without exporting
> them as part of API is basically wrong, which can easily break
> a userspace observability tool as kernel evolves. For example,
> perf suffers from this and still reports wrong states as of this
> writing.
> 
> OTOH, some masqueraded states like TASK_REPORT_IDLE and
> TASK_REPORT_MAX are also reported inadvertently, which confuses
> things even more and most userspace tools do not even take them
> into consideration.
> 
> So add a new variable in company with the old raw value to
> report task state in symbolic chars, which are self-explaining
> and no further translation is needed. Of course this does not
> break any userspace tool.
> 
> Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use
> the old conventions for the rest.

*sigh*... just because userspace if daft, we need to change the kernel?

Why do we need this character anyway, why not just print the state in
hex and leave it at that? These single character state things are a
relic, please just let them die.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 3/6] sched, tracing: add to report task state in symbolic chars
  2023-08-01  9:01 ` [RFC PATCH v3 3/6] sched, tracing: add to report task state in symbolic chars Ze Gao
  2023-08-01 11:34   ` Peter Zijlstra
@ 2023-08-01 11:45   ` Peter Zijlstra
  2023-08-01 13:08     ` Ze Gao
  2023-08-01 14:27     ` Steven Rostedt
  1 sibling, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2023-08-01 11:45 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Steven Rostedt, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

On Tue, Aug 01, 2023 at 05:01:21PM +0800, Ze Gao wrote:
> @@ -233,6 +255,7 @@ TRACE_EVENT(sched_switch,
>  		__field(	pid_t,	prev_pid			)
>  		__field(	int,	prev_prio			)
>  		__field(	long,	prev_state			)
> +		__field(	char,	prev_state_char			)
>  		__array(	char,	next_comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	next_pid			)
>  		__field(	int,	next_prio			)

And this again will wreck everybody that consumes the raw tracepoint
without looking at tracefs.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 4/6] sched, tracing: reorganize fields of switch event struct
  2023-08-01  9:01 ` [RFC PATCH v3 4/6] sched, tracing: reorganize fields of switch event struct Ze Gao
@ 2023-08-01 11:46   ` Peter Zijlstra
  2023-08-01 13:16     ` Ze Gao
  2023-08-01 14:33     ` Steven Rostedt
  2023-08-01 14:16   ` Steven Rostedt
  1 sibling, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2023-08-01 11:46 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Steven Rostedt, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

On Tue, Aug 01, 2023 at 05:01:22PM +0800, Ze Gao wrote:
> Report priorities in 'short' and prev_state in 'int' to save
> some buffer space. And also reorder the fields so that we take
> struct alignment into consideration to make the record compact.
> 
> Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>

I don't see a single line describing the effort you've done to audit
consumers of this tracepoint.

*IF* you're wanting to break this tracepoint ABI, because seriously
that's what it is, then you get to invest the time and effort to audit
the users.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 3/6] sched, tracing: add to report task state in symbolic chars
  2023-08-01 11:34   ` Peter Zijlstra
@ 2023-08-01 13:03     ` Ze Gao
  2023-08-01 13:41       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Ze Gao @ 2023-08-01 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Steven Rostedt, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

On Tue, Aug 1, 2023 at 7:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 01, 2023 at 05:01:21PM +0800, Ze Gao wrote:
> > Internal representations of task state are likely to be changed
> > or ordered, and reporting them to userspace without exporting
> > them as part of API is basically wrong, which can easily break
> > a userspace observability tool as kernel evolves. For example,
> > perf suffers from this and still reports wrong states as of this
> > writing.
> >
> > OTOH, some masqueraded states like TASK_REPORT_IDLE and
> > TASK_REPORT_MAX are also reported inadvertently, which confuses
> > things even more and most userspace tools do not even take them
> > into consideration.
> >
> > So add a new variable in company with the old raw value to
> > report task state in symbolic chars, which are self-explaining
> > and no further translation is needed. Of course this does not
> > break any userspace tool.
> >
> > Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use
> > the old conventions for the rest.
>
> *sigh*... just because userspace if daft, we need to change the kernel?

Hi Peter,

Sorry that I don't quite agree with you on this one.

It's just the design that exporting internal details is fundamentally wrong.
And even worse,  I did not see any userspace tool is aware of masqueraded
states like TASK_REPORT_IDLE and TASK_REPORT_MAX and let alone
parse it correctly.  This confused me a lot when I decided to write my own bpf
version of sched-latency analysis tool and only after I figured out everything
underneath, I started to make things right here.

Again, I mean it's not me that deliberately "breaks" ABI here and I am
never meant
to upset anyone.  My confusion is why did people forget to update in-tree perf
the very last time they decide to rearrange the task state mapping
since we all agree
this is important "ABI" here.  I don't think it's the tool's fault.
And that's my initiative
to request this RFC.

> Why do we need this character anyway, why not just print the state in
> hex and leave it at that? These single character state things are a
> relic, please just let them die.

I believe hex is ok only after having the reported task state mapping
appear in the
uapi headers, otherwise it's still useless to userspace especially for
value like
TASK_REPORT_IDLE and TASK_REPORT_MAX, which need to dig into the
kernel to see what the hell is going on here.

Thoughts?

Regards,
Ze

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 3/6] sched, tracing: add to report task state in symbolic chars
  2023-08-01 11:45   ` Peter Zijlstra
@ 2023-08-01 13:08     ` Ze Gao
  2023-08-01 14:27     ` Steven Rostedt
  1 sibling, 0 replies; 22+ messages in thread
From: Ze Gao @ 2023-08-01 13:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Steven Rostedt, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

Sorry that I don't get this one, did you mean kernel subsystems like
bpf or third party modules?  Honestly I don't know how it works here
for userspace to consume the raw tracepoint without looking at
tracefs.

Regards,
Ze

On Tue, Aug 1, 2023 at 7:46 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 01, 2023 at 05:01:21PM +0800, Ze Gao wrote:
> > @@ -233,6 +255,7 @@ TRACE_EVENT(sched_switch,
> >               __field(        pid_t,  prev_pid                        )
> >               __field(        int,    prev_prio                       )
> >               __field(        long,   prev_state                      )
> > +             __field(        char,   prev_state_char                 )
> >               __array(        char,   next_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  next_pid                        )
> >               __field(        int,    next_prio                       )
>
> And this again will wreck everybody that consumes the raw tracepoint
> without looking at tracefs.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 4/6] sched, tracing: reorganize fields of switch event struct
  2023-08-01 11:46   ` Peter Zijlstra
@ 2023-08-01 13:16     ` Ze Gao
  2023-08-01 14:33     ` Steven Rostedt
  1 sibling, 0 replies; 22+ messages in thread
From: Ze Gao @ 2023-08-01 13:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Steven Rostedt, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

Oops, I thought sending this series for RFC is the "effort" you mean
to audit the users :/

Correct me if I'm making stupid moves here and enlighten me what
I should do furthermore to audit the users.

Thanks,
Ze

On Tue, Aug 1, 2023 at 7:47 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 01, 2023 at 05:01:22PM +0800, Ze Gao wrote:
> > Report priorities in 'short' and prev_state in 'int' to save
> > some buffer space. And also reorder the fields so that we take
> > struct alignment into consideration to make the record compact.
> >
> > Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
>
> I don't see a single line describing the effort you've done to audit
> consumers of this tracepoint.
>
> *IF* you're wanting to break this tracepoint ABI, because seriously
> that's what it is, then you get to invest the time and effort to audit
> the users.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 3/6] sched, tracing: add to report task state in symbolic chars
  2023-08-01 13:03     ` Ze Gao
@ 2023-08-01 13:41       ` Peter Zijlstra
  2023-08-02  3:03         ` Ze Gao
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2023-08-01 13:41 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Steven Rostedt, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

On Tue, Aug 01, 2023 at 09:03:51PM +0800, Ze Gao wrote:

> It's just the design that exporting internal details is fundamentally wrong.

This is tracing... it wasn't supposed to be ABI (although it somehow
ended up being one). But even then, things like PF_foo get exposed in
procfs but even that we change.

The whole point of tracing is to see internals in order to figure out
wth is going wrong.

> And even worse,  I did not see any userspace tool is aware of masqueraded
> states like TASK_REPORT_IDLE and TASK_REPORT_MAX and let alone
> parse it correctly.

That's probably because I never use tools, I just look at the raw trace
output -- sometimes an impromptu awk script. I'm pretty sure I ran with
something like the below when I did the freezer rewrite -- or perhaps I
just stuck in trace_printk() and didn't even bother with the
tracepoints, I can't remember.

> > Why do we need this character anyway, why not just print the state in
> > hex and leave it at that? These single character state things are a
> > relic, please just let them die.
> 
> I believe hex is ok only after having the reported task state mapping
> appear in the uapi headers, otherwise it's still useless to userspace
> especially for value like TASK_REPORT_IDLE and TASK_REPORT_MAX, which
> need to dig into the kernel to see what the hell is going on here.
> 
> Thoughts?

If you're tracing the kernel, you had better know what the kernel is
doing, otherwise you get to keep the pieces.

Anyway, if you're doing BPF then why do you care about the trace event
at all, just attach to the raw tracepoint and consume @preemt, @prev,
@next and @prev_state.

---
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index fbb99a61f714..cb0c1251729e 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -186,36 +186,6 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
 	     TP_PROTO(struct task_struct *p),
 	     TP_ARGS(p));
 
-#ifdef CREATE_TRACE_POINTS
-static inline long __trace_sched_switch_state(bool preempt,
-					      unsigned int prev_state,
-					      struct task_struct *p)
-{
-	unsigned int state;
-
-#ifdef CONFIG_SCHED_DEBUG
-	BUG_ON(p != current);
-#endif /* CONFIG_SCHED_DEBUG */
-
-	/*
-	 * Preemption ignores task state, therefore preempted tasks are always
-	 * RUNNING (we will not have dequeued if state != RUNNING).
-	 */
-	if (preempt)
-		return TASK_REPORT_MAX;
-
-	/*
-	 * task_state_index() uses fls() and returns a value from 0-8 range.
-	 * Decrement it by 1 (except TASK_RUNNING state i.e 0) before using
-	 * it for left shift operation to get the correct task->state
-	 * mapping.
-	 */
-	state = __task_state_index(prev_state, p->exit_state);
-
-	return state ? (1 << (state - 1)) : state;
-}
-#endif /* CREATE_TRACE_POINTS */
-
 /*
  * Tracepoint for task switches, performed by the scheduler:
  */
@@ -242,29 +212,16 @@ TRACE_EVENT(sched_switch,
 		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
 		__entry->prev_pid	= prev->pid;
 		__entry->prev_prio	= prev->prio;
-		__entry->prev_state	= __trace_sched_switch_state(preempt, prev_state, prev);
+		__entry->prev_state	= prev_state | (preempt * TASK_STATE_MAX);
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
 		__entry->next_pid	= next->pid;
 		__entry->next_prio	= next->prio;
 		/* XXX SCHED_DEADLINE */
 	),
 
-	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
+	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%x ==> next_comm=%s next_pid=%d next_prio=%d",
 		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
-
-		(__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
-		  __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|",
-				{ TASK_INTERRUPTIBLE, "S" },
-				{ TASK_UNINTERRUPTIBLE, "D" },
-				{ __TASK_STOPPED, "T" },
-				{ __TASK_TRACED, "t" },
-				{ EXIT_DEAD, "X" },
-				{ EXIT_ZOMBIE, "Z" },
-				{ TASK_PARKED, "P" },
-				{ TASK_DEAD, "I" }) :
-		  "R",
-
-		__entry->prev_state & TASK_REPORT_MAX ? "+" : "",
+		__entry->prev_state,
 		__entry->next_comm, __entry->next_pid, __entry->next_prio)
 );
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 4/6] sched, tracing: reorganize fields of switch event struct
  2023-08-01  9:01 ` [RFC PATCH v3 4/6] sched, tracing: reorganize fields of switch event struct Ze Gao
  2023-08-01 11:46   ` Peter Zijlstra
@ 2023-08-01 14:16   ` Steven Rostedt
  2023-08-02  3:07     ` Ze Gao
  1 sibling, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2023-08-01 14:16 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

On Tue,  1 Aug 2023 17:01:22 +0800
Ze Gao <zegao2021@gmail.com> wrote:

> Report priorities in 'short' and prev_state in 'int' to save
> some buffer space. And also reorder the fields so that we take
> struct alignment into consideration to make the record compact.
> 
> Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> Signed-off-by: Ze Gao <zegao@tencent.com>

I'd swap this patch with patch 3. That is, make the field changes first.
I'd like this to get in regardless of if the state_char is accepted. We may
want to get this in first to see if there's any regressions before we add a
state_char.

-- Steve


> ---
>  include/trace/events/sched.h | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index e507901bcab8..36863ffb00c6 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -188,7 +188,7 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
>  	     TP_ARGS(p));
>  
>  #ifdef CREATE_TRACE_POINTS
> -static inline long __trace_sched_switch_state(bool preempt,
> +static inline int __trace_sched_switch_state(bool preempt,
>  					      unsigned int prev_state,
>  					      struct task_struct *p)
>  {
> @@ -251,25 +251,25 @@ TRACE_EVENT(sched_switch,
>  	TP_ARGS(preempt, prev, next, prev_state),
>  
>  	TP_STRUCT__entry(
> -		__array(	char,	prev_comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	prev_pid			)
> -		__field(	int,	prev_prio			)
> -		__field(	long,	prev_state			)
> -		__field(	char,	prev_state_char			)
> -		__array(	char,	next_comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	next_pid			)
> -		__field(	int,	next_prio			)
> +		__field(	short,	prev_prio			)
> +		__field(	short,	next_prio			)
> +		__field(	int,	prev_state			)
> +		__array(	char,	prev_comm,	TASK_COMM_LEN	)
> +		__array(	char,	next_comm,	TASK_COMM_LEN	)
> +		__field(	char,	prev_state_char			)
>  	),
>  
>  	TP_fast_assign(
> -		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
>  		__entry->prev_pid		= prev->pid;
> -		__entry->prev_prio		= prev->prio;
> +		__entry->next_pid		= next->pid;
> +		__entry->prev_prio		= (short) prev->prio;
> +		__entry->next_prio		= (short) next->prio;
>  		__entry->prev_state		= __trace_sched_switch_state(preempt, prev_state, prev);
>  		__entry->prev_state_char	= __trace_sched_switch_state_char(preempt, prev_state, prev);
>  		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> -		__entry->next_pid		= next->pid;
> -		__entry->next_prio		= next->prio;
> +		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
>  		/* XXX SCHED_DEADLINE */
>  	),
>  


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 6/6] libtraceevent: prefer to use prev_state_char introduced in sched_switch
  2023-08-01  9:01 ` [RFC PATCH v3 6/6] libtraceevent: " Ze Gao
@ 2023-08-01 14:19   ` Steven Rostedt
  2023-08-02  3:08     ` Ze Gao
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2023-08-01 14:19 UTC (permalink / raw)
  To: Ze Gao
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

On Tue,  1 Aug 2023 17:01:24 +0800
Ze Gao <zegao2021@gmail.com> wrote:

> Since the sched_switch tracepoint introduces a new variable to
> report sched-out task state in symbolic char, we prefer to use
> it to spare from knowing internal implementations in kernel.
> 
> Also we keep the old parsing logic intact but sync the state char
> array with the latest kernel.

This should be two patches. First sync the state char array and then add
your state_char change. The two changes are agnostic to each other, and
should be separate commits. Same goes for the perf changes.

-- Steve


> 
> Signed-off-by: Ze Gao <zegao@tencent.com>
> ---
>  plugins/plugin_sched_switch.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/plugins/plugin_sched_switch.c b/plugins/plugin_sched_switch.c
> index 8752cae..4c57322 100644
> --- a/plugins/plugin_sched_switch.c
> +++ b/plugins/plugin_sched_switch.c
> @@ -11,7 +11,7 @@
>  
>  static void write_state(struct trace_seq *s, int val)
>  {
> -	const char states[] = "SDTtZXxW";
> +	const char states[] = "SDTtXZPIp";
>  	int found = 0;
>  	int i;
>  
> @@ -99,7 +99,12 @@ static int sched_switch_handler(struct trace_seq *s,
>  	if (tep_get_field_val(s, event, "prev_prio", record, &val, 1) == 0)
>  		trace_seq_printf(s, "[%d] ", (int) val);
>  
> -	if (tep_get_field_val(s,  event, "prev_state", record, &val, 1) == 0)
> +	//find if has prev_state_char, otherwise fallback to prev_state
> +	if (tep_find_field(event, "prev_state_char")) {
> +		if (tep_get_field_val(s,  event, "prev_state_char", record, &val, 1) == 0)
> +			trace_seq_putc(s, (char) val);
> +	}
> +	else if (tep_get_field_val(s,  event, "prev_state", record, &val, 1) == 0)
>  		write_state(s, val);
>  
>  	trace_seq_puts(s, " ==> ");


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 3/6] sched, tracing: add to report task state in symbolic chars
  2023-08-01 11:45   ` Peter Zijlstra
  2023-08-01 13:08     ` Ze Gao
@ 2023-08-01 14:27     ` Steven Rostedt
  1 sibling, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2023-08-01 14:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ze Gao, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Mark Rutland, Masami Hiramatsu, Namhyung Kim, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

On Tue, 1 Aug 2023 13:45:45 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Aug 01, 2023 at 05:01:21PM +0800, Ze Gao wrote:
> > @@ -233,6 +255,7 @@ TRACE_EVENT(sched_switch,
> >  		__field(	pid_t,	prev_pid			)
> >  		__field(	int,	prev_prio			)
> >  		__field(	long,	prev_state			)
> > +		__field(	char,	prev_state_char			)
> >  		__array(	char,	next_comm,	TASK_COMM_LEN	)
> >  		__field(	pid_t,	next_pid			)
> >  		__field(	int,	next_prio			)  
> 
> And this again will wreck everybody that consumes the raw tracepoint
> without looking at tracefs.

Nobody does that anymore, as the events change constantly, and are
different on different kernels. Powertop (the tool that caused us pain
before by using raw values) had to break down and use libtraceevent,
because it would break if there was a 32 bit version running on a 64 bit
kernel.

I've changed the offsets of raw events a few times and nobody has
complained since.

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 4/6] sched, tracing: reorganize fields of switch event struct
  2023-08-01 11:46   ` Peter Zijlstra
  2023-08-01 13:16     ` Ze Gao
@ 2023-08-01 14:33     ` Steven Rostedt
  2023-08-02  3:06       ` Ze Gao
  1 sibling, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2023-08-01 14:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ze Gao, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Mark Rutland, Masami Hiramatsu, Namhyung Kim, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

On Tue, 1 Aug 2023 13:46:50 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Aug 01, 2023 at 05:01:22PM +0800, Ze Gao wrote:
> > Report priorities in 'short' and prev_state in 'int' to save
> > some buffer space. And also reorder the fields so that we take
> > struct alignment into consideration to make the record compact.
> > 
> > Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>  
> 
> I don't see a single line describing the effort you've done to audit
> consumers of this tracepoint.
> 
> *IF* you're wanting to break this tracepoint ABI, because seriously
> that's what it is, then you get to invest the time and effort to audit
> the users.

The known major users that I am aware of is raesdaemon,
powertop/latencytop, perf, trace-cmd and some bpf tools. The bpf tooling is
known to update per kernel. The others all use libtraceevent that can
handle this change.

What other tools are there? There's Perfetto, but it also looks at tracefs
to examine where the values are. There's LTTng, but I believe it uses the
raw tracepoint directly and doesn't look at the layout of the ftrace/perf
buffers.

All other tooling I am slightly aware of uses libtracefs and libtraceveent,
as I've been giving many talks on how to use those libraries.

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 3/6] sched, tracing: add to report task state in symbolic chars
  2023-08-01 13:41       ` Peter Zijlstra
@ 2023-08-02  3:03         ` Ze Gao
  0 siblings, 0 replies; 22+ messages in thread
From: Ze Gao @ 2023-08-02  3:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Steven Rostedt, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

On Tue, Aug 1, 2023 at 9:42 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 01, 2023 at 09:03:51PM +0800, Ze Gao wrote:
>
> > It's just the design that exporting internal details is fundamentally wrong.
>
> This is tracing... it wasn't supposed to be ABI (although it somehow

Sorry, I'm confused. And it sounds contradicting here
because you said this change is abi break before.

> ended up being one). But even then, things like PF_foo get exposed in
> procfs but even that we change.
>
> The whole point of tracing is to see internals in order to figure out
> wth is going wrong.

Fair point, but I think tracepoints are somewhat different from
kprobes/raw_tracepoints due to their stable nature. And I think
at least more and more observability tools use them in this way.

With all due respect, If it was for kernel developers only, what's the
point of leaking this since now we have raw tracepoints support?
Does that mean all tracepoints are useless now? Apparently the
answer is no. So I'm not convinced by this "for internal inspecting"
defense to not ignore what the problem is.

Honestly, I would've never thought to change this if I I got the correct
meaning for values I captured like 0x80/0x100  when I tried to look it up
in include/linux/sched.h the first time. But it really annoyed me to
figure out what it is only after I dived into the kernel and collected all
the pieces.  And you know what, when I turned to the famous in-tree perf
for possible help, only to find out it's horribly broken and still uses an
outdated state char array to interpret this weird raw value.

Anyway, we have to accept the fact that prev_state leaves a huge burden
on its users to make things right.  And I'm open and glad to see any
solutions (possibly better than this one) or efforts or suggestions to improve
this.

Regards,
Ze

> > And even worse,  I did not see any userspace tool is aware of masqueraded
> > states like TASK_REPORT_IDLE and TASK_REPORT_MAX and let alone
> > parse it correctly.
>
> That's probably because I never use tools, I just look at the raw trace
> output -- sometimes an impromptu awk script. I'm pretty sure I ran with
> something like the below when I did the freezer rewrite -- or perhaps I
> just stuck in trace_printk() and didn't even bother with the
> tracepoints, I can't remember.
>
> > > Why do we need this character anyway, why not just print the state in
> > > hex and leave it at that? These single character state things are a
> > > relic, please just let them die.
> >
> > I believe hex is ok only after having the reported task state mapping
> > appear in the uapi headers, otherwise it's still useless to userspace
> > especially for value like TASK_REPORT_IDLE and TASK_REPORT_MAX, which
> > need to dig into the kernel to see what the hell is going on here.
> >
> > Thoughts?
>
> If you're tracing the kernel, you had better know what the kernel is
> doing, otherwise you get to keep the pieces.
>
> Anyway, if you're doing BPF then why do you care about the trace event
> at all, just attach to the raw tracepoint and consume @preemt, @prev,
> @next and @prev_state.
>
> ---
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index fbb99a61f714..cb0c1251729e 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -186,36 +186,6 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
>              TP_PROTO(struct task_struct *p),
>              TP_ARGS(p));
>
> -#ifdef CREATE_TRACE_POINTS
> -static inline long __trace_sched_switch_state(bool preempt,
> -                                             unsigned int prev_state,
> -                                             struct task_struct *p)
> -{
> -       unsigned int state;
> -
> -#ifdef CONFIG_SCHED_DEBUG
> -       BUG_ON(p != current);
> -#endif /* CONFIG_SCHED_DEBUG */
> -
> -       /*
> -        * Preemption ignores task state, therefore preempted tasks are always
> -        * RUNNING (we will not have dequeued if state != RUNNING).
> -        */
> -       if (preempt)
> -               return TASK_REPORT_MAX;
> -
> -       /*
> -        * task_state_index() uses fls() and returns a value from 0-8 range.
> -        * Decrement it by 1 (except TASK_RUNNING state i.e 0) before using
> -        * it for left shift operation to get the correct task->state
> -        * mapping.
> -        */
> -       state = __task_state_index(prev_state, p->exit_state);
> -
> -       return state ? (1 << (state - 1)) : state;
> -}
> -#endif /* CREATE_TRACE_POINTS */
> -
>  /*
>   * Tracepoint for task switches, performed by the scheduler:
>   */
> @@ -242,29 +212,16 @@ TRACE_EVENT(sched_switch,
>                 memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
>                 __entry->prev_pid       = prev->pid;
>                 __entry->prev_prio      = prev->prio;
> -               __entry->prev_state     = __trace_sched_switch_state(preempt, prev_state, prev);
> +               __entry->prev_state     = prev_state | (preempt * TASK_STATE_MAX);
>                 memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
>                 __entry->next_pid       = next->pid;
>                 __entry->next_prio      = next->prio;
>                 /* XXX SCHED_DEADLINE */
>         ),
>
> -       TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
> +       TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%x ==> next_comm=%s next_pid=%d next_prio=%d",
>                 __entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
> -
> -               (__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
> -                 __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|",
> -                               { TASK_INTERRUPTIBLE, "S" },
> -                               { TASK_UNINTERRUPTIBLE, "D" },
> -                               { __TASK_STOPPED, "T" },
> -                               { __TASK_TRACED, "t" },
> -                               { EXIT_DEAD, "X" },
> -                               { EXIT_ZOMBIE, "Z" },
> -                               { TASK_PARKED, "P" },
> -                               { TASK_DEAD, "I" }) :
> -                 "R",
> -
> -               __entry->prev_state & TASK_REPORT_MAX ? "+" : "",
> +               __entry->prev_state,
>                 __entry->next_comm, __entry->next_pid, __entry->next_prio)
>  );
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 4/6] sched, tracing: reorganize fields of switch event struct
  2023-08-01 14:33     ` Steven Rostedt
@ 2023-08-02  3:06       ` Ze Gao
  0 siblings, 0 replies; 22+ messages in thread
From: Ze Gao @ 2023-08-02  3:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Mark Rutland, Masami Hiramatsu, Namhyung Kim, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

Thanks for clarifying this ! Steven. This is really helpful.

Regards,
Ze

On Tue, Aug 1, 2023 at 10:33 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 1 Aug 2023 13:46:50 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Tue, Aug 01, 2023 at 05:01:22PM +0800, Ze Gao wrote:
> > > Report priorities in 'short' and prev_state in 'int' to save
> > > some buffer space. And also reorder the fields so that we take
> > > struct alignment into consideration to make the record compact.
> > >
> > > Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> >
> > I don't see a single line describing the effort you've done to audit
> > consumers of this tracepoint.
> >
> > *IF* you're wanting to break this tracepoint ABI, because seriously
> > that's what it is, then you get to invest the time and effort to audit
> > the users.
>
> The known major users that I am aware of is raesdaemon,
> powertop/latencytop, perf, trace-cmd and some bpf tools. The bpf tooling is
> known to update per kernel. The others all use libtraceevent that can
> handle this change.
>
> What other tools are there? There's Perfetto, but it also looks at tracefs
> to examine where the values are. There's LTTng, but I believe it uses the
> raw tracepoint directly and doesn't look at the layout of the ftrace/perf
> buffers.
>
> All other tooling I am slightly aware of uses libtracefs and libtraceveent,
> as I've been giving many talks on how to use those libraries.
>
> -- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 4/6] sched, tracing: reorganize fields of switch event struct
  2023-08-01 14:16   ` Steven Rostedt
@ 2023-08-02  3:07     ` Ze Gao
  0 siblings, 0 replies; 22+ messages in thread
From: Ze Gao @ 2023-08-02  3:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

Fair point, will do it in v4 as well.

Thanks,
Ze

On Tue, Aug 1, 2023 at 10:16 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue,  1 Aug 2023 17:01:22 +0800
> Ze Gao <zegao2021@gmail.com> wrote:
>
> > Report priorities in 'short' and prev_state in 'int' to save
> > some buffer space. And also reorder the fields so that we take
> > struct alignment into consideration to make the record compact.
> >
> > Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > Signed-off-by: Ze Gao <zegao@tencent.com>
>
> I'd swap this patch with patch 3. That is, make the field changes first.
> I'd like this to get in regardless of if the state_char is accepted. We may
> want to get this in first to see if there's any regressions before we add a
> state_char.
>
> -- Steve
>
>
> > ---
> >  include/trace/events/sched.h | 22 +++++++++++-----------
> >  1 file changed, 11 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > index e507901bcab8..36863ffb00c6 100644
> > --- a/include/trace/events/sched.h
> > +++ b/include/trace/events/sched.h
> > @@ -188,7 +188,7 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
> >            TP_ARGS(p));
> >
> >  #ifdef CREATE_TRACE_POINTS
> > -static inline long __trace_sched_switch_state(bool preempt,
> > +static inline int __trace_sched_switch_state(bool preempt,
> >                                             unsigned int prev_state,
> >                                             struct task_struct *p)
> >  {
> > @@ -251,25 +251,25 @@ TRACE_EVENT(sched_switch,
> >       TP_ARGS(preempt, prev, next, prev_state),
> >
> >       TP_STRUCT__entry(
> > -             __array(        char,   prev_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  prev_pid                        )
> > -             __field(        int,    prev_prio                       )
> > -             __field(        long,   prev_state                      )
> > -             __field(        char,   prev_state_char                 )
> > -             __array(        char,   next_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  next_pid                        )
> > -             __field(        int,    next_prio                       )
> > +             __field(        short,  prev_prio                       )
> > +             __field(        short,  next_prio                       )
> > +             __field(        int,    prev_state                      )
> > +             __array(        char,   prev_comm,      TASK_COMM_LEN   )
> > +             __array(        char,   next_comm,      TASK_COMM_LEN   )
> > +             __field(        char,   prev_state_char                 )
> >       ),
> >
> >       TP_fast_assign(
> > -             memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> >               __entry->prev_pid               = prev->pid;
> > -             __entry->prev_prio              = prev->prio;
> > +             __entry->next_pid               = next->pid;
> > +             __entry->prev_prio              = (short) prev->prio;
> > +             __entry->next_prio              = (short) next->prio;
> >               __entry->prev_state             = __trace_sched_switch_state(preempt, prev_state, prev);
> >               __entry->prev_state_char        = __trace_sched_switch_state_char(preempt, prev_state, prev);
> >               memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> > -             __entry->next_pid               = next->pid;
> > -             __entry->next_prio              = next->prio;
> > +             memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> >               /* XXX SCHED_DEADLINE */
> >       ),
> >
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v3 6/6] libtraceevent: prefer to use prev_state_char introduced in sched_switch
  2023-08-01 14:19   ` Steven Rostedt
@ 2023-08-02  3:08     ` Ze Gao
  0 siblings, 0 replies; 22+ messages in thread
From: Ze Gao @ 2023-08-02  3:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, Namhyung Kim, Peter Zijlstra, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

Fair enough! Already did this in perf fixes. Will push a v4
to do this.

Thanks,
Ze


On Tue, Aug 1, 2023 at 10:19 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue,  1 Aug 2023 17:01:24 +0800
> Ze Gao <zegao2021@gmail.com> wrote:
>
> > Since the sched_switch tracepoint introduces a new variable to
> > report sched-out task state in symbolic char, we prefer to use
> > it to spare from knowing internal implementations in kernel.
> >
> > Also we keep the old parsing logic intact but sync the state char
> > array with the latest kernel.
>
> This should be two patches. First sync the state char array and then add
> your state_char change. The two changes are agnostic to each other, and
> should be separate commits. Same goes for the perf changes.
>
> -- Steve
>
>
> >
> > Signed-off-by: Ze Gao <zegao@tencent.com>
> > ---
> >  plugins/plugin_sched_switch.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/plugins/plugin_sched_switch.c b/plugins/plugin_sched_switch.c
> > index 8752cae..4c57322 100644
> > --- a/plugins/plugin_sched_switch.c
> > +++ b/plugins/plugin_sched_switch.c
> > @@ -11,7 +11,7 @@
> >
> >  static void write_state(struct trace_seq *s, int val)
> >  {
> > -     const char states[] = "SDTtZXxW";
> > +     const char states[] = "SDTtXZPIp";
> >       int found = 0;
> >       int i;
> >
> > @@ -99,7 +99,12 @@ static int sched_switch_handler(struct trace_seq *s,
> >       if (tep_get_field_val(s, event, "prev_prio", record, &val, 1) == 0)
> >               trace_seq_printf(s, "[%d] ", (int) val);
> >
> > -     if (tep_get_field_val(s,  event, "prev_state", record, &val, 1) == 0)
> > +     //find if has prev_state_char, otherwise fallback to prev_state
> > +     if (tep_find_field(event, "prev_state_char")) {
> > +             if (tep_get_field_val(s,  event, "prev_state_char", record, &val, 1) == 0)
> > +                     trace_seq_putc(s, (char) val);
> > +     }
> > +     else if (tep_get_field_val(s,  event, "prev_state", record, &val, 1) == 0)
> >               write_state(s, val);
> >
> >       trace_seq_puts(s, " ==> ");
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-08-02  3:08 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-01  9:01 [RFC PATCH v3 0/6] add to report task state in symbolic chars from sched tracepoint Ze Gao
2023-08-01  9:01 ` [RFC PATCH v3 1/6] perf sched: sync state char array with the kernel Ze Gao
2023-08-01  9:01 ` [RFC PATCH v3 2/6] perf sched: reorganize sched-out task state report code Ze Gao
2023-08-01  9:01 ` [RFC PATCH v3 3/6] sched, tracing: add to report task state in symbolic chars Ze Gao
2023-08-01 11:34   ` Peter Zijlstra
2023-08-01 13:03     ` Ze Gao
2023-08-01 13:41       ` Peter Zijlstra
2023-08-02  3:03         ` Ze Gao
2023-08-01 11:45   ` Peter Zijlstra
2023-08-01 13:08     ` Ze Gao
2023-08-01 14:27     ` Steven Rostedt
2023-08-01  9:01 ` [RFC PATCH v3 4/6] sched, tracing: reorganize fields of switch event struct Ze Gao
2023-08-01 11:46   ` Peter Zijlstra
2023-08-01 13:16     ` Ze Gao
2023-08-01 14:33     ` Steven Rostedt
2023-08-02  3:06       ` Ze Gao
2023-08-01 14:16   ` Steven Rostedt
2023-08-02  3:07     ` Ze Gao
2023-08-01  9:01 ` [RFC PATCH v3 5/6] perf sched: prefer to use prev_state_char introduced in sched_switch Ze Gao
2023-08-01  9:01 ` [RFC PATCH v3 6/6] libtraceevent: " Ze Gao
2023-08-01 14:19   ` Steven Rostedt
2023-08-02  3:08     ` Ze Gao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).