linux-trace-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint
@ 2023-07-26 12:16 Ze Gao
  2023-07-26 12:16 ` [RFC PATCH v2 1/3] sched, tracing: add to report task state in symbolic chars Ze Gao
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Ze Gao @ 2023-07-26 12:16 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Namhyung Kim
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, linux-kernel, linux-perf-users,
	linux-trace-kernel, linux-trace-devel, Ze Gao


This is the 2nd attempt to fix the report task state issue in sched
tracepint, here is the first version:

https://lore.kernel.org/linux-trace-kernel/20230725072254.32045-1-zegao@tencent.com

Against v1, add a new var to report task state in symbolic char instead
of replacing the old one and to not to break anything.

--

In the status quo, we should see three different outcomes of the reported
sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
of tracepoint sched_switch.  And it's not hard to figure out that the
former two are built upon the third one, and the reason why we see this
inconsistency is that the former two does not catch up with the internal
change of reported task state definitions as the kernel evolves.

IMHO, exporting internal representations of task state in the tracepoint
sched_switch is not a good practice and not encouraged at all, which can
easily break userspace tools that relies on it. Especially when tracepoints
are massively used in many observability tools nowadays due to its stable
nature, which makes them no longer used for debug only purpose and we
should be careful to decide what ought to be reported to userspace and what
ought not.

Therefore, to fix the issues mentioned above for good, instead of choosing
I proposed to add a new variable to report task state in sched_switch with
a symbolic character along with the old hardcoded value, and save the
further processing of userspace tools and spare them from knowing
implementation details in the kernel.

After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint only.

Reviews welcome!

Regards,

Ze

Ze Gao (2):
  sched, tracing: add to report task state in symbolic chars
  perf sched: use the new prev_state_char instead in tracepoint
    sched_switch

 include/trace/events/sched.h | 60 +++++++++++++++++++++---------------
 tools/perf/builtin-sched.c   | 57 ++++++----------------------------
 2 files changed, 45 insertions(+), 72 deletions(-)

Ze Gao (1):
  libtraceevent: use the new prev_state_char instead in tracepoint
    sched_switch

 plugins/plugin_sched_switch.c | 29 ++++-------------------------
 1 file changed, 4 insertions(+), 25 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 1/3] sched, tracing: add to report task state in symbolic chars
  2023-07-26 12:16 [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint Ze Gao
@ 2023-07-26 12:16 ` Ze Gao
  2023-07-31  9:36   ` Peter Zijlstra
  2023-07-31 15:38   ` Steven Rostedt
  2023-07-26 12:16 ` [RFC PATCH v2 2/3] perf sched: use the new prev_state_char instead in tracepoint sched_switch Ze Gao
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Ze Gao @ 2023-07-26 12:16 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Namhyung Kim
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, linux-kernel, linux-perf-users,
	linux-trace-kernel, linux-trace-devel, Ze Gao

Internal representations of task state are likely to be changed or
ordered, and reporting them to userspace without exporting them as
part of API is not a good choice, which can easily break a userspace
observability tool as kernel evolves. For example, perf suffers from
this and still reports wrong states by this patch.

OTOH, some masqueraded state like TASK_REPORT_IDLE and TASK_REPORT_MAX
are also reported inadvertently, which confuses things even more.

So add a new variable in company with the old raw value to report task
state in symbolic char, which is self-explaining and no further
translation is needed, and also report priorities in 'short' to save
some buffer space.  Of course this does not break any userspace tool.

Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use the old
conventions for the rest.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 include/trace/events/sched.h | 60 +++++++++++++++++++++---------------
 1 file changed, 35 insertions(+), 25 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index fbb99a61f714..0fcf68f49e45 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -6,6 +6,7 @@
 #define _TRACE_SCHED_H
 
 #include <linux/kthread.h>
+#include <linux/sched.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
@@ -214,6 +215,27 @@ static inline long __trace_sched_switch_state(bool preempt,
 
 	return state ? (1 << (state - 1)) : state;
 }
+
+static inline char __trace_sched_switch_state_char(bool preempt,
+						   unsigned int prev_state,
+						   struct task_struct *p)
+{
+	long state;
+
+#ifdef CONFIG_SCHED_DEBUG
+	BUG_ON(p != current);
+#endif /* CONFIG_SCHED_DEBUG */
+
+	/*
+	 * For PREEMPT_ACTIVE, we introduce 'p' to report it and use the old
+	 * conventions for the rest.
+	 */
+	if (preempt)
+		return 'p';
+
+	state = __task_state_index(prev_state, p->exit_state);
+	return task_index_to_char(state);
+}
 #endif /* CREATE_TRACE_POINTS */
 
 /*
@@ -231,41 +253,29 @@ TRACE_EVENT(sched_switch,
 	TP_STRUCT__entry(
 		__array(	char,	prev_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	prev_pid			)
-		__field(	int,	prev_prio			)
-		__field(	long,	prev_state			)
+		__field(	short,	prev_prio			)
+		__field(	int,	prev_state			)
+		__field(	char,	prev_state_char			)
 		__array(	char,	next_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	next_pid			)
-		__field(	int,	next_prio			)
+		__field(	short,	next_prio			)
 	),
 
 	TP_fast_assign(
 		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
-		__entry->prev_pid	= prev->pid;
-		__entry->prev_prio	= prev->prio;
-		__entry->prev_state	= __trace_sched_switch_state(preempt, prev_state, prev);
+		__entry->prev_pid		= prev->pid;
+		__entry->prev_prio		= (short) prev->prio;
+		__entry->prev_state		= __trace_sched_switch_state(preempt, prev_state, prev);
+		__entry->prev_state_char	= __trace_sched_switch_state_char(preempt, prev_state, prev);
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
-		__entry->next_pid	= next->pid;
-		__entry->next_prio	= next->prio;
+		__entry->next_pid		= next->pid;
+		__entry->next_prio		= (short) next->prio;
 		/* XXX SCHED_DEADLINE */
 	),
 
-	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
-		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
-
-		(__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
-		  __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|",
-				{ TASK_INTERRUPTIBLE, "S" },
-				{ TASK_UNINTERRUPTIBLE, "D" },
-				{ __TASK_STOPPED, "T" },
-				{ __TASK_TRACED, "t" },
-				{ EXIT_DEAD, "X" },
-				{ EXIT_ZOMBIE, "Z" },
-				{ TASK_PARKED, "P" },
-				{ TASK_DEAD, "I" }) :
-		  "R",
-
-		__entry->prev_state & TASK_REPORT_MAX ? "+" : "",
-		__entry->next_comm, __entry->next_pid, __entry->next_prio)
+	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%c ==> next_comm=%s next_pid=%d next_prio=%d",
+		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio, __entry->prev_state_char, __entry->next_comm,
+		__entry->next_pid, __entry->next_prio)
 );
 
 /*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 2/3] perf sched: use the new prev_state_char instead in tracepoint sched_switch
  2023-07-26 12:16 [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint Ze Gao
  2023-07-26 12:16 ` [RFC PATCH v2 1/3] sched, tracing: add to report task state in symbolic chars Ze Gao
@ 2023-07-26 12:16 ` Ze Gao
  2023-07-26 12:16 ` [RFC PATCH v2 3/3] libtraceevent: " Ze Gao
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Ze Gao @ 2023-07-26 12:16 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Namhyung Kim
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, linux-kernel, linux-perf-users,
	linux-trace-kernel, linux-trace-devel, Ze Gao

Since the sched_switch tracepoint introduces a new variable to
report sched-out task state in symbolic char, we switch to use
that instead to spare from knowing internal implementations
in kernel.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 tools/perf/builtin-sched.c | 57 +++++++-------------------------------
 1 file changed, 10 insertions(+), 47 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index cc4ba506e119..e3b31d526bc8 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -92,24 +92,6 @@ struct sched_atom {
 	struct task_desc	*wakee;
 };
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
-
-/* task state bitmask, copied from include/linux/sched.h */
-#define TASK_RUNNING		0
-#define TASK_INTERRUPTIBLE	1
-#define TASK_UNINTERRUPTIBLE	2
-#define __TASK_STOPPED		4
-#define __TASK_TRACED		8
-/* in tsk->exit_state */
-#define EXIT_DEAD		16
-#define EXIT_ZOMBIE		32
-#define EXIT_TRACE		(EXIT_ZOMBIE | EXIT_DEAD)
-/* in tsk->state again */
-#define TASK_DEAD		64
-#define TASK_WAKEKILL		128
-#define TASK_WAKING		256
-#define TASK_PARKED		512
-
 enum thread_state {
 	THREAD_SLEEPING = 0,
 	THREAD_WAIT_CPU,
@@ -266,7 +248,7 @@ struct thread_runtime {
 	u64 total_preempt_time;
 	u64 total_delay_time;
 
-	int last_state;
+	char last_state;
 
 	char shortname[3];
 	bool comm_changed;
@@ -436,7 +418,7 @@ static void add_sched_event_wakeup(struct perf_sched *sched, struct task_desc *t
 }
 
 static void add_sched_event_sleep(struct perf_sched *sched, struct task_desc *task,
-				  u64 timestamp, u64 task_state __maybe_unused)
+				  u64 timestamp, char task_state __maybe_unused)
 {
 	struct sched_atom *event = get_new_event(task, timestamp);
 
@@ -860,7 +842,7 @@ static int replay_switch_event(struct perf_sched *sched,
 		   *next_comm  = evsel__strval(evsel, sample, "next_comm");
 	const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
 		  next_pid = evsel__intval(evsel, sample, "next_pid");
-	const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
+	const char prev_state = evsel__intval(evsel, sample, "prev_state_char");
 	struct task_desc *prev, __maybe_unused *next;
 	u64 timestamp0, timestamp = sample->time;
 	int cpu = sample->cpu;
@@ -1050,13 +1032,6 @@ static int thread_atoms_insert(struct perf_sched *sched, struct thread *thread)
 	return 0;
 }
 
-static char sched_out_state(u64 prev_state)
-{
-	const char *str = TASK_STATE_TO_CHAR_STR;
-
-	return str[prev_state];
-}
-
 static int
 add_sched_out_event(struct work_atoms *atoms,
 		    char run_state,
@@ -1132,7 +1107,7 @@ static int latency_switch_event(struct perf_sched *sched,
 {
 	const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
 		  next_pid = evsel__intval(evsel, sample, "next_pid");
-	const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
+	const char prev_state = evsel__intval(evsel, sample, "prev_state_char");
 	struct work_atoms *out_events, *in_events;
 	struct thread *sched_out, *sched_in;
 	u64 timestamp0, timestamp = sample->time;
@@ -1168,7 +1143,7 @@ static int latency_switch_event(struct perf_sched *sched,
 			goto out_put;
 		}
 	}
-	if (add_sched_out_event(out_events, sched_out_state(prev_state), timestamp))
+	if (add_sched_out_event(out_events, prev_state, timestamp))
 		return -1;
 
 	in_events = thread_atoms_search(&sched->atom_root, sched_in, &sched->cmp_pid);
@@ -2026,24 +2001,12 @@ static void timehist_header(struct perf_sched *sched)
 	printf("\n");
 }
 
-static char task_state_char(struct thread *thread, int state)
-{
-	static const char state_to_char[] = TASK_STATE_TO_CHAR_STR;
-	unsigned bit = state ? ffs(state) : 0;
-
-	/* 'I' for idle */
-	if (thread->tid == 0)
-		return 'I';
-
-	return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
-}
-
 static void timehist_print_sample(struct perf_sched *sched,
 				  struct evsel *evsel,
 				  struct perf_sample *sample,
 				  struct addr_location *al,
 				  struct thread *thread,
-				  u64 t, int state)
+				  u64 t, char state)
 {
 	struct thread_runtime *tr = thread__priv(thread);
 	const char *next_comm = evsel__strval(evsel, sample, "next_comm");
@@ -2084,7 +2047,7 @@ static void timehist_print_sample(struct perf_sched *sched,
 	print_sched_time(tr->dt_run, 6);
 
 	if (sched->show_state)
-		printf(" %5c ", task_state_char(thread, state));
+		printf(" %5c ", thread->tid == 0 ? 'I' : state);
 
 	if (sched->show_next) {
 		snprintf(nstr, sizeof(nstr), "next: %s[%d]", next_comm, next_pid);
@@ -2156,9 +2119,9 @@ static void timehist_update_runtime_stats(struct thread_runtime *r,
 		else if (r->last_time) {
 			u64 dt_wait = tprev - r->last_time;
 
-			if (r->last_state == TASK_RUNNING)
+			if (r->last_state == 'R' || r->last_state == 'p')
 				r->dt_preempt = dt_wait;
-			else if (r->last_state == TASK_UNINTERRUPTIBLE)
+			else if (r->last_state == 'D')
 				r->dt_iowait = dt_wait;
 			else
 				r->dt_sleep = dt_wait;
@@ -2575,7 +2538,7 @@ static int timehist_sched_change_event(struct perf_tool *tool,
 	struct thread_runtime *tr = NULL;
 	u64 tprev, t = sample->time;
 	int rc = 0;
-	int state = evsel__intval(evsel, sample, "prev_state");
+	char state = evsel__intval(evsel, sample, "prev_state_char");
 
 	if (machine__resolve(machine, &al, sample) < 0) {
 		pr_err("problem processing %d event. skipping it\n",
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 3/3] libtraceevent: use the new prev_state_char instead in tracepoint sched_switch
  2023-07-26 12:16 [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint Ze Gao
  2023-07-26 12:16 ` [RFC PATCH v2 1/3] sched, tracing: add to report task state in symbolic chars Ze Gao
  2023-07-26 12:16 ` [RFC PATCH v2 2/3] perf sched: use the new prev_state_char instead in tracepoint sched_switch Ze Gao
@ 2023-07-26 12:16 ` Ze Gao
  2023-07-31 15:40   ` Steven Rostedt
  2023-07-28 17:29 ` [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint Ian Rogers
  2023-07-31  3:52 ` Masami Hiramatsu
  4 siblings, 1 reply; 15+ messages in thread
From: Ze Gao @ 2023-07-26 12:16 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Namhyung Kim
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ian Rogers, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Masami Hiramatsu, linux-kernel, linux-perf-users,
	linux-trace-kernel, linux-trace-devel, Ze Gao

Since the sched_switch tracepoint introduces a new variable to
report sched-out task state in symbolic char, we switch to use
that instead to spare from knowing internal implementations
in kernel.

Signed-off-by: Ze Gao <zegao@tencent.com>
---
 plugins/plugin_sched_switch.c | 29 ++++-------------------------
 1 file changed, 4 insertions(+), 25 deletions(-)

diff --git a/plugins/plugin_sched_switch.c b/plugins/plugin_sched_switch.c
index 8752cae..5dbdc28 100644
--- a/plugins/plugin_sched_switch.c
+++ b/plugins/plugin_sched_switch.c
@@ -9,27 +9,6 @@
 #include "event-parse.h"
 #include "trace-seq.h"
 
-static void write_state(struct trace_seq *s, int val)
-{
-	const char states[] = "SDTtZXxW";
-	int found = 0;
-	int i;
-
-	for (i = 0; i < (sizeof(states) - 1); i++) {
-		if (!(val & (1 << i)))
-			continue;
-
-		if (found)
-			trace_seq_putc(s, '|');
-
-		found = 1;
-		trace_seq_putc(s, states[i]);
-	}
-
-	if (!found)
-		trace_seq_putc(s, 'R');
-}
-
 static void write_and_save_comm(struct tep_format_field *field,
 				struct tep_record *record,
 				struct trace_seq *s, int pid)
@@ -97,10 +76,10 @@ static int sched_switch_handler(struct trace_seq *s,
 	trace_seq_printf(s, "%lld ", val);
 
 	if (tep_get_field_val(s, event, "prev_prio", record, &val, 1) == 0)
-		trace_seq_printf(s, "[%d] ", (int) val);
+		trace_seq_printf(s, "[%d] ", (short) val);
 
-	if (tep_get_field_val(s,  event, "prev_state", record, &val, 1) == 0)
-		write_state(s, val);
+	if (tep_get_field_val(s,  event, "prev_state_char", record, &val, 1) == 0)
+		trace_seq_putc(s, (char) val);
 
 	trace_seq_puts(s, " ==> ");
 
@@ -115,7 +94,7 @@ static int sched_switch_handler(struct trace_seq *s,
 	trace_seq_printf(s, "%lld", val);
 
 	if (tep_get_field_val(s, event, "next_prio", record, &val, 1) == 0)
-		trace_seq_printf(s, " [%d]", (int) val);
+		trace_seq_printf(s, " [%d]", (short) val);
 
 	return 0;
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint
  2023-07-26 12:16 [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint Ze Gao
                   ` (2 preceding siblings ...)
  2023-07-26 12:16 ` [RFC PATCH v2 3/3] libtraceevent: " Ze Gao
@ 2023-07-28 17:29 ` Ian Rogers
  2023-07-31  2:09   ` Ze Gao
  2023-07-31  3:52 ` Masami Hiramatsu
  4 siblings, 1 reply; 15+ messages in thread
From: Ian Rogers @ 2023-07-28 17:29 UTC (permalink / raw)
  To: Ze Gao
  Cc: Peter Zijlstra, Steven Rostedt, Namhyung Kim, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ingo Molnar,
	Jiri Olsa, Mark Rutland, Masami Hiramatsu, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

On Wed, Jul 26, 2023 at 5:16 AM Ze Gao <zegao2021@gmail.com> wrote:
>
>
> This is the 2nd attempt to fix the report task state issue in sched
> tracepint, here is the first version:
>
> https://lore.kernel.org/linux-trace-kernel/20230725072254.32045-1-zegao@tencent.com
>
> Against v1, add a new var to report task state in symbolic char instead
> of replacing the old one and to not to break anything.
>
> --
>
> In the status quo, we should see three different outcomes of the reported
> sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
> of tracepoint sched_switch.  And it's not hard to figure out that the
> former two are built upon the third one, and the reason why we see this
> inconsistency is that the former two does not catch up with the internal
> change of reported task state definitions as the kernel evolves.
>
> IMHO, exporting internal representations of task state in the tracepoint
> sched_switch is not a good practice and not encouraged at all, which can
> easily break userspace tools that relies on it. Especially when tracepoints
> are massively used in many observability tools nowadays due to its stable
> nature, which makes them no longer used for debug only purpose and we
> should be careful to decide what ought to be reported to userspace and what
> ought not.
>
> Therefore, to fix the issues mentioned above for good, instead of choosing
> I proposed to add a new variable to report task state in sched_switch with
> a symbolic character along with the old hardcoded value, and save the
> further processing of userspace tools and spare them from knowing
> implementation details in the kernel.
>
> After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
> a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint only.
>
> Reviews welcome!

Thanks Ze,

I think this is worthwhile cleanup and makes the code overall simpler.
I don't know if others have strong opinions, I don't often work in
this code, but I think the patches are worth landing this.

Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

> Regards,
>
> Ze
>
> Ze Gao (2):
>   sched, tracing: add to report task state in symbolic chars
>   perf sched: use the new prev_state_char instead in tracepoint
>     sched_switch
>
>  include/trace/events/sched.h | 60 +++++++++++++++++++++---------------
>  tools/perf/builtin-sched.c   | 57 ++++++----------------------------
>  2 files changed, 45 insertions(+), 72 deletions(-)
>
> Ze Gao (1):
>   libtraceevent: use the new prev_state_char instead in tracepoint
>     sched_switch
>
>  plugins/plugin_sched_switch.c | 29 ++++-------------------------
>  1 file changed, 4 insertions(+), 25 deletions(-)
>
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint
  2023-07-28 17:29 ` [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint Ian Rogers
@ 2023-07-31  2:09   ` Ze Gao
  0 siblings, 0 replies; 15+ messages in thread
From: Ze Gao @ 2023-07-31  2:09 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Steven Rostedt, Namhyung Kim, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ingo Molnar,
	Jiri Olsa, Mark Rutland, Masami Hiramatsu, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

Thanks Ian,

In regard to ABI, symbolic chars are much more stable and I think
we can benefit from this in the long run.

Regards,
Ze

On Sat, Jul 29, 2023 at 1:29 AM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, Jul 26, 2023 at 5:16 AM Ze Gao <zegao2021@gmail.com> wrote:
> >
> >
> > This is the 2nd attempt to fix the report task state issue in sched
> > tracepint, here is the first version:
> >
> > https://lore.kernel.org/linux-trace-kernel/20230725072254.32045-1-zegao@tencent.com
> >
> > Against v1, add a new var to report task state in symbolic char instead
> > of replacing the old one and to not to break anything.
> >
> > --
> >
> > In the status quo, we should see three different outcomes of the reported
> > sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
> > of tracepoint sched_switch.  And it's not hard to figure out that the
> > former two are built upon the third one, and the reason why we see this
> > inconsistency is that the former two does not catch up with the internal
> > change of reported task state definitions as the kernel evolves.
> >
> > IMHO, exporting internal representations of task state in the tracepoint
> > sched_switch is not a good practice and not encouraged at all, which can
> > easily break userspace tools that relies on it. Especially when tracepoints
> > are massively used in many observability tools nowadays due to its stable
> > nature, which makes them no longer used for debug only purpose and we
> > should be careful to decide what ought to be reported to userspace and what
> > ought not.
> >
> > Therefore, to fix the issues mentioned above for good, instead of choosing
> > I proposed to add a new variable to report task state in sched_switch with
> > a symbolic character along with the old hardcoded value, and save the
> > further processing of userspace tools and spare them from knowing
> > implementation details in the kernel.
> >
> > After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
> > a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint only.
> >
> > Reviews welcome!
>
> Thanks Ze,
>
> I think this is worthwhile cleanup and makes the code overall simpler.
> I don't know if others have strong opinions, I don't often work in
> this code, but I think the patches are worth landing this.
>
> Acked-by: Ian Rogers <irogers@google.com>
>
> Thanks,
> Ian
>
> > Regards,
> >
> > Ze
> >
> > Ze Gao (2):
> >   sched, tracing: add to report task state in symbolic chars
> >   perf sched: use the new prev_state_char instead in tracepoint
> >     sched_switch
> >
> >  include/trace/events/sched.h | 60 +++++++++++++++++++++---------------
> >  tools/perf/builtin-sched.c   | 57 ++++++----------------------------
> >  2 files changed, 45 insertions(+), 72 deletions(-)
> >
> > Ze Gao (1):
> >   libtraceevent: use the new prev_state_char instead in tracepoint
> >     sched_switch
> >
> >  plugins/plugin_sched_switch.c | 29 ++++-------------------------
> >  1 file changed, 4 insertions(+), 25 deletions(-)
> >
> > --
> > 2.40.1
> >

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint
  2023-07-26 12:16 [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint Ze Gao
                   ` (3 preceding siblings ...)
  2023-07-28 17:29 ` [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint Ian Rogers
@ 2023-07-31  3:52 ` Masami Hiramatsu
  2023-07-31  6:19   ` Ze Gao
  4 siblings, 1 reply; 15+ messages in thread
From: Masami Hiramatsu @ 2023-07-31  3:52 UTC (permalink / raw)
  To: Ze Gao
  Cc: Peter Zijlstra, Steven Rostedt, Namhyung Kim, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Mark Rutland, Masami Hiramatsu,
	linux-kernel, linux-perf-users, linux-trace-kernel,
	linux-trace-devel, Ze Gao

On Wed, 26 Jul 2023 20:16:15 +0800
Ze Gao <zegao2021@gmail.com> wrote:

> 
> This is the 2nd attempt to fix the report task state issue in sched
> tracepint, here is the first version:
> 
> https://lore.kernel.org/linux-trace-kernel/20230725072254.32045-1-zegao@tencent.com
> 
> Against v1, add a new var to report task state in symbolic char instead
> of replacing the old one and to not to break anything.
> 
> --
> 
> In the status quo, we should see three different outcomes of the reported
> sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
> of tracepoint sched_switch.  And it's not hard to figure out that the
> former two are built upon the third one, and the reason why we see this
> inconsistency is that the former two does not catch up with the internal
> change of reported task state definitions as the kernel evolves.
> 
> IMHO, exporting internal representations of task state in the tracepoint
> sched_switch is not a good practice and not encouraged at all, which can
> easily break userspace tools that relies on it. Especially when tracepoints
> are massively used in many observability tools nowadays due to its stable
> nature, which makes them no longer used for debug only purpose and we
> should be careful to decide what ought to be reported to userspace and what
> ought not.
> 
> Therefore, to fix the issues mentioned above for good, instead of choosing
> I proposed to add a new variable to report task state in sched_switch with
> a symbolic character along with the old hardcoded value, and save the
> further processing of userspace tools and spare them from knowing
> implementation details in the kernel.
> 
> After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
> a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint only.

This series looks good to me. Putting the flag in the trace record is
a good idea :)

Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

for this series.

Thank you,

> 
> Reviews welcome!
> 
> Regards,
> 
> Ze
> 
> Ze Gao (2):
>   sched, tracing: add to report task state in symbolic chars
>   perf sched: use the new prev_state_char instead in tracepoint
>     sched_switch
> 
>  include/trace/events/sched.h | 60 +++++++++++++++++++++---------------
>  tools/perf/builtin-sched.c   | 57 ++++++----------------------------
>  2 files changed, 45 insertions(+), 72 deletions(-)
> 
> Ze Gao (1):
>   libtraceevent: use the new prev_state_char instead in tracepoint
>     sched_switch
> 
>  plugins/plugin_sched_switch.c | 29 ++++-------------------------
>  1 file changed, 4 insertions(+), 25 deletions(-)
> 
> -- 
> 2.40.1
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint
  2023-07-31  3:52 ` Masami Hiramatsu
@ 2023-07-31  6:19   ` Ze Gao
  0 siblings, 0 replies; 15+ messages in thread
From: Ze Gao @ 2023-07-31  6:19 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, Steven Rostedt, Namhyung Kim, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Mark Rutland, linux-kernel,
	linux-perf-users, linux-trace-kernel, linux-trace-devel, Ze Gao

Hi Masami,

Thanks for your review!

Sincerely,
Ze

On Mon, Jul 31, 2023 at 11:53 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Wed, 26 Jul 2023 20:16:15 +0800
> Ze Gao <zegao2021@gmail.com> wrote:
>
> >
> > This is the 2nd attempt to fix the report task state issue in sched
> > tracepint, here is the first version:
> >
> > https://lore.kernel.org/linux-trace-kernel/20230725072254.32045-1-zegao@tencent.com
> >
> > Against v1, add a new var to report task state in symbolic char instead
> > of replacing the old one and to not to break anything.
> >
> > --
> >
> > In the status quo, we should see three different outcomes of the reported
> > sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
> > of tracepoint sched_switch.  And it's not hard to figure out that the
> > former two are built upon the third one, and the reason why we see this
> > inconsistency is that the former two does not catch up with the internal
> > change of reported task state definitions as the kernel evolves.
> >
> > IMHO, exporting internal representations of task state in the tracepoint
> > sched_switch is not a good practice and not encouraged at all, which can
> > easily break userspace tools that relies on it. Especially when tracepoints
> > are massively used in many observability tools nowadays due to its stable
> > nature, which makes them no longer used for debug only purpose and we
> > should be careful to decide what ought to be reported to userspace and what
> > ought not.
> >
> > Therefore, to fix the issues mentioned above for good, instead of choosing
> > I proposed to add a new variable to report task state in sched_switch with
> > a symbolic character along with the old hardcoded value, and save the
> > further processing of userspace tools and spare them from knowing
> > implementation details in the kernel.
> >
> > After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
> > a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint only.
>
> This series looks good to me. Putting the flag in the trace record is
> a good idea :)
>
> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>
> for this series.
>
> Thank you,
>
> >
> > Reviews welcome!
> >
> > Regards,
> >
> > Ze
> >
> > Ze Gao (2):
> >   sched, tracing: add to report task state in symbolic chars
> >   perf sched: use the new prev_state_char instead in tracepoint
> >     sched_switch
> >
> >  include/trace/events/sched.h | 60 +++++++++++++++++++++---------------
> >  tools/perf/builtin-sched.c   | 57 ++++++----------------------------
> >  2 files changed, 45 insertions(+), 72 deletions(-)
> >
> > Ze Gao (1):
> >   libtraceevent: use the new prev_state_char instead in tracepoint
> >     sched_switch
> >
> >  plugins/plugin_sched_switch.c | 29 ++++-------------------------
> >  1 file changed, 4 insertions(+), 25 deletions(-)
> >
> > --
> > 2.40.1
> >
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 1/3] sched, tracing: add to report task state in symbolic chars
  2023-07-26 12:16 ` [RFC PATCH v2 1/3] sched, tracing: add to report task state in symbolic chars Ze Gao
@ 2023-07-31  9:36   ` Peter Zijlstra
  2023-07-31 10:17     ` Ze Gao
  2023-07-31 11:00     ` Ze Gao
  2023-07-31 15:38   ` Steven Rostedt
  1 sibling, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2023-07-31  9:36 UTC (permalink / raw)
  To: Ze Gao
  Cc: Steven Rostedt, Namhyung Kim, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Mark Rutland, Masami Hiramatsu, linux-kernel, linux-perf-users,
	linux-trace-kernel, linux-trace-devel, Ze Gao

On Wed, Jul 26, 2023 at 08:16:16PM +0800, Ze Gao wrote:
> Internal representations of task state are likely to be changed or
> ordered, and reporting them to userspace without exporting them as
> part of API is not a good choice, which can easily break a userspace
> observability tool as kernel evolves. For example, perf suffers from
> this and still reports wrong states by this patch.
> 
> OTOH, some masqueraded state like TASK_REPORT_IDLE and TASK_REPORT_MAX
> are also reported inadvertently, which confuses things even more.
> 
> So add a new variable in company with the old raw value to report task
> state in symbolic char, which is self-explaining and no further
> translation is needed, and also report priorities in 'short' to save
> some buffer space.  Of course this does not break any userspace tool.
> 
> Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use the old
> conventions for the rest.

So I really dont much like this. This looses the ability to see the
actual wait state flags, there could be multiple. Eg, things like
TASK_FREEZEABLE gets lost completely.

And this is on top of my reluctance to touch any of this at all, for
fear of further regressions.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 1/3] sched, tracing: add to report task state in symbolic chars
  2023-07-31  9:36   ` Peter Zijlstra
@ 2023-07-31 10:17     ` Ze Gao
  2023-07-31 11:00     ` Ze Gao
  1 sibling, 0 replies; 15+ messages in thread
From: Ze Gao @ 2023-07-31 10:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Namhyung Kim, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Mark Rutland, Masami Hiramatsu, linux-kernel, linux-perf-users,
	linux-trace-kernel, linux-trace-devel, Ze Gao

On Mon, Jul 31, 2023 at 5:37 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jul 26, 2023 at 08:16:16PM +0800, Ze Gao wrote:
> > Internal representations of task state are likely to be changed or
> > ordered, and reporting them to userspace without exporting them as
> > part of API is not a good choice, which can easily break a userspace
> > observability tool as kernel evolves. For example, perf suffers from
> > this and still reports wrong states by this patch.
> >
> > OTOH, some masqueraded state like TASK_REPORT_IDLE and TASK_REPORT_MAX
> > are also reported inadvertently, which confuses things even more.
> >
> > So add a new variable in company with the old raw value to report task
> > state in symbolic char, which is self-explaining and no further
> > translation is needed, and also report priorities in 'short' to save
> > some buffer space.  Of course this does not break any userspace tool.
> >
> > Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use the old
> > conventions for the rest.
>
> So I really dont much like this. This looses the ability to see the
> actual wait state flags, there could be multiple. Eg, things like
> TASK_FREEZEABLE gets lost completely.

How about adding a new char for each distinct state you want to report
, e.g., 'F' for this for debug purposes ? I don;t think we can lose the
multiple flags just because we choose to report in chars. Still people
can use the old raw value since we do not replace but supplement it here.

Regards,
Ze

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 1/3] sched, tracing: add to report task state in symbolic chars
  2023-07-31  9:36   ` Peter Zijlstra
  2023-07-31 10:17     ` Ze Gao
@ 2023-07-31 11:00     ` Ze Gao
  1 sibling, 0 replies; 15+ messages in thread
From: Ze Gao @ 2023-07-31 11:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Namhyung Kim, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Mark Rutland, Masami Hiramatsu, linux-kernel, linux-perf-users,
	linux-trace-kernel, linux-trace-devel, Ze Gao

On Mon, Jul 31, 2023 at 5:37 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jul 26, 2023 at 08:16:16PM +0800, Ze Gao wrote:
> > Internal representations of task state are likely to be changed or
> > ordered, and reporting them to userspace without exporting them as
> > part of API is not a good choice, which can easily break a userspace
> > observability tool as kernel evolves. For example, perf suffers from
> > this and still reports wrong states by this patch.
> >
> > OTOH, some masqueraded state like TASK_REPORT_IDLE and TASK_REPORT_MAX
> > are also reported inadvertently, which confuses things even more.
> >
> > So add a new variable in company with the old raw value to report task
> > state in symbolic char, which is self-explaining and no further
> > translation is needed, and also report priorities in 'short' to save
> > some buffer space.  Of course this does not break any userspace tool.
> >
> > Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use the old
> > conventions for the rest.
>
> So I really dont much like this. This looses the ability to see the
> actual wait state flags, there could be multiple. Eg, things like
> TASK_FREEZEABLE gets lost completely.

Also, IIRC, TASK_FREEZABLE which is defined as 0x2000, is already lost
in the current implementation of __trace_sched_switch_state which limits
all states except PREEMPT_ACTIIVE below TASK_REPORT_IDLE to be
reported. So I do not believe you can achieve this by just leaving things alone.

Regards,
Ze

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 1/3] sched, tracing: add to report task state in symbolic chars
  2023-07-26 12:16 ` [RFC PATCH v2 1/3] sched, tracing: add to report task state in symbolic chars Ze Gao
  2023-07-31  9:36   ` Peter Zijlstra
@ 2023-07-31 15:38   ` Steven Rostedt
  2023-08-01  1:35     ` Ze Gao
  1 sibling, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2023-07-31 15:38 UTC (permalink / raw)
  To: Ze Gao
  Cc: Peter Zijlstra, Namhyung Kim, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Mark Rutland, Masami Hiramatsu, linux-kernel, linux-perf-users,
	linux-trace-kernel, linux-trace-devel, Ze Gao

On Wed, 26 Jul 2023 20:16:16 +0800
Ze Gao <zegao2021@gmail.com> wrote:

> @@ -231,41 +253,29 @@ TRACE_EVENT(sched_switch,
>  	TP_STRUCT__entry(
>  		__array(	char,	prev_comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	prev_pid			)
> -		__field(	int,	prev_prio			)
> -		__field(	long,	prev_state			)
> +		__field(	short,	prev_prio			)
> +		__field(	int,	prev_state			)
> +		__field(	char,	prev_state_char			)
>  		__array(	char,	next_comm,	TASK_COMM_LEN	)
>  		__field(	pid_t,	next_pid			)
> -		__field(	int,	next_prio			)
> +		__field(	short,	next_prio			)
>  	),

The above adds a bunch of holes. This needs to be reordered to condense the
event, we don't want to increase it. libtraceevent will handle reordering.

The above produces:

struct {
	char	prev_comm[16];
	pid_t	prev_pid;
	short	prev_prio; <-- 2 character padding
	int	prev_state;
	char	prev_state_char;
	char	next_comm[16]; <- 3 character padding
	pid_t	next_pid;
	short	next_prio; <- 2 char padding
};

(all events are at least 4 byte aligned, and are multiple of 4 bytes in
size, thus that last short of next_prio did nothing)

The above is a total of 56 bytes (note, that is the same as the current
sched_switch event size);

What the above should be:

	TP_STRUCT__entry(
		__field(	pid_t,	prev_pid			)
		__field(	pid_t,	next_pid			)
		__field(	short,	prev_prio			)
		__field(	short,	next_prio			)
		__field(	int,	prev_state			)
		__array(	char,	prev_comm,	TASK_COMM_LEN	)
		__array(	char,	next_comm,	TASK_COMM_LEN	)
		__field(	char,	prev_state_char			)
	),


Which would be:

struct {
	pid_t	prev_pid;
	pid_t	next_pid;
	short	prev_prio;
	short	next_prio;
	int	prev_state;
	char	prev_comm[16];
	char	next_comm[16];
	char	prev_stat_char; <-- 3 characters of padding
}

which would be 52 byte. Saving us 4 bytes per event. Which is a big deal!

-- Steve


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 3/3] libtraceevent: use the new prev_state_char instead in tracepoint sched_switch
  2023-07-26 12:16 ` [RFC PATCH v2 3/3] libtraceevent: " Ze Gao
@ 2023-07-31 15:40   ` Steven Rostedt
  2023-08-01  1:36     ` Ze Gao
  0 siblings, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2023-07-31 15:40 UTC (permalink / raw)
  To: Ze Gao
  Cc: Peter Zijlstra, Namhyung Kim, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Mark Rutland, Masami Hiramatsu, linux-kernel, linux-perf-users,
	linux-trace-kernel, linux-trace-devel, Ze Gao

On Wed, 26 Jul 2023 20:16:18 +0800
Ze Gao <zegao2021@gmail.com> wrote:

> Since the sched_switch tracepoint introduces a new variable to
> report sched-out task state in symbolic char, we switch to use
> that instead to spare from knowing internal implementations
> in kernel.

This needs to be changed to check if the "prev_state_char" field exists,
and if not, then it must use the old method. Same for perf. The tools must
work with older kernels.

-- Steve

> 
> Signed-off-by: Ze Gao <zegao@tencent.com>
> ---
>  plugins/plugin_sched_switch.c | 29 ++++-------------------------
>  1 file changed, 4 insertions(+), 25 deletions(-)
> 
> diff --git a/plugins/plugin_sched_switch.c b/plugins/plugin_sched_switch.c
> index 8752cae..5dbdc28 100644
> --- a/plugins/plugin_sched_switch.c
> +++ b/plugins/plugin_sched_switch.c
> @@ -9,27 +9,6 @@
>  #include "event-parse.h"
>  #include "trace-seq.h"
>  
> -static void write_state(struct trace_seq *s, int val)
> -{
> -	const char states[] = "SDTtZXxW";
> -	int found = 0;
> -	int i;
> -
> -	for (i = 0; i < (sizeof(states) - 1); i++) {
> -		if (!(val & (1 << i)))
> -			continue;
> -
> -		if (found)
> -			trace_seq_putc(s, '|');
> -
> -		found = 1;
> -		trace_seq_putc(s, states[i]);
> -	}
> -
> -	if (!found)
> -		trace_seq_putc(s, 'R');
> -}
> -
>  static void write_and_save_comm(struct tep_format_field *field,
>  				struct tep_record *record,
>  				struct trace_seq *s, int pid)
> @@ -97,10 +76,10 @@ static int sched_switch_handler(struct trace_seq *s,
>  	trace_seq_printf(s, "%lld ", val);
>  
>  	if (tep_get_field_val(s, event, "prev_prio", record, &val, 1) ==
> 0)
> -		trace_seq_printf(s, "[%d] ", (int) val);
> +		trace_seq_printf(s, "[%d] ", (short) val);
>  
> -	if (tep_get_field_val(s,  event, "prev_state", record, &val, 1)
> == 0)
> -		write_state(s, val);
> +	if (tep_get_field_val(s,  event, "prev_state_char", record,
> &val, 1) == 0)
> +		trace_seq_putc(s, (char) val);
>  
>  	trace_seq_puts(s, " ==> ");
>  
> @@ -115,7 +94,7 @@ static int sched_switch_handler(struct trace_seq *s,
>  	trace_seq_printf(s, "%lld", val);
>  
>  	if (tep_get_field_val(s, event, "next_prio", record, &val, 1) ==
> 0)
> -		trace_seq_printf(s, " [%d]", (int) val);
> +		trace_seq_printf(s, " [%d]", (short) val);
>  
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 1/3] sched, tracing: add to report task state in symbolic chars
  2023-07-31 15:38   ` Steven Rostedt
@ 2023-08-01  1:35     ` Ze Gao
  0 siblings, 0 replies; 15+ messages in thread
From: Ze Gao @ 2023-08-01  1:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Namhyung Kim, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Mark Rutland, Masami Hiramatsu, linux-kernel, linux-perf-users,
	linux-trace-kernel, linux-trace-devel, Ze Gao

Thanks Steven,

Indeed this did not consider the struct alignment. I left the field order intact
because I was not sure whether the order here matters for reporting or parsing.
Now I will fix this and send a v3, and again thanks for pointing it out.

Thanks,
Ze

On Mon, Jul 31, 2023 at 11:38 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 26 Jul 2023 20:16:16 +0800
> Ze Gao <zegao2021@gmail.com> wrote:
>
> > @@ -231,41 +253,29 @@ TRACE_EVENT(sched_switch,
> >       TP_STRUCT__entry(
> >               __array(        char,   prev_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  prev_pid                        )
> > -             __field(        int,    prev_prio                       )
> > -             __field(        long,   prev_state                      )
> > +             __field(        short,  prev_prio                       )
> > +             __field(        int,    prev_state                      )
> > +             __field(        char,   prev_state_char                 )
> >               __array(        char,   next_comm,      TASK_COMM_LEN   )
> >               __field(        pid_t,  next_pid                        )
> > -             __field(        int,    next_prio                       )
> > +             __field(        short,  next_prio                       )
> >       ),
>
> The above adds a bunch of holes. This needs to be reordered to condense the
> event, we don't want to increase it. libtraceevent will handle reordering.
>
> The above produces:
>
> struct {
>         char    prev_comm[16];
>         pid_t   prev_pid;
>         short   prev_prio; <-- 2 character padding
>         int     prev_state;
>         char    prev_state_char;
>         char    next_comm[16]; <- 3 character padding
>         pid_t   next_pid;
>         short   next_prio; <- 2 char padding
> };
>
> (all events are at least 4 byte aligned, and are multiple of 4 bytes in
> size, thus that last short of next_prio did nothing)
>
> The above is a total of 56 bytes (note, that is the same as the current
> sched_switch event size);
>
> What the above should be:
>
>         TP_STRUCT__entry(
>                 __field(        pid_t,  prev_pid                        )
>                 __field(        pid_t,  next_pid                        )
>                 __field(        short,  prev_prio                       )
>                 __field(        short,  next_prio                       )
>                 __field(        int,    prev_state                      )
>                 __array(        char,   prev_comm,      TASK_COMM_LEN   )
>                 __array(        char,   next_comm,      TASK_COMM_LEN   )
>                 __field(        char,   prev_state_char                 )
>         ),
>
>
> Which would be:
>
> struct {
>         pid_t   prev_pid;
>         pid_t   next_pid;
>         short   prev_prio;
>         short   next_prio;
>         int     prev_state;
>         char    prev_comm[16];
>         char    next_comm[16];
>         char    prev_stat_char; <-- 3 characters of padding
> }
>
> which would be 52 byte. Saving us 4 bytes per event. Which is a big deal!
>
> -- Steve
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 3/3] libtraceevent: use the new prev_state_char instead in tracepoint sched_switch
  2023-07-31 15:40   ` Steven Rostedt
@ 2023-08-01  1:36     ` Ze Gao
  0 siblings, 0 replies; 15+ messages in thread
From: Ze Gao @ 2023-08-01  1:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Namhyung Kim, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Mark Rutland, Masami Hiramatsu, linux-kernel, linux-perf-users,
	linux-trace-kernel, linux-trace-devel, Ze Gao

Roger that!

Thanks,
Ze

On Mon, Jul 31, 2023 at 11:40 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 26 Jul 2023 20:16:18 +0800
> Ze Gao <zegao2021@gmail.com> wrote:
>
> > Since the sched_switch tracepoint introduces a new variable to
> > report sched-out task state in symbolic char, we switch to use
> > that instead to spare from knowing internal implementations
> > in kernel.
>
> This needs to be changed to check if the "prev_state_char" field exists,
> and if not, then it must use the old method. Same for perf. The tools must
> work with older kernels.
>
> -- Steve
>
> >
> > Signed-off-by: Ze Gao <zegao@tencent.com>
> > ---
> >  plugins/plugin_sched_switch.c | 29 ++++-------------------------
> >  1 file changed, 4 insertions(+), 25 deletions(-)
> >
> > diff --git a/plugins/plugin_sched_switch.c b/plugins/plugin_sched_switch.c
> > index 8752cae..5dbdc28 100644
> > --- a/plugins/plugin_sched_switch.c
> > +++ b/plugins/plugin_sched_switch.c
> > @@ -9,27 +9,6 @@
> >  #include "event-parse.h"
> >  #include "trace-seq.h"
> >
> > -static void write_state(struct trace_seq *s, int val)
> > -{
> > -     const char states[] = "SDTtZXxW";
> > -     int found = 0;
> > -     int i;
> > -
> > -     for (i = 0; i < (sizeof(states) - 1); i++) {
> > -             if (!(val & (1 << i)))
> > -                     continue;
> > -
> > -             if (found)
> > -                     trace_seq_putc(s, '|');
> > -
> > -             found = 1;
> > -             trace_seq_putc(s, states[i]);
> > -     }
> > -
> > -     if (!found)
> > -             trace_seq_putc(s, 'R');
> > -}
> > -
> >  static void write_and_save_comm(struct tep_format_field *field,
> >                               struct tep_record *record,
> >                               struct trace_seq *s, int pid)
> > @@ -97,10 +76,10 @@ static int sched_switch_handler(struct trace_seq *s,
> >       trace_seq_printf(s, "%lld ", val);
> >
> >       if (tep_get_field_val(s, event, "prev_prio", record, &val, 1) ==
> > 0)
> > -             trace_seq_printf(s, "[%d] ", (int) val);
> > +             trace_seq_printf(s, "[%d] ", (short) val);
> >
> > -     if (tep_get_field_val(s,  event, "prev_state", record, &val, 1)
> > == 0)
> > -             write_state(s, val);
> > +     if (tep_get_field_val(s,  event, "prev_state_char", record,
> > &val, 1) == 0)
> > +             trace_seq_putc(s, (char) val);
> >
> >       trace_seq_puts(s, " ==> ");
> >
> > @@ -115,7 +94,7 @@ static int sched_switch_handler(struct trace_seq *s,
> >       trace_seq_printf(s, "%lld", val);
> >
> >       if (tep_get_field_val(s, event, "next_prio", record, &val, 1) ==
> > 0)
> > -             trace_seq_printf(s, " [%d]", (int) val);
> > +             trace_seq_printf(s, " [%d]", (short) val);
> >
> >       return 0;
> >  }
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-08-01  1:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-26 12:16 [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint Ze Gao
2023-07-26 12:16 ` [RFC PATCH v2 1/3] sched, tracing: add to report task state in symbolic chars Ze Gao
2023-07-31  9:36   ` Peter Zijlstra
2023-07-31 10:17     ` Ze Gao
2023-07-31 11:00     ` Ze Gao
2023-07-31 15:38   ` Steven Rostedt
2023-08-01  1:35     ` Ze Gao
2023-07-26 12:16 ` [RFC PATCH v2 2/3] perf sched: use the new prev_state_char instead in tracepoint sched_switch Ze Gao
2023-07-26 12:16 ` [RFC PATCH v2 3/3] libtraceevent: " Ze Gao
2023-07-31 15:40   ` Steven Rostedt
2023-08-01  1:36     ` Ze Gao
2023-07-28 17:29 ` [RFC PATCH v2 0/3] report task state in symbolic chars from sched tracepoint Ian Rogers
2023-07-31  2:09   ` Ze Gao
2023-07-31  3:52 ` Masami Hiramatsu
2023-07-31  6:19   ` Ze Gao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).