All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pawel Moll <pawel.moll@arm.com>
To: Richard Cochran <richardcochran@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>,
	Andy Lutomirski <luto@amacapital.net>,
	John Stultz <john.stultz@linaro.org>,
	Stephen Boyd <sboyd@codeaurora.org>,
	Baruch Siach <baruch@tkos.co.il>,
	Thomas Gleixner <tglx@linutronix.de>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [RFC] sched_clock: Track monotonic raw clock
Date: Fri, 12 Sep 2014 12:45:34 +0100	[thread overview]
Message-ID: <1410522334.16936.42.camel@hornet> (raw)
In-Reply-To: <20140722193939.GA15507@localhost.localdomain>

On Tue, 2014-07-22 at 20:39 +0100, Richard Cochran wrote:
> So the only reasonable way, IMHO, is to simply provide the needed
> information in the traces, and then add some user space code to find
> the relationship between the perf clock and the system clock. The
> great advantage of this approach is that you don't have to create a
> perfect servo from day one. The time stamps can be re-analyzed at any
> time after the fact.

Ok, so I had yet another (already lost count ;-) look into this...

I came up with a conclusion that the simplest way it can be done is to
have a periodically repeated events marked with both "kernel timestamp"
and "user timestamp". And a scheduler switch is such a periodic event.
Moreover, it seems quite natural: a user process is scheduled in (take a
snapshot of sched clock and monotonic raw at this moment, let's call
them t_s_in and t_m_in), the process "generates" the "user timestamp" by
clock_gettime(CLOCK_MONOTONIC_RAW, t_m_x), the process is scheduled out
(again, take sched and monotonic timestamps then, t_s_out and t_m_out).

The, after extracting all the t_?_? from the trace stream, a translation
(we've know the monotonic raw value t_m_x, we're looking for its sched
clock equivalent t_s_x) should simply get down to:

t_s_x = t_s_in + ((t_m_x - t_m_1) * (t_s_out - t_s_in)) / (t_m_out - t_m_in)

I quickly made an experiment, by adding a customised variant of
sched_switch trace event (note: I'm *not* proposing it to be merged)

8<-------------
>From ad772fca06da4774e6dd3a116ee2424a0dd3d80f Mon Sep 17 00:00:00 2001
From: Pawel Moll <pawel.moll@arm.com>
Date: Fri, 22 Aug 2014 15:23:13 +0100
Subject: [PATCH] trace: ched_switch event variant with raw monotonic timestamp

This patch adds a variant of sched_switch trace event,
adding a field for a monotonic raw timestamp.

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---
 include/trace/events/sched.h | 48 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          |  1 +
 2 files changed, 49 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 0a68d5a..1429958 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -7,6 +7,7 @@
 #include <linux/sched.h>
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
+#include <linux/time.h>
 
 /*
  * Tracepoint for calling kthread_stop, performed to end a kthread:
@@ -150,6 +151,53 @@ TRACE_EVENT(sched_switch,
 );
 
 /*
+ * Tracepoint for task switches, including raw monotonic raw clock
+ * timestamp, to perform user- to kernelspace time correlation
+ */
+TRACE_EVENT(sched_switch_raw_monotonic,
+
+	TP_PROTO(struct task_struct *prev,
+		 struct task_struct *next),
+
+	TP_ARGS(prev, next),
+
+	TP_STRUCT__entry(
+		__field(	u64,	raw_monotonic			)
+		__array(	char,	prev_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	prev_pid			)
+		__field(	int,	prev_prio			)
+		__field(	long,	prev_state			)
+		__array(	char,	next_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	next_pid			)
+		__field(	int,	next_prio			)
+	),
+
+	TP_fast_assign(
+		struct timespec now;
+		getrawmonotonic(&now);
+		__entry->raw_monotonic = timespec_to_ns(&now);
+		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+		__entry->prev_pid	= prev->pid;
+		__entry->prev_prio	= prev->prio;
+		__entry->prev_state	= __trace_sched_switch_state(prev);
+		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
+		__entry->next_pid	= next->pid;
+		__entry->next_prio	= next->prio;
+	),
+
+	TP_printk("raw_monotonic=%Lu prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
+		__entry->raw_monotonic, __entry->prev_comm,
+		__entry->prev_pid, __entry->prev_prio,
+		__entry->prev_state & (TASK_STATE_MAX-1) ?
+		  __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
+				{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
+				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
+				{ 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
+		__entry->prev_state & TASK_STATE_MAX ? "+" : "",
+		__entry->next_comm, __entry->next_pid, __entry->next_prio)
+);
+
+/*
  * Tracepoint for a task being migrated:
  */
 TRACE_EVENT(sched_migrate_task,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec1a286..0cc7a92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2179,6 +2179,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
 	trace_sched_switch(prev, next);
+	trace_sched_switch_raw_monotonic(prev, next);
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
-- 
1.9.1
8<-------------

And than did some experiment, opening a perf session with this event and
executing the following test:

        clock_gettime(CLOCK_MONOTONIC_RAW, &t_m_1);
        write(marker, "test", 4);
        clock_gettime(CLOCK_MONOTONIC_RAW, &t_m_2);

where "marker" is a descriptor for
"/sys/kernel/debug/tracing/trace_marker". In other words, I generated a
unique event which will be timestamped by sched_clock. Then I averaged
t_m_1 and t_m_2 to get t_m_x, translated it into t_s_x with the formula
above and compared it with a timestamp for the marked, grepped from the
"/sys/kernel/debug/tracing/trace" (as it is directly injected into the
ftrace buffer, there is no way of getting it into perf data stream). On
average I was getting results accurate to or below 1 microsecond (that's
the resolution of a timestamp generated by standard ftrace output).

So, the approach works in general, however:

1. I guess adding monotonic raw field to the "vanilla" sched_switch
won't be praised... A clone like the above one doesn't make much sense
either, I think.

2. Even if this was possible (or we had a clone like the above), only
root can request trace events. This is acceptable for system-wide
profiling tools (which will run as root by definition), but can be a
problem for JIT engines, which usually run as unprivileged processes.

3. The accuracy, although not bad and sufficient in a lot of use cases
(eg. latencies when accessing hwmon sensors are big and unpredictable by
their nature, so a microsecond here and there don't matter at all), may
not be good enough for JIT engines - there must be a way of precisely
marking a moment between a new code (and debug information) being
generated and its first instruction executed. "trace_marker" could be
used for this, but as mentioned above, it's "invisible" to perf, and
even if it was, it would be root-only, see p.2.

With all this in mind, I though it would be possible to generalise the
"monotonic timestamping" at the perf interface level and duplicating the
marker idea there... A short RFC series to follow.

Pawel


      reply	other threads:[~2014-09-12 11:53 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-18 17:43 [RFC] sched_clock: Track monotonic raw clock Pawel Moll
2014-07-18 17:51 ` John Stultz
2014-07-18 19:06   ` Peter Zijlstra
2014-07-22 16:17   ` Pawel Moll
2014-07-18 19:13 ` Peter Zijlstra
2014-07-18 19:25   ` John Stultz
2014-07-18 19:34     ` Peter Zijlstra
2014-07-18 19:46       ` John Stultz
2014-07-18 20:22         ` Peter Zijlstra
2014-07-18 22:41           ` Peter Zijlstra
2014-07-22 16:17             ` Pawel Moll
2014-07-22 16:29               ` Peter Zijlstra
2014-07-22 16:17       ` Pawel Moll
2014-07-19  5:02 ` Richard Cochran
2014-07-22 16:17   ` Pawel Moll
2014-07-22 16:34     ` Peter Zijlstra
2014-07-22 16:48       ` Pawel Moll
2014-07-22 19:39     ` Richard Cochran
2014-09-12 11:45       ` Pawel Moll [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1410522334.16936.42.camel@hornet \
    --to=pawel.moll@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=baruch@tkos.co.il \
    --cc=john.stultz@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=richardcochran@gmail.com \
    --cc=rostedt@goodmis.org \
    --cc=sboyd@codeaurora.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.