Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Peter Zijlstra <peterz@infradead.org>
To: Jason Vas Dias <jason.vas.dias@gmail.com>
Cc: x86@kernel.org, LKML <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>, andi <andi@firstfloor.org>
Subject: Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
Date: Wed, 14 Mar 2018 10:45:19 +0100	[thread overview]
Message-ID: <20180314094519.GD4082@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <CALyZvKxKXWiFF9OYtykXEfnx-k+xaAM7XREEQ+N58oJtOTZhew@mail.gmail.com>

On Tue, Mar 13, 2018 at 11:45:45PM +0000, Jason Vas Dias wrote:
> On 12/03/2018, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Mar 12, 2018 at 07:01:20AM +0000, Jason Vas Dias wrote:
> >>   Sometimes, particularly when correlating elapsed time to performance
> >>   counter values,
> >
> > So what actual problem are you tring to solve here? Perf can already
> > give you sample time in various clocks, including MONOTONIC_RAW.
> >
> >
> 
> Yes, I am sampling perf counters,

You're not in fact sampling, you're just reading the counters.

> including CPU_CYCLES , INSTRUCTIONS,
> CPU_CLOCK, TASK_CLOCK, etc, in a Group FD I open with
> perf_event_open() , for the current thread on the current CPU -
> I am doing this for 4 threads , on Intel & ARM cpus.
> 
> Reading performance counters does involve  2 ioctls and a read() ,
> which takes time that  already far exceeds the time required to read
> the TSC or CNTPCT in the VDSO .

So you can avoid the whole ioctl(ENABLE), ioctl(DISABLE) nonsense and
just let them run and do:

	read(group_fd, &buf_pre, size);
	/* your code section */
	read(group_fd, &buf_post, size);

	/* compute buf_post - buf_pre */

Which is only 2 system calls, not 4.

Also, a while back there was the proposal to extend the mmap()
self-monitoring interface to groups, see:

 https://lkml.kernel.org/r/20170530172555.5ya3ilfw3sowokjz@hirez.programming.kicks-ass.net

I never did get around to writing the actual code for it, but it
shouldn't be too hard.

> The CPU_CLOCK software counter should give the converted TSC cycles
> seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...)
> and the  ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the
> difference between the event->time_running and time_enabled
> should also measure elapsed time .

While CPU_CLOCK is TSC based, there is no guarantee it has any
correlation to CLOCK_MONOTONIC_RAW (even if that is also TSC based).

(although, I think I might have fixed that recently and it might just
work, but it's very much not guaranteed).

If you want to correlate to CLOCK_MONOTONIC_RAW you have to read
CLOCK_MONOTONIC_RAW and not some random other clock value.

> This gives the "inner" elapsed time, from the perpective of the kernel,
> while the measured code section had the counters enabled.
> 
> But unless the user-space program  also has a way of measuring elapsed
> time from the CPU's perspective , ie. without being subject to
> operator or NTP / PTP adjustment, it has no way of correlating this
> inner elapsed time with any "outer"

You could read the time using the group_fd's mmap() page. That actually
includes the TSC mult,shift,offset as used by perf clocks.

> Currently, users must parse the log file or use gdb / objdump to
> inspect /proc/kcore to get the TSC calibration and exact
> mult+shift values for the TSC value conversion.

Which ;-) there's multiple floating around..

> Intel does not publish, nor does the CPU come with in ROM or firmware,
> the actual precise TSC frequency - this must be calibrated against the
> other clocks , according to a complicated procedure in section 18.2 of
> the SDM . My TSC has a "rated" / nominal TSC frequency , which one
> can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency"
> is 2.8333ghz .

You might want to look at commit:

  b51120309348 ("x86/tsc: Fix erroneous TSC rate on Skylake Xeon")

There is no such thing as a precise TSC frequency, there's a reason we
have NTP/PTP.

> Hence I think Linux should export this calibrated frequency somehow ;
> its "calibration" is expressed as the raw clocksource 'mult' and 'shift'
> values, and is exported to the VDSO .
> 
> I think the VDSO should read the TSC and use the calibration
> to render the raw, unadjusted time from the CPU's perspective.
> 
> Hence, the patch I am preparing , which is again attached.

I have no objection to adding CLOCK_MONOTONIC_RAW support to the VDSO,
but you seem to be rather confused on how things work.

Now, if you wanted to actually have CLOCK_MONOTONIC_RAW times from perf
you'd need something like the below patch.

You'd need to create your events with:

	attr.use_clockid = 1;
	attr.clockid = CLOCK_MONOTONIC_RAW;
	attr.read_format |= PERF_FORMAT_TIME;

But whatever you do, you really have to stop mixing clocks, that's
broken, even if it magically works for now.

---
 include/uapi/linux/perf_event.h |  5 ++++-
 kernel/events/core.c            | 23 ++++++++++++++++++++---
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 912b85b52344..e210c9a97f2b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -271,9 +271,11 @@ enum {
  *	  { u64		time_enabled; } && PERF_FORMAT_TOTAL_TIME_ENABLED
  *	  { u64		time_running; } && PERF_FORMAT_TOTAL_TIME_RUNNING
  *	  { u64		id;           } && PERF_FORMAT_ID
+ *	  { u64		time;         } && PERF_FORMAT_TIME
  *	} && !PERF_FORMAT_GROUP
  *
  *	{ u64		nr;
+ *	  { u64         time;         } && PERF_FORMAT_TIME
  *	  { u64		time_enabled; } && PERF_FORMAT_TOTAL_TIME_ENABLED
  *	  { u64		time_running; } && PERF_FORMAT_TOTAL_TIME_RUNNING
  *	  { u64		value;
@@ -287,8 +289,9 @@ enum perf_event_read_format {
 	PERF_FORMAT_TOTAL_TIME_RUNNING		= 1U << 1,
 	PERF_FORMAT_ID				= 1U << 2,
 	PERF_FORMAT_GROUP			= 1U << 3,
+	PERF_FORMAT_TIME			= 1U << 4,
 
-	PERF_FORMAT_MAX = 1U << 4,		/* non-ABI */
+	PERF_FORMAT_MAX = 1U << 5,		/* non-ABI */
 };
 
 #define PERF_ATTR_SIZE_VER0	64	/* sizeof first published struct */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c87decf03757..4298b4a39bc0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1707,6 +1707,9 @@ static void __perf_event_read_size(struct perf_event *event, int nr_siblings)
 		size += sizeof(u64);
 	}
 
+	if (event->attr.read_format & PERF_FORMAT_TIME)
+		size += sizeof(u64);
+
 	size += entry * nr;
 	event->read_size = size;
 }
@@ -4685,6 +4688,9 @@ static int __perf_read_group_add(struct perf_event *leader,
 	int n = 1; /* skip @nr */
 	int ret;
 
+	if (read_format & PERF_FORMAT_TIME)
+		n++; /* skip @time */
+
 	ret = perf_event_read(leader, true);
 	if (ret)
 		return ret;
@@ -4739,6 +4745,9 @@ static int perf_read_group(struct perf_event *event,
 
 	values[0] = 1 + leader->nr_siblings;
 
+	if (read_format & PERF_FORMAT_TIME)
+		values[1] = perf_event_clock(event);
+
 	/*
 	 * By locking the child_mutex of the leader we effectively
 	 * lock the child list of all siblings.. XXX explain how.
@@ -4773,7 +4782,7 @@ static int perf_read_one(struct perf_event *event,
 				 u64 read_format, char __user *buf)
 {
 	u64 enabled, running;
-	u64 values[4];
+	u64 values[5];
 	int n = 0;
 
 	values[n++] = __perf_event_read_value(event, &enabled, &running);
@@ -4783,6 +4792,8 @@ static int perf_read_one(struct perf_event *event,
 		values[n++] = running;
 	if (read_format & PERF_FORMAT_ID)
 		values[n++] = primary_event_id(event);
+	if (read_format & PERF_FORMAT_TIME)
+		values[n++] = perf_event_clock(event)
 
 	if (copy_to_user(buf, values, n * sizeof(u64)))
 		return -EFAULT;
@@ -6034,7 +6045,7 @@ static void perf_output_read_one(struct perf_output_handle *handle,
 				 u64 enabled, u64 running)
 {
 	u64 read_format = event->attr.read_format;
-	u64 values[4];
+	u64 values[5];
 	int n = 0;
 
 	values[n++] = perf_event_count(event);
@@ -6049,6 +6060,9 @@ static void perf_output_read_one(struct perf_output_handle *handle,
 	if (read_format & PERF_FORMAT_ID)
 		values[n++] = primary_event_id(event);
 
+	if (read_format & PERF_FORMAT_TIME)
+		values[n++] = perf_event_clock(event);
+
 	__output_copy(handle, values, n * sizeof(u64));
 }
 
@@ -6058,11 +6072,14 @@ static void perf_output_read_group(struct perf_output_handle *handle,
 {
 	struct perf_event *leader = event->group_leader, *sub;
 	u64 read_format = event->attr.read_format;
-	u64 values[5];
+	u64 values[6];
 	int n = 0;
 
 	values[n++] = 1 + leader->nr_siblings;
 
+	if (read_format & PERF_FORMAT_TIME)
+		values[n++] = perf_event_clock(event);
+
 	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
 		values[n++] = enabled;

next prev parent reply	other threads:[~2018-03-14  9:45 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-12  7:01 [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW Jason Vas Dias
2018-03-12  8:27 ` Peter Zijlstra
2018-03-13 23:45   ` Jason Vas Dias
2018-03-14  9:45     ` Peter Zijlstra [this message]
2018-03-14 12:55       ` Jason Vas Dias
2018-03-14 13:11         ` Peter Zijlstra
2018-03-14 13:12         ` Peter Zijlstra
2018-03-14 13:16         ` Peter Zijlstra
2018-03-14 13:29         ` Peter Zijlstra
  -- strict thread matches above, loose matches on Subject: below --
2018-03-12  9:14 Jason Vas Dias
2018-03-12 17:41 ` kbuild test robot
2018-03-12  5:44 Jason Vas Dias
2018-03-12  5:33 Jason Vas Dias

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:912b85b5234 dfblob:e210c9a97f2 dfblob:c87decf0375
dfblob:4298b4a39bc )
 OR (
bs:"Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180314094519.GD4082@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=andi@firstfloor.org \
    --cc=jason.vas.dias@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox