linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: pawel.moll@arm.com, adrian.hunter@intel.com,
	john.stultz@linaro.org, mingo@kernel.org, eranian@google.com
Cc: linux-kernel@vger.kernel.org, acme@kernel.org, dsahern@gmail.com,
	fweisbec@gmail.com, jolsa@redhat.com, namhyung@gmail.com,
	paulus@samba.org, tglx@linutronix.de, rostedt@goodmis.org,
	sonnyrao@chromium.org, ak@linux.intel.com,
	vincent.weaver@maine.edu, "Peter Zijlstra" <peterz@infradead.org>
Subject: [RFC][PATCH 2/2] perf: Add per event clockid support
Date: Fri, 20 Feb 2015 15:29:32 +0100	[thread overview]
Message-ID: <20150220143754.852733868@infradead.org> (raw)
In-Reply-To: 20150220142930.013968488@infradead.org

[-- Attachment #1: peterz-perf-clockzzz.patch --]
[-- Type: text/plain, Size: 8763 bytes --]

While thinking on the whole clock discussion it occured to me we have
two distinct uses of time:

 1) the tracking of event/ctx/cgroup enabled/running/stopped times
    which includes the self-monitoring support in struct
    perf_event_mmap_page.

 2) the actual timestamps visible in the data records.

And we've been conflating them.

The first is all about tracking time deltas, nobody should really care
in what time base that happens, its all relative information, as long
as its internally consistent it works.

The second however is what people are worried about when having to
merge their data with external sources. And here we have the
discussion on MONOTONIC vs MONOTONIC_RAW etc..

Where MONOTONIC is good for correlating between machines (static
offset), MONOTNIC_RAW is required for correlating against a fixed rate
hardware clock.

This means configurability; now 1) makes that hard because it needs to
be internally consistent across groups of unrelated events; which is
why we had to have a global perf_clock().

However, for 2) it doesn't really matter, perf itself doesn't care
what it writes into the buffer.

The below patch makes the distinction between these two cases by
adding perf_event_clock() which is used for the second case. It
further makes this configurable on a per-event basis, but adds a few
sanity checks such that we cannot combine events with different clocks
in confusing ways.

And since we then have per-event configurability we might as well
retain the 'legacy' behaviour as a default.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/perf_event.c |   14 ++++++-
 include/linux/perf_event.h       |    2 +
 include/linux/timekeeping.h      |    5 ++
 include/uapi/linux/perf_event.h  |    6 +--
 kernel/events/core.c             |   74 +++++++++++++++++++++++++++++++++++++--
 5 files changed, 93 insertions(+), 8 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1968,13 +1968,23 @@ void arch_perf_update_userpage(struct pe
 
 	data = cyc2ns_read_begin();
 
+	/*
+	 * Internal timekeeping for enabled/running/stopped times
+	 * is always in the local_clock domain.
+	 */
 	userpg->cap_user_time = 1;
 	userpg->time_mult = data->cyc2ns_mul;
 	userpg->time_shift = data->cyc2ns_shift;
 	userpg->time_offset = data->cyc2ns_offset - now;
 
-	userpg->cap_user_time_zero = 1;
-	userpg->time_zero = data->cyc2ns_offset;
+	/*
+	 * cap_user_time_zero doesn't make sense when we're using a different
+	 * time base for the records.
+	 */
+	if (event->clock == &local_clock) {
+		userpg->cap_user_time_zero = 1;
+		userpg->time_zero = data->cyc2ns_offset;
+	}
 
 	cyc2ns_read_end(data);
 }
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -166,6 +166,7 @@ struct perf_event;
  * pmu::capabilities flags
  */
 #define PERF_PMU_CAP_NO_INTERRUPT		0x01
+#define PERF_PMU_CAP_NO_NMI			0x02
 
 /**
  * struct pmu - generic performance monitoring unit
@@ -438,6 +439,7 @@ struct perf_event {
 	struct pid_namespace		*ns;
 	u64				id;
 
+	u64				(*clock)(void);
 	perf_overflow_handler_t		overflow_handler;
 	void				*overflow_handler_context;
 
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -214,6 +214,11 @@ static inline u64 ktime_get_boot_ns(void
 	return ktime_to_ns(ktime_get_boottime());
 }
 
+static inline u64 ktime_get_tai_ns(void)
+{
+	return ktime_to_ns(ktime_get_clocktai());
+}
+
 static inline u64 ktime_get_raw_ns(void)
 {
 	return ktime_to_ns(ktime_get_raw());
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -305,7 +305,8 @@ struct perf_event_attr {
 				exclude_callchain_user   : 1, /* exclude user callchains */
 				mmap2          :  1, /* include mmap with inode data     */
 				comm_exec      :  1, /* flag comm events that are due to an exec */
-				__reserved_1   : 39;
+				use_clockid    :  1, /* use @clockid for time fields */
+				__reserved_1   : 38;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -334,8 +335,7 @@ struct perf_event_attr {
 	 */
 	__u32	sample_stack_user;
 
-	/* Align to u64. */
-	__u32	__reserved_2;
+	__u32	clockid;
 	/*
 	 * Defines set of regs to dump for each sample
 	 * state captured on:
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -327,6 +327,11 @@ static inline u64 perf_clock(void)
 	return local_clock();
 }
 
+static inline u64 perf_event_clock(struct perf_event *event)
+{
+	return event->clock();
+}
+
 static inline struct perf_cpu_context *
 __get_cpu_context(struct perf_event_context *ctx)
 {
@@ -4756,7 +4761,7 @@ static void __perf_event_header__init_id
 	}
 
 	if (sample_type & PERF_SAMPLE_TIME)
-		data->time = perf_clock();
+		data->time = perf_event_clock(event);
 
 	if (sample_type & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER))
 		data->id = primary_event_id(event);
@@ -5334,6 +5339,8 @@ static void perf_event_task_output(struc
 	task_event->event_id.tid = perf_event_tid(event, task);
 	task_event->event_id.ptid = perf_event_tid(event, current);
 
+	task_event->event_id.time = perf_event_clock(event);
+
 	perf_output_put(&handle, task_event->event_id);
 
 	perf_event__output_id_sample(event, &handle, &sample);
@@ -5367,7 +5374,7 @@ static void perf_event_task(struct task_
 			/* .ppid */
 			/* .tid  */
 			/* .ptid */
-			.time = perf_clock(),
+			/* .time */
 		},
 	};
 
@@ -5743,7 +5750,7 @@ static void perf_log_throttle(struct per
 			.misc = 0,
 			.size = sizeof(throttle_event),
 		},
-		.time		= perf_clock(),
+		.time		= perf_event_clock(event),
 		.id		= primary_event_id(event),
 		.stream_id	= event->id,
 	};
@@ -6286,6 +6293,8 @@ static int perf_swevent_init(struct perf
 static struct pmu perf_swevent = {
 	.task_ctx_nr	= perf_sw_context,
 
+	.capabilities	= PERF_PMU_CAP_NO_NMI,
+
 	.event_init	= perf_swevent_init,
 	.add		= perf_swevent_add,
 	.del		= perf_swevent_del,
@@ -6403,6 +6412,8 @@ static int perf_tp_event_init(struct per
 static struct pmu perf_tracepoint = {
 	.task_ctx_nr	= perf_sw_context,
 
+	.capabilities	= PERF_PMU_CAP_NO_NMI,
+
 	.event_init	= perf_tp_event_init,
 	.add		= perf_trace_add,
 	.del		= perf_trace_del,
@@ -6628,6 +6639,8 @@ static int cpu_clock_event_init(struct p
 static struct pmu perf_cpu_clock = {
 	.task_ctx_nr	= perf_sw_context,
 
+	.capabilities	= PERF_PMU_CAP_NO_NMI,
+
 	.event_init	= cpu_clock_event_init,
 	.add		= cpu_clock_event_add,
 	.del		= cpu_clock_event_del,
@@ -6706,6 +6719,8 @@ static int task_clock_event_init(struct
 static struct pmu perf_task_clock = {
 	.task_ctx_nr	= perf_sw_context,
 
+	.capabilities	= PERF_PMU_CAP_NO_NMI,
+
 	.event_init	= task_clock_event_init,
 	.add		= task_clock_event_add,
 	.del		= task_clock_event_del,
@@ -7188,6 +7203,10 @@ perf_event_alloc(struct perf_event_attr
 #endif
 	}
 
+	event->clock = &local_clock;
+	if (parent_event)
+		event->clock = parent_event->clock;
+
 	if (!overflow_handler && parent_event) {
 		overflow_handler = parent_event->overflow_handler;
 		context = parent_event->overflow_handler_context;
@@ -7399,6 +7418,12 @@ perf_event_set_output(struct perf_event
 	if (output_event->cpu == -1 && output_event->ctx != event->ctx)
 		goto out;
 
+	/*
+	 * Mixing clocks in the same buffer is trouble you don't need.
+	 */
+	if (output_event->clock != event->clock)
+		goto out;
+
 set:
 	mutex_lock(&event->mmap_mutex);
 	/* Can't redirect output if we've got an active mmap() */
@@ -7550,6 +7575,44 @@ SYSCALL_DEFINE5(perf_event_open,
 	 */
 	pmu = event->pmu;
 
+	if (attr.use_clockid) {
+		switch (attr.clockid) {
+		case CLOCK_MONOTONIC:
+			event->clock = &ktime_get_mono_fast_ns;
+			goto clock_set;
+
+		case CLOCK_MONOTONIC_RAW:
+			event->clock = &ktime_get_mono_raw_fast_ns;
+			goto clock_set;
+
+		default:
+			if (!(pmu->capabilities & PERF_PMU_CAP_NO_NMI)) {
+				err = -EINVAL;
+				goto err_alloc;
+			}
+		}
+
+		switch (attr.clockid) {
+		case CLOCK_REALTIME:
+			event->clock = &ktime_get_real_ns;
+			goto clock_set;
+
+		case CLOCK_BOOTTIME:
+			event->clock = &ktime_get_boot_ns;
+			goto clock_set;
+
+		case CLOCK_TAI:
+			event->clock = &ktime_get_tai_ns;
+			goto clock_set;
+
+		default:
+			/* XXX add: clock_id_valid() && clock_gettime_ns() ? */
+			err = -EINVAL;
+			goto err_alloc;
+		}
+	}
+clock_set:
+
 	if (group_leader &&
 	    (is_software_event(event) != is_software_event(group_leader))) {
 		if (is_software_event(event)) {
@@ -7599,6 +7662,11 @@ SYSCALL_DEFINE5(perf_event_open,
 		 */
 		if (group_leader->group_leader != group_leader)
 			goto err_context;
+
+		/* All events in a group should have the same clock */
+		if (group_leader->clock != event->clock)
+			goto err_context;
+
 		/*
 		 * Do not allow to attach to a group in a different
 		 * task or CPU context:



  parent reply	other threads:[~2015-02-20 14:41 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-20 14:29 [RFC][PATCH 0/2] On perf and clocks Peter Zijlstra
2015-02-20 14:29 ` [RFC][PATCH 1/2] time: Add ktime_get_mono_raw_fast_ns() Peter Zijlstra
2015-02-20 19:49   ` John Stultz
2015-02-20 20:11     ` Peter Zijlstra
2015-03-17 11:24     ` Peter Zijlstra
2015-03-18 19:48       ` John Stultz
2015-02-20 14:29 ` Peter Zijlstra [this message]
2015-02-20 15:28   ` [RFC][PATCH 2/2] perf: Add per event clockid support Pawel Moll
2015-02-20 16:01     ` Peter Zijlstra
2015-02-23  8:13   ` Adrian Hunter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150220143754.852733868@infradead.org \
    --to=peterz@infradead.org \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=ak@linux.intel.com \
    --cc=dsahern@gmail.com \
    --cc=eranian@google.com \
    --cc=fweisbec@gmail.com \
    --cc=john.stultz@linaro.org \
    --cc=jolsa@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=namhyung@gmail.com \
    --cc=paulus@samba.org \
    --cc=pawel.moll@arm.com \
    --cc=rostedt@goodmis.org \
    --cc=sonnyrao@chromium.org \
    --cc=tglx@linutronix.de \
    --cc=vincent.weaver@maine.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).