public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/5] hrtimer group events
@ 2010-08-30 12:13 Matt Fleming
  2010-08-30 12:13 ` [RFC][PATCH 1/5] perf: Check if we should exclude idle thread in perf_exclude_event() Matt Fleming
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Matt Fleming @ 2010-08-30 12:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhang Rui, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

This series is an attempt to link perf group events with a hrtimer, so
that when the hrtimer fires all counters in the group are sampled. The
reason this functionality is needed is because some performance counters
cannot generate any form of interrupts, such as on SH or the power
consumption counters on x86.

Peter, I'm sending this as an RFC because I'm not convinced I've got
this patch series completely right. I've run it a few times on SH and
the results seem sensible, e.g. monitoring L1-dcache-load-misses has
flush_dcache_all() near the top of the list, but I just wanted to get
this out so that people could review it early before I go any
further.

One thing worth noting is that I've essentially hijacked group events so
that they only work specifically for the hrtimer case. That's probably
not right, in which case, we'll need some more flags to distinguish
hrtimer-backed group from non-hrtimer-backed group.

Matt Fleming (5):
  perf: Check if we should exclude idle thread in perf_exclude_event()
  perf: Turn the group counter values into delta values
  perf: Add hrtimer code for PMI-less hardware counters
  sh: Add support for sampling counters
  perf: Add support for PERF_SAMPLE_READ samples

 arch/sh/kernel/perf_event.c |    3 +-
 include/linux/perf_event.h  |    4 ++
 kernel/perf_event.c         |  111 +++++++++++++++++++++++++++++++++++++++++-
 tools/perf/builtin-record.c |   70 ++++++++++++++++++++++++---
 tools/perf/builtin-report.c |   19 +++++++-
 tools/perf/util/event.c     |    7 ++-
 tools/perf/util/event.h     |   15 ++++++
 7 files changed, 213 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC][PATCH 1/5] perf: Check if we should exclude idle thread in perf_exclude_event()
  2010-08-30 12:13 [RFC][PATCH 0/5] hrtimer group events Matt Fleming
@ 2010-08-30 12:13 ` Matt Fleming
  2010-08-31 14:54   ` Frederic Weisbecker
  2010-08-30 12:13 ` [RFC][PATCH 2/5] perf: Turn the group counter values into delta values Matt Fleming
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Matt Fleming @ 2010-08-30 12:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhang Rui, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

Don't open code the event check for excluding the idle thread. Instead
include the check in perf_exclude_event().

Signed-off-by: Matt Fleming <matt@console-pimps.org>
---
 kernel/perf_event.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 0d38f27..16b0476 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -4310,6 +4310,9 @@ static int perf_exclude_event(struct perf_event *event,
 
 		if (event->attr.exclude_kernel && !user_mode(regs))
 			return 1;
+
+		if (event->attr.exclude_idle && current->pid == 0)
+			return 1;
 	}
 
 	return 0;
@@ -4512,9 +4515,8 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
 	regs = get_irq_regs();
 
 	if (regs && !perf_exclude_event(event, regs)) {
-		if (!(event->attr.exclude_idle && current->pid == 0))
-			if (perf_event_overflow(event, 0, &data, regs))
-				ret = HRTIMER_NORESTART;
+		if (perf_event_overflow(event, 0, &data, regs))
+			ret = HRTIMER_NORESTART;
 	}
 
 	period = max_t(u64, 10000, event->hw.sample_period);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC][PATCH 2/5] perf: Turn the group counter values into delta values
  2010-08-30 12:13 [RFC][PATCH 0/5] hrtimer group events Matt Fleming
  2010-08-30 12:13 ` [RFC][PATCH 1/5] perf: Check if we should exclude idle thread in perf_exclude_event() Matt Fleming
@ 2010-08-30 12:13 ` Matt Fleming
  2010-08-30 12:54   ` Peter Zijlstra
  2010-08-30 12:13 ` [RFC][PATCH 3/5] perf: Add hrtimer code for PMI-less hardware counters Matt Fleming
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Matt Fleming @ 2010-08-30 12:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhang Rui, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

Change the semantics of the PERF_FORMAT_GROUP records and make them
delta values, i.e. the difference in value between consecutive reads of
the hardware counters. These delta values will be used in a subsequent
patch to calculate weighted values (by multiplying the deltas with the
time difference between reads of the counters). Calculating these deltas
is easiest to do in the kernel (as opposed to the userland tools)
because we can reset the counter whenever we read from it.

Note that even though this patch changes the semantics of
PERF_FORMAT_GROUP records there were no users of them anyway (the tools
didn't even understand them).

Signed-off-by: Matt Fleming <matt@console-pimps.org>
---
 kernel/perf_event.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 16b0476..2cda375 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -3450,7 +3450,14 @@ static void perf_output_read_group(struct perf_output_handle *handle,
 	if (leader != event)
 		leader->pmu->read(leader);
 
+	/*
+	 * Reset the counter value so that ->count contains a delta from
+	 * the previous value.
+	 */
 	values[n++] = perf_event_count(leader);
+	local64_set(&leader->count, 0);
+	atomic64_set(&leader->child_count, 0);
+
 	if (read_format & PERF_FORMAT_ID)
 		values[n++] = primary_event_id(leader);
 
@@ -3463,6 +3470,9 @@ static void perf_output_read_group(struct perf_output_handle *handle,
 			sub->pmu->read(sub);
 
 		values[n++] = perf_event_count(sub);
+		local64_set(&sub->count, 0);
+		atomic64_set(&sub->child_count, 0);
+
 		if (read_format & PERF_FORMAT_ID)
 			values[n++] = primary_event_id(sub);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC][PATCH 3/5] perf: Add hrtimer code for PMI-less hardware counters
  2010-08-30 12:13 [RFC][PATCH 0/5] hrtimer group events Matt Fleming
  2010-08-30 12:13 ` [RFC][PATCH 1/5] perf: Check if we should exclude idle thread in perf_exclude_event() Matt Fleming
  2010-08-30 12:13 ` [RFC][PATCH 2/5] perf: Turn the group counter values into delta values Matt Fleming
@ 2010-08-30 12:13 ` Matt Fleming
  2010-08-30 12:55   ` Peter Zijlstra
  2010-08-30 12:13 ` [RFC][PATCH 4/5] sh: Add support for sampling counters Matt Fleming
  2010-08-30 12:13 ` [RFC][PATCH 5/5] perf: Add support for PERF_SAMPLE_READ samples Matt Fleming
  4 siblings, 1 reply; 15+ messages in thread
From: Matt Fleming @ 2010-08-30 12:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhang Rui, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

Currently, it's impossible to periodically sample hardware counters that
lack performance monitoring interrupt (PMI) support. In order to sample
these counters we can create an event group which is backed by a
hrtimer, thereby simulating a PMI.

When the hrtimer goes off we sample the values in the hardware
counters. Because we obviously can't rely on the hrtimer going off at
exactly the sample period (say, every 1000 cache misses) the values need
to be weighted by the variable period since the last hrtimer went
off. This is so that we can compensate for the variability in hrtimer
period.

If perf record tries to create a sampling counter and the hardware
doesn't support it then we'll fall back to creating an event group with
a hrtimer.

Signed-off-by: Matt Fleming <matt@console-pimps.org>
---
 include/linux/perf_event.h |    4 ++
 kernel/perf_event.c        |   93 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+), 0 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 000610c..57d4aa0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -755,6 +755,10 @@ struct perf_event {
 
 	perf_overflow_handler_t		overflow_handler;
 
+	/* timer for sampling event group */
+	struct hrtimer			group_hrtimer;
+	s64				group_remaining;
+
 #ifdef CONFIG_EVENT_TRACING
 	struct ftrace_event_call	*tp_event;
 	struct event_filter		*filter;
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 2cda375..10a054b 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -85,6 +85,11 @@ void __weak hw_perf_enable(void)		{ barrier(); }
 
 void __weak perf_event_print_debug(void)	{ }
 
+static int perf_exclude_event(struct perf_event *event, struct pt_regs *regs);
+static void perf_swevent_overflow(struct perf_event *event, u64 overflow,
+				  int nmi, struct perf_sample_data *data,
+				  struct pt_regs *regs);
+
 static DEFINE_PER_CPU(int, perf_disable_count);
 
 void perf_disable(void)
@@ -402,6 +407,89 @@ static void perf_group_detach(struct perf_event *event)
 	}
 }
 
+static enum hrtimer_restart perf_group_event_hrtimer(struct hrtimer *hrtimer)
+{
+	struct perf_sample_data data;
+	struct pt_regs *regs;
+	struct perf_event *leader, *event;
+	struct hw_perf_event *hwc;
+	u64 period, nr;
+
+	leader = container_of(hrtimer, struct perf_event, group_hrtimer);
+	hwc = &leader->hw;
+	leader->pmu->read(leader);
+	nr = local64_read(&leader->count);
+
+	perf_sample_data_init(&data, 0);
+	data.period = leader->hw.last_period;
+	regs = get_irq_regs();
+
+	if (!regs || perf_exclude_event(leader, regs))
+		goto restart_timer;
+
+	if (!local64_add_negative(nr, &hwc->period_left))
+		perf_swevent_overflow(leader, 0, 0, &data, regs);
+
+	list_for_each_entry(event, &leader->sibling_list, group_entry) {
+		if (perf_exclude_event(event, regs))
+			continue;
+
+		event->pmu->read(event);
+		hwc = &event->hw;
+		nr = local64_read(&event->count);
+
+		if (local64_add_negative(nr, &hwc->period_left))
+			continue;
+
+		perf_swevent_overflow(event, 0, 0, &data, regs);
+	}
+
+restart_timer:
+	period = max_t(u64, 10000, leader->hw.sample_period);
+	hrtimer_forward_now(hrtimer, ns_to_ktime(period));
+
+	return HRTIMER_RESTART;
+}
+
+static void perf_group_start_hrtimer(struct perf_event *group_event)
+{
+	struct hw_perf_event *hwc = &group_event->hw;
+	struct hrtimer *hrtimer = &group_event->group_hrtimer;
+
+	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+
+	hrtimer->function = perf_group_event_hrtimer;
+	if (hwc->sample_period) {
+		u64 period;
+
+		if (group_event->group_remaining) {
+			if (group_event->group_remaining < 0)
+				period = 10000;
+			else
+				period = group_event->group_remaining;
+			group_event->group_remaining = 0;
+		} else {
+			period = max_t(u64, 10000, hwc->sample_period);
+		}
+		__hrtimer_start_range_ns(hrtimer,
+				ns_to_ktime(period), 0,
+				HRTIMER_MODE_REL, 0);
+	}
+}
+
+static void perf_group_cancel_hrtimer(struct perf_event *group_event)
+{
+	struct hw_perf_event *hwc = &group_event->hw;
+	struct hrtimer *hrtimer = &group_event->group_hrtimer;
+
+	if (hwc->sample_period) {
+		ktime_t remaining = hrtimer_get_remaining(hrtimer);
+		group_event->group_remaining = ktime_to_ns(remaining);
+
+		hrtimer_cancel(&group_event->group_hrtimer);
+	}
+}
+
 static void
 event_sched_out(struct perf_event *event,
 		  struct perf_cpu_context *cpuctx,
@@ -436,6 +524,8 @@ group_sched_out(struct perf_event *group_event,
 	if (group_event->state != PERF_EVENT_STATE_ACTIVE)
 		return;
 
+	perf_group_cancel_hrtimer(group_event);
+
 	event_sched_out(group_event, cpuctx, ctx);
 
 	/*
@@ -702,6 +792,9 @@ group_sched_in(struct perf_event *group_event,
 		}
 	}
 
+	/* Kick off the hrtimer that samples this group */
+	perf_group_start_hrtimer(group_event);
+
 	if (!txn || !pmu->commit_txn(pmu))
 		return 0;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC][PATCH 4/5] sh: Add support for sampling counters
  2010-08-30 12:13 [RFC][PATCH 0/5] hrtimer group events Matt Fleming
                   ` (2 preceding siblings ...)
  2010-08-30 12:13 ` [RFC][PATCH 3/5] perf: Add hrtimer code for PMI-less hardware counters Matt Fleming
@ 2010-08-30 12:13 ` Matt Fleming
  2010-08-30 12:13 ` [RFC][PATCH 5/5] perf: Add support for PERF_SAMPLE_READ samples Matt Fleming
  4 siblings, 0 replies; 15+ messages in thread
From: Matt Fleming @ 2010-08-30 12:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhang Rui, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

SH performance counters cannot generate any form of interrupt and so
require hrtimer assistance for sampling counter values. Event groups
periodically sample counters when a hrtimer fires which allows the use
of sampling counters on SH.

Signed-off-by: Matt Fleming <matt@console-pimps.org>
---
 arch/sh/kernel/perf_event.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/sh/kernel/perf_event.c b/arch/sh/kernel/perf_event.c
index 7a3dc35..1f77de6 100644
--- a/arch/sh/kernel/perf_event.c
+++ b/arch/sh/kernel/perf_event.c
@@ -114,7 +114,8 @@ static int __hw_perf_event_init(struct perf_event *event)
 	 * no interrupts, and are therefore unable to do sampling without
 	 * further work and timer assistance.
 	 */
-	if (hwc->sample_period)
+	if (!(attr->read_format & PERF_FORMAT_GROUP) &&
+	    hwc->sample_period && event->group_leader == event)
 		return -EINVAL;
 
 	/*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC][PATCH 5/5] perf: Add support for PERF_SAMPLE_READ samples
  2010-08-30 12:13 [RFC][PATCH 0/5] hrtimer group events Matt Fleming
                   ` (3 preceding siblings ...)
  2010-08-30 12:13 ` [RFC][PATCH 4/5] sh: Add support for sampling counters Matt Fleming
@ 2010-08-30 12:13 ` Matt Fleming
  4 siblings, 0 replies; 15+ messages in thread
From: Matt Fleming @ 2010-08-30 12:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhang Rui, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

Using the counter values from PERF_SAMPLE_READ samples and weighting
them by the hrtimer period we can approximate a counter with a PMI. By
following how fast a counter varies over a hrtimer period we can figure
out which functions are causing the counters to change the fastest.

Suppose you have a workload consisting of two main parts:

  my_important_work()
  {
     load_my_data();
     compute_me_silly();
  }

Now, lets assume that both these functions take the same time to
complete for each part of work. In that case a periodic timer generate
samples that are about 50/50 distributed between these two functions.

Now, let us further assume that load_my_data() is so slow because its
missing all the caches and compute_me_silly() is slow because its
defeating the branch predictor.

So what we end up with, is that when we sample for cache-misses we get
load_my_data() as the predominant function, not a nice 50/50
relation. Idem for branch misses and compute_me_silly().

By weighting the samples by the hw counter delta we get this, if we
assume that the sampling frequency is not a harmonic of the runtime of
these functions, then statistics will dtrt.

Signed-off-by: Matt Fleming <matt@console-pimps.org>
---
 tools/perf/builtin-record.c |   70 ++++++++++++++++++++++++++++++++++++++-----
 tools/perf/builtin-report.c |   19 ++++++++++-
 tools/perf/util/event.c     |    7 +++-
 tools/perf/util/event.h     |   15 +++++++++
 4 files changed, 99 insertions(+), 12 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index b530bee..4bd7c4a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -224,20 +224,43 @@ static struct perf_header_attr *get_header_attr(struct perf_event_attr *a, int n
 	return h_attr;
 }
 
+struct read_format_single {
+	u64 count;
+	u64 time_enabled;
+	u64 time_running;
+	u64 id;
+};
+
+struct group_entry {
+	u64 value;
+	u64 id;
+};
+
+struct read_format_group {
+	u64 nr;
+	u64 time_enabled;
+	u64 time_running;
+	struct group_entry cntr[0];
+};
+
 static void create_counter(int counter, int cpu)
 {
 	char *filter = filters[counter];
 	struct perf_event_attr *attr = attrs + counter;
 	struct perf_header_attr *h_attr;
 	int track = !counter; /* only the first counter needs these */
+	size_t read_data_sz;
+	void *read_data;
 	int thread_index;
 	int ret;
-	struct {
-		u64 count;
-		u64 time_enabled;
-		u64 time_running;
-		u64 id;
-	} read_data;
+	u64 id;
+
+	read_data_sz = sizeof(struct read_format_single);
+	read_data = malloc(read_data_sz);
+	if (!read_data) {
+		perror("Unable to allocate read data");
+		return;
+	}
 
 	attr->read_format	= PERF_FORMAT_TOTAL_TIME_ENABLED |
 				  PERF_FORMAT_TOTAL_TIME_RUNNING |
@@ -325,6 +348,32 @@ try_again:
 				attr->config = PERF_COUNT_SW_CPU_CLOCK;
 				goto try_again;
 			}
+
+			/*
+			 * If we requested a sampling counter but the
+			 * hardware doesn't support it, create an
+			 * event group.
+			 */
+			if (err == EINVAL && attr->sample_period && !group) {
+				size_t sz = sizeof(struct read_format_group);
+
+				attr->read_format |= PERF_FORMAT_GROUP;
+				attr->sample_type |= PERF_SAMPLE_READ;
+
+				free(read_data);
+
+				read_data_sz = sz + (sizeof(struct group_entry) * nr_counters);
+				read_data = malloc(read_data_sz);
+				if (!read_data) {
+					perror("Unable to allocate read_data");
+					exit(-1);
+				}
+
+				/* Only try to fallback to a group once. */
+				group = 1;
+				goto try_again;
+			}
+
 			printf("\n");
 			error("perfcounter syscall returned with %d (%s)\n",
 					fd[nr_cpu][counter][thread_index], strerror(err));
@@ -352,12 +401,17 @@ try_again:
 			}
 		}
 
-		if (read(fd[nr_cpu][counter][thread_index], &read_data, sizeof(read_data)) == -1) {
+		if (read(fd[nr_cpu][counter][thread_index], read_data, read_data_sz) == -1) {
 			perror("Unable to read perf file descriptor");
 			exit(-1);
 		}
 
-		if (perf_header_attr__add_id(h_attr, read_data.id) < 0) {
+		if (attr->read_format & PERF_FORMAT_GROUP)
+			id = ((struct read_format_group *)read_data)->cntr[0].id;
+		else
+			id = ((struct read_format_single *)read_data)->id;
+
+		if (perf_header_attr__add_id(h_attr, id) < 0) {
 			pr_warning("Not enough memory to add id\n");
 			exit(-1);
 		}
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 5de405d..44772fb 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -139,14 +139,29 @@ static int add_event_total(struct perf_session *session,
 	if (!hists)
 		return -ENOMEM;
 
-	hists->stats.total_period += data->period;
+	if (attr && attr->type & PERF_SAMPLE_READ) {
+		u64 value;
+		unsigned int i;
+
+		for (i = 0; i < data->group->nr; i++) {
+			struct read_group_entry *entry = &data->group->entries[i];
+
+			value = entry->value * data->group->time_running;
+			hists->stats.total_period += value;
+			session->hists.stats.total_period += value;
+		}
+	} else {
+		hists->stats.total_period += data->period;
+		session->hists.stats.total_period += data->period;
+	}
+
 	/*
 	 * FIXME: add_event_total should be moved from here to
 	 * perf_session__process_event so that the proper hist is passed to
 	 * the event_op methods.
 	 */
 	hists__inc_nr_events(hists, PERF_RECORD_SAMPLE);
-	session->hists.stats.total_period += data->period;
+
 	return 0;
 }
 
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index dab9e75..c52b3ef 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -816,8 +816,11 @@ int event__parse_sample(const event_t *event, u64 type, struct sample_data *data
 	}
 
 	if (type & PERF_SAMPLE_READ) {
-		pr_debug("PERF_SAMPLE_READ is unsuported for now\n");
-		return -1;
+		/* FIXME assume group read event for now. */
+		size_t entry_sz = sizeof(struct read_group_entry);
+
+		data->group = (struct read_group *)array;
+		array += sizeof(struct read_group) + (data->group->nr * entry_sz);
 	}
 
 	if (type & PERF_SAMPLE_CALLCHAIN) {
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 8e790da..e7cadaa 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -56,6 +56,20 @@ struct read_event {
 	u64 id;
 };
 
+struct read_group_entry {
+	u64 value;
+	u64 id;
+};
+
+struct read_group {
+	struct perf_event_header header;
+	u32 pid, tid;
+	u64 nr;
+	u64 time_enabled;
+	u64 time_running;
+	struct read_group_entry entries[0];
+};
+
 struct sample_event {
 	struct perf_event_header        header;
 	u64 array[];
@@ -73,6 +87,7 @@ struct sample_data {
 	u32 raw_size;
 	void *raw_data;
 	struct ip_callchain *callchain;
+	struct read_group *group;
 };
 
 #define BUILD_ID_SIZE 20
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH 2/5] perf: Turn the group counter values into delta values
  2010-08-30 12:13 ` [RFC][PATCH 2/5] perf: Turn the group counter values into delta values Matt Fleming
@ 2010-08-30 12:54   ` Peter Zijlstra
  2010-08-30 13:21     ` Matt Fleming
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2010-08-30 12:54 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Zhang Rui, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

On Mon, 2010-08-30 at 13:13 +0100, Matt Fleming wrote:
> Change the semantics of the PERF_FORMAT_GROUP records and make them
> delta values, i.e. the difference in value between consecutive reads of
> the hardware counters. These delta values will be used in a subsequent
> patch to calculate weighted values (by multiplying the deltas with the
> time difference between reads of the counters). Calculating these deltas
> is easiest to do in the kernel (as opposed to the userland tools)
> because we can reset the counter whenever we read from it.
> 
> Note that even though this patch changes the semantics of
> PERF_FORMAT_GROUP records there were no users of them anyway (the tools
> didn't even understand them).

libpfmon might be using it,.. I know some people occasionally reported
bugs for this so there are users.

Also, there's no real synchonization between read,read and read,sample
so its racy to do this, simply generate deltas in post-processing.

> Signed-off-by: Matt Fleming <matt@console-pimps.org>
> ---
>  kernel/perf_event.c |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 16b0476..2cda375 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -3450,7 +3450,14 @@ static void perf_output_read_group(struct perf_output_handle *handle,
>  	if (leader != event)
>  		leader->pmu->read(leader);
>  
> +	/*
> +	 * Reset the counter value so that ->count contains a delta from
> +	 * the previous value.
> +	 */
>  	values[n++] = perf_event_count(leader);
> +	local64_set(&leader->count, 0);
> +	atomic64_set(&leader->child_count, 0);
> +
>  	if (read_format & PERF_FORMAT_ID)
>  		values[n++] = primary_event_id(leader);
>  
> @@ -3463,6 +3470,9 @@ static void perf_output_read_group(struct perf_output_handle *handle,
>  			sub->pmu->read(sub);
>  
>  		values[n++] = perf_event_count(sub);
> +		local64_set(&sub->count, 0);
> +		atomic64_set(&sub->child_count, 0);
> +
>  		if (read_format & PERF_FORMAT_ID)
>  			values[n++] = primary_event_id(sub);
>  



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH 3/5] perf: Add hrtimer code for PMI-less hardware counters
  2010-08-30 12:13 ` [RFC][PATCH 3/5] perf: Add hrtimer code for PMI-less hardware counters Matt Fleming
@ 2010-08-30 12:55   ` Peter Zijlstra
  2010-08-30 13:27     ` Matt Fleming
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2010-08-30 12:55 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Zhang Rui, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

On Mon, 2010-08-30 at 13:13 +0100, Matt Fleming wrote:
> Currently, it's impossible to periodically sample hardware counters that
> lack performance monitoring interrupt (PMI) support. In order to sample
> these counters we can create an event group which is backed by a
> hrtimer, thereby simulating a PMI.
> 
> When the hrtimer goes off we sample the values in the hardware
> counters. Because we obviously can't rely on the hrtimer going off at
> exactly the sample period (say, every 1000 cache misses) the values need
> to be weighted by the variable period since the last hrtimer went
> off. This is so that we can compensate for the variability in hrtimer
> period.
> 
> If perf record tries to create a sampling counter and the hardware
> doesn't support it then we'll fall back to creating an event group with
> a hrtimer.


Why is this changing kernel code?

You can create those groups in userspace..

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH 2/5] perf: Turn the group counter values into delta values
  2010-08-30 12:54   ` Peter Zijlstra
@ 2010-08-30 13:21     ` Matt Fleming
  0 siblings, 0 replies; 15+ messages in thread
From: Matt Fleming @ 2010-08-30 13:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhang Rui, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

On Mon, Aug 30, 2010 at 02:54:01PM +0200, Peter Zijlstra wrote:
> On Mon, 2010-08-30 at 13:13 +0100, Matt Fleming wrote:
> > Change the semantics of the PERF_FORMAT_GROUP records and make them
> > delta values, i.e. the difference in value between consecutive reads of
> > the hardware counters. These delta values will be used in a subsequent
> > patch to calculate weighted values (by multiplying the deltas with the
> > time difference between reads of the counters). Calculating these deltas
> > is easiest to do in the kernel (as opposed to the userland tools)
> > because we can reset the counter whenever we read from it.
> > 
> > Note that even though this patch changes the semantics of
> > PERF_FORMAT_GROUP records there were no users of them anyway (the tools
> > didn't even understand them).
> 
> libpfmon might be using it,.. I know some people occasionally reported
> bugs for this so there are users.
> 
> Also, there's no real synchonization between read,read and read,sample
> so its racy to do this, simply generate deltas in post-processing.

Ah, fair enough. Thanks, I'll do this with post-processing.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH 3/5] perf: Add hrtimer code for PMI-less hardware counters
  2010-08-30 12:55   ` Peter Zijlstra
@ 2010-08-30 13:27     ` Matt Fleming
  2010-08-30 14:12       ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Matt Fleming @ 2010-08-30 13:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhang Rui, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

On Mon, Aug 30, 2010 at 02:55:02PM +0200, Peter Zijlstra wrote:
> On Mon, 2010-08-30 at 13:13 +0100, Matt Fleming wrote:
> > Currently, it's impossible to periodically sample hardware counters that
> > lack performance monitoring interrupt (PMI) support. In order to sample
> > these counters we can create an event group which is backed by a
> > hrtimer, thereby simulating a PMI.
> > 
> > When the hrtimer goes off we sample the values in the hardware
> > counters. Because we obviously can't rely on the hrtimer going off at
> > exactly the sample period (say, every 1000 cache misses) the values need
> > to be weighted by the variable period since the last hrtimer went
> > off. This is so that we can compensate for the variability in hrtimer
> > period.
> > 
> > If perf record tries to create a sampling counter and the hardware
> > doesn't support it then we'll fall back to creating an event group with
> > a hrtimer.
> 
> 
> Why is this changing kernel code?
> 
> You can create those groups in userspace..

I'm lost. Is it possible to do this patch entirely in userspace? How do
we periodically sample the counters if it's not being done in the
kernel?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH 3/5] perf: Add hrtimer code for PMI-less hardware counters
  2010-08-30 13:27     ` Matt Fleming
@ 2010-08-30 14:12       ` Peter Zijlstra
  2010-08-30 19:35         ` Matt Fleming
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2010-08-30 14:12 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Zhang Rui, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

On Mon, 2010-08-30 at 14:27 +0100, Matt Fleming wrote:

> I'm lost. Is it possible to do this patch entirely in userspace? How do
> we periodically sample the counters if it's not being done in the
> kernel?

perf_event_attr leader = {
 .type = PERF_TYPE_SOFTWARE,
 .config = PERF_COUNT_SW_TASK_CLOCK,
 .sample_period = xxx,
 .sample_type = PERF_SAMPLE_READ|...,
 .read_format = PERF_FORMAT_GROUP,
};

perf_event_attr counter = {
 .type = PERF_TYPE_HARDWARE,
 .config = PERF_COUNT_HW_CPU_CYCLES,
};
 
leader_fd = sys_perf_event_open(&leader, pid, cpu, 0, 0);
counter_fd = sys_perf_event_open(&counter, pid, cpu, leader_fd, 0);

Which gives you a group of 2 events, one software timer that samples,
one hardware counter that only counts.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH 3/5] perf: Add hrtimer code for PMI-less hardware counters
  2010-08-30 14:12       ` Peter Zijlstra
@ 2010-08-30 19:35         ` Matt Fleming
  0 siblings, 0 replies; 15+ messages in thread
From: Matt Fleming @ 2010-08-30 19:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhang Rui, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

On Mon, Aug 30, 2010 at 04:12:24PM +0200, Peter Zijlstra wrote:
> On Mon, 2010-08-30 at 14:27 +0100, Matt Fleming wrote:
> 
> > I'm lost. Is it possible to do this patch entirely in userspace? How do
> > we periodically sample the counters if it's not being done in the
> > kernel?
> 
> perf_event_attr leader = {
>  .type = PERF_TYPE_SOFTWARE,
>  .config = PERF_COUNT_SW_TASK_CLOCK,
>  .sample_period = xxx,
>  .sample_type = PERF_SAMPLE_READ|...,
>  .read_format = PERF_FORMAT_GROUP,
> };
> 
> perf_event_attr counter = {
>  .type = PERF_TYPE_HARDWARE,
>  .config = PERF_COUNT_HW_CPU_CYCLES,
> };
>  
> leader_fd = sys_perf_event_open(&leader, pid, cpu, 0, 0);
> counter_fd = sys_perf_event_open(&counter, pid, cpu, leader_fd, 0);
> 
> Which gives you a group of 2 events, one software timer that samples,
> one hardware counter that only counts.

Aha. OK, thanks! I'll respin this.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH 1/5] perf: Check if we should exclude idle thread in perf_exclude_event()
  2010-08-30 12:13 ` [RFC][PATCH 1/5] perf: Check if we should exclude idle thread in perf_exclude_event() Matt Fleming
@ 2010-08-31 14:54   ` Frederic Weisbecker
  2010-08-31 15:20     ` Matt Fleming
  0 siblings, 1 reply; 15+ messages in thread
From: Frederic Weisbecker @ 2010-08-31 14:54 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Peter Zijlstra, Zhang Rui, linux-kernel, Ingo Molnar,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

On Mon, Aug 30, 2010 at 01:13:43PM +0100, Matt Fleming wrote:
> Don't open code the event check for excluding the idle thread. Instead
> include the check in perf_exclude_event().
> 
> Signed-off-by: Matt Fleming <matt@console-pimps.org>
> ---
>  kernel/perf_event.c |    8 +++++---
>  1 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 0d38f27..16b0476 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -4310,6 +4310,9 @@ static int perf_exclude_event(struct perf_event *event,
>  
>  		if (event->attr.exclude_kernel && !user_mode(regs))
>  			return 1;
> +
> +		if (event->attr.exclude_idle && current->pid == 0)
> +			return 1;



Right.

But one of the problems people have reported is that they can miss
interrupts samples if they happen in idle. Hence we have decided
that exclude_idle should exclude events that happen in idle process
context but not in interrupts interrupting idle.

So adding an in_interrupt() check would perhaps be better.

I plan to do this exclusion using the per context exclusion, which is
a patchset I have in queue. But until then, having this patch is better.



>  	}
>  
>  	return 0;
> @@ -4512,9 +4515,8 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
>  	regs = get_irq_regs();
>  
>  	if (regs && !perf_exclude_event(event, regs)) {
> -		if (!(event->attr.exclude_idle && current->pid == 0))
> -			if (perf_event_overflow(event, 0, &data, regs))
> -				ret = HRTIMER_NORESTART;
> +		if (perf_event_overflow(event, 0, &data, regs))
> +			ret = HRTIMER_NORESTART;



But yeah if we add an in_interrupt() check in perf_exclude_event(), it
won't work here. This one needs to check if irqs are nesting :)

Bah, checking we interrupted softirqs is probably enough. I guess we
don't care about nesting hardirqs.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH 1/5] perf: Check if we should exclude idle thread in perf_exclude_event()
  2010-08-31 14:54   ` Frederic Weisbecker
@ 2010-08-31 15:20     ` Matt Fleming
  2010-08-31 15:21       ` Frederic Weisbecker
  0 siblings, 1 reply; 15+ messages in thread
From: Matt Fleming @ 2010-08-31 15:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Zhang Rui, linux-kernel, Ingo Molnar,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

On Tue, Aug 31, 2010 at 04:54:07PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 30, 2010 at 01:13:43PM +0100, Matt Fleming wrote:
> > Don't open code the event check for excluding the idle thread. Instead
> > include the check in perf_exclude_event().
> > 
> > Signed-off-by: Matt Fleming <matt@console-pimps.org>
> > ---
> >  kernel/perf_event.c |    8 +++++---
> >  1 files changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> > index 0d38f27..16b0476 100644
> > --- a/kernel/perf_event.c
> > +++ b/kernel/perf_event.c
> > @@ -4310,6 +4310,9 @@ static int perf_exclude_event(struct perf_event *event,
> >  
> >  		if (event->attr.exclude_kernel && !user_mode(regs))
> >  			return 1;
> > +
> > +		if (event->attr.exclude_idle && current->pid == 0)
> > +			return 1;
> 
> 
> 
> Right.
> 
> But one of the problems people have reported is that they can miss
> interrupts samples if they happen in idle. Hence we have decided
> that exclude_idle should exclude events that happen in idle process
> context but not in interrupts interrupting idle.
> 
> So adding an in_interrupt() check would perhaps be better.
> 
> I plan to do this exclusion using the per context exclusion, which is
> a patchset I have in queue. But until then, having this patch is better.
> 
> 
> 
> >  	}
> >  
> >  	return 0;
> > @@ -4512,9 +4515,8 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
> >  	regs = get_irq_regs();
> >  
> >  	if (regs && !perf_exclude_event(event, regs)) {
> > -		if (!(event->attr.exclude_idle && current->pid == 0))
> > -			if (perf_event_overflow(event, 0, &data, regs))
> > -				ret = HRTIMER_NORESTART;
> > +		if (perf_event_overflow(event, 0, &data, regs))
> > +			ret = HRTIMER_NORESTART;
> 
> 
> 
> But yeah if we add an in_interrupt() check in perf_exclude_event(), it
> won't work here. This one needs to check if irqs are nesting :)
> 
> Bah, checking we interrupted softirqs is probably enough. I guess we
> don't care about nesting hardirqs.

This patch isn't really worth it on its own, I only grouped the idle
check into perf_exclude_event() because patch 3/5 introduced a new
caller. As you've said, the semantics at the various callsites are
quite complex. It's probably best to wait for your patchset :)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH 1/5] perf: Check if we should exclude idle thread in perf_exclude_event()
  2010-08-31 15:20     ` Matt Fleming
@ 2010-08-31 15:21       ` Frederic Weisbecker
  0 siblings, 0 replies; 15+ messages in thread
From: Frederic Weisbecker @ 2010-08-31 15:21 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Peter Zijlstra, Zhang Rui, linux-kernel, Ingo Molnar,
	Robert Richter, Lin Ming, Paul Mackerras,
	Arnaldo Carvalho de Melo, Don Zickus, Cyrill Gorcunov, Len Brown,
	Matthew Garrett

On Tue, Aug 31, 2010 at 04:20:12PM +0100, Matt Fleming wrote:
> On Tue, Aug 31, 2010 at 04:54:07PM +0200, Frederic Weisbecker wrote:
> > On Mon, Aug 30, 2010 at 01:13:43PM +0100, Matt Fleming wrote:
> > > Don't open code the event check for excluding the idle thread. Instead
> > > include the check in perf_exclude_event().
> > > 
> > > Signed-off-by: Matt Fleming <matt@console-pimps.org>
> > > ---
> > >  kernel/perf_event.c |    8 +++++---
> > >  1 files changed, 5 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> > > index 0d38f27..16b0476 100644
> > > --- a/kernel/perf_event.c
> > > +++ b/kernel/perf_event.c
> > > @@ -4310,6 +4310,9 @@ static int perf_exclude_event(struct perf_event *event,
> > >  
> > >  		if (event->attr.exclude_kernel && !user_mode(regs))
> > >  			return 1;
> > > +
> > > +		if (event->attr.exclude_idle && current->pid == 0)
> > > +			return 1;
> > 
> > 
> > 
> > Right.
> > 
> > But one of the problems people have reported is that they can miss
> > interrupts samples if they happen in idle. Hence we have decided
> > that exclude_idle should exclude events that happen in idle process
> > context but not in interrupts interrupting idle.
> > 
> > So adding an in_interrupt() check would perhaps be better.
> > 
> > I plan to do this exclusion using the per context exclusion, which is
> > a patchset I have in queue. But until then, having this patch is better.
> > 
> > 
> > 
> > >  	}
> > >  
> > >  	return 0;
> > > @@ -4512,9 +4515,8 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
> > >  	regs = get_irq_regs();
> > >  
> > >  	if (regs && !perf_exclude_event(event, regs)) {
> > > -		if (!(event->attr.exclude_idle && current->pid == 0))
> > > -			if (perf_event_overflow(event, 0, &data, regs))
> > > -				ret = HRTIMER_NORESTART;
> > > +		if (perf_event_overflow(event, 0, &data, regs))
> > > +			ret = HRTIMER_NORESTART;
> > 
> > 
> > 
> > But yeah if we add an in_interrupt() check in perf_exclude_event(), it
> > won't work here. This one needs to check if irqs are nesting :)
> > 
> > Bah, checking we interrupted softirqs is probably enough. I guess we
> > don't care about nesting hardirqs.
> 
> This patch isn't really worth it on its own, I only grouped the idle
> check into perf_exclude_event() because patch 3/5 introduced a new
> caller. As you've said, the semantics at the various callsites are
> quite complex. It's probably best to wait for your patchset :)


Ok :)


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-08-31 15:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-30 12:13 [RFC][PATCH 0/5] hrtimer group events Matt Fleming
2010-08-30 12:13 ` [RFC][PATCH 1/5] perf: Check if we should exclude idle thread in perf_exclude_event() Matt Fleming
2010-08-31 14:54   ` Frederic Weisbecker
2010-08-31 15:20     ` Matt Fleming
2010-08-31 15:21       ` Frederic Weisbecker
2010-08-30 12:13 ` [RFC][PATCH 2/5] perf: Turn the group counter values into delta values Matt Fleming
2010-08-30 12:54   ` Peter Zijlstra
2010-08-30 13:21     ` Matt Fleming
2010-08-30 12:13 ` [RFC][PATCH 3/5] perf: Add hrtimer code for PMI-less hardware counters Matt Fleming
2010-08-30 12:55   ` Peter Zijlstra
2010-08-30 13:27     ` Matt Fleming
2010-08-30 14:12       ` Peter Zijlstra
2010-08-30 19:35         ` Matt Fleming
2010-08-30 12:13 ` [RFC][PATCH 4/5] sh: Add support for sampling counters Matt Fleming
2010-08-30 12:13 ` [RFC][PATCH 5/5] perf: Add support for PERF_SAMPLE_READ samples Matt Fleming

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox