public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: David Carrillo-Cisneros <davidcc@google.com>
To: linux-kernel@vger.kernel.org
Cc: "x86@kernel.org" <x86@kernel.org>, Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andi Kleen <ak@linux.intel.com>, Kan Liang <kan.liang@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Borislav Petkov <bp@suse.de>,
	Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Vikas Shivappa <vikas.shivappa@linux.intel.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Vince Weaver <vince@deater.net>, Paul Turner <pjt@google.com>,
	Stephane Eranian <eranian@google.com>,
	David Carrillo-Cisneros <davidcc@google.com>
Subject: [RFC 2/6] perf/core: add a rb-tree index to inactive_groups
Date: Tue, 10 Jan 2017 02:24:58 -0800	[thread overview]
Message-ID: <20170110102502.106187-3-davidcc@google.com> (raw)
In-Reply-To: <20170110102502.106187-1-davidcc@google.com>

Add a rb-tree that indexes inactive events by {CPU/cgroup,flexible,stamp}.

The original idea by Peter Z. was to sort task events in an rb-tree using
{pmu,cpu,timestamp} as key.

Having the PMU as part of the key gets complicated for contexts that
share pmus (i.e. software context) because all events in a context should
rotate together irrespective of their pmu. It's also unclear to me that
there is any case where a seach by pmu is useful.

Another complicatino is that using ctx->time (or timestamp) implies that
groups added during the same context switch may not have unique key.
This increases the complexity of that finds all events in the rb-tree
that are within a time interval.

Lastly, it is useful to query pinned and flexible events separately since
they are scheduled in at different times.

For the reasons above, I created a rb-tree per context with key
{CPU,flexible,stamp} for task contexts and {cgroup,flexible,stamp} for
CPU contexts.

The "flexible" boolean allows to query pinned or flexible events
separately.
The stamp is given by a non-decreasing counter: ctx->nr_events_added.
It increases every time a new inactive event is inserted. That choice of
stamp guarantees unique keys for all events and that events of the same
type (same {CPU/cgroup,flexible}) have the same order in the rb-tree.

When events are scheduled in or rotated, all events in the context must be
iterated or rotated together, irrespective of the CPU/cgroup. To do that,
we add ctx->inactive_groups, a list that "threads" the rb-tree in total
ctx->nr_events_added order. Note that this order is the same as timestamp
order and ctx->inactive_groups is used for both scheduling and iteration.
The rb-tree can be seen as an index over ctx->inactive_groups.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 include/linux/perf_event.h |  5 +++
 kernel/events/core.c       | 82 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3fa18f05c9b0..fd32ecc37d33 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -564,6 +564,9 @@ struct perf_event {
 	struct list_head		group_entry;
 	struct list_head		sibling_list;
 
+	u64				rbtree_key;
+	struct rb_node			rbtree_node;
+
 	/*
 	 * We need storage to track the entries in perf_pmu_migrate_context; we
 	 * cannot use the event_entry because of RCU and we want to keep the
@@ -736,6 +739,8 @@ struct perf_event_context {
 	struct list_head		pinned_groups;
 	struct list_head		flexible_groups;
 
+	struct rb_root			rbtree_root;
+	u32				nr_inactive_added;
 	struct list_head		active_pinned_groups;
 	struct list_head		active_flexible_groups;
 	struct list_head		inactive_groups;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b744b5a8dbd0..623d81c0ca93 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1462,19 +1462,98 @@ ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
 		return &ctx->flexible_groups;
 }
 
+/*
+ * The bits perf_event::kbtree_key represent:
+ *   - 63:33 an unique identifier for CPU (if a task context) or a cgroup
+ *     (if a CPU context).
+ *   - 32    a boolean to indicate if eventt is flexible (vs  pinnned).
+ *   - 31:0  a unique "stamp" that follows the last time the event was
+ *   scheduled.
+ * The 64 bits value groups event of the same type (CPU/cgroup + flexible)
+ * together in the rb-tree.
+ */
+#define RBTREE_KEY_STAMP_WIDTH		32
+#define RBTREE_KEY_STAMP_MASK 		GENMASK_ULL(RBTREE_KEY_STAMP_WIDTH - 1, 0)
+#define RBTREE_KEY_FLEXIBLE_MASK	BIT_ULL(RBTREE_KEY_STAMP_WIDTH)
+
+static u64 taskctx_rbtree_key(int cpu, bool flexible)
+{
+	/*
+	 * Use CPU only. PMU is never used in schedule in/out and, since  some
+	 * contexts share PMU, iterate over them would make things complicated.
+	 * I could not find a case where an ordered iteration over all PMU
+	 * events in one context is useful.
+	 */
+	return ((u64)cpu << (RBTREE_KEY_STAMP_WIDTH + 1)) |
+		(flexible ? RBTREE_KEY_FLEXIBLE_MASK : 0);
+}
+
+static u64 cpuctx_rbtree_key(struct perf_cgroup *cgrp, bool flexible)
+{
+	u64 k;
+
+	if (cgrp)
+		/* A cheap way to obtain an identifier for a cgroup. Suggestions appreciated. */
+		k = (u64)cgrp->css.id << (RBTREE_KEY_STAMP_WIDTH + 1);
+	else
+		k = GENMASK_ULL(63, RBTREE_KEY_STAMP_WIDTH + 1);
+	return k | (flexible ? RBTREE_KEY_FLEXIBLE_MASK : 0);
+}
+
+static void
+rbtree_add_inactive(struct perf_event *event,
+		    struct perf_event_context *ctx)
+{
+	struct rb_node **pos = &(ctx->rbtree_root.rb_node), *parent = NULL;
+	struct perf_event *pos_event;
+
+	event->rbtree_key &= ~RBTREE_KEY_STAMP_MASK;
+	/*
+	 * A unique key simplifies finding intervals of events. We could use
+	 * ctx time as timestamp, but it may no be unique. So use
+	 * nr_inactive_added, a counter that is guaranteed to be unique and that
+	 * has the same order as ctx->inactive_groups.
+	 */
+	event->rbtree_key |= ctx->nr_inactive_added;
+	while (*pos) {
+		pos_event = rb_entry(*pos, struct perf_event, rbtree_node);
+		parent = *pos;
+		if (event->rbtree_key < pos_event->rbtree_key)
+			pos = &((*pos)->rb_left);
+		else /* There cannot be repeated keys. */
+			pos = &((*pos)->rb_right);
+	}
+	/* Add new node and rebalance tree. */
+	rb_link_node(&event->rbtree_node, parent, pos);
+	rb_insert_color(&event->rbtree_node, &ctx->rbtree_root);
+}
+
 static void
 ctx_sched_groups_to_inactive(struct perf_event *event,
 			     struct perf_event_context *ctx)
 {
 	WARN_ON(event->state != PERF_EVENT_STATE_INACTIVE);
 	list_move_tail(&event->ctx_active_entry, &ctx->inactive_groups);
+	rbtree_add_inactive(event, ctx);
+	ctx->nr_inactive_added++;
 };
 
 static void
 ctx_sched_groups_add(struct perf_event *event, struct perf_event_context *ctx)
 {
+	u64 k;
+
+	WARN_ON(event->state != PERF_EVENT_STATE_INACTIVE);
+	if (event->attach_state & PERF_ATTACH_TASK)
+		k = taskctx_rbtree_key(event->cpu, !event->attr.pinned);
+	else
+		k = cpuctx_rbtree_key(event->cgrp, !event->attr.pinned);
+	event->rbtree_key = k;
+
 	WARN_ON(!list_empty(&event->ctx_active_entry));
 	list_add_tail(&event->ctx_active_entry, &ctx->inactive_groups);
+	rbtree_add_inactive(event, ctx);
+	ctx->nr_inactive_added++;
 }
 
 /*
@@ -1668,6 +1747,7 @@ static void ctx_sched_groups_del(struct perf_event *group,
 				 struct perf_event_context *ctx)
 {
 	WARN_ON(group->state != PERF_EVENT_STATE_INACTIVE);
+	rb_erase(&group->rbtree_node, &ctx->rbtree_root);
 	list_del_init(&group->ctx_active_entry);
 }
 
@@ -2055,6 +2135,7 @@ ctx_sched_groups_to_active(struct perf_event *event, struct perf_event_context *
 	WARN_ON(!event);
 	WARN_ON(list_empty(&event->ctx_active_entry));
 	WARN_ON(event->state != PERF_EVENT_STATE_ACTIVE);
+	rb_erase(&event->rbtree_node, &ctx->rbtree_root);
 	list_move_tail(&event->ctx_active_entry, h);
 }
 
@@ -3690,6 +3771,7 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
 	INIT_LIST_HEAD(&ctx->pinned_groups);
 	INIT_LIST_HEAD(&ctx->flexible_groups);
 	INIT_LIST_HEAD(&ctx->event_list);
+	ctx->rbtree_root = RB_ROOT;
 	INIT_LIST_HEAD(&ctx->active_pinned_groups);
 	INIT_LIST_HEAD(&ctx->active_flexible_groups);
 	INIT_LIST_HEAD(&ctx->inactive_groups);
-- 
2.11.0.390.gc69c2f50cf-goog

  parent reply	other threads:[~2017-01-10 10:26 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-10 10:24 [RFC 0/6] optimize ctx switch with rb-tree David Carrillo-Cisneros
2017-01-10 10:24 ` [RFC 1/6] perf/core: create active and inactive event groups David Carrillo-Cisneros
2017-01-10 13:49   ` Mark Rutland
2017-01-10 20:45     ` David Carrillo-Cisneros
2017-01-12 11:05       ` Mark Rutland
     [not found]         ` <CALcN6mhPmpSqKhE3Ua+j-xROLzeAyrgdCk4AGGtfF9kExXRTJg@mail.gmail.com>
2017-01-13 11:01           ` Mark Rutland
2017-01-10 10:24 ` David Carrillo-Cisneros [this message]
2017-01-10 14:14   ` [RFC 2/6] perf/core: add a rb-tree index to inactive_groups Mark Rutland
2017-01-10 20:20     ` David Carrillo-Cisneros
2017-01-12 11:47       ` Mark Rutland
2017-01-13  7:34         ` David Carrillo-Cisneros
2017-01-16  2:03   ` [lkp-developer] [perf/core] 33da94bd89: BUG:unable_to_handle_kernel kernel test robot
2017-01-10 10:24 ` [RFC 3/6] perf/core: use rb-tree to sched in event groups David Carrillo-Cisneros
2017-01-10 16:38   ` Mark Rutland
2017-01-10 20:51     ` David Carrillo-Cisneros
2017-01-12 12:14       ` Mark Rutland
2017-01-13  8:01         ` David Carrillo-Cisneros
2017-01-13 10:24           ` Mark Rutland
2017-01-11 20:31     ` Liang, Kan
2017-01-12 10:11       ` Mark Rutland
2017-01-12 13:28         ` Liang, Kan
2017-01-13  8:05           ` David Carrillo-Cisneros
2017-01-10 10:25 ` [RFC 4/6] perf/core: avoid rb-tree traversal when no inactive events David Carrillo-Cisneros
2017-01-10 10:25 ` [RFC 5/6] perf/core: rotation no longer necessary. Behavior has changed. Beware David Carrillo-Cisneros
2017-01-10 10:25 ` [RFC 6/6] perf/core: use rb-tree index to optimize filtered perf_iterate_ctx David Carrillo-Cisneros
2017-01-16  2:05   ` [lkp-developer] [perf/core] 49c04ee1a7: WARNING:at_kernel/events/core.c:#perf_iterate_ctx_matching kernel test robot
2017-04-25 17:27 ` [RFC 0/6] optimize ctx switch with rb-tree Liang, Kan
2017-04-25 17:49   ` David Carrillo-Cisneros
2017-04-25 18:11     ` Budankov, Alexey
2017-04-25 18:54       ` David Carrillo-Cisneros
2017-04-26 10:34         ` Budankov, Alexey
2017-04-26 19:40           ` David Carrillo-Cisneros
2017-04-26 10:52         ` Mark Rutland

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170110102502.106187-3-davidcc@google.com \
    --to=davidcc@google.com \
    --cc=acme@kernel.org \
    --cc=ak@linux.intel.com \
    --cc=bp@suse.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=eranian@google.com \
    --cc=kan.liang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=srinivas.pandruvada@linux.intel.com \
    --cc=tglx@linutronix.de \
    --cc=vikas.shivappa@linux.intel.com \
    --cc=vince@deater.net \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox