All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Carrillo-Cisneros <davidcc@google.com>
To: linux-kernel@vger.kernel.org
Cc: "x86@kernel.org" <x86@kernel.org>, Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andi Kleen <ak@linux.intel.com>, Kan Liang <kan.liang@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Borislav Petkov <bp@suse.de>,
	Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Vikas Shivappa <vikas.shivappa@linux.intel.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Vince Weaver <vince@deater.net>, Paul Turner <pjt@google.com>,
	Stephane Eranian <eranian@google.com>,
	David Carrillo-Cisneros <davidcc@google.com>
Subject: [RFC 2/6] perf/core: add a rb-tree index to inactive_groups
Date: Tue, 10 Jan 2017 02:24:58 -0800	[thread overview]
Message-ID: <20170110102502.106187-3-davidcc@google.com> (raw)
In-Reply-To: <20170110102502.106187-1-davidcc@google.com>

Add a rb-tree that indexes inactive events by {CPU/cgroup,flexible,stamp}.

The original idea by Peter Z. was to sort task events in an rb-tree using
{pmu,cpu,timestamp} as key.

Having the PMU as part of the key gets complicated for contexts that
share pmus (i.e. software context) because all events in a context should
rotate together irrespective of their pmu. It's also unclear to me that
there is any case where a seach by pmu is useful.

Another complicatino is that using ctx->time (or timestamp) implies that
groups added during the same context switch may not have unique key.
This increases the complexity of that finds all events in the rb-tree
that are within a time interval.

Lastly, it is useful to query pinned and flexible events separately since
they are scheduled in at different times.

For the reasons above, I created a rb-tree per context with key
{CPU,flexible,stamp} for task contexts and {cgroup,flexible,stamp} for
CPU contexts.

The "flexible" boolean allows to query pinned or flexible events
separately.
The stamp is given by a non-decreasing counter: ctx->nr_events_added.
It increases every time a new inactive event is inserted. That choice of
stamp guarantees unique keys for all events and that events of the same
type (same {CPU/cgroup,flexible}) have the same order in the rb-tree.

When events are scheduled in or rotated, all events in the context must be
iterated or rotated together, irrespective of the CPU/cgroup. To do that,
we add ctx->inactive_groups, a list that "threads" the rb-tree in total
ctx->nr_events_added order. Note that this order is the same as timestamp
order and ctx->inactive_groups is used for both scheduling and iteration.
The rb-tree can be seen as an index over ctx->inactive_groups.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 include/linux/perf_event.h |  5 +++
 kernel/events/core.c       | 82 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3fa18f05c9b0..fd32ecc37d33 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -564,6 +564,9 @@ struct perf_event {
 	struct list_head		group_entry;
 	struct list_head		sibling_list;
 
+	u64				rbtree_key;
+	struct rb_node			rbtree_node;
+
 	/*
 	 * We need storage to track the entries in perf_pmu_migrate_context; we
 	 * cannot use the event_entry because of RCU and we want to keep the
@@ -736,6 +739,8 @@ struct perf_event_context {
 	struct list_head		pinned_groups;
 	struct list_head		flexible_groups;
 
+	struct rb_root			rbtree_root;
+	u32				nr_inactive_added;
 	struct list_head		active_pinned_groups;
 	struct list_head		active_flexible_groups;
 	struct list_head		inactive_groups;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b744b5a8dbd0..623d81c0ca93 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1462,19 +1462,98 @@ ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
 		return &ctx->flexible_groups;
 }
 
+/*
+ * The bits perf_event::kbtree_key represent:
+ *   - 63:33 an unique identifier for CPU (if a task context) or a cgroup
+ *     (if a CPU context).
+ *   - 32    a boolean to indicate if eventt is flexible (vs  pinnned).
+ *   - 31:0  a unique "stamp" that follows the last time the event was
+ *   scheduled.
+ * The 64 bits value groups event of the same type (CPU/cgroup + flexible)
+ * together in the rb-tree.
+ */
+#define RBTREE_KEY_STAMP_WIDTH		32
+#define RBTREE_KEY_STAMP_MASK 		GENMASK_ULL(RBTREE_KEY_STAMP_WIDTH - 1, 0)
+#define RBTREE_KEY_FLEXIBLE_MASK	BIT_ULL(RBTREE_KEY_STAMP_WIDTH)
+
+static u64 taskctx_rbtree_key(int cpu, bool flexible)
+{
+	/*
+	 * Use CPU only. PMU is never used in schedule in/out and, since  some
+	 * contexts share PMU, iterate over them would make things complicated.
+	 * I could not find a case where an ordered iteration over all PMU
+	 * events in one context is useful.
+	 */
+	return ((u64)cpu << (RBTREE_KEY_STAMP_WIDTH + 1)) |
+		(flexible ? RBTREE_KEY_FLEXIBLE_MASK : 0);
+}
+
+static u64 cpuctx_rbtree_key(struct perf_cgroup *cgrp, bool flexible)
+{
+	u64 k;
+
+	if (cgrp)
+		/* A cheap way to obtain an identifier for a cgroup. Suggestions appreciated. */
+		k = (u64)cgrp->css.id << (RBTREE_KEY_STAMP_WIDTH + 1);
+	else
+		k = GENMASK_ULL(63, RBTREE_KEY_STAMP_WIDTH + 1);
+	return k | (flexible ? RBTREE_KEY_FLEXIBLE_MASK : 0);
+}
+
+static void
+rbtree_add_inactive(struct perf_event *event,
+		    struct perf_event_context *ctx)
+{
+	struct rb_node **pos = &(ctx->rbtree_root.rb_node), *parent = NULL;
+	struct perf_event *pos_event;
+
+	event->rbtree_key &= ~RBTREE_KEY_STAMP_MASK;
+	/*
+	 * A unique key simplifies finding intervals of events. We could use
+	 * ctx time as timestamp, but it may no be unique. So use
+	 * nr_inactive_added, a counter that is guaranteed to be unique and that
+	 * has the same order as ctx->inactive_groups.
+	 */
+	event->rbtree_key |= ctx->nr_inactive_added;
+	while (*pos) {
+		pos_event = rb_entry(*pos, struct perf_event, rbtree_node);
+		parent = *pos;
+		if (event->rbtree_key < pos_event->rbtree_key)
+			pos = &((*pos)->rb_left);
+		else /* There cannot be repeated keys. */
+			pos = &((*pos)->rb_right);
+	}
+	/* Add new node and rebalance tree. */
+	rb_link_node(&event->rbtree_node, parent, pos);
+	rb_insert_color(&event->rbtree_node, &ctx->rbtree_root);
+}
+
 static void
 ctx_sched_groups_to_inactive(struct perf_event *event,
 			     struct perf_event_context *ctx)
 {
 	WARN_ON(event->state != PERF_EVENT_STATE_INACTIVE);
 	list_move_tail(&event->ctx_active_entry, &ctx->inactive_groups);
+	rbtree_add_inactive(event, ctx);
+	ctx->nr_inactive_added++;
 };
 
 static void
 ctx_sched_groups_add(struct perf_event *event, struct perf_event_context *ctx)
 {
+	u64 k;
+
+	WARN_ON(event->state != PERF_EVENT_STATE_INACTIVE);
+	if (event->attach_state & PERF_ATTACH_TASK)
+		k = taskctx_rbtree_key(event->cpu, !event->attr.pinned);
+	else
+		k = cpuctx_rbtree_key(event->cgrp, !event->attr.pinned);
+	event->rbtree_key = k;
+
 	WARN_ON(!list_empty(&event->ctx_active_entry));
 	list_add_tail(&event->ctx_active_entry, &ctx->inactive_groups);
+	rbtree_add_inactive(event, ctx);
+	ctx->nr_inactive_added++;
 }
 
 /*
@@ -1668,6 +1747,7 @@ static void ctx_sched_groups_del(struct perf_event *group,
 				 struct perf_event_context *ctx)
 {
 	WARN_ON(group->state != PERF_EVENT_STATE_INACTIVE);
+	rb_erase(&group->rbtree_node, &ctx->rbtree_root);
 	list_del_init(&group->ctx_active_entry);
 }
 
@@ -2055,6 +2135,7 @@ ctx_sched_groups_to_active(struct perf_event *event, struct perf_event_context *
 	WARN_ON(!event);
 	WARN_ON(list_empty(&event->ctx_active_entry));
 	WARN_ON(event->state != PERF_EVENT_STATE_ACTIVE);
+	rb_erase(&event->rbtree_node, &ctx->rbtree_root);
 	list_move_tail(&event->ctx_active_entry, h);
 }
 
@@ -3690,6 +3771,7 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
 	INIT_LIST_HEAD(&ctx->pinned_groups);
 	INIT_LIST_HEAD(&ctx->flexible_groups);
 	INIT_LIST_HEAD(&ctx->event_list);
+	ctx->rbtree_root = RB_ROOT;
 	INIT_LIST_HEAD(&ctx->active_pinned_groups);
 	INIT_LIST_HEAD(&ctx->active_flexible_groups);
 	INIT_LIST_HEAD(&ctx->inactive_groups);
-- 
2.11.0.390.gc69c2f50cf-goog

  parent reply	other threads:[~2017-01-10 10:26 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-10 10:24 [RFC 0/6] optimize ctx switch with rb-tree David Carrillo-Cisneros
2017-01-10 10:24 ` [RFC 1/6] perf/core: create active and inactive event groups David Carrillo-Cisneros
2017-01-10 13:49   ` Mark Rutland
2017-01-10 20:45     ` David Carrillo-Cisneros
2017-01-12 11:05       ` Mark Rutland
     [not found]         ` <CALcN6mhPmpSqKhE3Ua+j-xROLzeAyrgdCk4AGGtfF9kExXRTJg@mail.gmail.com>
2017-01-13 11:01           ` Mark Rutland
2017-01-10 10:24 ` David Carrillo-Cisneros [this message]
2017-01-10 14:14   ` [RFC 2/6] perf/core: add a rb-tree index to inactive_groups Mark Rutland
2017-01-10 20:20     ` David Carrillo-Cisneros
2017-01-12 11:47       ` Mark Rutland
2017-01-13  7:34         ` David Carrillo-Cisneros
2017-01-16  2:03   ` [lkp-developer] [perf/core] 33da94bd89: BUG:unable_to_handle_kernel kernel test robot
2017-01-16  2:03     ` kernel test robot
2017-01-10 10:24 ` [RFC 3/6] perf/core: use rb-tree to sched in event groups David Carrillo-Cisneros
2017-01-10 16:38   ` Mark Rutland
2017-01-10 20:51     ` David Carrillo-Cisneros
2017-01-12 12:14       ` Mark Rutland
2017-01-13  8:01         ` David Carrillo-Cisneros
2017-01-13 10:24           ` Mark Rutland
2017-01-11 20:31     ` Liang, Kan
2017-01-12 10:11       ` Mark Rutland
2017-01-12 13:28         ` Liang, Kan
2017-01-13  8:05           ` David Carrillo-Cisneros
2017-01-10 10:25 ` [RFC 4/6] perf/core: avoid rb-tree traversal when no inactive events David Carrillo-Cisneros
2017-01-10 10:25 ` [RFC 5/6] perf/core: rotation no longer necessary. Behavior has changed. Beware David Carrillo-Cisneros
2017-01-10 10:25 ` [RFC 6/6] perf/core: use rb-tree index to optimize filtered perf_iterate_ctx David Carrillo-Cisneros
2017-01-16  2:05   ` [lkp-developer] [perf/core] 49c04ee1a7: WARNING:at_kernel/events/core.c:#perf_iterate_ctx_matching kernel test robot
2017-01-16  2:05     ` kernel test robot
2017-04-25 17:27 ` [RFC 0/6] optimize ctx switch with rb-tree Liang, Kan
2017-04-25 17:49   ` David Carrillo-Cisneros
2017-04-25 18:11     ` Budankov, Alexey
2017-04-25 18:54       ` David Carrillo-Cisneros
2017-04-26 10:34         ` Budankov, Alexey
2017-04-26 19:40           ` David Carrillo-Cisneros
2017-04-26 10:52         ` Mark Rutland

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170110102502.106187-3-davidcc@google.com \
    --to=davidcc@google.com \
    --cc=acme@kernel.org \
    --cc=ak@linux.intel.com \
    --cc=bp@suse.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=eranian@google.com \
    --cc=kan.liang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=srinivas.pandruvada@linux.intel.com \
    --cc=tglx@linutronix.de \
    --cc=vikas.shivappa@linux.intel.com \
    --cc=vince@deater.net \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.