[PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Alexey Budankov <alexey.budankov@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Andi Kleen <ak@linux.intel.com>, Kan Liang <kan.liang@intel.com>,
	Dmitri Prokhorov <Dmitry.Prohorov@intel.com>,
	Valery Cherepennikov <valery.cherepennikov@intel.com>,
	David Carrillo-Cisneros <davidcc@google.com>,
	Stephane Eranian <eranian@google.com>,
	Mark Rutland <mark.rutland@arm.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi
Date: Sat, 27 May 2017 01:13:25 +0300	[thread overview]
Message-ID: <1e962b59-3e39-e0d6-515d-c4fd3502edae@linux.intel.com> (raw)


perf/core: addressing 4x slowdown during per-process profiling of
            STREAM benchmark on Intel Xeon Phi

Motivation:

The issue manifests like 4x slowdown when profiling single thread
STREAM benchmark on Intel Xeon Phi running RHEL7.2 (Intel MPSS distro).
Perf profiling is done in per-process mode and involves about 30 core
events. In case the benchmark is OpenMP based and runs under profiling
in 272 threads the overhead becomes even more dramatic:
512.144s against 36.192s (with this patch).

Solution:

cpu indexed trees for perf_event_context::pinned_group and
perf_event_context::flexible_group lists are introduced. Every tree
node keeps a list of groups allocated for the same cpu. A tree
references only groups located at the appropriate group list. The tree
provides capability to iterate over groups allocated for a specific
cpu only, what is exactly required by multiplexing timer interrupt
handler. The handler runs per-cpu and enables/disables groups using
group_sched_in()/group_sched_out() that call event_filter_match()
function filtering out groups allocated for cpus different from the
one executing the handler. Additionally for every filtered out group
group_sched_out() updates tstamps values to the current interrupt
time. This updating work is now done only once by
update_context_time() called by ctx_sched_out() before cpu groups
iteration. For this trick to work it is required that tstamps of
filtered out events would point to perf_event_context::tstamp_data
object instead of perf_event::tstam_data ones, as it is initialized
from an event allocation. tstamps references are switched by
group_sched_in()/group_sched_out() every time a group is checked for
its suitability for currently running cpu. When a thread enters some
cpu on a context switch a long run through pinned and flexible groups
is performed by perf_event_sched_in(, mux=0) with new parameter mux
set to 0 and filtered out groups tstamps are switched to
perf_event_context::tstamp_data object. Then a series of multiplexing
interrupts happens and the handler rotates the flexible groups calling
ctx_sched_out(,mux=1)/perf_event_sched_in(,mux=1) iterating over the
cpu tree lists only and avoiding long runs through the complete group
lists. This is where speedup comes from. Eventually when the thread
leaves the cpu ctx_sched_out(,mux=0) is called restoring tstamps
pointers to the events' perf_event::tstamp_data objects.

Locking:

The new perf_event_context::pinned_tree and
perf_event_context::flexible_tree are protected by
perf_evet_context::lock and perf_event_context::mutex, the same
locking schema as for appropriate perf_event_context::pinned_group and
perf_event_context::flexible_group. Indirect perf_event::tstamp
reference and perf_event::tstam_data object are protected by
perf_event::lock as earlier for original perf_event::tstamp_* fields.

Added API:

Objects:
1. struct perf_event_tstamp:
    - enabled
    - running
    - stopped
2. struct perf_event:
    - group_node
    - group_list
    - group_list_entry
    - tstamp
    - tstamp_data
3. struct perf_event_context:
    - pinned_tree
    - flexible_tree
    - tstamp_data

Functions:
1. insert a group into a tree using event->cpu as a key:
         int perf_cpu_tree_insert(struct rb_root *tree,
                 struct perf_event *event);
2. delete a group from a tree, if the group is directly attached
    to a tree it also detaches all groups on the groups
    group_list list:
         int perf_cpu_tree_delete(struct rb_root *tree,
                 struct perf_event *event);
3. find group_list list by a cpu key:
         struct list_head * perf_cpu_tree_find(struct rb_root *tree,
                 int cpu);
4. enable a pinned group on a cpu:
         void ctx_sched_in_flexible_group(
                 struct perf_event_context *ctx,
                 struct perf_cpu_context *cpuctx,
                 struct perf_event *group, int *can_add_hw);
5. enable a flexible group on a cpu; calls ctx_sched_in_flexible_group
    and updates group->state to ERROR in case of failure:
         void ctx_sched_in_pinned_group(struct perf_event_context *ctx,
                 struct perf_cpu_context *cpuctx,
                 struct perf_event *group);
6. enable per-cpu pinned tree's groups on a cpu:
         void ctx_pinned_sched_in_groups(struct perf_event_context *ctx,
                 struct perf_cpu_context *cpuctx,
                 struct list_head *groups);
7. enable per-cpu flexible tree's groups on a cpu:
         void ctx_flexible_sched_in_groups(
                 struct perf_event_context *ctx,
                 struct perf_cpu_context *cpuctx,
                 struct list_head *groups);
8. disable per-cpu tree's groups on a cpu:
         void ctx_sched_out_groups(struct perf_event_context *ctx,
                 struct perf_cpu_context *cpuctx,
                 struct list_head *groups);
9. get a group tree based on event->attr.pinned attribute value:
         struct rb_root * ctx_cpu_tree(struct perf_event *event,
                 struct perf_event_context *ctx);

Modified API:

Objects:
1. struct perf_event
2. struct perf_event_context

Functions:
1. perf_event_alloc()
2. add_event_to_ctx()
3. perf_event_enable_on_exec()
4. __perf_install_in_context()
5. __perf_event_init_context()
6. __perf_event_mark_enabled()
7. __perf_event_enable()
8. __perf_event_task_sched_out()
9. ctx_group_list()
10. list_add_event()
11. list_del_event()
12. perf_event_context_sched_in()
13. cpu_ctx_sched_in()
14. cpu_ctx_sched_out()
15. ctx_sched_in()
16. ctx_sched_out()
17. ctx_resched()
18. ctx_pinned_sched_in()
19. ctx_flexible_sched_in()
20. group_sched_in()
21. event_sched_in()
22. event_sched_out()
23. rotate_ctx()
24. perf_rotate_context()
25. update_context_time()
26. update_event_times()
27. calc_timer_values()
28. perf_cgroup_switch()
29. perf_cgroup_mark_enabled()

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
  include/linux/perf_event.h |  63 +++++--
  kernel/events/core.c       | 442 
++++++++++++++++++++++++++++++++++-----------
  2 files changed, 384 insertions(+), 121 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 24a6358..6bd380f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -550,6 +550,22 @@ struct pmu_event_list {
  	struct list_head	list;
  };

+struct perf_event_tstamp {
+	/*
+	 * These are timestamps used for computing total_time_enabled
+	 * and total_time_running when the event is in INACTIVE or
+	 * ACTIVE state, measured in nanoseconds from an arbitrary point
+	 * in time.
+	 * enabled: the notional time when the event was enabled
+	 * running: the notional time when the event was scheduled on
+	 * stopped: in INACTIVE state, the notional time when the
+	 *	event was scheduled off.
+	 */
+	u64 enabled;
+	u64 running;
+	u64 stopped;
+};
+
  /**
   * struct perf_event - performance event kernel representation:
   */
@@ -571,6 +587,27 @@ struct perf_event {
  	 * either sufficies for read.
  	 */
  	struct list_head		group_entry;
+	/*
+	 * Node on the pinned or flexible tree located at the event context;
+	 * the node may be empty in case its event is not directly attached
+	 * to the tree but to group_list list of the event directly
+	 * attached to the tree;
+	 */
+	struct rb_node			group_node;
+	/*
+	 * List keeps groups allocated for the same cpu;
+	 * the list may be empty in case its event is not directly
+	 * attached to the tree but to group_list list of the event directly
+	 * attached to the tree;
+	 */
+	struct list_head		group_list;
+	/*
+	 * Entry into the group_list list above;
+	 * the entry may be attached to the self group_list list above
+	 * in case the event is directly attached to the pinned or
+	 * flexible tree;
+	 */
+	struct list_head		group_list_entry;
  	struct list_head		sibling_list;

  	/*
@@ -611,18 +648,11 @@ struct perf_event {
  	u64				total_time_running;

  	/*
-	 * These are timestamps used for computing total_time_enabled
-	 * and total_time_running when the event is in INACTIVE or
-	 * ACTIVE state, measured in nanoseconds from an arbitrary point
-	 * in time.
-	 * tstamp_enabled: the notional time when the event was enabled
-	 * tstamp_running: the notional time when the event was scheduled on
-	 * tstamp_stopped: in INACTIVE state, the notional time when the
-	 *	event was scheduled off.
+	 * tstamp points to the tstamp_data object below or to the object
+	 * located at the event context;
  	 */
-	u64				tstamp_enabled;
-	u64				tstamp_running;
-	u64				tstamp_stopped;
+	struct perf_event_tstamp	*tstamp;
+	struct perf_event_tstamp	tstamp_data;

  	/*
  	 * timestamp shadows the actual context timing but it can
@@ -742,7 +772,17 @@ struct perf_event_context {

  	struct list_head		active_ctx_list;
  	struct list_head		pinned_groups;
+	/*
+	 * Cpu tree for pinned groups; keeps event's group_node nodes
+	 * of attached flexible groups;
+	 */
+	struct rb_root			pinned_tree;
  	struct list_head		flexible_groups;
+	/*
+	 * Cpu tree for flexible groups; keeps event's group_node nodes
+	 * of attached flexible groups;
+	 */
+	struct rb_root			flexible_tree;
  	struct list_head		event_list;
  	int				nr_events;
  	int				nr_active;
@@ -758,6 +798,7 @@ struct perf_event_context {
  	 */
  	u64				time;
  	u64				timestamp;
+	struct perf_event_tstamp	tstamp_data;

  	/*
  	 * These fields let us detect when two contexts have both
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f8c27d3..9168406 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -555,11 +555,11 @@ void perf_sample_event_took(u64 sample_len_ns)
  static atomic64_t perf_event_id;

  static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			      enum event_type_t event_type);
+			      enum event_type_t event_type, int mux);

  static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
  			     enum event_type_t event_type,
-			     struct task_struct *task);
+			     struct task_struct *task, int mux);

  static void update_context_time(struct perf_event_context *ctx);
  static u64 perf_event_time(struct perf_event *event);
@@ -701,6 +701,7 @@ static void perf_cgroup_switch(struct task_struct 
*task, int mode)
  	struct perf_cpu_context *cpuctx;
  	struct list_head *list;
  	unsigned long flags;
+	int mux = 0;

  	/*
  	 * Disable interrupts and preemption to avoid this CPU's
@@ -716,7 +717,7 @@ static void perf_cgroup_switch(struct task_struct 
*task, int mode)
  		perf_pmu_disable(cpuctx->ctx.pmu);

  		if (mode & PERF_CGROUP_SWOUT) {
-			cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+			cpu_ctx_sched_out(cpuctx, EVENT_ALL, mux);
  			/*
  			 * must not be done before ctxswout due
  			 * to event_filter_match() in event_sched_out()
@@ -735,7 +736,7 @@ static void perf_cgroup_switch(struct task_struct 
*task, int mode)
  			 */
  			cpuctx->cgrp = perf_cgroup_from_task(task,
  							     &cpuctx->ctx);
-			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
+			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task, mux);
  		}
  		perf_pmu_enable(cpuctx->ctx.pmu);
  		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -865,10 +866,10 @@ perf_cgroup_mark_enabled(struct perf_event *event,

  	event->cgrp_defer_enabled = 0;

-	event->tstamp_enabled = tstamp - event->total_time_enabled;
+	event->tstamp->enabled = tstamp - event->total_time_enabled;
  	list_for_each_entry(sub, &event->sibling_list, group_entry) {
  		if (sub->state >= PERF_EVENT_STATE_INACTIVE) {
-			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
+			sub->tstamp->enabled = tstamp - sub->total_time_enabled;
  			sub->cgrp_defer_enabled = 0;
  		}
  	}
@@ -1383,6 +1384,9 @@ static void update_context_time(struct 
perf_event_context *ctx)

  	ctx->time += now - ctx->timestamp;
  	ctx->timestamp = now;
+
+	ctx->tstamp_data.running += ctx->time - ctx->tstamp_data.stopped;
+	ctx->tstamp_data.stopped = ctx->time;
  }

  static u64 perf_event_time(struct perf_event *event)
@@ -1424,16 +1428,16 @@ static void update_event_times(struct perf_event 
*event)
  	else if (ctx->is_active)
  		run_end = ctx->time;
  	else
-		run_end = event->tstamp_stopped;
+		run_end = event->tstamp->stopped;

-	event->total_time_enabled = run_end - event->tstamp_enabled;
+	event->total_time_enabled = run_end - event->tstamp->enabled;

  	if (event->state == PERF_EVENT_STATE_INACTIVE)
-		run_end = event->tstamp_stopped;
+		run_end = event->tstamp->stopped;
  	else
  		run_end = perf_event_time(event);

-	event->total_time_running = run_end - event->tstamp_running;
+	event->total_time_running = run_end - event->tstamp->running;

  }

@@ -1472,6 +1476,186 @@ ctx_group_list(struct perf_event *event, struct 
perf_event_context *ctx)
  		return &ctx->flexible_groups;
  }

+static int
+perf_cpu_tree_insert(struct rb_root *tree, struct perf_event *event)
+{
+	struct rb_node **node;
+	struct rb_node *parent;
+
+	if (!tree || !event)
+		return 0;
+
+	node = &tree->rb_node;
+	parent = *node;
+
+	while (*node) {
+		struct perf_event *node_event =	container_of(*node,
+				struct perf_event, group_node);
+
+		parent = *node;
+
+		if (event->cpu < node_event->cpu) {
+			node = &((*node)->rb_left);
+		} else if (event->cpu > node_event->cpu) {
+			node = &((*node)->rb_right);
+		} else {
+			list_add_tail(&event->group_list_entry,
+					&node_event->group_list);
+			return 2;
+		}
+	}
+
+	list_add_tail(&event->group_list_entry, &event->group_list);
+
+	rb_link_node(&event->group_node, parent, node);
+	rb_insert_color(&event->group_node, tree);
+
+	return 1;
+}
+
+static int
+perf_cpu_tree_delete(struct rb_root *tree, struct perf_event *event)
+{
+	if (!tree || !event)
+		return 0;
+
+	if (RB_EMPTY_NODE(&event->group_node)) {
+		list_del_init(&event->group_list_entry);
+	} else {
+		struct perf_event *list_event, *list_next;
+
+		rb_erase(&event->group_node, tree);
+		list_for_each_entry_safe(list_event, list_next,
+				&event->group_list, group_list_entry)
+			list_del_init(&list_event->group_list_entry);
+	}
+
+	return 1;
+}
+
+static struct list_head *
+perf_cpu_tree_find(struct rb_root *tree, int cpu)
+{
+	struct rb_node *node;
+
+	if (!tree)
+		return NULL;
+
+	node = tree->rb_node;
+
+	while (node) {
+		struct perf_event *node_event = container_of(node,
+				struct perf_event, group_node);
+
+		if (cpu < node_event->cpu)
+			node = node->rb_left;
+		else if (cpu > node_event->cpu)
+			node = node->rb_right;
+		else
+			return &node_event->group_list;
+	}
+
+	return NULL;
+}
+
+static struct rb_root *
+ctx_cpu_tree(struct perf_event *event, struct perf_event_context *ctx)
+{
+	if (event->attr.pinned)
+		return &ctx->pinned_tree;
+	else
+		return &ctx->flexible_tree;
+}
+
+static inline int
+event_filter_match(struct perf_event *event);
+
+static int
+group_can_go_on(struct perf_event *event,
+		struct perf_cpu_context *cpuctx,
+		int can_add_hw);
+
+static int
+group_sched_in(struct perf_event *group_event,
+	       struct perf_cpu_context *cpuctx,
+	       struct perf_event_context *ctx);
+
+static void
+ctx_sched_in_flexible_group(struct perf_event_context *ctx,
+		    struct perf_cpu_context *cpuctx,
+		    struct perf_event *event, int *can_add_hw)
+{
+	if (event->state <= PERF_EVENT_STATE_OFF)
+		return;
+	if (!event_filter_match(event)) {
+		if (event->tstamp != &ctx->tstamp_data)
+			event->tstamp = &ctx->tstamp_data;
+		return;
+	}
+
+	/* may need to reset tstamp_enabled */
+	if (is_cgroup_event(event))
+		perf_cgroup_mark_enabled(event, ctx);
+
+	if (group_can_go_on(event, cpuctx, *can_add_hw)) {
+		if (group_sched_in(event, cpuctx, ctx))
+			*can_add_hw = 0;
+	}
+}
+
+static void
+ctx_sched_in_pinned_group(struct perf_event_context *ctx,
+		    struct perf_cpu_context *cpuctx,
+		    struct perf_event *event)
+{
+	int can_add_hw = 1;
+
+	ctx_sched_in_flexible_group(ctx, cpuctx, event, &can_add_hw);
+	if (!can_add_hw) {
+		update_group_times(event);
+		event->state = PERF_EVENT_STATE_ERROR;
+	}
+}
+
+static void
+ctx_pinned_sched_in_groups(struct perf_event_context *ctx,
+		    struct perf_cpu_context *cpuctx,
+		    struct list_head *groups)
+{
+	struct perf_event *group;
+
+	list_for_each_entry(group, groups, group_list_entry)
+		ctx_sched_in_pinned_group(ctx, cpuctx, group);
+}
+
+static void
+ctx_flexible_sched_in_groups(struct perf_event_context *ctx,
+		      struct perf_cpu_context *cpuctx,
+		      struct list_head *groups)
+{
+	struct perf_event *group;
+	int can_add_hw = 1;
+
+	list_for_each_entry(group, groups, group_list_entry)
+		ctx_sched_in_flexible_group(ctx, cpuctx, group, &can_add_hw);
+}
+
+static void
+group_sched_out(struct perf_event *group_event,
+		struct perf_cpu_context *cpuctx,
+		struct perf_event_context *ctx);
+
+static void
+ctx_sched_out_groups(struct perf_event_context *ctx,
+		struct perf_cpu_context *cpuctx,
+		struct list_head *groups)
+{
+	struct perf_event *group;
+
+	list_for_each_entry(group, groups, group_list_entry)
+		group_sched_out(group, cpuctx, ctx);
+}
+
  /*
   * Add a event from the lists for its context.
   * Must be called with ctx->mutex and ctx->lock held.
@@ -1491,11 +1675,15 @@ list_add_event(struct perf_event *event, struct 
perf_event_context *ctx)
  	 */
  	if (event->group_leader == event) {
  		struct list_head *list;
+		struct rb_root *tree;

  		event->group_caps = event->event_caps;

  		list = ctx_group_list(event, ctx);
  		list_add_tail(&event->group_entry, list);
+
+		tree = ctx_cpu_tree(event, ctx);
+		perf_cpu_tree_insert(tree, event);
  	}

  	list_update_cgroup_event(event, ctx, true);
@@ -1685,8 +1873,14 @@ list_del_event(struct perf_event *event, struct 
perf_event_context *ctx)

  	list_del_rcu(&event->event_entry);

-	if (event->group_leader == event)
+	if (event->group_leader == event) {
+		struct rb_root *tree;
+
+		tree = ctx_cpu_tree(event, ctx);
+		perf_cpu_tree_delete(tree, event);
+
  		list_del_init(&event->group_entry);
+	}

  	update_group_times(event);

@@ -1811,9 +2005,9 @@ event_sched_out(struct perf_event *event,
  	 */
  	if (event->state == PERF_EVENT_STATE_INACTIVE &&
  	    !event_filter_match(event)) {
-		delta = tstamp - event->tstamp_stopped;
-		event->tstamp_running += delta;
-		event->tstamp_stopped = tstamp;
+		delta = tstamp - event->tstamp->stopped;
+		event->tstamp->running += delta;
+		event->tstamp->stopped = tstamp;
  	}

  	if (event->state != PERF_EVENT_STATE_ACTIVE)
@@ -1821,7 +2015,7 @@ event_sched_out(struct perf_event *event,

  	perf_pmu_disable(event->pmu);

-	event->tstamp_stopped = tstamp;
+	event->tstamp->stopped = tstamp;
  	event->pmu->del(event, 0);
  	event->oncpu = -1;
  	event->state = PERF_EVENT_STATE_INACTIVE;
@@ -2096,7 +2290,7 @@ event_sched_in(struct perf_event *event,
  		goto out;
  	}

-	event->tstamp_running += tstamp - event->tstamp_stopped;
+	event->tstamp->running += tstamp - event->tstamp->stopped;

  	if (!is_software_event(event))
  		cpuctx->active_oncpu++;
@@ -2168,8 +2362,8 @@ group_sched_in(struct perf_event *group_event,
  			simulate = true;

  		if (simulate) {
-			event->tstamp_running += now - event->tstamp_stopped;
-			event->tstamp_stopped = now;
+			event->tstamp->running += now - event->tstamp->stopped;
+			event->tstamp->stopped = now;
  		} else {
  			event_sched_out(event, cpuctx, ctx);
  		}
@@ -2221,43 +2415,45 @@ static void add_event_to_ctx(struct perf_event 
*event,

  	list_add_event(event, ctx);
  	perf_group_attach(event);
-	event->tstamp_enabled = tstamp;
-	event->tstamp_running = tstamp;
-	event->tstamp_stopped = tstamp;
+	event->tstamp->enabled = tstamp;
+	event->tstamp->running = tstamp;
+	event->tstamp->stopped = tstamp;
  }

  static void ctx_sched_out(struct perf_event_context *ctx,
  			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type);
+			  enum event_type_t event_type, int mux);
  static void
  ctx_sched_in(struct perf_event_context *ctx,
  	     struct perf_cpu_context *cpuctx,
  	     enum event_type_t event_type,
-	     struct task_struct *task);
+	     struct task_struct *task, int mux);

  static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
  			       struct perf_event_context *ctx,
  			       enum event_type_t event_type)
  {
+	int mux = 0;
+
  	if (!cpuctx->task_ctx)
  		return;

  	if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
  		return;

-	ctx_sched_out(ctx, cpuctx, event_type);
+	ctx_sched_out(ctx, cpuctx, event_type, mux);
  }

  static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
  				struct perf_event_context *ctx,
-				struct task_struct *task)
+				struct task_struct *task, int mux)
  {
-	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task);
+	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task, mux);
  	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task);
-	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task);
+		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task, mux);
+	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task, mux);
  	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
+		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task, mux);
  }

  /*
@@ -2281,6 +2477,7 @@ static void ctx_resched(struct perf_cpu_context 
*cpuctx,
  {
  	enum event_type_t ctx_event_type = event_type & EVENT_ALL;
  	bool cpu_event = !!(event_type & EVENT_CPU);
+	int mux = 0;

  	/*
  	 * If pinned groups are involved, flexible groups also need to be
@@ -2301,11 +2498,11 @@ static void ctx_resched(struct perf_cpu_context 
*cpuctx,
  	 *  - otherwise, do nothing more.
  	 */
  	if (cpu_event)
-		cpu_ctx_sched_out(cpuctx, ctx_event_type);
+		cpu_ctx_sched_out(cpuctx, ctx_event_type, mux);
  	else if (ctx_event_type & EVENT_PINNED)
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);

-	perf_event_sched_in(cpuctx, task_ctx, current);
+	perf_event_sched_in(cpuctx, task_ctx, current, mux);
  	perf_pmu_enable(cpuctx->ctx.pmu);
  }

@@ -2323,6 +2520,7 @@ static int  __perf_install_in_context(void *info)
  	struct perf_event_context *task_ctx = cpuctx->task_ctx;
  	bool reprogram = true;
  	int ret = 0;
+	int mux = 0;

  	raw_spin_lock(&cpuctx->ctx.lock);
  	if (ctx->task) {
@@ -2349,7 +2547,7 @@ static int  __perf_install_in_context(void *info)
  	}

  	if (reprogram) {
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
  		add_event_to_ctx(event, ctx);
  		ctx_resched(cpuctx, task_ctx, get_event_type(event));
  	} else {
@@ -2468,10 +2666,10 @@ static void __perf_event_mark_enabled(struct 
perf_event *event)
  	u64 tstamp = perf_event_time(event);

  	event->state = PERF_EVENT_STATE_INACTIVE;
-	event->tstamp_enabled = tstamp - event->total_time_enabled;
+	event->tstamp->enabled = tstamp - event->total_time_enabled;
  	list_for_each_entry(sub, &event->sibling_list, group_entry) {
  		if (sub->state >= PERF_EVENT_STATE_INACTIVE)
-			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
+			sub->tstamp->enabled = tstamp - sub->total_time_enabled;
  	}
  }

@@ -2485,13 +2683,14 @@ static void __perf_event_enable(struct 
perf_event *event,
  {
  	struct perf_event *leader = event->group_leader;
  	struct perf_event_context *task_ctx;
+	int mux = 0;

  	if (event->state >= PERF_EVENT_STATE_INACTIVE ||
  	    event->state <= PERF_EVENT_STATE_ERROR)
  		return;

  	if (ctx->is_active)
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);

  	__perf_event_mark_enabled(event);

@@ -2501,7 +2700,7 @@ static void __perf_event_enable(struct perf_event 
*event,
  	if (!event_filter_match(event)) {
  		if (is_cgroup_event(event))
  			perf_cgroup_defer_enabled(event);
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
  		return;
  	}

@@ -2510,7 +2709,7 @@ static void __perf_event_enable(struct perf_event 
*event,
  	 * then don't put it on unless the group is on.
  	 */
  	if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
  		return;
  	}

@@ -2706,10 +2905,12 @@ EXPORT_SYMBOL_GPL(perf_event_refresh);

  static void ctx_sched_out(struct perf_event_context *ctx,
  			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type)
+			  enum event_type_t event_type, int mux)
  {
  	int is_active = ctx->is_active;
  	struct perf_event *event;
+	struct list_head *groups;
+	int cpu = smp_processor_id();

  	lockdep_assert_held(&ctx->lock);

@@ -2756,13 +2957,34 @@ static void ctx_sched_out(struct 
perf_event_context *ctx,

  	perf_pmu_disable(ctx->pmu);
  	if (is_active & EVENT_PINNED) {
-		list_for_each_entry(event, &ctx->pinned_groups, group_entry)
-			group_sched_out(event, cpuctx, ctx);
+		if (mux) {
+			groups = perf_cpu_tree_find(&ctx->pinned_tree, -1);
+			if (groups)
+				ctx_sched_out_groups(ctx, cpuctx, groups);
+			groups = perf_cpu_tree_find(&ctx->pinned_tree, cpu);
+			if (groups)
+				ctx_sched_out_groups(ctx, cpuctx, groups);
+		} else {
+			list_for_each_entry(event, &ctx->pinned_groups,
+					group_entry)
+				group_sched_out(event, cpuctx, ctx);
+		}
  	}

  	if (is_active & EVENT_FLEXIBLE) {
-		list_for_each_entry(event, &ctx->flexible_groups, group_entry)
-			group_sched_out(event, cpuctx, ctx);
+		if (mux) {
+			groups = perf_cpu_tree_find(&ctx->flexible_tree, -1);
+			if (groups)
+				ctx_sched_out_groups(ctx, cpuctx, groups);
+			groups = perf_cpu_tree_find(&ctx->flexible_tree, cpu);
+			if (groups)
+				ctx_sched_out_groups(ctx, cpuctx, groups);
+
+		} else {
+			list_for_each_entry(event, &ctx->flexible_groups,
+					group_entry)
+				group_sched_out(event, cpuctx, ctx);
+		}
  	}
  	perf_pmu_enable(ctx->pmu);
  }
@@ -3051,9 +3273,9 @@ void __perf_event_task_sched_out(struct 
task_struct *task,
   * Called with IRQs disabled
   */
  static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			      enum event_type_t event_type)
+			      enum event_type_t event_type, int mux)
  {
-	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
+	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, mux);
  }

  static void
@@ -3061,29 +3283,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
  		    struct perf_cpu_context *cpuctx)
  {
  	struct perf_event *event;
-
-	list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		if (!event_filter_match(event))
-			continue;
-
-		/* may need to reset tstamp_enabled */
-		if (is_cgroup_event(event))
-			perf_cgroup_mark_enabled(event, ctx);
-
-		if (group_can_go_on(event, cpuctx, 1))
-			group_sched_in(event, cpuctx, ctx);
-
-		/*
-		 * If this pinned group hasn't been scheduled,
-		 * put it in error state.
-		 */
-		if (event->state == PERF_EVENT_STATE_INACTIVE) {
-			update_group_times(event);
-			event->state = PERF_EVENT_STATE_ERROR;
-		}
-	}
+	list_for_each_entry(event, &ctx->pinned_groups, group_entry)
+		ctx_sched_in_pinned_group(ctx, cpuctx, event);
  }

  static void
@@ -3092,37 +3293,19 @@ ctx_flexible_sched_in(struct perf_event_context 
*ctx,
  {
  	struct perf_event *event;
  	int can_add_hw = 1;
-
-	list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
-		/* Ignore events in OFF or ERROR state */
-		if (event->state <= PERF_EVENT_STATE_OFF)
-			continue;
-		/*
-		 * Listen to the 'cpu' scheduling filter constraint
-		 * of events:
-		 */
-		if (!event_filter_match(event))
-			continue;
-
-		/* may need to reset tstamp_enabled */
-		if (is_cgroup_event(event))
-			perf_cgroup_mark_enabled(event, ctx);
-
-		if (group_can_go_on(event, cpuctx, can_add_hw)) {
-			if (group_sched_in(event, cpuctx, ctx))
-				can_add_hw = 0;
-		}
-	}
+	list_for_each_entry(event, &ctx->flexible_groups, group_entry)
+		ctx_sched_in_flexible_group(ctx, cpuctx, event, &can_add_hw);
  }

  static void
  ctx_sched_in(struct perf_event_context *ctx,
  	     struct perf_cpu_context *cpuctx,
  	     enum event_type_t event_type,
-	     struct task_struct *task)
+	     struct task_struct *task, int mux)
  {
  	int is_active = ctx->is_active;
-	u64 now;
+	struct list_head *groups;
+	int cpu = smp_processor_id();

  	lockdep_assert_held(&ctx->lock);

@@ -3141,8 +3324,7 @@ ctx_sched_in(struct perf_event_context *ctx,

  	if (is_active & EVENT_TIME) {
  		/* start ctx time */
-		now = perf_clock();
-		ctx->timestamp = now;
+		ctx->timestamp = perf_clock();
  		perf_cgroup_set_timestamp(task, ctx);
  	}

@@ -3150,27 +3332,50 @@ ctx_sched_in(struct perf_event_context *ctx,
  	 * First go through the list and put on any pinned groups
  	 * in order to give them the best chance of going on.
  	 */
-	if (is_active & EVENT_PINNED)
-		ctx_pinned_sched_in(ctx, cpuctx);
+	if (is_active & EVENT_PINNED) {
+		if (mux) {
+			groups = perf_cpu_tree_find(&ctx->pinned_tree, -1);
+			if (groups)
+				ctx_pinned_sched_in_groups(ctx, cpuctx, groups);
+			groups = perf_cpu_tree_find(&ctx->pinned_tree, cpu);
+			if (groups)
+				ctx_pinned_sched_in_groups(ctx, cpuctx, groups);
+		} else {
+			ctx_pinned_sched_in(ctx, cpuctx);
+		}
+	}

  	/* Then walk through the lower prio flexible groups */
-	if (is_active & EVENT_FLEXIBLE)
-		ctx_flexible_sched_in(ctx, cpuctx);
+	if (is_active & EVENT_FLEXIBLE) {
+		if (mux) {
+			groups = perf_cpu_tree_find(&ctx->flexible_tree, -1);
+			if (groups)
+				ctx_flexible_sched_in_groups(ctx, cpuctx,
+						groups);
+			groups = perf_cpu_tree_find(&ctx->flexible_tree, cpu);
+			if (groups)
+				ctx_flexible_sched_in_groups(ctx, cpuctx,
+						groups);
+		} else {
+			ctx_flexible_sched_in(ctx, cpuctx);
+		}
+	}
  }

  static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
  			     enum event_type_t event_type,
-			     struct task_struct *task)
+			     struct task_struct *task, int mux)
  {
  	struct perf_event_context *ctx = &cpuctx->ctx;

-	ctx_sched_in(ctx, cpuctx, event_type, task);
+	ctx_sched_in(ctx, cpuctx, event_type, task, mux);
  }

  static void perf_event_context_sched_in(struct perf_event_context *ctx,
  					struct task_struct *task)
  {
  	struct perf_cpu_context *cpuctx;
+	int mux = 0;

  	cpuctx = __get_cpu_context(ctx);
  	if (cpuctx->task_ctx == ctx)
@@ -3187,8 +3392,8 @@ static void perf_event_context_sched_in(struct 
perf_event_context *ctx,
  	 * events, no need to flip the cpuctx's events around.
  	 */
  	if (!list_empty(&ctx->pinned_groups))
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
-	perf_event_sched_in(cpuctx, ctx, task);
+		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
+	perf_event_sched_in(cpuctx, ctx, task, mux);
  	perf_pmu_enable(ctx->pmu);
  	perf_ctx_unlock(cpuctx, ctx);
  }
@@ -3421,14 +3626,24 @@ static void rotate_ctx(struct perf_event_context 
*ctx)
  	 * Rotate the first entry last of non-pinned groups. Rotation might be
  	 * disabled by the inheritance code.
  	 */
-	if (!ctx->rotate_disable)
-		list_rotate_left(&ctx->flexible_groups);
+	if (!ctx->rotate_disable) {
+		struct list_head *groups;
+
+		groups = perf_cpu_tree_find(&ctx->flexible_tree, -1);
+		if (groups)
+			list_rotate_left(groups);
+		groups = perf_cpu_tree_find(&ctx->flexible_tree,
+				smp_processor_id());
+		if (groups)
+			list_rotate_left(groups);
+	}
  }

  static int perf_rotate_context(struct perf_cpu_context *cpuctx)
  {
  	struct perf_event_context *ctx = NULL;
  	int rotate = 0;
+	int mux = 1;

  	if (cpuctx->ctx.nr_events) {
  		if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
@@ -3447,15 +3662,15 @@ static int perf_rotate_context(struct 
perf_cpu_context *cpuctx)
  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
  	perf_pmu_disable(cpuctx->ctx.pmu);

-	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE, mux);
  	if (ctx)
-		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
+		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE, mux);

  	rotate_ctx(&cpuctx->ctx);
  	if (ctx)
  		rotate_ctx(ctx);

-	perf_event_sched_in(cpuctx, ctx, current);
+	perf_event_sched_in(cpuctx, ctx, current, mux);

  	perf_pmu_enable(cpuctx->ctx.pmu);
  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -3507,6 +3722,7 @@ static void perf_event_enable_on_exec(int ctxn)
  	struct perf_event *event;
  	unsigned long flags;
  	int enabled = 0;
+	int mux = 0;

  	local_irq_save(flags);
  	ctx = current->perf_event_ctxp[ctxn];
@@ -3515,7 +3731,7 @@ static void perf_event_enable_on_exec(int ctxn)

  	cpuctx = __get_cpu_context(ctx);
  	perf_ctx_lock(cpuctx, ctx);
-	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+	ctx_sched_out(ctx, cpuctx, EVENT_TIME, mux);
  	list_for_each_entry(event, &ctx->event_list, event_entry) {
  		enabled |= event_enable_on_exec(event, ctx);
  		event_type |= get_event_type(event);
@@ -3528,7 +3744,7 @@ static void perf_event_enable_on_exec(int ctxn)
  		clone_ctx = unclone_ctx(ctx);
  		ctx_resched(cpuctx, ctx, event_type);
  	} else {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current, mux);
  	}
  	perf_ctx_unlock(cpuctx, ctx);

@@ -3749,7 +3965,9 @@ static void __perf_event_init_context(struct 
perf_event_context *ctx)
  	mutex_init(&ctx->mutex);
  	INIT_LIST_HEAD(&ctx->active_ctx_list);
  	INIT_LIST_HEAD(&ctx->pinned_groups);
+	ctx->pinned_tree = RB_ROOT;
  	INIT_LIST_HEAD(&ctx->flexible_groups);
+	ctx->flexible_tree = RB_ROOT;
  	INIT_LIST_HEAD(&ctx->event_list);
  	atomic_set(&ctx->refcount, 1);
  }
@@ -4848,8 +5066,8 @@ static void calc_timer_values(struct perf_event 
*event,

  	*now = perf_clock();
  	ctx_time = event->shadow_ctx_time + *now;
-	*enabled = ctx_time - event->tstamp_enabled;
-	*running = ctx_time - event->tstamp_running;
+	*enabled = ctx_time - event->tstamp->enabled;
+	*running = ctx_time - event->tstamp->running;
  }

  static void perf_event_init_userpage(struct perf_event *event)
@@ -9357,6 +9575,9 @@ perf_event_alloc(struct perf_event_attr *attr, int 
cpu,
  	INIT_LIST_HEAD(&event->child_list);

  	INIT_LIST_HEAD(&event->group_entry);
+	RB_CLEAR_NODE(&event->group_node);
+	INIT_LIST_HEAD(&event->group_list);
+	INIT_LIST_HEAD(&event->group_list_entry);
  	INIT_LIST_HEAD(&event->event_entry);
  	INIT_LIST_HEAD(&event->sibling_list);
  	INIT_LIST_HEAD(&event->rb_entry);
@@ -9372,6 +9593,7 @@ perf_event_alloc(struct perf_event_attr *attr, int 
cpu,
  	raw_spin_lock_init(&event->addr_filters.lock);

  	atomic_long_set(&event->refcount, 1);
+	event->tstamp		= &event->tstamp_data;
  	event->cpu		= cpu;
  	event->attr		= *attr;
  	event->group_leader	= group_leader;

next             reply	other threads:[~2017-05-27  1:18 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-26 22:13 Alexey Budankov [this message]
2017-05-27 11:19 ` [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
2017-05-29  7:45   ` Peter Zijlstra
2017-05-29  9:24     ` Alexey Budankov
2017-05-29 10:33       ` Peter Zijlstra
2017-05-29 10:46         ` Alexey Budankov
2017-05-29  7:46   ` Peter Zijlstra
2017-05-29  9:15     ` Alexey Budankov
2017-05-29 10:43       ` Peter Zijlstra
2017-05-29 10:56         ` Alexey Budankov
2017-05-29 11:23           ` Peter Zijlstra
2017-05-29 11:45             ` Alexey Budankov
2017-06-15 17:42               ` Alexey Budankov
2017-06-21 15:39                 ` Alexey Budankov
2017-06-30 10:22                   ` Alexey Budankov
2017-05-31 21:33   ` David Carrillo-Cisneros
2017-06-14 11:27     ` Alexey Budankov
2017-05-29 12:03 ` [PATCH]: perf/core: addressing 4x slowdown during per-process " Alexander Shishkin
2017-05-29 13:43   ` Alexey Budankov
2017-05-29 15:22     ` Peter Zijlstra
2017-05-29 15:29       ` Peter Zijlstra
2017-05-29 16:41         ` Alexey Budankov
2017-05-30  8:29     ` Alexander Shishkin
2017-06-14 10:07       ` Alexey Budankov
2017-06-15 17:44         ` Alexey Budankov

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:24a6358 dfblob:6bd380f dfblob:f8c27d3 dfblob:9168406 )
 OR (
bs:"[PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1e962b59-3e39-e0d6-515d-c4fd3502edae@linux.intel.com \
    --to=alexey.budankov@linux.intel.com \
    --cc=Dmitry.Prohorov@intel.com \
    --cc=acme@kernel.org \
    --cc=ak@linux.intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=davidcc@google.com \
    --cc=eranian@google.com \
    --cc=kan.liang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=valery.cherepennikov@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.