[RFC PATCH 0/4] perf/core: Small event scheduling changes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] perf/core: Small event scheduling changes
@ 2009-11-08 20:13 Frederic Weisbecker
  2009-11-08 20:13 ` [RFC PATCH 1/4] perf/core: split context's event group list into pinned and non-pinned lists Frederic Weisbecker
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2009-11-08 20:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mike Galbraith, Paul Mackerras,
	Thomas Gleixner

Hi,

This is an rfc patchset, only compile tested just to ensure I'm taking
a good direction before going ahead.

This is intended to rework a bit the perf event scheduling to
guarantee a real priority of the pinned events over the volatile
ones. This patchset handles such priority on task tick time
only. But if the idea is agreed, I could expand that to every
task event sched-in calls to guarantee the priority in every
event rescheduling time.

Thanks.

Frederic Weisbecker (4):
  perf/core: split context's event group list into pinned and
    non-pinned lists
  perf/core: Optimize a bit rotate_ctx()
  perf/core: Split up pinned and non pinned processing
  perf/core: Schedule every pinned events before the the non-pinned

 include/linux/perf_event.h |    3 +-
 kernel/perf_event.c        |  283 ++++++++++++++++++++++++++++++++-----------
 2 files changed, 212 insertions(+), 74 deletions(-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/4] perf/core: split context's event group list into pinned and non-pinned lists
  2009-11-08 20:13 [RFC PATCH 0/4] perf/core: Small event scheduling changes Frederic Weisbecker
@ 2009-11-08 20:13 ` Frederic Weisbecker
  2009-11-08 20:13 ` [RFC PATCH 2/4] perf/core: Optimize a bit rotate_ctx() Frederic Weisbecker
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2009-11-08 20:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mike Galbraith, Paul Mackerras,
	Thomas Gleixner

Split-up struct perf_event_context::group_list into pinned_grp_list
and volatile_grp_list (non-pinned).

This first appears to be useless as it duplicates various loops around
the group list handlings.

But it scales better in the fast-path in perf_sched_in(). We don't
anymore iterate twice through the entire list to separate pinned and
non-pinned scheduling. Instead we interate through two distinct lists.

The another desired effect is that it makes easier the distinct
scheduling rules for both.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/perf_event.h |    3 +-
 kernel/perf_event.c        |  177 +++++++++++++++++++++++++++++++-------------
 2 files changed, 127 insertions(+), 53 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6ff7c3b..659351c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -662,7 +662,8 @@ struct perf_event_context {
 	 */
 	struct mutex			mutex;
 
-	struct list_head		group_list;
+	struct list_head		pinned_grp_list;
+	struct list_head		volatile_grp_list;
 	struct list_head		event_list;
 	int				nr_events;
 	int				nr_active;
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 6f4ed3b..b3a31c8 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -259,9 +259,15 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	 * add it straight to the context's event list, or to the group
 	 * leader's sibling list:
 	 */
-	if (group_leader == event)
-		list_add_tail(&event->group_entry, &ctx->group_list);
-	else {
+	if (group_leader == event) {
+		struct list_head *list;
+
+		if (event->attr.pinned)
+			list = &ctx->pinned_grp_list;
+		else
+			list = &ctx->volatile_grp_list;
+		list_add_tail(&event->group_entry, list);
+	} else {
 		list_add_tail(&event->group_entry, &group_leader->sibling_list);
 		group_leader->nr_siblings++;
 	}
@@ -299,8 +305,14 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 	 * to the context list directly:
 	 */
 	list_for_each_entry_safe(sibling, tmp, &event->sibling_list, group_entry) {
+		struct list_head *list;
+
+		if (sibling->attr.pinned)
+			list = &ctx->pinned_grp_list;
+		else
+			list = &ctx->volatile_grp_list;
 
-		list_move_tail(&sibling->group_entry, &ctx->group_list);
+		list_move_tail(&sibling->group_entry, list);
 		sibling->group_leader = sibling;
 	}
 }
@@ -1032,10 +1044,14 @@ void __perf_event_sched_out(struct perf_event_context *ctx,
 	update_context_time(ctx);
 
 	perf_disable();
-	if (ctx->nr_active)
-		list_for_each_entry(event, &ctx->group_list, group_entry)
+	if (ctx->nr_active) {
+		list_for_each_entry(event, &ctx->pinned_grp_list, group_entry)
 			group_sched_out(event, cpuctx, ctx);
 
+		list_for_each_entry(event, &ctx->volatile_grp_list, group_entry)
+			group_sched_out(event, cpuctx, ctx);
+	}
+
 	perf_enable();
  out:
 	spin_unlock(&ctx->lock);
@@ -1249,9 +1265,8 @@ __perf_event_sched_in(struct perf_event_context *ctx,
 	 * First go through the list and put on any pinned groups
 	 * in order to give them the best chance of going on.
 	 */
-	list_for_each_entry(event, &ctx->group_list, group_entry) {
-		if (event->state <= PERF_EVENT_STATE_OFF ||
-		    !event->attr.pinned)
+	list_for_each_entry(event, &ctx->pinned_grp_list, group_entry) {
+		if (event->state <= PERF_EVENT_STATE_OFF)
 			continue;
 		if (event->cpu != -1 && event->cpu != cpu)
 			continue;
@@ -1269,13 +1284,12 @@ __perf_event_sched_in(struct perf_event_context *ctx,
 		}
 	}
 
-	list_for_each_entry(event, &ctx->group_list, group_entry) {
+	list_for_each_entry(event, &ctx->volatile_grp_list, group_entry) {
 		/*
 		 * Ignore events in OFF or ERROR state, and
 		 * ignore pinned events since we did them already.
 		 */
-		if (event->state <= PERF_EVENT_STATE_OFF ||
-		    event->attr.pinned)
+		if (event->state <= PERF_EVENT_STATE_OFF)
 			continue;
 
 		/*
@@ -1428,8 +1442,13 @@ static void rotate_ctx(struct perf_event_context *ctx)
 	 * Rotate the first entry last (works just fine for group events too):
 	 */
 	perf_disable();
-	list_for_each_entry(event, &ctx->group_list, group_entry) {
-		list_move_tail(&event->group_entry, &ctx->group_list);
+	list_for_each_entry(event, &ctx->pinned_grp_list, group_entry) {
+		list_move_tail(&event->group_entry, &ctx->pinned_grp_list);
+		break;
+	}
+
+	list_for_each_entry(event, &ctx->volatile_grp_list, group_entry) {
+		list_move_tail(&event->group_entry, &ctx->volatile_grp_list);
 		break;
 	}
 	perf_enable();
@@ -1465,6 +1484,22 @@ void perf_event_task_tick(struct task_struct *curr, int cpu)
 		perf_event_task_sched_in(curr, cpu);
 }
 
+static void __perf_event_enable_on_exec(struct perf_event *event,
+					struct perf_event_context *ctx,
+					int *enabled)
+{
+	if (!event->attr.enable_on_exec)
+		return;
+
+	event->attr.enable_on_exec = 0;
+	if (event->state >= PERF_EVENT_STATE_INACTIVE)
+		return;
+
+	__perf_event_mark_enabled(event, ctx);
+
+	*enabled = 1;
+}
+
 /*
  * Enable all of a task's events that have been marked enable-on-exec.
  * This expects task == current.
@@ -1485,15 +1520,11 @@ static void perf_event_enable_on_exec(struct task_struct *task)
 
 	spin_lock(&ctx->lock);
 
-	list_for_each_entry(event, &ctx->group_list, group_entry) {
-		if (!event->attr.enable_on_exec)
-			continue;
-		event->attr.enable_on_exec = 0;
-		if (event->state >= PERF_EVENT_STATE_INACTIVE)
-			continue;
-		__perf_event_mark_enabled(event, ctx);
-		enabled = 1;
-	}
+	list_for_each_entry(event, &ctx->pinned_grp_list, group_entry)
+		__perf_event_enable_on_exec(event, ctx, &enabled);
+
+	list_for_each_entry(event, &ctx->volatile_grp_list, group_entry)
+		__perf_event_enable_on_exec(event, ctx, &enabled);
 
 	/*
 	 * Unclone this context if we enabled any event.
@@ -1562,7 +1593,8 @@ __perf_event_init_context(struct perf_event_context *ctx,
 	memset(ctx, 0, sizeof(*ctx));
 	spin_lock_init(&ctx->lock);
 	mutex_init(&ctx->mutex);
-	INIT_LIST_HEAD(&ctx->group_list);
+	INIT_LIST_HEAD(&ctx->pinned_grp_list);
+	INIT_LIST_HEAD(&ctx->volatile_grp_list);
 	INIT_LIST_HEAD(&ctx->event_list);
 	atomic_set(&ctx->refcount, 1);
 	ctx->task = task;
@@ -4869,7 +4901,11 @@ void perf_event_exit_task(struct task_struct *child)
 	mutex_lock_nested(&child_ctx->mutex, SINGLE_DEPTH_NESTING);
 
 again:
-	list_for_each_entry_safe(child_event, tmp, &child_ctx->group_list,
+	list_for_each_entry_safe(child_event, tmp, &child_ctx->pinned_grp_list,
+				 group_entry)
+		__perf_event_exit_task(child_event, child_ctx, child);
+
+	list_for_each_entry_safe(child_event, tmp, &child_ctx->volatile_grp_list,
 				 group_entry)
 		__perf_event_exit_task(child_event, child_ctx, child);
 
@@ -4878,7 +4914,8 @@ again:
 	 * its siblings to the list, but we obtained 'tmp' before that which
 	 * will still point to the list head terminating the iteration.
 	 */
-	if (!list_empty(&child_ctx->group_list))
+	if (!list_empty(&child_ctx->pinned_grp_list) ||
+	    !list_empty(&child_ctx->volatile_grp_list))
 		goto again;
 
 	mutex_unlock(&child_ctx->mutex);
@@ -4886,6 +4923,24 @@ again:
 	put_ctx(child_ctx);
 }
 
+static void perf_event_free_event(struct perf_event *event,
+				  struct perf_event_context *ctx)
+{
+	struct perf_event *parent = event->parent;
+
+	if (WARN_ON_ONCE(!parent))
+		return;
+
+	mutex_lock(&parent->child_mutex);
+	list_del_init(&event->child_list);
+	mutex_unlock(&parent->child_mutex);
+
+	fput(parent->filp);
+
+	list_del_event(event, ctx);
+	free_event(event);
+}
+
 /*
  * free an unexposed, unused context as created by inheritance by
  * init_task below, used by fork() in case of fail.
@@ -4900,23 +4955,15 @@ void perf_event_free_task(struct task_struct *task)
 
 	mutex_lock(&ctx->mutex);
 again:
-	list_for_each_entry_safe(event, tmp, &ctx->group_list, group_entry) {
-		struct perf_event *parent = event->parent;
-
-		if (WARN_ON_ONCE(!parent))
-			continue;
-
-		mutex_lock(&parent->child_mutex);
-		list_del_init(&event->child_list);
-		mutex_unlock(&parent->child_mutex);
+	list_for_each_entry_safe(event, tmp, &ctx->pinned_grp_list, group_entry)
+		perf_event_free_event(event, ctx);
 
-		fput(parent->filp);
-
-		list_del_event(event, ctx);
-		free_event(event);
-	}
+	list_for_each_entry_safe(event, tmp, &ctx->volatile_grp_list,
+				 group_entry)
+		perf_event_free_event(event, ctx);
 
-	if (!list_empty(&ctx->group_list))
+	if (!list_empty(&ctx->pinned_grp_list) ||
+	    !list_empty(&ctx->volatile_grp_list))
 		goto again;
 
 	mutex_unlock(&ctx->mutex);
@@ -4924,6 +4971,29 @@ again:
 	put_ctx(ctx);
 }
 
+static int
+perf_event_inherit(struct perf_event *event, struct task_struct *parent,
+		   struct perf_event_context *parent_ctx,
+		   struct task_struct *child,
+		   struct perf_event_context *child_ctx,
+		   int *inherited_all)
+{
+	int ret;
+
+	if (!event->attr.inherit) {
+		*inherited_all = 0;
+		return 0;
+	}
+
+	ret = inherit_group(event, parent, parent_ctx,
+			    child, child_ctx);
+	if (ret)
+		*inherited_all = 0;
+
+	return ret;
+}
+
+
 /*
  * Initialize the perf_event context in task_struct
  */
@@ -4981,19 +5051,20 @@ int perf_event_init_task(struct task_struct *child)
 	 * We dont have to disable NMIs - we are only looking at
 	 * the list, not manipulating it:
 	 */
-	list_for_each_entry(event, &parent_ctx->group_list, group_entry) {
+	list_for_each_entry(event, &parent_ctx->pinned_grp_list, group_entry) {
 
-		if (!event->attr.inherit) {
-			inherited_all = 0;
-			continue;
-		}
+		ret = perf_event_inherit(event, parent, parent_ctx, child,
+					 child_ctx, &inherited_all);
+		if (ret)
+			break;
+	}
+
+	list_for_each_entry(event, &parent_ctx->volatile_grp_list, group_entry) {
 
-		ret = inherit_group(event, parent, parent_ctx,
-					     child, child_ctx);
-		if (ret) {
-			inherited_all = 0;
+		ret = perf_event_inherit(event, parent, parent_ctx, child,
+					 child_ctx, &inherited_all);
+		if (ret)
 			break;
-		}
 	}
 
 	if (inherited_all) {
@@ -5044,7 +5115,9 @@ static void __perf_event_exit_cpu(void *info)
 	struct perf_event_context *ctx = &cpuctx->ctx;
 	struct perf_event *event, *tmp;
 
-	list_for_each_entry_safe(event, tmp, &ctx->group_list, group_entry)
+	list_for_each_entry_safe(event, tmp, &ctx->pinned_grp_list, group_entry)
+		__perf_event_remove_from_context(event);
+	list_for_each_entry_safe(event, tmp, &ctx->volatile_grp_list, group_entry)
 		__perf_event_remove_from_context(event);
 }
 static void perf_event_exit_cpu(int cpu)
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/4] perf/core: Optimize a bit rotate_ctx()
  2009-11-08 20:13 [RFC PATCH 0/4] perf/core: Small event scheduling changes Frederic Weisbecker
  2009-11-08 20:13 ` [RFC PATCH 1/4] perf/core: split context's event group list into pinned and non-pinned lists Frederic Weisbecker
@ 2009-11-08 20:13 ` Frederic Weisbecker
  2009-11-08 20:13 ` [RFC PATCH 3/4] perf/core: Split up pinned and non pinned processing Frederic Weisbecker
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2009-11-08 20:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mike Galbraith, Paul Mackerras,
	Thomas Gleixner

Don't use list_for_each_entry() just to swap the first and
last entry in the list.

Instead, use a direct list->next derefencing. This saves up a
useless prefetch.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/perf_event.c |   16 +++++++++-------
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index b3a31c8..0432c1c 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1432,7 +1432,7 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx)
  */
 static void rotate_ctx(struct perf_event_context *ctx)
 {
-	struct perf_event *event;
+	struct list_head *list;
 
 	if (!ctx->nr_events)
 		return;
@@ -1442,15 +1442,17 @@ static void rotate_ctx(struct perf_event_context *ctx)
 	 * Rotate the first entry last (works just fine for group events too):
 	 */
 	perf_disable();
-	list_for_each_entry(event, &ctx->pinned_grp_list, group_entry) {
-		list_move_tail(&event->group_entry, &ctx->pinned_grp_list);
-		break;
+
+	if (!list_empty(&ctx->pinned_grp_list)) {
+		list = ctx->pinned_grp_list.next;
+		list_move_tail(list, &ctx->pinned_grp_list);
 	}
 
-	list_for_each_entry(event, &ctx->volatile_grp_list, group_entry) {
-		list_move_tail(&event->group_entry, &ctx->volatile_grp_list);
-		break;
+	if (!list_empty(&ctx->volatile_grp_list)) {
+		list = ctx->volatile_grp_list.next;
+		list_move_tail(list, &ctx->volatile_grp_list);
 	}
+
 	perf_enable();
 
 	spin_unlock(&ctx->lock);
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 3/4] perf/core: Split up pinned and non pinned processing
  2009-11-08 20:13 [RFC PATCH 0/4] perf/core: Small event scheduling changes Frederic Weisbecker
  2009-11-08 20:13 ` [RFC PATCH 1/4] perf/core: split context's event group list into pinned and non-pinned lists Frederic Weisbecker
  2009-11-08 20:13 ` [RFC PATCH 2/4] perf/core: Optimize a bit rotate_ctx() Frederic Weisbecker
@ 2009-11-08 20:13 ` Frederic Weisbecker
  2009-11-10  5:11   ` Ingo Molnar
  2009-11-08 20:13 ` [RFC PATCH 4/4] perf/core: Schedule every pinned events before the the non-pinned Frederic Weisbecker
  2009-11-10  5:13 ` [RFC PATCH 0/4] perf/core: Small event scheduling changes Ingo Molnar
  4 siblings, 1 reply; 12+ messages in thread
From: Frederic Weisbecker @ 2009-11-08 20:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mike Galbraith, Paul Mackerras,
	Thomas Gleixner

Split up pinned and non-pinned events processing in two helpers
so that it's more flexible to handle them seperately.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/perf_event.c |   51 +++++++++++++++++++++++++++++++++++----------------
 1 files changed, 35 insertions(+), 16 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 0432c1c..50f2997 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1246,25 +1246,11 @@ static void perf_event_cpu_sched_out(struct perf_cpu_context *cpuctx)
 }
 
 static void
-__perf_event_sched_in(struct perf_event_context *ctx,
-			struct perf_cpu_context *cpuctx, int cpu)
+__perf_event_sched_in_pinned(struct perf_event_context *ctx,
+			     struct perf_cpu_context *cpuctx, int cpu)
 {
 	struct perf_event *event;
-	int can_add_hw = 1;
-
-	spin_lock(&ctx->lock);
-	ctx->is_active = 1;
-	if (likely(!ctx->nr_events))
-		goto out;
-
-	ctx->timestamp = perf_clock();
-
-	perf_disable();
 
-	/*
-	 * First go through the list and put on any pinned groups
-	 * in order to give them the best chance of going on.
-	 */
 	list_for_each_entry(event, &ctx->pinned_grp_list, group_entry) {
 		if (event->state <= PERF_EVENT_STATE_OFF)
 			continue;
@@ -1283,6 +1269,14 @@ __perf_event_sched_in(struct perf_event_context *ctx,
 			event->state = PERF_EVENT_STATE_ERROR;
 		}
 	}
+}
+
+static void
+__perf_event_sched_in_volatile(struct perf_event_context *ctx,
+			       struct perf_cpu_context *cpuctx, int cpu)
+{
+	int can_add_hw = 1;
+	struct perf_event *event;
 
 	list_for_each_entry(event, &ctx->volatile_grp_list, group_entry) {
 		/*
@@ -1303,6 +1297,31 @@ __perf_event_sched_in(struct perf_event_context *ctx,
 			if (group_sched_in(event, cpuctx, ctx, cpu))
 				can_add_hw = 0;
 	}
+}
+
+static void
+__perf_event_sched_in(struct perf_event_context *ctx,
+			struct perf_cpu_context *cpuctx, int cpu)
+{
+	spin_lock(&ctx->lock);
+	ctx->is_active = 1;
+	if (likely(!ctx->nr_events))
+		goto out;
+
+	ctx->timestamp = perf_clock();
+
+	perf_disable();
+
+	/*
+	 * First go through the list and put on any pinned groups
+	 * in order to give them the best chance of going on.
+	 */
+	__perf_event_sched_in_pinned(ctx, cpuctx, cpu);
+
+	/* Then handle the non-pinned groups */
+	__perf_event_sched_in_volatile(ctx, cpuctx, cpu);
+
+
 	perf_enable();
  out:
 	spin_unlock(&ctx->lock);
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 4/4] perf/core: Schedule every pinned events before the the non-pinned
  2009-11-08 20:13 [RFC PATCH 0/4] perf/core: Small event scheduling changes Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2009-11-08 20:13 ` [RFC PATCH 3/4] perf/core: Split up pinned and non pinned processing Frederic Weisbecker
@ 2009-11-08 20:13 ` Frederic Weisbecker
  2009-11-10 10:10   ` Peter Zijlstra
  2009-11-10  5:13 ` [RFC PATCH 0/4] perf/core: Small event scheduling changes Ingo Molnar
  4 siblings, 1 reply; 12+ messages in thread
From: Frederic Weisbecker @ 2009-11-08 20:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mike Galbraith, Paul Mackerras,
	Thomas Gleixner

Currently, the order to schedule events is as follows:

- cpu context pinned events
- cpu context non-pinned events
- task context pinned events
- task context non-pinned events

What we want instead is to schedule every pinned events first because
those have a higher priority.

This is what does this patch in each task tick. If the approach is
agreed, we may want to expand this to task-only sched in (where the
cpu events are not sched out), fork, exec, etc... So that we guarantee
the pinned priority every time the task is scheduled in.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/perf_event.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 50f2997..f32aec4 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1327,6 +1327,41 @@ __perf_event_sched_in(struct perf_event_context *ctx,
 	spin_unlock(&ctx->lock);
 }
 
+static void
+__perf_event_sched_in_all(struct perf_event_context *ctx,
+			  struct perf_cpu_context *cpuctx, int cpu)
+{
+	struct perf_event_context *cpu_ctx = &cpuctx->ctx;
+
+	/* May require different classes between cpu and task lock */
+	spin_lock(&cpu_ctx->lock);
+	spin_lock(&ctx->lock);
+	cpu_ctx->is_active = ctx->is_active = 1;
+
+	ctx->timestamp = cpu_ctx->timestamp = perf_clock();
+
+	perf_disable();
+
+	if (cpu_ctx->nr_events)
+		__perf_event_sched_in_pinned(cpu_ctx, cpuctx, cpu);
+
+	if (ctx->nr_events)
+		__perf_event_sched_in_pinned(cpu_ctx, cpuctx, cpu);
+
+	if (cpu_ctx->nr_events)
+		__perf_event_sched_in_volatile(cpu_ctx, cpuctx, cpu);
+
+	if (ctx->nr_events)
+		__perf_event_sched_in_volatile(cpu_ctx, cpuctx, cpu);
+
+	cpuctx->task_ctx = ctx;
+
+	perf_enable();
+
+	spin_unlock(&ctx->lock);
+	spin_lock(&cpu_ctx->lock);
+}
+
 /*
  * Called from scheduler to add the events of the current task
  * with interrupts disabled.
@@ -1477,6 +1512,16 @@ static void rotate_ctx(struct perf_event_context *ctx)
 	spin_unlock(&ctx->lock);
 }
 
+static void
+perf_event_sched_in_all(struct perf_event_context *ctx,
+			struct perf_cpu_context *cpuctx, int cpu)
+{
+	if (!ctx || ctx == cpuctx->task_ctx)
+		perf_event_cpu_sched_in(cpuctx, cpu);
+	else
+		__perf_event_sched_in_all(ctx, cpuctx, cpu);
+}
+
 void perf_event_task_tick(struct task_struct *curr, int cpu)
 {
 	struct perf_cpu_context *cpuctx;
@@ -1500,9 +1545,7 @@ void perf_event_task_tick(struct task_struct *curr, int cpu)
 	if (ctx)
 		rotate_ctx(ctx);
 
-	perf_event_cpu_sched_in(cpuctx, cpu);
-	if (ctx)
-		perf_event_task_sched_in(curr, cpu);
+	perf_event_sched_in_all(ctx, cpuctx, cpu);
 }
 
 static void __perf_event_enable_on_exec(struct perf_event *event,
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 3/4] perf/core: Split up pinned and non pinned processing
  2009-11-08 20:13 ` [RFC PATCH 3/4] perf/core: Split up pinned and non pinned processing Frederic Weisbecker
@ 2009-11-10  5:11   ` Ingo Molnar
  2009-11-10  9:32     ` Frederic Weisbecker
  0 siblings, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2009-11-10  5:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Arnaldo Carvalho de Melo, Mike Galbraith,
	Paul Mackerras, Thomas Gleixner

* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> Split up pinned and non-pinned events processing in two helpers
> so that it's more flexible to handle them seperately.

> +static void
> +__perf_event_sched_in_volatile(struct perf_event_context *ctx,
> +			       struct perf_cpu_context *cpuctx, int cpu)

Small naming suggestion: 'volatile' is a C keyword and rarely used 
outside of that context in the kernel, which makes this function name a 
bit confusing.

So instead of pinned/volatile, a pinned/flexible naming would be more 
readable, i.e. __perf_event_sched_in_flexible() or so.

Also, most of the static functions in kernel/perf_event.c could lose 
their perf_event_ prefix - we already know it's a perf thing, right? 
That will shorten quite a few function names there.

These functions would turn into __sched_in_pinned()/__sched_in_flexible().

Agreed?

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] perf/core: Small event scheduling changes
  2009-11-08 20:13 [RFC PATCH 0/4] perf/core: Small event scheduling changes Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2009-11-08 20:13 ` [RFC PATCH 4/4] perf/core: Schedule every pinned events before the the non-pinned Frederic Weisbecker
@ 2009-11-10  5:13 ` Ingo Molnar
  2009-11-10  9:41   ` Frederic Weisbecker
  4 siblings, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2009-11-10  5:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Arnaldo Carvalho de Melo, Mike Galbraith,
	Paul Mackerras, Thomas Gleixner


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> Hi,
> 
> This is an rfc patchset, only compile tested just to ensure I'm taking 
> a good direction before going ahead.
> 
> This is intended to rework a bit the perf event scheduling to 
> guarantee a real priority of the pinned events over the volatile ones. 
> This patchset handles such priority on task tick time only. But if the 
> idea is agreed, I could expand that to every task event sched-in calls 
> to guarantee the priority in every event rescheduling time.
> 
> Thanks.
> 
> Frederic Weisbecker (4):
>   perf/core: split context's event group list into pinned and
>     non-pinned lists
>   perf/core: Optimize a bit rotate_ctx()
>   perf/core: Split up pinned and non pinned processing
>   perf/core: Schedule every pinned events before the the non-pinned
> 
>  include/linux/perf_event.h |    3 +-
>  kernel/perf_event.c        |  283 ++++++++++++++++++++++++++++++++-----------
>  2 files changed, 212 insertions(+), 74 deletions(-)

Sans the small naming suggestions i had, the general principle looks 
good to me - it's a nice restructuring of the various scheduling rules 
we have for events.

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 3/4] perf/core: Split up pinned and non pinned processing
  2009-11-10  5:11   ` Ingo Molnar
@ 2009-11-10  9:32     ` Frederic Weisbecker
  0 siblings, 0 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2009-11-10  9:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Peter Zijlstra, Arnaldo Carvalho de Melo, Mike Galbraith,
	Paul Mackerras, Thomas Gleixner

On Tue, Nov 10, 2009 at 06:11:41AM +0100, Ingo Molnar wrote:
> 
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > Split up pinned and non-pinned events processing in two helpers
> > so that it's more flexible to handle them seperately.
> 
> > +static void
> > +__perf_event_sched_in_volatile(struct perf_event_context *ctx,
> > +			       struct perf_cpu_context *cpuctx, int cpu)
> 
> Small naming suggestion: 'volatile' is a C keyword and rarely used 
> outside of that context in the kernel, which makes this function name a 
> bit confusing.
> 
> So instead of pinned/volatile, a pinned/flexible naming would be more 
> readable, i.e. __perf_event_sched_in_flexible() or so.


Right, also that makes it consistent with the hw-breakpoint constraints
naming.

 
> Also, most of the static functions in kernel/perf_event.c could lose 
> their perf_event_ prefix - we already know it's a perf thing, right? 
> That will shorten quite a few function names there.
> 
> These functions would turn into __sched_in_pinned()/__sched_in_flexible().
> 
> Agreed?


Totally.

I'll prepare a new iteration, thanks!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] perf/core: Small event scheduling changes
  2009-11-10  5:13 ` [RFC PATCH 0/4] perf/core: Small event scheduling changes Ingo Molnar
@ 2009-11-10  9:41   ` Frederic Weisbecker
  2009-11-10  9:47     ` Frederic Weisbecker
  0 siblings, 1 reply; 12+ messages in thread
From: Frederic Weisbecker @ 2009-11-10  9:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Peter Zijlstra, Arnaldo Carvalho de Melo, Mike Galbraith,
	Paul Mackerras, Thomas Gleixner

On Tue, Nov 10, 2009 at 06:13:21AM +0100, Ingo Molnar wrote:
> 
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > Hi,
> > 
> > This is an rfc patchset, only compile tested just to ensure I'm taking 
> > a good direction before going ahead.
> > 
> > This is intended to rework a bit the perf event scheduling to 
> > guarantee a real priority of the pinned events over the volatile ones. 
> > This patchset handles such priority on task tick time only. But if the 
> > idea is agreed, I could expand that to every task event sched-in calls 
> > to guarantee the priority in every event rescheduling time.
> > 
> > Thanks.
> > 
> > Frederic Weisbecker (4):
> >   perf/core: split context's event group list into pinned and
> >     non-pinned lists
> >   perf/core: Optimize a bit rotate_ctx()
> >   perf/core: Split up pinned and non pinned processing
> >   perf/core: Schedule every pinned events before the the non-pinned
> > 
> >  include/linux/perf_event.h |    3 +-
> >  kernel/perf_event.c        |  283 ++++++++++++++++++++++++++++++++-----------
> >  2 files changed, 212 insertions(+), 74 deletions(-)
> 
> Sans the small naming suggestions i had, the general principle looks 
> good to me - it's a nice restructuring of the various scheduling rules 
> we have for events.
> 
> 	Ingo


Thanks.

With this draft, it makes the pinned priority more consistent
with its purpose but it doesn't yet bring the full pinned over
flexible priority determinism.

It does apply the priority in tick time, while we round robin.
I did that there first so that it covers most of the events
rescheduling actions and also it doesn't bring much more
overhead over the previous layout (in theory), it just changes
the order.

I'll also try to expand the priority constraint each time we
sched in a task: when we schedule a new task that belongs to
a new context, we don't schedule out/in the cpu context but
that will be needed if we want the full priority determinism.

Anyway, I'll do that progressively.

	Frederic.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] perf/core: Small event scheduling changes
  2009-11-10  9:41   ` Frederic Weisbecker
@ 2009-11-10  9:47     ` Frederic Weisbecker
  0 siblings, 0 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2009-11-10  9:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Peter Zijlstra, Arnaldo Carvalho de Melo, Mike Galbraith,
	Paul Mackerras, Thomas Gleixner

On Tue, Nov 10, 2009 at 10:41:45AM +0100, Frederic Weisbecker wrote:
> Thanks.
> 
> With this draft, it makes the pinned priority more consistent
> with its purpose but it doesn't yet bring the full pinned over
> flexible priority determinism.
> 
> It does apply the priority in tick time, while we round robin.
> I did that there first so that it covers most of the events
> rescheduling actions and also it doesn't bring much more
> overhead over the previous layout (in theory), it just changes
> the order.
> 
> I'll also try to expand the priority constraint each time we
> sched in a task: when we schedule a new task that belongs to
> a new context, we don't schedule out/in the cpu context but
> that will be needed if we want the full priority determinism.


To lower the overhead at non-tick time, we could even just reschedule
the cpu flexible events. Anyway...



> 
> Anyway, I'll do that progressively.
> 
> 	Frederic.
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 4/4] perf/core: Schedule every pinned events before the the non-pinned
  2009-11-08 20:13 ` [RFC PATCH 4/4] perf/core: Schedule every pinned events before the the non-pinned Frederic Weisbecker
@ 2009-11-10 10:10   ` Peter Zijlstra
  2009-11-10 10:34     ` Frederic Weisbecker
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2009-11-10 10:10 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Arnaldo Carvalho de Melo, Mike Galbraith,
	Paul Mackerras, Thomas Gleixner

On Sun, 2009-11-08 at 21:13 +0100, Frederic Weisbecker wrote:

> +static void
> +__perf_event_sched_in_all(struct perf_event_context *ctx,
> +			  struct perf_cpu_context *cpuctx, int cpu)
> +{
> +	struct perf_event_context *cpu_ctx = &cpuctx->ctx;
> +
> +	/* May require different classes between cpu and task lock */
> +	spin_lock(&cpu_ctx->lock);
> +	spin_lock(&ctx->lock);

Would be good to know for sure, running with lockdep enabled ought to
tell you that pretty quick ;-)

> +	cpu_ctx->is_active = ctx->is_active = 1;
> +
> +	ctx->timestamp = cpu_ctx->timestamp = perf_clock();
> +
> +	perf_disable();
> +
> +	if (cpu_ctx->nr_events)
> +		__perf_event_sched_in_pinned(cpu_ctx, cpuctx, cpu);
> +
> +	if (ctx->nr_events)
> +		__perf_event_sched_in_pinned(cpu_ctx, cpuctx, cpu);
> +
> +	if (cpu_ctx->nr_events)
> +		__perf_event_sched_in_volatile(cpu_ctx, cpuctx, cpu);
> +
> +	if (ctx->nr_events)
> +		__perf_event_sched_in_volatile(cpu_ctx, cpuctx, cpu);
> +
> +	cpuctx->task_ctx = ctx;
> +
> +	perf_enable();
> +
> +	spin_unlock(&ctx->lock);
> +	spin_lock(&cpu_ctx->lock);

I'm pretty sure that ought to be spin_unlock() ;-)

> +}


Like Ingo I don't really like the volatile name.

Can't we simply have 2 lists per cpu a pinned and normal list, and first
schedule all the pinned and RR the normal events?

I guess one could implement that by adding the task context events to
the cpu context events on sched_in and removing them on sched_out. That
would clear up a lot of funny scheduling details.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 4/4] perf/core: Schedule every pinned events before the the non-pinned
  2009-11-10 10:10   ` Peter Zijlstra
@ 2009-11-10 10:34     ` Frederic Weisbecker
  0 siblings, 0 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2009-11-10 10:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Arnaldo Carvalho de Melo, Mike Galbraith,
	Paul Mackerras, Thomas Gleixner

On Tue, Nov 10, 2009 at 11:10:13AM +0100, Peter Zijlstra wrote:
> On Sun, 2009-11-08 at 21:13 +0100, Frederic Weisbecker wrote:
> 
> > +static void
> > +__perf_event_sched_in_all(struct perf_event_context *ctx,
> > +			  struct perf_cpu_context *cpuctx, int cpu)
> > +{
> > +	struct perf_event_context *cpu_ctx = &cpuctx->ctx;
> > +
> > +	/* May require different classes between cpu and task lock */
> > +	spin_lock(&cpu_ctx->lock);
> > +	spin_lock(&ctx->lock);
> 
> Would be good to know for sure, running with lockdep enabled ought to
> tell you that pretty quick ;-)



That's about sure I guess :)
I just wanted to take care of that after your comments.


 
> > +	cpu_ctx->is_active = ctx->is_active = 1;
> > +
> > +	ctx->timestamp = cpu_ctx->timestamp = perf_clock();
> > +
> > +	perf_disable();
> > +
> > +	if (cpu_ctx->nr_events)
> > +		__perf_event_sched_in_pinned(cpu_ctx, cpuctx, cpu);
> > +
> > +	if (ctx->nr_events)
> > +		__perf_event_sched_in_pinned(cpu_ctx, cpuctx, cpu);
> > +
> > +	if (cpu_ctx->nr_events)
> > +		__perf_event_sched_in_volatile(cpu_ctx, cpuctx, cpu);
> > +
> > +	if (ctx->nr_events)
> > +		__perf_event_sched_in_volatile(cpu_ctx, cpuctx, cpu);
> > +
> > +	cpuctx->task_ctx = ctx;
> > +
> > +	perf_enable();
> > +
> > +	spin_unlock(&ctx->lock);
> > +	spin_lock(&cpu_ctx->lock);
> 
> I'm pretty sure that ought to be spin_unlock() ;-)


Indeed :)


> > +}
> 
> 
> Like Ingo I don't really like the volatile name.
> 
> Can't we simply have 2 lists per cpu a pinned and normal list, and first
> schedule all the pinned and RR the normal events?
> 
> I guess one could implement that by adding the task context events to
> the cpu context events on sched_in and removing them on sched_out. That
> would clear up a lot of funny scheduling details.


I thought about doing that, but didn't expand the idea that much,
because of the list manipulation that induces.

But you're right, that would be be indeed more proper.
I can just save the "real" cpu event group tail in the
struct perf_cpu_context so that I can keep track of the real
state and (un)glue the queues easily.

Yeah, I'll try that, thanks!


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-11-10 10:34 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-08 20:13 [RFC PATCH 0/4] perf/core: Small event scheduling changes Frederic Weisbecker
2009-11-08 20:13 ` [RFC PATCH 1/4] perf/core: split context's event group list into pinned and non-pinned lists Frederic Weisbecker
2009-11-08 20:13 ` [RFC PATCH 2/4] perf/core: Optimize a bit rotate_ctx() Frederic Weisbecker
2009-11-08 20:13 ` [RFC PATCH 3/4] perf/core: Split up pinned and non pinned processing Frederic Weisbecker
2009-11-10  5:11   ` Ingo Molnar
2009-11-10  9:32     ` Frederic Weisbecker
2009-11-08 20:13 ` [RFC PATCH 4/4] perf/core: Schedule every pinned events before the the non-pinned Frederic Weisbecker
2009-11-10 10:10   ` Peter Zijlstra
2009-11-10 10:34     ` Frederic Weisbecker
2009-11-10  5:13 ` [RFC PATCH 0/4] perf/core: Small event scheduling changes Ingo Molnar
2009-11-10  9:41   ` Frederic Weisbecker
2009-11-10  9:47     ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox