All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: mingo@kernel.org, alexander.shishkin@linux.intel.com, eranian@google.com
Cc: linux-kernel@vger.kernel.org, vince@deater.net,
	dvyukov@google.com, andi@firstfloor.org, jolsa@redhat.com,
	peterz@infradead.org
Subject: [RFC][PATCH 10/12] perf: Fix task context scheduling
Date: Mon, 11 Jan 2016 17:25:08 +0100	[thread overview]
Message-ID: <20160111163229.293220067@infradead.org> (raw)
In-Reply-To: 20160111162458.427203780@infradead.org

[-- Attachment #1: peterz-perf-fixes-12.patch --]
[-- Type: text/plain, Size: 8154 bytes --]

There is a very nasty problem wrt disabling the perf task scheduling
hooks.

Currently we {set,clear} ctx->is_active on every
__perf_event_task_sched_{in,out}, _however_ this means that if we
disable these calls we'll have task contexts with ->is_active set that
are not active and 'active' task contexts without ->is_active set.

This can result in event_function_call() looping on the ctx->is_active
condition basically indefinitely.

Resolve this by changing things such that contexts without events do
not set ->is_active like we used to. From this invariant it trivially
follows that if there are no (task) events, every task ctx is inactive
and disabling the context switch hooks is harmless.

This leaves two places that need attention (and already had
accumulated weird and wonderful hacks to work around, without
recognising this actual problem).

Namely:

 - perf_install_in_context() will need to deal with installing events
   in an inactive context, meaning it cannot rely on ctx-is_active for
   its IPIs.

 - perf_remove_from_context() will have to mark a context as inactive
   when it removes the last event.

For specific detail, see the patch/comments.

Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |  155 +++++++++++++++++++++++++++++----------------------
 1 file changed, 91 insertions(+), 64 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -126,6 +126,38 @@ static int cpu_function_call(int cpu, re
 	return data.ret;
 }
 
+/*
+ * On task ctx scheduling...
+ *
+ * When !ctx->nr_events a task context will not be scheduled. This means
+ * we can disable the scheduler hooks (for performance) without leaving
+ * pending task ctx state.
+ *
+ * This however results in two special cases:
+ *
+ *  - removing the last event from a task ctx; this is relatively straight
+ *    forward and is done in __perf_remove_from_context.
+ *
+ *  - adding the first event to a task ctx; this is tricky because we cannot
+ *    rely on ctx->is_active and therefore cannot use event_function_call().
+ *    See perf_install_in_context().
+ *
+ * This is because we need a ctx->lock serialized variable (ctx->is_active)
+ * to reliably determine if a particular task/context is scheduled in. The
+ * task_curr() use in task_function_call() is racy in that a remote context
+ * switch is not a single atomic operation.
+ *
+ * As is, the situation is 'safe' because we set rq->curr before we do the
+ * actual context switch. This means that task_curr() will fail early, but
+ * we'll continue spinning on ctx->is_active until we've passed
+ * perf_event_task_sched_out().
+ *
+ * Without this ctx->lock serialized variable we could have race where we find
+ * the task (and hence the context) would not be active while in fact they are.
+ *
+ * If ctx->nr_events, then ctx->is_active and cpuctx->task_ctx are set.
+ */
+
 static void event_function_call(struct perf_event *event,
 				int (*active)(void *),
 				void (*inactive)(void *),
@@ -1686,9 +1718,13 @@ static int __perf_remove_from_context(vo
 	if (re->detach_group)
 		perf_group_detach(event);
 	list_del_event(event, ctx);
-	if (!ctx->nr_events && cpuctx->task_ctx == ctx) {
+
+	if (!ctx->nr_events && ctx->is_active) {
 		ctx->is_active = 0;
-		cpuctx->task_ctx = NULL;
+		if (ctx->task) {
+			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
+			cpuctx->task_ctx = NULL;
+		}
 	}
 	raw_spin_unlock(&ctx->lock);
 
@@ -2056,18 +2092,6 @@ static void perf_event_sched_in(struct p
 		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
 }
 
-static void ___perf_install_in_context(void *info)
-{
-	struct perf_event *event = info;
-	struct perf_event_context *ctx = event->ctx;
-
-	/*
-	 * Since the task isn't running, its safe to add the event, us holding
-	 * the ctx->lock ensures the task won't get scheduled in.
-	 */
-	add_event_to_ctx(event, ctx);
-}
-
 static void ctx_resched(struct perf_cpu_context *cpuctx,
 			struct perf_event_context *task_ctx)
 {
@@ -2086,55 +2110,27 @@ static void ctx_resched(struct perf_cpu_
  */
 static int  __perf_install_in_context(void *info)
 {
-	struct perf_event *event = info;
-	struct perf_event_context *ctx = event->ctx;
+	struct perf_event_context *ctx = info;
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	struct perf_event_context *task_ctx = cpuctx->task_ctx;
-	struct task_struct *task = current;
 
-	perf_ctx_lock(cpuctx, task_ctx);
-	perf_pmu_disable(cpuctx->ctx.pmu);
-
-	/*
-	 * If there was an active task_ctx schedule it out.
-	 */
-	if (task_ctx)
-		task_ctx_sched_out(cpuctx, task_ctx);
+	if (ctx->task) {
+		/*
+		 * If we hit the 'wrong' task, we've since scheduled and
+		 * everything should be sorted, nothing to do!
+		 */
+		if (ctx->task != current)
+			return 0;
 
-	/*
-	 * If the context we're installing events in is not the
-	 * active task_ctx, flip them.
-	 */
-	if (ctx->task && task_ctx != ctx) {
-		if (task_ctx)
-			raw_spin_unlock(&task_ctx->lock);
-		raw_spin_lock(&ctx->lock);
+		/*
+		 * If task_ctx is set, it had better be to us.
+		 */
+		WARN_ON_ONCE(cpuctx->task_ctx != ctx && cpuctx->task_ctx);
 		task_ctx = ctx;
 	}
 
-	if (task_ctx) {
-		cpuctx->task_ctx = task_ctx;
-		task = task_ctx->task;
-	}
-
-	cpu_ctx_sched_out(cpuctx, EVENT_ALL);
-
-	update_context_time(ctx);
-	/*
-	 * update cgrp time only if current cgrp
-	 * matches event->cgrp. Must be done before
-	 * calling add_event_to_ctx()
-	 */
-	update_cgrp_time_from_event(event);
-
-	add_event_to_ctx(event, ctx);
-
-	/*
-	 * Schedule everything back in
-	 */
-	perf_event_sched_in(cpuctx, task_ctx, task);
-
-	perf_pmu_enable(cpuctx->ctx.pmu);
+	perf_ctx_lock(cpuctx, task_ctx);
+	ctx_resched(cpuctx, task_ctx);
 	perf_ctx_unlock(cpuctx, task_ctx);
 
 	return 0;
@@ -2148,14 +2144,38 @@ perf_install_in_context(struct perf_even
 			struct perf_event *event,
 			int cpu)
 {
+	struct task_struct *task = NULL;
+
 	lockdep_assert_held(&ctx->mutex);
 
 	event->ctx = ctx;
 	if (event->cpu != -1)
 		event->cpu = cpu;
 
-	event_function_call(event, __perf_install_in_context,
-			    ___perf_install_in_context, event);
+	/*
+	 * Installing events is tricky because we cannot rely on ctx->is_active
+	 * to be set in case this is the nr_events 0 -> 1 transition.
+	 *
+	 * So what we do is we add the event to the list here, which will allow
+	 * a future context switch to DTRT and then send a racy IPI. If the IPI
+	 * fails to hit the right task, this means a context switch must have
+	 * happened and that will have taken care of business.
+	 */
+	raw_spin_lock_irq(&ctx->lock);
+	update_context_time(ctx);
+	/*
+	 * Update cgrp time only if current cgrp matches event->cgrp.
+	 * Must be done before calling add_event_to_ctx().
+	 */
+	update_cgrp_time_from_event(event);
+	add_event_to_ctx(event, ctx);
+	task = ctx->task;
+	raw_spin_unlock_irq(&ctx->lock);
+
+	if (task)
+		task_function_call(task, __perf_install_in_context, ctx);
+	else
+		cpu_function_call(cpu, __perf_install_in_context, ctx);
 }
 
 /*
@@ -2328,6 +2348,16 @@ static void ctx_sched_out(struct perf_ev
 
 	lockdep_assert_held(&ctx->lock);
 
+	if (likely(!ctx->nr_events)) {
+		/*
+		 * See __perf_remove_from_context().
+		 */
+		WARN_ON_ONCE(ctx->is_active);
+		if (ctx->task)
+			WARN_ON_ONCE(cpuctx->task_ctx);
+		return;
+	}
+
 	ctx->is_active &= ~event_type;
 	if (ctx->task) {
 		WARN_ON_ONCE(cpuctx->task_ctx != ctx);
@@ -2335,9 +2365,6 @@ static void ctx_sched_out(struct perf_ev
 			cpuctx->task_ctx = NULL;
 	}
 
-	if (likely(!ctx->nr_events))
-		return;
-
 	update_context_time(ctx);
 	update_cgrp_time_from_cpuctx(cpuctx);
 	if (!ctx->nr_active)
@@ -2716,6 +2743,9 @@ ctx_sched_in(struct perf_event_context *
 
 	lockdep_assert_held(&ctx->lock);
 
+	if (likely(!ctx->nr_events))
+		return;
+
 	ctx->is_active |= event_type;
 	if (ctx->task) {
 		if (!is_active)
@@ -2724,9 +2754,6 @@ ctx_sched_in(struct perf_event_context *
 			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
 	}
 
-	if (likely(!ctx->nr_events))
-		return;
-
 	now = perf_clock();
 	ctx->timestamp = now;
 	perf_cgroup_set_timestamp(task, ctx);

  parent reply	other threads:[~2016-01-11 16:36 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-11 16:24 [RFC][PATCH 00/12] various perf fixes Peter Zijlstra
2016-01-11 16:24 ` [RFC][PATCH 01/12] perf: Add lockdep assertions Peter Zijlstra
2016-01-11 16:25 ` [RFC][PATCH 02/12] perf: Fix cgroup event scheduling Peter Zijlstra
2016-01-11 19:43   ` Stephane Eranian
2016-01-11 22:03     ` Peter Zijlstra
2016-01-11 16:25 ` [RFC][PATCH 03/12] perf: Fix cgroup scheduling in enable_on_exec Peter Zijlstra
2016-01-11 16:25 ` [RFC][PATCH 04/12] perf: Remove stale comment Peter Zijlstra
2016-01-11 16:25 ` [RFC][PATCH 05/12] perf: Fix enable_on_exec event scheduling Peter Zijlstra
2016-01-11 16:25 ` [RFC][PATCH 06/12] perf: Use task_ctx_sched_out() Peter Zijlstra
2016-01-11 16:25 ` [RFC][PATCH 07/12] perf: Simplify/fix perf_event_enable() event scheduling Peter Zijlstra
2016-03-08 10:04   ` James Morse
2016-03-08 10:26     ` Peter Zijlstra
2016-03-08 10:42       ` James Morse
2016-03-08 11:29         ` Peter Zijlstra
2016-01-11 16:25 ` [RFC][PATCH 08/12] perf: Optimize perf_sched_events usage Peter Zijlstra
2016-01-11 16:25 ` [RFC][PATCH 09/12] perf: Make ctx->is_active and cpuctx->task_ctx consistent Peter Zijlstra
2016-01-11 16:25 ` Peter Zijlstra [this message]
2016-01-11 16:25 ` [RFC][PATCH 11/12] perf: Specialize perf_event_exit_task() Peter Zijlstra
2016-01-11 16:25 ` [RFC][PATCH 12/12] perf: Collapse and fix event_function_call() users Peter Zijlstra
2016-01-13 10:50   ` Peter Zijlstra
2016-01-13 13:46     ` Alexander Shishkin
2016-01-13 17:30       ` Peter Zijlstra
2016-01-13 15:00   ` Alexander Shishkin
2016-01-13 15:38     ` Alexander Shishkin
2016-01-13 18:10     ` Peter Zijlstra
2016-01-13 20:44       ` Peter Zijlstra
2016-01-14 10:44     ` Peter Zijlstra
2016-01-14 16:30       ` Peter Zijlstra
2016-02-17 22:38   ` Sasha Levin
2016-02-18  7:46     ` Peter Zijlstra
2016-02-18 12:37       ` Peter Zijlstra
2016-01-11 18:44 ` [RFC][PATCH 00/12] various perf fixes Dmitry Vyukov
2016-01-11 19:56   ` Andi Kleen
2016-01-11 22:00   ` Peter Zijlstra
2016-01-12  9:59     ` Ingo Molnar
2016-01-12 10:11 ` Ingo Molnar
2016-01-12 10:57   ` Dmitry Vyukov
2016-01-12 11:00     ` Dmitry Vyukov
2016-01-12 11:01       ` Dmitry Vyukov
2016-01-12 11:26         ` Dmitry Vyukov
2016-01-12 11:35           ` Dmitry Vyukov
2016-01-12 12:01           ` Peter Zijlstra
2016-01-13 15:18           ` Alexander Shishkin
2016-01-13 15:22             ` Dmitry Vyukov
2016-01-13 15:35               ` Alexander Shishkin
2016-01-14  9:35           ` Peter Zijlstra
2016-01-14 10:05             ` Dmitry Vyukov
2016-02-15 15:04               ` Dmitry Vyukov
2016-02-15 15:38                 ` Peter Zijlstra
2016-02-15 15:47                   ` Dmitry Vyukov
2016-02-15 16:01                   ` Vince Weaver
2016-02-15 16:17               ` Peter Zijlstra
2016-02-15 16:29                 ` Dmitry Vyukov
2016-02-15 16:29               ` Peter Zijlstra
2016-02-15 16:35                 ` Dmitry Vyukov
2016-02-15 16:38                   ` Dmitry Vyukov
2016-02-15 16:52                     ` Peter Zijlstra
2016-02-15 17:04                       ` Dmitry Vyukov
2016-02-15 17:07                         ` Peter Zijlstra
2016-02-15 17:45                           ` Dmitry Vyukov
2016-02-15 18:01                             ` Peter Zijlstra
2016-02-15 18:33                               ` Dmitry Vyukov
2016-02-15 16:41                   ` Peter Zijlstra
2016-02-15 16:54                     ` Dmitry Vyukov
2016-02-15 16:59                       ` Peter Zijlstra
2016-01-12 13:13   ` Peter Zijlstra
2016-01-12 13:31     ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160111163229.293220067@infradead.org \
    --to=peterz@infradead.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=andi@firstfloor.org \
    --cc=dvyukov@google.com \
    --cc=eranian@google.com \
    --cc=jolsa@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=vince@deater.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.