From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1758970AbdJRBIB (ORCPT <rfc822;w@1wt.eu>);
        Tue, 17 Oct 2017 21:08:01 -0400
Received: from mail-wm0-f41.google.com ([74.125.82.41]:55964 "EHLO
        mail-wm0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1761464AbdJRBH6 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 17 Oct 2017 21:07:58 -0400
X-Google-Smtp-Source: ABhQp+TpWVVcHbQMW6gOiF5koIie1QS2GGcuvUsooiqxBMnBRndXPNKSS3Rjna5LmE+1OlcMRavGAQ==
Date: Wed, 18 Oct 2017 03:07:48 +0200
From: Andrea Parri <parri.andrea@gmail.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: j.alglave@ucl.ac.uk, luc.maranget@inria.fr, stern@rowland.harvard.edu,
        dhowells@redhat.com, peterz@infradead.org, will.deacon@arm.com,
        boqun.feng@gmail.com, npiggin@gmail.com, linux-kernel@vger.kernel.org
Subject: Re: Memory-ordering recipes
Message-ID: <20171018010748.GA4017@andrea>
References: <20170917230509.GA21394@linux.vnet.ibm.com>
 <20171017210137.GA12700@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20171017210137.GA12700@linux.vnet.ibm.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, Oct 17, 2017 at 02:01:37PM -0700, Paul E. McKenney wrote:
> On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> > The topic of memory-ordering recipes came up at the Linux Plumbers
> > Conference microconference on Friday, so I thought that I should summarize
> > what is currently "out there":
> 
> And here is an updated list of potential Linux-kernel examples for a
> "recipes" document, and thank you for the feedback.  Please feel free
> to counterpropose better examples.  In addition, if there is some other
> pattern that is commonplace and important enough to be included in a
> recipes document, please point it out.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> This document lists the litmus-test patterns that we have been discussing,
> along with examples from the Linux kernel.  This is intended to feed into
> the recipes document.  All examples are from v4.13.
> 
> 0.	Simple special cases
> 
> 	If there is only one CPU on the one hand or only one variable
> 	on the other, the code will execute in order.  There are (as
> 	usual) some things to be careful of:
> 
> 	a.	There are some aspects of the C language that are
> 		unordered.  For example, in the expression "f(x) + g(y)",
> 		the order in which f and g are called is not defined;
> 		the object code is allowed to use either order or even
> 		to interleave the computations.
> 
> 	b.	Compilers are permitted to use the "as-if" rule.  That is,
> 		a compiler can emit whatever code it likes, as long as
> 		the results of a single-threaded execution appear just
> 		as if the compiler had followed all the relevant rules.
> 		To see this, compile with a high level of optimization
> 		and run the debugger on the resulting binary.
> 
> 	c.	If there is only one variable but multiple CPUs, all
> 		that variable must be properly aligned and all accesses
> 		to that variable must be full sized.  Variables that
> 		straddle cachelines or pages void your full-ordering
> 		warranty, as do undersized accesses that load from or
> 		store to only part of the variable.
> 
> 1.	Another simple case: Locking.  [ Assuming you don't think too
> 	hard about it, that is! ]
> 
> 	Any CPU that has acquired a given lock sees any changes previously
> 	made by any CPU prior to having released that same lock.
> 
> 	[ Should we discuss chaining back through different locks,
> 	  sort of like release-acquire chains? ]
> 
> 2.	MP (see test6.pdf for nickname translation)
> 
> 	a.	smp_store_release() / smp_load_acquire()
> 
> 		init_stack_slab() in lib/stackdepot.c uses release-acquire
> 		to handle initialization of a slab of the stack.  Working
> 		out the mutual-exclusion design is left as an exercise for
> 		the reader.
> 
> 	b.	rcu_assign_pointer() / rcu_dereference()
> 
> 		expand_to_next_prime() does the rcu_assign_pointer(),
> 		and next_prime_number() does the rcu_dereference().
> 		This mediates access to a bit vector that is expanded
> 		as additional primes are needed.  These two functions
> 		are in lib/prime_numbers.c.
> 
> 	c.	smp_wmb() / smp_rmb()
> 
> 		xlog_state_switch_iclogs() contains the following:
> 
> 			log->l_curr_block -= log->l_logBBsize;
> 			ASSERT(log->l_curr_block >= 0);
> 			smp_wmb();
> 			log->l_curr_cycle++;
> 
> 		And xlog_valid_lsn() contains the following:
> 
> 			cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
> 			smp_rmb();
> 			cur_block = ACCESS_ONCE(log->l_curr_block);
> 
> 		Alternatively, from the comment in perf_output_put_handle()
> 		in kernel/events/ring_buffer.c:
> 
> 		 *   kernel				user
> 		 *
> 		 *   if (LOAD ->data_tail) {		LOAD ->data_head
> 		 *			(A)		smp_rmb()	(C)
> 		 *	STORE $data			LOAD $data
> 		 *	smp_wmb()	(B)		smp_mb()	(D)
> 		 *	STORE ->data_head		STORE ->data_tail
> 		 *   }
> 		 *
> 		 * Where A pairs with D, and B pairs with C.
> 
> 		The B/C pairing is MP with smp_wmb() and smp_rmb().
> 
> 	d.	Replacing either of the above with smp_mb()
> 
> 		Holding off on this one for the moment...
> 
> 3.	LB
> 
> 	a.	LB+ctrl+mb
> 
> 		Again, from the comment in perf_output_put_handle()
> 		in kernel/events/ring_buffer.c:
> 
> 		 *   kernel				user
> 		 *
> 		 *   if (LOAD ->data_tail) {		LOAD ->data_head
> 		 *			(A)		smp_rmb()	(C)
> 		 *	STORE $data			LOAD $data
> 		 *	smp_wmb()	(B)		smp_mb()	(D)
> 		 *	STORE ->data_head		STORE ->data_tail
> 		 *   }
> 		 *
> 		 * Where A pairs with D, and B pairs with C.
> 
> 		The A/D pairing covers this one.
> 
> 4.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> 
> 	Lots of variety here, can in some cases substitute:
> 	
> 	a.	READ_ONCE() for smp_load_acquire()
> 	b.	WRITE_ONCE() for smp_store_release()
> 	c.	Dependencies for both smp_load_acquire() and
> 		smp_store_release().
> 	d.	smp_wmb() for smp_store_release() in first thread
> 		of ISA2 and Z6.2.
> 	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.
> 
> 	The canonical illustration of LB involves the various memory
> 	allocators, where you don't want a load from about-to-be-freed
> 	memory to see a store initializing a later incarnation of that
> 	same memory area.  But the per-CPU caches make this a very
> 	long and complicated example.
> 
> 	I am not aware of any three-CPU release-acquire chains in the
> 	Linux kernel.  There are three-CPU lock-based chains in RCU,
> 	but these are not at all simple, either.
> 
> 	Thoughts?
> 
> 5.	SB
> 
> 	a.	smp_mb(), as in lockless wait-wakeup coordination.
> 		And as in sys_membarrier()-scheduler coordination,
> 		for that matter.
> 
> 		Examples seem to be lacking.  Most cases use locking.
> 		Here is one rather strange one from RCU:
> 
> 		void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
> 		{
> 			unsigned long flags;
> 			bool needwake;
> 			bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);
> 
> 			rhp->next = NULL;
> 			rhp->func = func;
> 			raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> 			needwake = !rcu_tasks_cbs_head;
> 			*rcu_tasks_cbs_tail = rhp;
> 			rcu_tasks_cbs_tail = &rhp->next;
> 			raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> 			/* We can't create the thread unless interrupts are enabled. */
> 			if ((needwake && havetask) ||
> 			    (!havetask && !irqs_disabled_flags(flags))) {
> 				rcu_spawn_tasks_kthread();
> 				wake_up(&rcu_tasks_cbs_wq);
> 			}
> 		}
> 
> 		And for the wait side, using synchronize_sched() to supply
> 		the barrier for both ends, with the preemption disabling
> 		due to raw_spin_lock_irqsave() serving as the read-side
> 		critical section:
> 
> 		if (!list) {
> 			wait_event_interruptible(rcu_tasks_cbs_wq,
> 						 rcu_tasks_cbs_head);
> 			if (!rcu_tasks_cbs_head) {
> 				WARN_ON(signal_pending(current));
> 				schedule_timeout_interruptible(HZ/10);
> 			}
> 			continue;
> 		}
> 		synchronize_sched();
> 
> 		-----------------
> 
> 		Here is another one that uses atomic_cmpxchg() as a
> 		full memory barrier:
> 
> 		if (!wait_event_timeout(*wait, !atomic_read(stopping),
> 					msecs_to_jiffies(1000))) {
> 			atomic_set(stopping, 0);
> 			smp_mb();
> 			return -ETIMEDOUT;
> 		}
> 
> 		int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait,
> 						     atomic_t *stopping)
> 		{
> 			if (atomic_cmpxchg(stopping, 1, 0)) {
> 				wake_up(wait);
> 				return 1;
> 			}
> 
> 			return 0;
> 		}
> 
> 		-----------------
> 
> 		And here is the generic pattern for the above two examples
> 		taken from waitqueue_active() in include/linux/wait.h:
> 
>  *      CPU0 - waker                    CPU1 - waiter
>  *
>  *                                      for (;;) {
>  *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
>  *      smp_mb();                         // smp_mb() from set_current_state()
>  *      if (waitqueue_active(wq_head))         if (@cond)
>  *        wake_up(wq_head);                      break;
>  *                                        schedule();
>  *                                      }
>  *                                      finish_wait(&wq_head, &wait);
> 
> 		Note that prepare_to_wait() does the both the write
> 		and the set_current_state() that contains the smp_mb().
> 		The read is the waitqueue_active() on the one hand and
> 		the "if (@cond)" on the other.
> 
> 6.	W+RWC+porel+mb+mb
> 
> 	See recipes-LKcode-63cae12bce986.txt.
> 
> 	Mostly of historical interest -- as far as I know, this commit
> 	was the first to contain a litmus test.
> 
> 7.	Context switch and migration.  A bit specialized, so might leave
> 	this one out.
> 
> 	When a thread moves from one CPU to another to another, the
> 	scheduler is required to do whatever is necessary for the thread
> 	to see any prior accesses that it executed on other CPUs.  This
> 	includes "interesting" interactions with wake_up() shown in the
> 	following comment from try_to_wake_up() in kernel/sched/core.c:
> 
>  * Notes on Program-Order guarantees on SMP systems.
>  *
>  *  MIGRATION
>  *
>  * The basic program-order guarantee on SMP systems is that when a task [t]
>  * migrates, all its activity on its old CPU [c0] happens-before any subsequent
>  * execution on its new CPU [c1].
>  *
>  * For migration (of runnable tasks) this is provided by the following means:
>  *
>  *  A) UNLOCK of the rq(c0)->lock scheduling out task t
>  *  B) migration for t is required to synchronize *both* rq(c0)->lock and
>  *     rq(c1)->lock (if not at the same time, then in that order).
>  *  C) LOCK of the rq(c1)->lock scheduling in task
>  *
>  * Transitivity guarantees that B happens after A and C after B.
>  * Note: we only require RCpc transitivity.
>  * Note: the CPU doing B need not be c0 or c1
>  *
>  * Example:
>  *
>  *   CPU0            CPU1            CPU2
>  *
>  *   LOCK rq(0)->lock
>  *   sched-out X
>  *   sched-in Y
>  *   UNLOCK rq(0)->lock
>  *
>  *                                   LOCK rq(0)->lock // orders against CPU0
>  *                                   dequeue X
>  *                                   UNLOCK rq(0)->lock
>  *
>  *                                   LOCK rq(1)->lock
>  *                                   enqueue X
>  *                                   UNLOCK rq(1)->lock
>  *
>  *                   LOCK rq(1)->lock // orders against CPU2
>  *                   sched-out Z
>  *                   sched-in X
>  *                   UNLOCK rq(1)->lock

We might want to "specialize" this snippet to emphasize (or illustrate)
a particular _cycle_; one possible way of "closing the chain" would be:

   CPU0

   smp_store_release(&X->on_cpu, 0); /* from sched-out X */
   UNLOCK rq(0)->lock


   CPU2

   LOCK rq(0)->lock
   UNLOCK rq(1)->lock


   CPU1

   LOCK rq(1)->lock
   X->on_cpu = 1; /* from sched-in X */


   BUG_ON("each LOCK reads from its UNLOCK" && X->on_cpu == 0);

(c.f., Z6.2 in test6.pdf)


>  *
>  *
>  *  BLOCKING -- aka. SLEEP + WAKEUP
>  *
>  * For blocking we (obviously) need to provide the same guarantee as for
>  * migration. However the means are completely different as there is no lock
>  * chain to provide order. Instead we do:
>  *
>  *   1) smp_store_release(X->on_cpu, 0)
>  *   2) smp_cond_load_acquire(!X->on_cpu)
>  *
>  * Example:
>  *
>  *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
>  *
>  *   LOCK rq(0)->lock LOCK X->pi_lock
>  *   dequeue X
>  *   sched-out X
>  *   smp_store_release(X->on_cpu, 0);
>  *
>  *                    smp_cond_load_acquire(&X->on_cpu, !VAL);
>  *                    X->state = WAKING
>  *                    set_task_cpu(X,2)
>  *
>  *                    LOCK rq(2)->lock
>  *                    enqueue X
>  *                    X->state = RUNNING
>  *                    UNLOCK rq(2)->lock
>  *
>  *                                          LOCK rq(2)->lock // orders against CPU1
>  *                                          sched-out Z
>  *                                          sched-in X
>  *                                          UNLOCK rq(2)->lock
>  *
>  *                    UNLOCK X->pi_lock
>  *   UNLOCK rq(0)->lock

Similarly:

   CPU0

   smp_store_release(&X->on_cpu, 0); /* from sched-out X */


   CPU1

   smp_cond_load_acquire(&X->on_cpu, !VAL);
   UNLOCK rq(2)->lock


   CPU2

   LOCK rq(2)->lock
   X->on_cpu = 1; // from sched-in X


   BUG_ON("each LOCK reads from its UNLOCK" && X->on_cpu == 0);

(c.f., WWC in test6.pdf)

  Andrea


>  *
>  *
>  * However; for wakeups there is a second guarantee we must provide, namely we
>  * must observe the state that lead to our wakeup. That is, not only must our
>  * task observe its own prior state, it must also observe the stores prior to
>  * its wakeup.
>  *
>  * This means that any means of doing remote wakeups must order the CPU doing
>  * the wakeup against the CPU the task is going to end up running on. This,
>  * however, is already required for the regular Program-Order guarantee above,
>  * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).

> commit 63cae12bce9861cec309798d34701cf3da20bc71
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Fri Dec 9 14:59:00 2016 +0100
> 
>     perf/core: Fix sys_perf_event_open() vs. hotplug
>     
>     There is problem with installing an event in a task that is 'stuck' on
>     an offline CPU.
>     
>     Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
>     blocked task doesn't run and doesn't require a CPU etc.. Only on
>     wakeup do we ammend the situation and place the task on a available
>     CPU.
>     
>     If we hit such a task with perf_install_in_context() we'll loop until
>     either that task wakes up or the CPU comes back online, if the task
>     waking depends on the event being installed, we're stuck.
>     
>     While looking into this issue, I also spotted another problem, if we
>     hit a task with perf_install_in_context() that is in the middle of
>     being migrated, that is we observe the old CPU before sending the IPI,
>     but run the IPI (on the old CPU) while the task is already running on
>     the new CPU, things also go sideways.
>     
>     Rework things to rely on task_curr() -- outside of rq->lock -- which
>     is rather tricky. Imagine the following scenario where we're trying to
>     install the first event into our task 't':
>     
>     CPU0            CPU1            CPU2
>     
>                     (current == t)
>     
>     t->perf_event_ctxp[] = ctx;
>     smp_mb();
>     cpu = task_cpu(t);
>     
>                     switch(t, n);
>                                     migrate(t, 2);
>                                     switch(p, t);
>     
>                                     ctx = t->perf_event_ctxp[]; // must not be NULL
>     
>     smp_function_call(cpu, ..);
>     
>                     generic_exec_single()
>                       func();
>                         spin_lock(ctx->lock);
>                         if (task_curr(t)) // false
>     
>                         add_event_to_ctx();
>                         spin_unlock(ctx->lock);
>     
>                                     perf_event_context_sched_in();
>                                       spin_lock(ctx->lock);
>                                       // sees event
>     
>     So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
>     Because if CPU2's load of that variable were to observe NULL, it would
>     not try to schedule the ctx and we'd have a task running without its
>     counter, which would be 'bad'.
>     
>     As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
>     first and not see the event yet, then CPU0 must observe task_curr()
>     and retry. If the install happens first, then we must see the event on
>     sched-in and all is well.
>     
>     I think we can translate the first part (until the 'must not be NULL')
>     of the scenario to a litmus test like:
>     
>       C C-peterz
>     
>       {
>       }
>     
>       P0(int *x, int *y)
>       {
>               int r1;
>     
>               WRITE_ONCE(*x, 1);
>               smp_mb();
>               r1 = READ_ONCE(*y);
>       }
>     
>       P1(int *y, int *z)
>       {
>               WRITE_ONCE(*y, 1);
>               smp_store_release(z, 1);
>       }
>     
>       P2(int *x, int *z)
>       {
>               int r1;
>               int r2;
>     
>               r1 = smp_load_acquire(z);
>     	  smp_mb();
>               r2 = READ_ONCE(*x);
>       }
>     
>       exists
>       (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
>     
>     Where:
>       x is perf_event_ctxp[],
>       y is our tasks's CPU, and
>       z is our task being placed on the rq of CPU2.
>     
>     The P0 smp_mb() is the one added by this patch, ordering the store to
>     perf_event_ctxp[] from find_get_context() and the load of task_cpu()
>     in task_function_call().
>     
>     The smp_store_release/smp_load_acquire model the RCpc locking of the
>     rq->lock and the smp_mb() of P2 is the context switch switching from
>     whatever CPU2 was running to our task 't'.
>     
>     This litmus test evaluates into:
>     
>       Test C-peterz Allowed
>       States 7
>       0:r1=0; 2:r1=0; 2:r2=0;
>       0:r1=0; 2:r1=0; 2:r2=1;
>       0:r1=0; 2:r1=1; 2:r2=1;
>       0:r1=1; 2:r1=0; 2:r2=0;
>       0:r1=1; 2:r1=0; 2:r2=1;
>       0:r1=1; 2:r1=1; 2:r2=0;
>       0:r1=1; 2:r1=1; 2:r2=1;
>       No
>       Witnesses
>       Positive: 0 Negative: 7
>       Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
>       Observation C-peterz Never 0 7
>       Hash=e427f41d9146b2a5445101d3e2fcaa34
>     
>     And the strong and weak model agree.
>     
>     Reported-by: Mark Rutland <mark.rutland@arm.com>
>     Tested-by: Mark Rutland <mark.rutland@arm.com>
>     Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>     Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>     Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
>     Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
>     Cc: Jiri Olsa <jolsa@redhat.com>
>     Cc: Linus Torvalds <torvalds@linux-foundation.org>
>     Cc: Peter Zijlstra <peterz@infradead.org>
>     Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>     Cc: Stephane Eranian <eranian@google.com>
>     Cc: Thomas Gleixner <tglx@linutronix.de>
>     Cc: Vince Weaver <vincent.weaver@maine.edu>
>     Cc: Will Deacon <will.deacon@arm.com>
>     Cc: jeremy.linton@arm.com
>     Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net
>     Signed-off-by: Ingo Molnar <mingo@kernel.org>
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index ab15509fab8c..72ce7d63e561 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2249,7 +2249,7 @@ static int  __perf_install_in_context(void *info)
>  	struct perf_event_context *ctx = event->ctx;
>  	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
>  	struct perf_event_context *task_ctx = cpuctx->task_ctx;
> -	bool activate = true;
> +	bool reprogram = true;
>  	int ret = 0;
>  
>  	raw_spin_lock(&cpuctx->ctx.lock);
> @@ -2257,27 +2257,26 @@ static int  __perf_install_in_context(void *info)
>  		raw_spin_lock(&ctx->lock);
>  		task_ctx = ctx;
>  
> -		/* If we're on the wrong CPU, try again */
> -		if (task_cpu(ctx->task) != smp_processor_id()) {
> -			ret = -ESRCH;
> -			goto unlock;
> -		}
> +		reprogram = (ctx->task == current);
>  
>  		/*
> -		 * If we're on the right CPU, see if the task we target is
> -		 * current, if not we don't have to activate the ctx, a future
> -		 * context switch will do that for us.
> +		 * If the task is running, it must be running on this CPU,
> +		 * otherwise we cannot reprogram things.
> +		 *
> +		 * If its not running, we don't care, ctx->lock will
> +		 * serialize against it becoming runnable.
>  		 */
> -		if (ctx->task != current)
> -			activate = false;
> -		else
> -			WARN_ON_ONCE(cpuctx->task_ctx && cpuctx->task_ctx != ctx);
> +		if (task_curr(ctx->task) && !reprogram) {
> +			ret = -ESRCH;
> +			goto unlock;
> +		}
>  
> +		WARN_ON_ONCE(reprogram && cpuctx->task_ctx && cpuctx->task_ctx != ctx);
>  	} else if (task_ctx) {
>  		raw_spin_lock(&task_ctx->lock);
>  	}
>  
> -	if (activate) {
> +	if (reprogram) {
>  		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>  		add_event_to_ctx(event, ctx);
>  		ctx_resched(cpuctx, task_ctx);
> @@ -2328,13 +2327,36 @@ perf_install_in_context(struct perf_event_context *ctx,
>  	/*
>  	 * Installing events is tricky because we cannot rely on ctx->is_active
>  	 * to be set in case this is the nr_events 0 -> 1 transition.
> +	 *
> +	 * Instead we use task_curr(), which tells us if the task is running.
> +	 * However, since we use task_curr() outside of rq::lock, we can race
> +	 * against the actual state. This means the result can be wrong.
> +	 *
> +	 * If we get a false positive, we retry, this is harmless.
> +	 *
> +	 * If we get a false negative, things are complicated. If we are after
> +	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
> +	 * value must be correct. If we're before, it doesn't matter since
> +	 * perf_event_context_sched_in() will program the counter.
> +	 *
> +	 * However, this hinges on the remote context switch having observed
> +	 * our task->perf_event_ctxp[] store, such that it will in fact take
> +	 * ctx::lock in perf_event_context_sched_in().
> +	 *
> +	 * We do this by task_function_call(), if the IPI fails to hit the task
> +	 * we know any future context switch of task must see the
> +	 * perf_event_ctpx[] store.
>  	 */
> -again:
> +
>  	/*
> -	 * Cannot use task_function_call() because we need to run on the task's
> -	 * CPU regardless of whether its current or not.
> +	 * This smp_mb() orders the task->perf_event_ctxp[] store with the
> +	 * task_cpu() load, such that if the IPI then does not find the task
> +	 * running, a future context switch of that task must observe the
> +	 * store.
>  	 */
> -	if (!cpu_function_call(task_cpu(task), __perf_install_in_context, event))
> +	smp_mb();
> +again:
> +	if (!task_function_call(task, __perf_install_in_context, event))
>  		return;
>  
>  	raw_spin_lock_irq(&ctx->lock);
> @@ -2348,12 +2370,16 @@ again:
>  		raw_spin_unlock_irq(&ctx->lock);
>  		return;
>  	}
> -	raw_spin_unlock_irq(&ctx->lock);
>  	/*
> -	 * Since !ctx->is_active doesn't mean anything, we must IPI
> -	 * unconditionally.
> +	 * If the task is not running, ctx->lock will avoid it becoming so,
> +	 * thus we can safely install the event.
>  	 */
> -	goto again;
> +	if (task_curr(task)) {
> +		raw_spin_unlock_irq(&ctx->lock);
> +		goto again;
> +	}
> +	add_event_to_ctx(event, ctx);
> +	raw_spin_unlock_irq(&ctx->lock);
>  }
>  
>  /*