From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758970AbdJRBIB (ORCPT ); Tue, 17 Oct 2017 21:08:01 -0400 Received: from mail-wm0-f41.google.com ([74.125.82.41]:55964 "EHLO mail-wm0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1761464AbdJRBH6 (ORCPT ); Tue, 17 Oct 2017 21:07:58 -0400 X-Google-Smtp-Source: ABhQp+TpWVVcHbQMW6gOiF5koIie1QS2GGcuvUsooiqxBMnBRndXPNKSS3Rjna5LmE+1OlcMRavGAQ== Date: Wed, 18 Oct 2017 03:07:48 +0200 From: Andrea Parri To: "Paul E. McKenney" Cc: j.alglave@ucl.ac.uk, luc.maranget@inria.fr, stern@rowland.harvard.edu, dhowells@redhat.com, peterz@infradead.org, will.deacon@arm.com, boqun.feng@gmail.com, npiggin@gmail.com, linux-kernel@vger.kernel.org Subject: Re: Memory-ordering recipes Message-ID: <20171018010748.GA4017@andrea> References: <20170917230509.GA21394@linux.vnet.ibm.com> <20171017210137.GA12700@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171017210137.GA12700@linux.vnet.ibm.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 17, 2017 at 02:01:37PM -0700, Paul E. McKenney wrote: > On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote: > > Hello! > > > > The topic of memory-ordering recipes came up at the Linux Plumbers > > Conference microconference on Friday, so I thought that I should summarize > > what is currently "out there": > > And here is an updated list of potential Linux-kernel examples for a > "recipes" document, and thank you for the feedback. Please feel free > to counterpropose better examples. In addition, if there is some other > pattern that is commonplace and important enough to be included in a > recipes document, please point it out. > > Thanx, Paul > > ------------------------------------------------------------------------ > > This document lists the litmus-test patterns that we have been discussing, > along with examples from the Linux kernel. This is intended to feed into > the recipes document. All examples are from v4.13. > > 0. Simple special cases > > If there is only one CPU on the one hand or only one variable > on the other, the code will execute in order. There are (as > usual) some things to be careful of: > > a. There are some aspects of the C language that are > unordered. For example, in the expression "f(x) + g(y)", > the order in which f and g are called is not defined; > the object code is allowed to use either order or even > to interleave the computations. > > b. Compilers are permitted to use the "as-if" rule. That is, > a compiler can emit whatever code it likes, as long as > the results of a single-threaded execution appear just > as if the compiler had followed all the relevant rules. > To see this, compile with a high level of optimization > and run the debugger on the resulting binary. > > c. If there is only one variable but multiple CPUs, all > that variable must be properly aligned and all accesses > to that variable must be full sized. Variables that > straddle cachelines or pages void your full-ordering > warranty, as do undersized accesses that load from or > store to only part of the variable. > > 1. Another simple case: Locking. [ Assuming you don't think too > hard about it, that is! ] > > Any CPU that has acquired a given lock sees any changes previously > made by any CPU prior to having released that same lock. > > [ Should we discuss chaining back through different locks, > sort of like release-acquire chains? ] > > 2. MP (see test6.pdf for nickname translation) > > a. smp_store_release() / smp_load_acquire() > > init_stack_slab() in lib/stackdepot.c uses release-acquire > to handle initialization of a slab of the stack. Working > out the mutual-exclusion design is left as an exercise for > the reader. > > b. rcu_assign_pointer() / rcu_dereference() > > expand_to_next_prime() does the rcu_assign_pointer(), > and next_prime_number() does the rcu_dereference(). > This mediates access to a bit vector that is expanded > as additional primes are needed. These two functions > are in lib/prime_numbers.c. > > c. smp_wmb() / smp_rmb() > > xlog_state_switch_iclogs() contains the following: > > log->l_curr_block -= log->l_logBBsize; > ASSERT(log->l_curr_block >= 0); > smp_wmb(); > log->l_curr_cycle++; > > And xlog_valid_lsn() contains the following: > > cur_cycle = ACCESS_ONCE(log->l_curr_cycle); > smp_rmb(); > cur_block = ACCESS_ONCE(log->l_curr_block); > > Alternatively, from the comment in perf_output_put_handle() > in kernel/events/ring_buffer.c: > > * kernel user > * > * if (LOAD ->data_tail) { LOAD ->data_head > * (A) smp_rmb() (C) > * STORE $data LOAD $data > * smp_wmb() (B) smp_mb() (D) > * STORE ->data_head STORE ->data_tail > * } > * > * Where A pairs with D, and B pairs with C. > > The B/C pairing is MP with smp_wmb() and smp_rmb(). > > d. Replacing either of the above with smp_mb() > > Holding off on this one for the moment... > > 3. LB > > a. LB+ctrl+mb > > Again, from the comment in perf_output_put_handle() > in kernel/events/ring_buffer.c: > > * kernel user > * > * if (LOAD ->data_tail) { LOAD ->data_head > * (A) smp_rmb() (C) > * STORE $data LOAD $data > * smp_wmb() (B) smp_mb() (D) > * STORE ->data_head STORE ->data_tail > * } > * > * Where A pairs with D, and B pairs with C. > > The A/D pairing covers this one. > > 4. Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB > > Lots of variety here, can in some cases substitute: > > a. READ_ONCE() for smp_load_acquire() > b. WRITE_ONCE() for smp_store_release() > c. Dependencies for both smp_load_acquire() and > smp_store_release(). > d. smp_wmb() for smp_store_release() in first thread > of ISA2 and Z6.2. > e. smp_rmb() for smp_load_acquire() in last thread of ISA2. > > The canonical illustration of LB involves the various memory > allocators, where you don't want a load from about-to-be-freed > memory to see a store initializing a later incarnation of that > same memory area. But the per-CPU caches make this a very > long and complicated example. > > I am not aware of any three-CPU release-acquire chains in the > Linux kernel. There are three-CPU lock-based chains in RCU, > but these are not at all simple, either. > > Thoughts? > > 5. SB > > a. smp_mb(), as in lockless wait-wakeup coordination. > And as in sys_membarrier()-scheduler coordination, > for that matter. > > Examples seem to be lacking. Most cases use locking. > Here is one rather strange one from RCU: > > void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func) > { > unsigned long flags; > bool needwake; > bool havetask = READ_ONCE(rcu_tasks_kthread_ptr); > > rhp->next = NULL; > rhp->func = func; > raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags); > needwake = !rcu_tasks_cbs_head; > *rcu_tasks_cbs_tail = rhp; > rcu_tasks_cbs_tail = &rhp->next; > raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags); > /* We can't create the thread unless interrupts are enabled. */ > if ((needwake && havetask) || > (!havetask && !irqs_disabled_flags(flags))) { > rcu_spawn_tasks_kthread(); > wake_up(&rcu_tasks_cbs_wq); > } > } > > And for the wait side, using synchronize_sched() to supply > the barrier for both ends, with the preemption disabling > due to raw_spin_lock_irqsave() serving as the read-side > critical section: > > if (!list) { > wait_event_interruptible(rcu_tasks_cbs_wq, > rcu_tasks_cbs_head); > if (!rcu_tasks_cbs_head) { > WARN_ON(signal_pending(current)); > schedule_timeout_interruptible(HZ/10); > } > continue; > } > synchronize_sched(); > > ----------------- > > Here is another one that uses atomic_cmpxchg() as a > full memory barrier: > > if (!wait_event_timeout(*wait, !atomic_read(stopping), > msecs_to_jiffies(1000))) { > atomic_set(stopping, 0); > smp_mb(); > return -ETIMEDOUT; > } > > int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait, > atomic_t *stopping) > { > if (atomic_cmpxchg(stopping, 1, 0)) { > wake_up(wait); > return 1; > } > > return 0; > } > > ----------------- > > And here is the generic pattern for the above two examples > taken from waitqueue_active() in include/linux/wait.h: > > * CPU0 - waker CPU1 - waiter > * > * for (;;) { > * @cond = true; prepare_to_wait(&wq_head, &wait, state); > * smp_mb(); // smp_mb() from set_current_state() > * if (waitqueue_active(wq_head)) if (@cond) > * wake_up(wq_head); break; > * schedule(); > * } > * finish_wait(&wq_head, &wait); > > Note that prepare_to_wait() does the both the write > and the set_current_state() that contains the smp_mb(). > The read is the waitqueue_active() on the one hand and > the "if (@cond)" on the other. > > 6. W+RWC+porel+mb+mb > > See recipes-LKcode-63cae12bce986.txt. > > Mostly of historical interest -- as far as I know, this commit > was the first to contain a litmus test. > > 7. Context switch and migration. A bit specialized, so might leave > this one out. > > When a thread moves from one CPU to another to another, the > scheduler is required to do whatever is necessary for the thread > to see any prior accesses that it executed on other CPUs. This > includes "interesting" interactions with wake_up() shown in the > following comment from try_to_wake_up() in kernel/sched/core.c: > > * Notes on Program-Order guarantees on SMP systems. > * > * MIGRATION > * > * The basic program-order guarantee on SMP systems is that when a task [t] > * migrates, all its activity on its old CPU [c0] happens-before any subsequent > * execution on its new CPU [c1]. > * > * For migration (of runnable tasks) this is provided by the following means: > * > * A) UNLOCK of the rq(c0)->lock scheduling out task t > * B) migration for t is required to synchronize *both* rq(c0)->lock and > * rq(c1)->lock (if not at the same time, then in that order). > * C) LOCK of the rq(c1)->lock scheduling in task > * > * Transitivity guarantees that B happens after A and C after B. > * Note: we only require RCpc transitivity. > * Note: the CPU doing B need not be c0 or c1 > * > * Example: > * > * CPU0 CPU1 CPU2 > * > * LOCK rq(0)->lock > * sched-out X > * sched-in Y > * UNLOCK rq(0)->lock > * > * LOCK rq(0)->lock // orders against CPU0 > * dequeue X > * UNLOCK rq(0)->lock > * > * LOCK rq(1)->lock > * enqueue X > * UNLOCK rq(1)->lock > * > * LOCK rq(1)->lock // orders against CPU2 > * sched-out Z > * sched-in X > * UNLOCK rq(1)->lock We might want to "specialize" this snippet to emphasize (or illustrate) a particular _cycle_; one possible way of "closing the chain" would be: CPU0 smp_store_release(&X->on_cpu, 0); /* from sched-out X */ UNLOCK rq(0)->lock CPU2 LOCK rq(0)->lock UNLOCK rq(1)->lock CPU1 LOCK rq(1)->lock X->on_cpu = 1; /* from sched-in X */ BUG_ON("each LOCK reads from its UNLOCK" && X->on_cpu == 0); (c.f., Z6.2 in test6.pdf) > * > * > * BLOCKING -- aka. SLEEP + WAKEUP > * > * For blocking we (obviously) need to provide the same guarantee as for > * migration. However the means are completely different as there is no lock > * chain to provide order. Instead we do: > * > * 1) smp_store_release(X->on_cpu, 0) > * 2) smp_cond_load_acquire(!X->on_cpu) > * > * Example: > * > * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule) > * > * LOCK rq(0)->lock LOCK X->pi_lock > * dequeue X > * sched-out X > * smp_store_release(X->on_cpu, 0); > * > * smp_cond_load_acquire(&X->on_cpu, !VAL); > * X->state = WAKING > * set_task_cpu(X,2) > * > * LOCK rq(2)->lock > * enqueue X > * X->state = RUNNING > * UNLOCK rq(2)->lock > * > * LOCK rq(2)->lock // orders against CPU1 > * sched-out Z > * sched-in X > * UNLOCK rq(2)->lock > * > * UNLOCK X->pi_lock > * UNLOCK rq(0)->lock Similarly: CPU0 smp_store_release(&X->on_cpu, 0); /* from sched-out X */ CPU1 smp_cond_load_acquire(&X->on_cpu, !VAL); UNLOCK rq(2)->lock CPU2 LOCK rq(2)->lock X->on_cpu = 1; // from sched-in X BUG_ON("each LOCK reads from its UNLOCK" && X->on_cpu == 0); (c.f., WWC in test6.pdf) Andrea > * > * > * However; for wakeups there is a second guarantee we must provide, namely we > * must observe the state that lead to our wakeup. That is, not only must our > * task observe its own prior state, it must also observe the stores prior to > * its wakeup. > * > * This means that any means of doing remote wakeups must order the CPU doing > * the wakeup against the CPU the task is going to end up running on. This, > * however, is already required for the regular Program-Order guarantee above, > * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire). > commit 63cae12bce9861cec309798d34701cf3da20bc71 > Author: Peter Zijlstra > Date: Fri Dec 9 14:59:00 2016 +0100 > > perf/core: Fix sys_perf_event_open() vs. hotplug > > There is problem with installing an event in a task that is 'stuck' on > an offline CPU. > > Blocked tasks are not dis-assosciated from offlined CPUs, after all, a > blocked task doesn't run and doesn't require a CPU etc.. Only on > wakeup do we ammend the situation and place the task on a available > CPU. > > If we hit such a task with perf_install_in_context() we'll loop until > either that task wakes up or the CPU comes back online, if the task > waking depends on the event being installed, we're stuck. > > While looking into this issue, I also spotted another problem, if we > hit a task with perf_install_in_context() that is in the middle of > being migrated, that is we observe the old CPU before sending the IPI, > but run the IPI (on the old CPU) while the task is already running on > the new CPU, things also go sideways. > > Rework things to rely on task_curr() -- outside of rq->lock -- which > is rather tricky. Imagine the following scenario where we're trying to > install the first event into our task 't': > > CPU0 CPU1 CPU2 > > (current == t) > > t->perf_event_ctxp[] = ctx; > smp_mb(); > cpu = task_cpu(t); > > switch(t, n); > migrate(t, 2); > switch(p, t); > > ctx = t->perf_event_ctxp[]; // must not be NULL > > smp_function_call(cpu, ..); > > generic_exec_single() > func(); > spin_lock(ctx->lock); > if (task_curr(t)) // false > > add_event_to_ctx(); > spin_unlock(ctx->lock); > > perf_event_context_sched_in(); > spin_lock(ctx->lock); > // sees event > > So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'. > Because if CPU2's load of that variable were to observe NULL, it would > not try to schedule the ctx and we'd have a task running without its > counter, which would be 'bad'. > > As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it > first and not see the event yet, then CPU0 must observe task_curr() > and retry. If the install happens first, then we must see the event on > sched-in and all is well. > > I think we can translate the first part (until the 'must not be NULL') > of the scenario to a litmus test like: > > C C-peterz > > { > } > > P0(int *x, int *y) > { > int r1; > > WRITE_ONCE(*x, 1); > smp_mb(); > r1 = READ_ONCE(*y); > } > > P1(int *y, int *z) > { > WRITE_ONCE(*y, 1); > smp_store_release(z, 1); > } > > P2(int *x, int *z) > { > int r1; > int r2; > > r1 = smp_load_acquire(z); > smp_mb(); > r2 = READ_ONCE(*x); > } > > exists > (0:r1=0 /\ 2:r1=1 /\ 2:r2=0) > > Where: > x is perf_event_ctxp[], > y is our tasks's CPU, and > z is our task being placed on the rq of CPU2. > > The P0 smp_mb() is the one added by this patch, ordering the store to > perf_event_ctxp[] from find_get_context() and the load of task_cpu() > in task_function_call(). > > The smp_store_release/smp_load_acquire model the RCpc locking of the > rq->lock and the smp_mb() of P2 is the context switch switching from > whatever CPU2 was running to our task 't'. > > This litmus test evaluates into: > > Test C-peterz Allowed > States 7 > 0:r1=0; 2:r1=0; 2:r2=0; > 0:r1=0; 2:r1=0; 2:r2=1; > 0:r1=0; 2:r1=1; 2:r2=1; > 0:r1=1; 2:r1=0; 2:r2=0; > 0:r1=1; 2:r1=0; 2:r2=1; > 0:r1=1; 2:r1=1; 2:r2=0; > 0:r1=1; 2:r1=1; 2:r2=1; > No > Witnesses > Positive: 0 Negative: 7 > Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0) > Observation C-peterz Never 0 7 > Hash=e427f41d9146b2a5445101d3e2fcaa34 > > And the strong and weak model agree. > > Reported-by: Mark Rutland > Tested-by: Mark Rutland > Signed-off-by: Peter Zijlstra (Intel) > Cc: Alexander Shishkin > Cc: Arnaldo Carvalho de Melo > Cc: Arnaldo Carvalho de Melo > Cc: Jiri Olsa > Cc: Linus Torvalds > Cc: Peter Zijlstra > Cc: Sebastian Andrzej Siewior > Cc: Stephane Eranian > Cc: Thomas Gleixner > Cc: Vince Weaver > Cc: Will Deacon > Cc: jeremy.linton@arm.com > Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net > Signed-off-by: Ingo Molnar > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index ab15509fab8c..72ce7d63e561 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -2249,7 +2249,7 @@ static int __perf_install_in_context(void *info) > struct perf_event_context *ctx = event->ctx; > struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); > struct perf_event_context *task_ctx = cpuctx->task_ctx; > - bool activate = true; > + bool reprogram = true; > int ret = 0; > > raw_spin_lock(&cpuctx->ctx.lock); > @@ -2257,27 +2257,26 @@ static int __perf_install_in_context(void *info) > raw_spin_lock(&ctx->lock); > task_ctx = ctx; > > - /* If we're on the wrong CPU, try again */ > - if (task_cpu(ctx->task) != smp_processor_id()) { > - ret = -ESRCH; > - goto unlock; > - } > + reprogram = (ctx->task == current); > > /* > - * If we're on the right CPU, see if the task we target is > - * current, if not we don't have to activate the ctx, a future > - * context switch will do that for us. > + * If the task is running, it must be running on this CPU, > + * otherwise we cannot reprogram things. > + * > + * If its not running, we don't care, ctx->lock will > + * serialize against it becoming runnable. > */ > - if (ctx->task != current) > - activate = false; > - else > - WARN_ON_ONCE(cpuctx->task_ctx && cpuctx->task_ctx != ctx); > + if (task_curr(ctx->task) && !reprogram) { > + ret = -ESRCH; > + goto unlock; > + } > > + WARN_ON_ONCE(reprogram && cpuctx->task_ctx && cpuctx->task_ctx != ctx); > } else if (task_ctx) { > raw_spin_lock(&task_ctx->lock); > } > > - if (activate) { > + if (reprogram) { > ctx_sched_out(ctx, cpuctx, EVENT_TIME); > add_event_to_ctx(event, ctx); > ctx_resched(cpuctx, task_ctx); > @@ -2328,13 +2327,36 @@ perf_install_in_context(struct perf_event_context *ctx, > /* > * Installing events is tricky because we cannot rely on ctx->is_active > * to be set in case this is the nr_events 0 -> 1 transition. > + * > + * Instead we use task_curr(), which tells us if the task is running. > + * However, since we use task_curr() outside of rq::lock, we can race > + * against the actual state. This means the result can be wrong. > + * > + * If we get a false positive, we retry, this is harmless. > + * > + * If we get a false negative, things are complicated. If we are after > + * perf_event_context_sched_in() ctx::lock will serialize us, and the > + * value must be correct. If we're before, it doesn't matter since > + * perf_event_context_sched_in() will program the counter. > + * > + * However, this hinges on the remote context switch having observed > + * our task->perf_event_ctxp[] store, such that it will in fact take > + * ctx::lock in perf_event_context_sched_in(). > + * > + * We do this by task_function_call(), if the IPI fails to hit the task > + * we know any future context switch of task must see the > + * perf_event_ctpx[] store. > */ > -again: > + > /* > - * Cannot use task_function_call() because we need to run on the task's > - * CPU regardless of whether its current or not. > + * This smp_mb() orders the task->perf_event_ctxp[] store with the > + * task_cpu() load, such that if the IPI then does not find the task > + * running, a future context switch of that task must observe the > + * store. > */ > - if (!cpu_function_call(task_cpu(task), __perf_install_in_context, event)) > + smp_mb(); > +again: > + if (!task_function_call(task, __perf_install_in_context, event)) > return; > > raw_spin_lock_irq(&ctx->lock); > @@ -2348,12 +2370,16 @@ again: > raw_spin_unlock_irq(&ctx->lock); > return; > } > - raw_spin_unlock_irq(&ctx->lock); > /* > - * Since !ctx->is_active doesn't mean anything, we must IPI > - * unconditionally. > + * If the task is not running, ctx->lock will avoid it becoming so, > + * thus we can safely install the event. > */ > - goto again; > + if (task_curr(task)) { > + raw_spin_unlock_irq(&ctx->lock); > + goto again; > + } > + add_event_to_ctx(event, ctx); > + raw_spin_unlock_irq(&ctx->lock); > } > > /*