From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7014DC43142 for ; Wed, 27 Jun 2018 15:41:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 0923D22B27 for ; Wed, 27 Jun 2018 15:41:23 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0923D22B27 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.vnet.ibm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754795AbeF0PlV (ORCPT ); Wed, 27 Jun 2018 11:41:21 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:46714 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752937AbeF0PlT (ORCPT ); Wed, 27 Jun 2018 11:41:19 -0400 Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w5RFdktf068964 for ; Wed, 27 Jun 2018 11:41:19 -0400 Received: from e11.ny.us.ibm.com (e11.ny.us.ibm.com [129.33.205.201]) by mx0a-001b2d01.pphosted.com with ESMTP id 2jvafusq8n-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Wed, 27 Jun 2018 11:41:19 -0400 Received: from localhost by e11.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 27 Jun 2018 11:41:17 -0400 Received: from b01cxnp22036.gho.pok.ibm.com (9.57.198.26) by e11.ny.us.ibm.com (146.89.104.198) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Wed, 27 Jun 2018 11:41:12 -0400 Received: from b01ledav003.gho.pok.ibm.com (b01ledav003.gho.pok.ibm.com [9.57.199.108]) by b01cxnp22036.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w5RFfBAk53411976 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Wed, 27 Jun 2018 15:41:11 GMT Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 87714B2066; Wed, 27 Jun 2018 11:41:03 -0400 (EDT) Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 4C7F8B205F; Wed, 27 Jun 2018 11:41:03 -0400 (EDT) Received: from paulmck-ThinkPad-W541 (unknown [9.70.82.159]) by b01ledav003.gho.pok.ibm.com (Postfix) with ESMTP; Wed, 27 Jun 2018 11:41:03 -0400 (EDT) Received: by paulmck-ThinkPad-W541 (Postfix, from userid 1000) id C9F1C16C3F79; Wed, 27 Jun 2018 08:43:17 -0700 (PDT) Date: Wed, 27 Jun 2018 08:43:17 -0700 From: "Paul E. McKenney" To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com, akpm@linux-foundation.org, mathieu.desnoyers@efficios.com, josh@joshtriplett.org, tglx@linutronix.de, rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, fweisbec@gmail.com, oleg@redhat.com, joel@joelfernandes.org Subject: Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline Reply-To: paulmck@linux.vnet.ibm.com References: <20180626002052.GA24146@linux.vnet.ibm.com> <20180626171048.2181-13-paulmck@linux.vnet.ibm.com> <20180626175119.GL2494@hirez.programming.kicks-ass.net> <20180626182950.GH3593@linux.vnet.ibm.com> <20180626202615.GA32162@linux.vnet.ibm.com> <20180626203225.GT2494@hirez.programming.kicks-ass.net> <20180626234004.GQ3593@linux.vnet.ibm.com> <20180627083335.GA7184@worktop.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180627083335.GA7184@worktop.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 18062715-2213-0000-0000-000002C0C583 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00009264; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000266; SDB=6.01053119; UDB=6.00539939; IPR=6.00831043; MB=3.00021883; MTD=3.00000008; XFM=3.00000015; UTC=2018-06-27 15:41:16 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18062715-2214-0000-0000-00005AA0959B Message-Id: <20180627154317.GX3593@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-06-27_03:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1806210000 definitions=main-1806270169 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 27, 2018 at 10:33:35AM +0200, Peter Zijlstra wrote: > On Tue, Jun 26, 2018 at 04:40:04PM -0700, Paul E. McKenney wrote: > > The options I have considered are as follows: > > > > 1. Stick with the no-failsafe approach, adding the lock as shown > > in this patch. (I obviously prefer this approach.) > > > > 2. Stick with the no-failsafe approach, but rely on RCU's grace-period > > kthread to wake up later due to its timed wait during the > > force-quiescent-state process. This would be a bit obnoxious, > > as it requires passing a don't-wake flag (or some such) up the > > quiescent-state reporting mechanism. It would also needlessly > > delay grace-period ends, especially on large systems (RCU scales > > up the FQS delay on larger systems to maintain limited CPU > > consumption per unit time). > > > > 3. Stick with the no-failsafe approach, but have the quiescent-state > > reporting code hand back a value indicating that a wakeup is needed. > > Also a bit obnoxious, as this value would need to be threaded up > > the reporting code's return path. Simple in theory, but a bit > > of an ugly change, especially for the many places in the code that > > currently expect quiescent-state reporting to be an unconditional > > fire-and-forget operation. > > You can combine 2 and 3. Use a skip wakeup flag and ignore the return > value most times. Let me do that just to see how horrible it is. > > > > > 4. Re-introduce the old fail-safe code, and don't report the > > quiescent state at CPU-offline time, relying on the fail-safe > > code to do so. This also potentially introduces delays and can > > add extra FQS scans, which in turn increases lock contention. > > > > So what did you have in mind? > > The thing I talked about last night before crashing is the patch below; > it does however suffer from a little false-negative, much like the one > you explained earlier. It allows @qsmaskinit to retain the bit set after > offline. > > I had hoped to be able to clear @qsmaskinit unconditionally, but that > doesn't quite work. Yes, unless you are insanely careful (and possess an unusual tolerance for complexity), you will end up with inconsistent ->qsmask fields, which will get you too-short grace periods, grace-period hangs, or maybe even both. For one thing, whatever code sets/clears a leaf rcu_node structure's ->qsmaskinit must propagate that change up the tree. If that code is not grace-period initialization, then that code must somehow synchronize correctly with grace-period initialization. For example, by introducing a lock. ;-) > The other approach is yet another mask @qsmaskofflinenext which the > kthread will use to clear bits on @qsmaskinitnext. And here I thought that my current use of only three such masks was getting a bit ornate. ;-) > In any case, aside from the above, the below contains a bunch of missing > WRITE_ONCE()s. Since you read the various @qsmask variables using > READ_ONCE() you must also consistently update them using WRITE_ONCE(), > otherwise it's all still buggered. And I introduced those READ_ONCE() calls an embarrassingly long time ago, didn't I? But yes, any update needs to use WRITE_ONCE(). I will put together a patch with your Reported-by. No, wait, I guess I start with pieces of your patch below. To that end, may I have your Signed-off-by for the WRITE_ONCE() pieces? > --- > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 7832dd556490..8713048d5103 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -104,7 +104,6 @@ struct rcu_state sname##_state = { \ > .abbr = sabbr, \ > .exp_mutex = __MUTEX_INITIALIZER(sname##_state.exp_mutex), \ > .exp_wake_mutex = __MUTEX_INITIALIZER(sname##_state.exp_wake_mutex), \ > - .ofl_lock = __SPIN_LOCK_UNLOCKED(sname##_state.ofl_lock), \ > } > > RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched); > @@ -209,7 +208,12 @@ EXPORT_SYMBOL_GPL(rcu_get_gp_kthreads_prio); > */ > unsigned long rcu_rnp_online_cpus(struct rcu_node *rnp) > { > - return READ_ONCE(rnp->qsmaskinitnext); > + /* > + * For both online and offline we first set/clear @qsmaskinitnext, > + * and complete by propagating into @qsmaskinit. As long as the bit > + * remains in either mask, RCU is still online. > + */ > + return READ_ONCE(rnp->qsmaskinit) | READ_ONCE(rnp->qsmaskinitnext); > } > > /* > @@ -1928,19 +1932,17 @@ static bool rcu_gp_init(struct rcu_state *rsp) > */ > rsp->gp_state = RCU_GP_ONOFF; > rcu_for_each_leaf_node(rsp, rnp) { > - spin_lock(&rsp->ofl_lock); > raw_spin_lock_irq_rcu_node(rnp); > if (rnp->qsmaskinit == rnp->qsmaskinitnext && > !rnp->wait_blkd_tasks) { > /* Nothing to do on this leaf rcu_node structure. */ > raw_spin_unlock_irq_rcu_node(rnp); > - spin_unlock(&rsp->ofl_lock); > continue; > } > > /* Record old state, apply changes to ->qsmaskinit field. */ > oldmask = rnp->qsmaskinit; > - rnp->qsmaskinit = rnp->qsmaskinitnext; > + WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinitnext); > > /* If zero-ness of ->qsmaskinit changed, propagate up tree. */ > if (!oldmask != !rnp->qsmaskinit) { > @@ -1970,7 +1972,6 @@ static bool rcu_gp_init(struct rcu_state *rsp) > } > > raw_spin_unlock_irq_rcu_node(rnp); > - spin_unlock(&rsp->ofl_lock); > } > rcu_gp_slow(rsp, gp_preinit_delay); /* Races with CPU hotplug. */ > > @@ -1992,7 +1993,7 @@ static bool rcu_gp_init(struct rcu_state *rsp) > raw_spin_lock_irqsave_rcu_node(rnp, flags); > rdp = this_cpu_ptr(rsp->rda); > rcu_preempt_check_blocked_tasks(rsp, rnp); > - rnp->qsmask = rnp->qsmaskinit; > + WRITE_ONCE(rnp->qsmask, rnp->qsmaskinit); > WRITE_ONCE(rnp->gp_seq, rsp->gp_seq); > if (rnp == rdp->mynode) > (void)__note_gp_changes(rsp, rnp, rdp); > @@ -2295,7 +2296,7 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp, > WARN_ON_ONCE(oldmask); /* Any child must be all zeroed! */ > WARN_ON_ONCE(!rcu_is_leaf_node(rnp) && > rcu_preempt_blocked_readers_cgp(rnp)); > - rnp->qsmask &= ~mask; > + WRITE_ONCE(rnp->qsmask, rnp->qsmask & ~mask); > trace_rcu_quiescent_state_report(rsp->name, rnp->gp_seq, > mask, rnp->qsmask, rnp->level, > rnp->grplo, rnp->grphi, > @@ -2503,7 +2504,7 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf) > if (!rnp) > break; > raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */ > - rnp->qsmaskinit &= ~mask; > + WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinit & ~mask); > /* Between grace periods, so better already be zero! */ > WARN_ON_ONCE(rnp->qsmask); > if (rnp->qsmaskinit) { > @@ -3522,7 +3523,7 @@ static void rcu_init_new_rnp(struct rcu_node *rnp_leaf) > return; > raw_spin_lock_rcu_node(rnp); /* Interrupts already disabled. */ > oldmask = rnp->qsmaskinit; > - rnp->qsmaskinit |= mask; > + WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinit | mask); > raw_spin_unlock_rcu_node(rnp); /* Interrupts remain disabled. */ > if (oldmask) > return; > @@ -3733,7 +3734,7 @@ void rcu_cpu_starting(unsigned int cpu) > rnp = rdp->mynode; > mask = rdp->grpmask; > raw_spin_lock_irqsave_rcu_node(rnp, flags); > - rnp->qsmaskinitnext |= mask; > + WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask); > oldmask = rnp->expmaskinitnext; > rnp->expmaskinitnext |= mask; > oldmask ^= rnp->expmaskinitnext; > @@ -3768,18 +3769,36 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp) > > /* Remove outgoing CPU from mask in the leaf rcu_node structure. */ > mask = rdp->grpmask; > - spin_lock(&rsp->ofl_lock); > raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */ > rdp->rcu_ofl_gp_seq = READ_ONCE(rsp->gp_seq); > rdp->rcu_ofl_gp_flags = READ_ONCE(rsp->gp_flags); > + > + /* > + * First clear @qsmaskinitnext such that we'll not start a new GP > + * on this outgoing CPU. > + */ > + WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask); > if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */ > - /* Report quiescent state -before- changing ->qsmaskinitnext! */ > + /* > + * Report the QS on the outgoing CPU. This will propagate the > + * cleared bit into @qsmaskinit and @qsmask. We rely on > + * @qsmaskinit still containing this CPU such that > + * rcu_rnp_online_cpus() will still consider RCU online. > + * > + * This allows us to wake the GP kthread, since wakeups rely on > + * RCU. > + */ > + WARN_ON_ONCE(!(rnp->qsmaskinit & mask)); > rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags); > raw_spin_lock_irqsave_rcu_node(rnp, flags); > + } else { > + /* > + * If there was no QS required, clear @qsmaskinit now to > + * finalize the offline. > + */ > + WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinit & ~mask); > } > - rnp->qsmaskinitnext &= ~mask; > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > - spin_unlock(&rsp->ofl_lock); > } > > /* > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h > index 4e74df768c57..a1528b970257 100644 > --- a/kernel/rcu/tree.h > +++ b/kernel/rcu/tree.h > @@ -84,19 +84,24 @@ struct rcu_node { > unsigned long gp_seq; /* Track rsp->rcu_gp_seq. */ > unsigned long gp_seq_needed; /* Track rsp->rcu_gp_seq_needed. */ > unsigned long completedqs; /* All QSes done for this node. */ > - unsigned long qsmask; /* CPUs or groups that need to switch in */ > - /* order for current grace period to proceed.*/ > - /* In leaf rcu_node, each bit corresponds to */ > - /* an rcu_data structure, otherwise, each */ > - /* bit corresponds to a child rcu_node */ > - /* structure. */ > - unsigned long rcu_gp_init_mask; /* Mask of offline CPUs at GP init. */ > + > + /* > + * @qsmask - CPUs pending in this GP Huh. I wasn't aware that docbook/sphinx/whatever knew about this style of documentation. I will have to check that out in the fulness of time... > + * @qsmaskinit - CPUs we started this GP with - CPUs online at the last GP start > + * @qsmaskinitnext - CPUs we'll start the next GP with Only if that GP were to start immediately, of course. Note that if CPU 0 and CPU 88 come online in that order, it is quite possible that there will be a grace period that waits on CPU 88 but not CPU 0. This would happen if CPU 0's rcu_node structure checked ->qsmaskinitnext, CPU 0 came online, CPU 88 came online, and then CPU 88's rcu_node structure checked ->qsmaskinitnext. So the order in which CPUs come online and go offline is not necessarily the order in which successive grace periods start/stop paying attention to them. Thanx, Paul > + * > + * online: we add the incoming CPU to @qsmaskinitnext which will then > + * be propagated into @qsmaskinit and @qsmask by starting/joining a GP. > + * > + * offline: we remove the CPU from @qsmaskinitnext such that the > + * outgoing CPU will not be part of a next GP, which will then be > + * propagated into @qsmaskinit and @qsmask by completing/leaving a GP. > + */ > + unsigned long qsmask; > unsigned long qsmaskinit; > - /* Per-GP initial value for qsmask. */ > - /* Initialized from ->qsmaskinitnext at the */ > - /* beginning of each grace period. */ > unsigned long qsmaskinitnext; > - /* Online CPUs for next grace period. */ > + > + unsigned long rcu_gp_init_mask; /* Mask of offline CPUs at GP init. */ > unsigned long expmask; /* CPUs or groups that need to check in */ > /* to allow the current expedited GP */ > /* to complete. */ > @@ -367,10 +372,6 @@ struct rcu_state { > const char *name; /* Name of structure. */ > char abbr; /* Abbreviated name. */ > struct list_head flavors; /* List of RCU flavors. */ > - > - spinlock_t ofl_lock ____cacheline_internodealigned_in_smp; > - /* Synchronize offline with */ > - /* GP pre-initialization. */ > }; > > /* Values for rcu_state structure's gp_flags field. */ >