From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751526AbdJEQTR (ORCPT ); Thu, 5 Oct 2017 12:19:17 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:36700 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751216AbdJEQTP (ORCPT ); Thu, 5 Oct 2017 12:19:15 -0400 Date: Thu, 5 Oct 2017 09:19:09 -0700 From: "Paul E. McKenney" To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com, akpm@linux-foundation.org, mathieu.desnoyers@efficios.com, josh@joshtriplett.org, tglx@linutronix.de, rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, fweisbec@gmail.com, oleg@redhat.com Subject: Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays Reply-To: paulmck@linux.vnet.ibm.com References: <20171004212915.GA10089@linux.vnet.ibm.com> <1507152575-11055-1-git-send-email-paulmck@linux.vnet.ibm.com> <20171005094114.yoqkhndfwidczfqj@hirez.programming.kicks-ass.net> <20171005145513.GO3521@linux.vnet.ibm.com> <20171005153913.5wanmnfmi6i3byv7@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171005153913.5wanmnfmi6i3byv7@hirez.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17100516-0044-0000-0000-0000039AF34D X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007847; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000235; SDB=6.00926874; UDB=6.00466327; IPR=6.00707118; BA=6.00005623; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00017408; XFM=3.00000015; UTC=2017-10-05 16:19:13 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17100516-0045-0000-0000-000007CA0302 Message-Id: <20171005161909.GS3521@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-10-05_07:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1710050224 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 05, 2017 at 05:39:13PM +0200, Peter Zijlstra wrote: > On Thu, Oct 05, 2017 at 07:55:13AM -0700, Paul E. McKenney wrote: > > On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote: > > > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote: > > > > Consider the following admittedly improbable sequence of events: > > > > > > > > o RCU is initially idle. > > > > > > > > o Task A on CPU 0 executes rcu_read_lock(). > > > > > > > > o Task B on CPU 1 executes synchronize_rcu(), which must > > > > wait on Task A: > > > > > > > > o Task B registers the callback, which starts a new > > > > grace period, awakening the grace-period kthread > > > > on CPU 3, which immediately starts a new grace period. > > > > > > > > o Task B migrates to CPU 2, which provides a quiescent > > > > state for both CPUs 1 and 2. > > > > > > > > o Both CPUs 1 and 2 take scheduling-clock interrupts, > > > > and both invoke RCU_SOFTIRQ, both thus learning of the > > > > new grace period. > > > > > > > > o Task B is delayed, perhaps by vCPU preemption on CPU 2. > > > > > > > > o CPUs 2 and 3 pass through quiescent states, which are reported > > > > to core RCU. > > > > > > > > o Task B is resumed just long enough to be migrated to CPU 3, > > > > and then is once again delayed. > > > > > > > > o Task A executes rcu_read_unlock(), exiting its RCU read-side > > > > critical section. > > > > > > > > o CPU 0 passes through a quiescent sate, which is reported to > > > > core RCU. Only CPU 1 continues to block the grace period. > > > > > > > > o CPU 1 passes through a quiescent state, which is reported to > > > > core RCU. This ends the grace period, and CPU 1 therefore > > > > invokes its callbacks, one of which awakens Task B via > > > > complete(). > > > > > > > > o Task B resumes (still on CPU 3) and starts executing > > > > wait_for_completion(), which sees that the completion has > > > > already completed, and thus does not block. It returns from > > > > the synchronize_rcu() without any ordering against the > > > > end of Task A's RCU read-side critical section. > > > > > > > > It can therefore mess up Task A's RCU read-side critical section, > > > > in theory, anyway. > > > > > > I'm not sure I follow, at the very least the wait_for_completion() does > > > an ACQUIRE such that it observes the state prior to the RELEASE as done > > > by complete(), no? > > > > Your point being that both wait_for_completion() and complete() acquire > > and release the same lock? (Yes, I suspect that I was confusing this > > with wait_event() and wake_up(), just so you know.) > > Well, fundamentally complete()/wait_for_completion() is a message-pass > and they include a RELEASE/ACQUIRE pair for causal reasons. > > Per the implementation they use a spinlock, but any implementation needs > to provide at least that RELEASE/ACQUIRE pair. > > > > And is not CPU0's QS reporting ordered against that complete()? > > > > Mumble mumble mumble powerpc mumble mumble mumble... > > > > OK, I will make this new memory barrier only execute for powerpc. > > > > Or am I missing something else here? > > So I'm not entirely clear on the required semantics here; why do we need > a full mb? I'm thinking CPU0's QS propagating through the tree and > arriving at the root node is a multi-copy-atomic / transitive thing and > all CPUs will agree the system QS has ended, right? > > Whichever CPU establishes the system QS does complete() and the > wait_for_completion() then has the weak-transitive causal relation to > that, ensuring that -- in the above example -- CPU3 must be _after_ > CPU0's rcu_read_unlock(). Yes, the ordering does need to be visible to uninvolved CPUs, so release-acquire is not necessarily strong enough. My current thought is like this: if (IS_ENABLED(CONFIG_ARCH_WEAK_RELEASE_ACQUIRE)) smp_mb(); Thoughts? Thanx, Paul