From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751526AbdJEQTR (ORCPT <rfc822;w@1wt.eu>);
        Thu, 5 Oct 2017 12:19:17 -0400
Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:36700 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL)
        by vger.kernel.org with ESMTP id S1751216AbdJEQTP (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 5 Oct 2017 12:19:15 -0400
Date: Thu, 5 Oct 2017 09:19:09 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, mingo@kernel.org, jiangshanlai@gmail.com,
        dipankar@in.ibm.com, akpm@linux-foundation.org,
        mathieu.desnoyers@efficios.com, josh@joshtriplett.org,
        tglx@linutronix.de, rostedt@goodmis.org, dhowells@redhat.com,
        edumazet@google.com, fweisbec@gmail.com, oleg@redhat.com
Subject: Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of
 migrations and delays
Reply-To: paulmck@linux.vnet.ibm.com
References: <20171004212915.GA10089@linux.vnet.ibm.com>
 <1507152575-11055-1-git-send-email-paulmck@linux.vnet.ibm.com>
 <20171005094114.yoqkhndfwidczfqj@hirez.programming.kicks-ass.net>
 <20171005145513.GO3521@linux.vnet.ibm.com>
 <20171005153913.5wanmnfmi6i3byv7@hirez.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20171005153913.5wanmnfmi6i3byv7@hirez.programming.kicks-ass.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-GCONF: 00
x-cbid: 17100516-0044-0000-0000-0000039AF34D
X-IBM-SpamModules-Scores: 
X-IBM-SpamModules-Versions: BY=3.00007847; HX=3.00000241; KW=3.00000007;
 PH=3.00000004; SC=3.00000235; SDB=6.00926874; UDB=6.00466327; IPR=6.00707118;
 BA=6.00005623; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000;
 ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00017408; XFM=3.00000015;
 UTC=2017-10-05 16:19:13
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17100516-0045-0000-0000-000007CA0302
Message-Id: <20171005161909.GS3521@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-10-05_07:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000
 definitions=main-1710050224
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, Oct 05, 2017 at 05:39:13PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2017 at 07:55:13AM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote:
> > > > Consider the following admittedly improbable sequence of events:
> > > > 
> > > > o	RCU is initially idle.
> > > > 
> > > > o	Task A on CPU 0 executes rcu_read_lock().
> > > > 
> > > > o	Task B on CPU 1 executes synchronize_rcu(), which must
> > > > 	wait on Task A:
> > > > 
> > > > 	o	Task B registers the callback, which starts a new
> > > > 		grace period, awakening the grace-period kthread
> > > > 		on CPU 3, which immediately starts a new grace period.
> > > > 
> > > > 	o	Task B migrates to CPU 2, which provides a quiescent
> > > > 		state for both CPUs 1 and 2.
> > > > 
> > > > 	o	Both CPUs 1 and 2 take scheduling-clock interrupts,
> > > > 		and both invoke RCU_SOFTIRQ, both thus learning of the
> > > > 		new grace period.
> > > > 
> > > > 	o	Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > > > 
> > > > o	CPUs 2 and 3 pass through quiescent states, which are reported
> > > > 	to core RCU.
> > > > 
> > > > o	Task B is resumed just long enough to be migrated to CPU 3,
> > > > 	and then is once again delayed.
> > > > 
> > > > o	Task A executes rcu_read_unlock(), exiting its RCU read-side
> > > > 	critical section.
> > > > 
> > > > o	CPU 0 passes through a quiescent sate, which is reported to
> > > > 	core RCU.  Only CPU 1 continues to block the grace period.
> > > > 
> > > > o	CPU 1 passes through a quiescent state, which is reported to
> > > > 	core RCU.  This ends the grace period, and CPU 1 therefore
> > > > 	invokes its callbacks, one of which awakens Task B via
> > > > 	complete().
> > > > 
> > > > o	Task B resumes (still on CPU 3) and starts executing
> > > > 	wait_for_completion(), which sees that the completion has
> > > > 	already completed, and thus does not block.  It returns from
> > > > 	the synchronize_rcu() without any ordering against the
> > > > 	end of Task A's RCU read-side critical section.
> > > > 
> > > > 	It can therefore mess up Task A's RCU read-side critical section,
> > > > 	in theory, anyway.
> > > 
> > > I'm not sure I follow, at the very least the wait_for_completion() does
> > > an ACQUIRE such that it observes the state prior to the RELEASE as done
> > > by complete(), no?
> > 
> > Your point being that both wait_for_completion() and complete() acquire
> > and release the same lock?  (Yes, I suspect that I was confusing this
> > with wait_event() and wake_up(), just so you know.)
> 
> Well, fundamentally complete()/wait_for_completion() is a message-pass
> and they include a RELEASE/ACQUIRE pair for causal reasons.
> 
> Per the implementation they use a spinlock, but any implementation needs
> to provide at least that RELEASE/ACQUIRE pair.
> 
> > > And is not CPU0's QS reporting ordered against that complete()?
> > 
> > Mumble mumble mumble powerpc mumble mumble mumble...
> > 
> > OK, I will make this new memory barrier only execute for powerpc.
> > 
> > Or am I missing something else here?
> 
> So I'm not entirely clear on the required semantics here; why do we need
> a full mb? I'm thinking CPU0's QS propagating through the tree and
> arriving at the root node is a multi-copy-atomic / transitive thing and
> all CPUs will agree the system QS has ended, right?
> 
> Whichever CPU establishes the system QS does complete() and the
> wait_for_completion() then has the weak-transitive causal relation to
> that, ensuring that -- in the above example -- CPU3 must be _after_
> CPU0's rcu_read_unlock().

Yes, the ordering does need to be visible to uninvolved CPUs, so
release-acquire is not necessarily strong enough.

My current thought is like this:

	if (IS_ENABLED(CONFIG_ARCH_WEAK_RELEASE_ACQUIRE))
		smp_mb();

Thoughts?

							Thanx, Paul