From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1753135AbdCNQVh (ORCPT <rfc822;w@1wt.eu>);
        Tue, 14 Mar 2017 12:21:37 -0400
Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:36503 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1753071AbdCNQVe (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 14 Mar 2017 12:21:34 -0400
Date: Tue, 14 Mar 2017 09:21:26 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Lance Roy <ldr709@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>,
        Lai Jiangshan <jiangshanlai@gmail.com>,
        Josh Triplett <josh@joshtriplett.org>,
        Steven Rostedt <rostedt@goodmis.org>,
        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        LKML <linux-kernel@vger.kernel.org>,
        syzkaller <syzkaller@googlegroups.com>,
        Dmitry Vyukov <dvyukov@google.com>, Kostya Serebryany <kcc@google.com>
Subject: Re: srcu: BUG in __synchronize_srcu
Reply-To: paulmck@linux.vnet.ibm.com
References: <CAAeHK+x57VZbhaT3_9eCAq3k7-2ufDVC6yJ98rThF2kHRVp2iw@mail.gmail.com>
 <CAAeHK+yeM99Lvod-EfZPGxMBbmDSzBDRzaLwB+upqYeq6uCZ1Q@mail.gmail.com>
 <20170310222609.GN30506@linux.vnet.ibm.com>
 <20170314004702.6650857e@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170314004702.6650857e@gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-GCONF: 00
x-cbid: 17031416-0052-0000-0000-000001A11279
X-IBM-SpamModules-Scores: 
X-IBM-SpamModules-Versions: BY=3.00006781; HX=3.00000240; KW=3.00000007;
 PH=3.00000004; SC=3.00000206; SDB=6.00833755; UDB=6.00409375; IPR=6.00611411;
 BA=6.00005209; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000;
 ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00014646; XFM=3.00000013;
 UTC=2017-03-14 16:21:30
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17031416-0053-0000-0000-00004F1CDEA3
Message-Id: <20170314162126.GC3637@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-03-14_08:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1702020001
 definitions=main-1703140126
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, Mar 14, 2017 at 12:47:02AM -0700, Lance Roy wrote:
> I am not sure how the rcu_scheduler_active changes in __synchronize_srcu work,
> but there seem to be a few problems in them. First,
> "if (done && likely(!driving))" on line 453 doesn't appear to ever happen,
> as driving doesn't get set to false when srcu_reschedule is called. This seems
> like it could cause a race condition if another thread notices that ->running is
> false, adds itself to the queue, set ->running to true, and starts on its own
> grace period before the first thread acquires the lock again on line 469. Then
> the first thread will then acquire the lock, set running to false, and release
> the lock, resulting in an invalid state where ->running is false but the second
> thread is still trying to finish its grace period.
> 
> Second, the while loop on line 455 seems to violate to rule that ->running
> shouldn't be false when there are entries in the queue. If a second thread adds
> itself to the queue while the first thread is driving SRCU inside that loop, and
> then the first thread finishes its own grace period and quits the loop, it will
> set ->running to false even though there is still a thread on the queue.
> 
> The second issue requires rcu_scheduler_active to be RCU_SCHEDULER_INIT to
> occur, and as I don't what the assumptions during RCU_SCHEDULER_INIT are I don't
> know if it is actually a problem, but the first issue looks like it could occur
> at any time.

Thank you for looking into this!

I determined that my patch-order strategy was flawed, as it required
me to rewrite the mid-boot functionality several times.  I therefore
removed the mid-boot commits.  I will add them in later, but they will
use a rather different approach based on a grace-period sequence number
similar to that used by the expedited grace periods.

Which should also teach me to be less aggressive about pushing new code
to -next.  For a few weeks, anyway.  ;-)

							Thanx, Paul

> Thanks,
> Lance
> 
> On Fri, 10 Mar 2017 14:26:09 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > On Fri, Mar 10, 2017 at 08:29:55PM +0100, Andrey Konovalov wrote:
> > > On Fri, Mar 10, 2017 at 8:28 PM, Andrey Konovalov <andreyknvl@google.com>
> > > wrote:
> > > > Kernel panic - not syncing: Fatal exception
> >
> > So the theory is that if !sp->running, all of SRCU's queues must be empty.
> > So if you are holding ->queue_lock (with irqs disabled) and you see
> > !sp->running, and then you enqueue a callback on ->batch_check0, then
> > that callback must be the first in the list.  And the code preceding
> > the WARN_ON() you triggered does in fact check and enqueue shile holding
> > ->queue_lock with irqs disabled.
> >
> > And rcu_batch_queue() does operate FIFO as required.  (Otherwise,
> > srcu_barrier() would not work.)
> >
> > There are only three calls to rcu_batch_queue(), and the one involved with
> > the WARN_ON() enqueues to ->batch_check0.  The other two enqueue to
> > ->batch_queue.  Callbacks move from ->batch_queue to ->batch_check0 to
> > ->batch_check1 to ->batch_done, so nothing should slip in front.
> >
> > Of course, if ->running were ever set to false with any of ->batch_check0,
> > ->batch_check1, or ->batch_done non-empty, this WARN_ON() would trigger.
> > But srcu_reschedule() sets it to false only if all four batches are
> > empty (and checks and sets under ->queue_lock()), and all other cases
> > where it is set to false happen at initialization time, and also clear
> > out the queues.  Of course, if someone raced an init_srcu_struct() with
> > either a call_srcu() or synchronize_srcu(), all bets are off.  Now,
> > mmu_notifier.c does invoke init_srcu_struct() manually, but it does
> > so at subsys_initcall() time.  Which -might- be after other things are
> > happening, so one "hail Mary" attempted fix is to remove mmu_notifier_init()
> > and replace the "static struct srcu_struct srcu" with:
> 
> >
> > 	DEFINE_STATIC_SRCU(srcu);
> >
> > But this might require changing the name -- I vaguely recall some
> > strangeness where the names of statically defined per-CPU variables need
> > to be globally unique even when static.  Easy enough to do, though.
> > Might need a similar change to the "srcu" instances defined in vmd.c
> > and kvm_host.h -- assuming that this change helps.
> >
> > Another possibility is that something in SRCU is messing with either the
> > queues or the ->running field without holding ->queue_lock.  And that does
> > seem to be happening -- srcu_advance_batches() invokes rcu_batch_move()
> > without holding anything.  Which seems like it could cause trouble
> > if someone else was invoking synchronize_srcu() concurrently.  Those
> > particular invocations might be safe due to access only by a single
> > kthread/workqueue, given that all updates to ->batch_queue are protected
> > by ->queue_lock (aside from initialization).
> >
> > But ->batch_check0 is updated by __synchronize_srcu(), though protected
> > by ->queue_lock, and only if ->running is false, and with both the
> > check and the update protected by the same ->queue_lock critical section.
> > If ->running is false, then the workqueue cannot be running, so it remains
> > to see if all other updates to ->batch_check0 are either with ->queue_lock
> > held and ->running false on the one hand or from the workqueue handler
> > on the other:
> >
> > srcu_collect_new() updates with ->queue_lock held, but does not check
> > 	->running.  It is invoked only from process_srcu(), which in
> > 	turn is invoked only as a workqueue handler.  The work is queued
> > 	from:
> >
> > 	call_srcu(), which does so with ->queue_lock held having just
> > 		set ->running to true.
> >
> > 	srcu_reschedule(), which invokes it if there are non-empty
> > 		queues.  This is invoked from __synchronize_srcu()
> > 		in the case where it has set ->running to true
> > 		after finding the queues empty, which should imply
> > 		no other instances.
> >
> > 		It is also invoked from process_srcu(), which is
> > 		invoked only as a workqueue handler.  (Yay
> > 		recursive inquiry!)
> >
> > srcu_advance_batches() updates without locks held.  It is invoked as
> > 	follows:
> >
> > 	__synchronize_srcu() in the case where ->running was set, which
> > 		as noted before excludes workqueue handlers.
> >
> > 	process_srcu() which as noted before is only invoked from
> > 		a workqueue handler.
> >
> > So an SRCU workqueue is invoked only from a workqueue handler, or from
> > some other task that transitioned ->running from false to true while
> > holding ->queuelock.  There should therefore only be one SRCU workqueue
> > per srcu_struct, so this should be safe.  Though I hope that it can
> > be simplified a bit.  :-/
> >
> > So the only suggestion I have at the moment is static definition of
> > the "srcu" variable.  Lai, Josh, Steve, Mathieu, anything I missed?
> >
> > 						Thanx, Paul
> >
>