From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752724Ab1GNE3f (ORCPT ); Thu, 14 Jul 2011 00:29:35 -0400 Received: from e7.ny.us.ibm.com ([32.97.182.137]:44680 "EHLO e7.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750807Ab1GNE3e (ORCPT ); Thu, 14 Jul 2011 00:29:34 -0400 Date: Wed, 13 Jul 2011 21:29:32 -0700 From: "Paul E. McKenney" To: Peter Zijlstra Cc: Dave Jones , Linux Kernel , Ingo Molnar Subject: Re: lockdep circular locking error (rcu_node_level_0 vs rq->lock) Message-ID: <20110714042932.GA2932@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20110711231932.GB20367@redhat.com> <1310502014.2309.7.camel@laptop> <20110712205126.GM2326@linux.vnet.ibm.com> <1310507047.2309.10.camel@laptop> <20110712225443.GS2326@linux.vnet.ibm.com> <1310545448.14978.40.camel@twins> <20110713163309.GC2355@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110713163309.GC2355@linux.vnet.ibm.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 13, 2011 at 09:33:09AM -0700, Paul E. McKenney wrote: > On Wed, Jul 13, 2011 at 10:24:08AM +0200, Peter Zijlstra wrote: > > On Tue, 2011-07-12 at 15:54 -0700, Paul E. McKenney wrote: > > > > @@ -892,6 +892,7 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev) > > > > * prev into current: > > > > */ > > > > spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_); > > > > + rcu_read_acquire(); > > > > > > Oooh... This is a tricky one. Hmmm... > > > > > > > > > Does any of this make sense? > > > > No? > > > > > > @@ -3141,6 +3170,7 @@ context_switch(struct rq *rq, struct task_struct *prev, > > > > */ > > > > #ifndef __ARCH_WANT_UNLOCKED_CTXSW > > > > spin_release(&rq->lock.dep_map, 1, _THIS_IP_); > > > > + rcu_read_release(); > > > > > > My guess is that we don't need the rcu_read_release() -- the arch shouldn't > > > care that we have a non-atomic field in task_struct incremented, right? > > > > > > Or am I confused about what this is doing? > > > > Either that or I used the wrong primitives for what I was meaning to do. > > > > So from looking at rcupdate.h rcu_read_{acquire,release} are the lockdep > > annotations of the rcu_read_lock. The thing we're doing above is context > > switching and not making lockdep upset. > > > > The problem is that held lock state is tracked per task, and we're well > > switching tasks, so we need to transfer the held lock state from the old > > to the new task, since the new task will be unlocking the thing, if at > > that time lockdep finds we're not actually holding it it screams bloody > > murder. > > > > So what we do is we release the lock (annotation only, the actual lock > > says locked) on prev before switching and acquire the lock on next right > > after switching. > > OK, got it. I think, anyway. > > So preemptible RCU and context switches are handled as follows by > your patch, correct? > > 1. The outgoing task is in an RCU read-side critical section > due to the rcu_read_lock() that you added at the beginning > of schedule(). > > 2. In context_switch(), the lockdep state is set so that lockdep > believes that the task has exited its RCU read-side critical > section (thus suppressing the lockdep splat that would otherwise > appear). > > 3. Once the context switch has happened, RCU no longer pays > attention to the outgoing task. So the outgoing task's > current->rcu_read_lock_nesting still indicates that the task is in > an RCU read-side critical section, but RCU does not know or care. > > 4. If the incoming task has done at least one prior context switch, > its current->rcu_read_lock_nesting will indicate that it is > already in an RCU read-side critical section. Since there is > no opportunity for RCU to report a quiescent state during the > context switch itself (because holding the runqueue lock keeps > interrupts disabled), RCU sees the whole context switch event > as being one big RCU read-side critical section. > > If interrupts were enabled at any point, RCU might see this as > two back-to-back read-side critical sections. This might be > OK for all I know, but it would open the door more widely to > heavyweight RCU processing at rcu_read_unlock() time. > > 5. Yes, context switch is a quiescent state, but that quiescent > state happens when rcu_note_context_switch() is called from > schedule, which is before the runqueue lock is acquired, and > thus before the beginning of the RCU read-side critical section > that covers the context switch. > > 6. In finish_lock_switch(), the rcu_read_acquire() restores > lockdep state to match that of the incoming task. > > But if the task is newly created, then its value of > current->rcu_read_lock_nesting will be zero, which will leave > the incoming task unprotected by RCU. I believe that you > need the (untested) patch below to make this work properly. > But this patch needs to be included in your set, as I don't > immediately see a reasonable way to make it work independently. > (Yes, we could have a special CONFIG var for the transition, > but it sounds a lot easier to just have it be part of your patch.) > > 7. Exit processing is done by the task itself before a last > call to schedule(). From what I can see, it should not matter > that the value of current->rcu_read_lock_nesting gets incremented > on this last TASK_DEAD call to schedule(). > > Or am I still confused? Hmmm... Something is confused. I get boot time hangs with the occasional stack overflow, whether or not I add my patch on top. Left to myself, I would try applying your patch incrementally, and also disabling irqs before rcu_read_lock() and enabling them after rcu_read_unlock(), as this prevents rcu_read_unlock() from ever getting to the rcu_read_unlock_special() slowpath. Thanx, Paul