From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754941AbbCFSDl (ORCPT ); Fri, 6 Mar 2015 13:03:41 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:49886 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753675AbbCFSDj (ORCPT ); Fri, 6 Mar 2015 13:03:39 -0500 Message-ID: <54F9EBCA.1060300@oracle.com> Date: Fri, 06 Mar 2015 13:02:50 -0500 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Davidlohr Bueso , Ingo Molnar CC: Peter Zijlstra , LKML , Dave Jones , jason.low2@hp.com, Linus Torvalds Subject: Re: sched: softlockups in multi_cpu_stop References: <54F41516.6060608@oracle.com> <54F98F1F.3080107@oracle.com> <20150306123233.GA9972@gmail.com> <1425662342.19505.41.camel@stgolabs.net> In-Reply-To: <1425662342.19505.41.camel@stgolabs.net> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Source-IP: acsinet22.oracle.com [141.146.126.238] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/06/2015 12:19 PM, Davidlohr Bueso wrote: >> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c >> > index 1c0d11e8ce34..e4ad019e23f5 100644 >> > --- a/kernel/locking/rwsem-xadd.c >> > +++ b/kernel/locking/rwsem-xadd.c >> > @@ -298,23 +298,30 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem) >> > static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) >> > { >> > struct task_struct *owner; >> > - bool on_cpu = false; >> > + bool ret = true; >> > >> > if (need_resched()) >> > return false; >> > >> > rcu_read_lock(); >> > owner = ACCESS_ONCE(sem->owner); >> > - if (owner) >> > - on_cpu = owner->on_cpu; >> > - rcu_read_unlock(); >> > + if (!owner) { >> > + long count = ACCESS_ONCE(sem->count); >> > + /* >> > + * If sem->owner is not set, yet we have just recently entered the >> > + * slowpath with the lock being active, then there is a possibility >> > + * reader(s) may have the lock. To be safe, bail spinning in these >> > + * situations. >> > + */ >> > + if (count & RWSEM_ACTIVE_MASK) >> > + ret = false; >> > + goto done; > Hmmm so the lockup would be due to this (when owner is non-nil the patch > has no effect), telling users to spin instead of sleep -- _except_ for > this condition. And when spinning we're always checking for need_resched > to be safe. So even if this function was completely bogus, we'd end up > needlessly spinning but I'm surprised about the lockup. Maybe coffee > will make things clearer. There's always the possibility that bisect went wrong. I did it twice, but since I don't have a sure way of reproducing it I was basing my good/bad decisions on whether I saw it within a reasonable amount of time. I can go redo that again if you suspect that that commit is not the cause. Thanks, Sasha