From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751367AbcGUGkh (ORCPT ); Thu, 21 Jul 2016 02:40:37 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:61247 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751143AbcGUGkf (ORCPT ); Thu, 21 Jul 2016 02:40:35 -0400 X-IBM-Helo: d23dlp03.au.ibm.com X-IBM-MailFrom: xinhui.pan@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Thu, 21 Jul 2016 14:40:22 +0800 From: xinhui User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.8.0 MIME-Version: 1.0 To: Peter Zijlstra , Pan Xinhui CC: Waiman Long , Ingo Molnar , linux-kernel@vger.kernel.org, Boqun Feng , Scott J Norton , Douglas Hatch Subject: Re: [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem References: <1464713631-1066-1-git-send-email-Waiman.Long@hpe.com> <1464713631-1066-3-git-send-email-Waiman.Long@hpe.com> <20160715084732.GF30921@twins.programming.kicks-ass.net> <3c5d5c29-7956-572f-2638-b85299c72432@linux.vnet.ibm.com> <20160715100703.GQ30154@twins.programming.kicks-ass.net> <20160715163556.GA7141@twins.programming.kicks-ass.net> In-Reply-To: <20160715163556.GA7141@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16072106-0040-0000-0000-000001C31CE2 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16072106-0041-0000-0000-00000A1BBF81 Message-Id: <57906E56.5010906@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-07-21_03:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=2 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000 definitions=main-1607210075 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2016年07月16日 00:35, Peter Zijlstra wrote: > On Fri, Jul 15, 2016 at 12:07:03PM +0200, Peter Zijlstra wrote: >>> So if we are kicked by the unlock_slowpath, and the lock is stealed by >>> someone else, we need hash its node again and set l->locked to >>> _Q_SLOW_VAL, then enter pv_wait. >> >> Right, let me go think about this a bit. > > Urgh, brain hurt. > > So I _think_ the below does for it but I could easily have missed yet > another case. > > Waiman's patch has the problem that it can have two pv_hash() calls for > the same lock in progress and I'm thinking that means we can hit the > BUG() in pv_hash() due to that. > > If we can't, it still has a problem because its not telling us either. > > > > --- a/kernel/locking/qspinlock_paravirt.h > +++ b/kernel/locking/qspinlock_paravirt.h > @@ -20,7 +20,8 @@ > * native_queued_spin_unlock(). > */ > > -#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET) > +#define _Q_HASH_VAL (3U << _Q_LOCKED_OFFSET) > +#define _Q_SLOW_VAL (7U << _Q_LOCKED_OFFSET) > > /* > * Queue Node Adaptive Spinning > @@ -36,14 +37,11 @@ > */ > #define PV_PREV_CHECK_MASK 0xff > > -/* > - * Queue node uses: vcpu_running & vcpu_halted. > - * Queue head uses: vcpu_running & vcpu_hashed. > - */ > enum vcpu_state { > - vcpu_running = 0, > - vcpu_halted, /* Used only in pv_wait_node */ > - vcpu_hashed, /* = pv_hash'ed + vcpu_halted */ > + vcpu_node_running = 0, > + vcpu_node_halted, > + vcpu_head_running, > + vcpu_head_halted, > }; > > struct pv_node { > @@ -263,7 +261,7 @@ pv_wait_early(struct pv_node *prev, int > if ((loop & PV_PREV_CHECK_MASK) != 0) > return false; > > - return READ_ONCE(prev->state) != vcpu_running; > + return READ_ONCE(prev->state) & 1; > } > > /* > @@ -311,20 +309,19 @@ static void pv_wait_node(struct mcs_spin > * > * Matches the cmpxchg() from pv_kick_node(). > */ > - smp_store_mb(pn->state, vcpu_halted); > + smp_store_mb(pn->state, vcpu_node_halted); > > - if (!READ_ONCE(node->locked)) { > - qstat_inc(qstat_pv_wait_node, true); > - qstat_inc(qstat_pv_wait_early, wait_early); > - pv_wait(&pn->state, vcpu_halted); > - } > + if (READ_ONCE(node->locked)) > + return; > + > + qstat_inc(qstat_pv_wait_node, true); > + qstat_inc(qstat_pv_wait_early, wait_early); > + pv_wait(&pn->state, vcpu_node_halted); > > /* > - * If pv_kick_node() changed us to vcpu_hashed, retain that > - * value so that pv_wait_head_or_lock() knows to not also try > - * to hash this lock. > + * If pv_kick_node() advanced us, retain that state. > */ > - cmpxchg(&pn->state, vcpu_halted, vcpu_running); > + cmpxchg(&pn->state, vcpu_node_halted, vcpu_node_running); > > /* > * If the locked flag is still not set after wakeup, it is a > @@ -362,18 +359,17 @@ static void pv_kick_node(struct qspinloc > * > * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > */ > - if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) > + if (cmpxchg(&pn->state, vcpu_node_halted, vcpu_head_running) != vcpu_node_halted) well, I think it should be if (cmpxchg(&pn->state, vcpu_node_halted, vcpu_head_halted) != vcpu_node_halted) yes, the node has been the *head*, but the running state does not change. others looks ok to me. > return; > > /* > - * Put the lock into the hash table and set the _Q_SLOW_VAL. > - * > - * As this is the same vCPU that will check the _Q_SLOW_VAL value and > - * the hash table later on at unlock time, no atomic instruction is > - * needed. > + * See pv_wait_head_or_lock(). We have to hash and force the unlock > + * into the slow path to deliver the actual kick for waking. > */ > - WRITE_ONCE(l->locked, _Q_SLOW_VAL); > - (void)pv_hash(lock, pn); > + if (cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL) == _Q_LOCKED_VAL) { > + (void)pv_hash(lock, pn); > + smp_store_release(&l->locked, _Q_SLOW_VAL); > + } > } > > /* > @@ -388,28 +384,22 @@ pv_wait_head_or_lock(struct qspinlock *l > { > struct pv_node *pn = (struct pv_node *)node; > struct __qspinlock *l = (void *)lock; > - struct qspinlock **lp = NULL; > int waitcnt = 0; > int loop; > > /* > - * If pv_kick_node() already advanced our state, we don't need to > - * insert ourselves into the hash table anymore. > - */ > - if (READ_ONCE(pn->state) == vcpu_hashed) > - lp = (struct qspinlock **)1; > - > - /* > * Tracking # of slowpath locking operations > */ > qstat_inc(qstat_pv_lock_slowpath, true); > > for (;; waitcnt++) { > + u8 locked; > + > /* > * Set correct vCPU state to be used by queue node wait-early > * mechanism. > */ > - WRITE_ONCE(pn->state, vcpu_running); > + WRITE_ONCE(pn->state, vcpu_head_running); > > /* > * Set the pending bit in the active lock spinning loop to > @@ -423,33 +413,38 @@ pv_wait_head_or_lock(struct qspinlock *l > } > clear_pending(lock); > > + /* > + * We want to go sleep; ensure we're hashed so that > + * __pv_queued_spin_unlock_slow() can find us for a wakeup. > + */ > + locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL); > + switch (locked) { > + /* > + * We're not hashed yet, either we're fresh from pv_wait_node() > + * or __pv_queued_spin_unlock_slow() unhashed us but we lost > + * the trylock to a steal and have to re-hash. > + */ > + case _Q_LOCKED_VAL: > + (void)pv_hash(lock, pn); > + smp_store_release(&l->locked, _Q_SLOW_VAL); > + break; > > - if (!lp) { /* ONCE */ > - lp = pv_hash(lock, pn); > + /* > + * pv_kick_node() is hashing us, wait for it. > + */ > + case _Q_HASH_VAL: > + while (READ_ONCE(l->locked) == _Q_HASH_VAL) > + cpu_relax(); > + break; > > - /* > - * We must hash before setting _Q_SLOW_VAL, such that > - * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock() > - * we'll be sure to be able to observe our hash entry. > - * > - * [S] [Rmw] l->locked == _Q_SLOW_VAL > - * MB RMB > - * [RmW] l->locked = _Q_SLOW_VAL [L] > - * > - * Matches the smp_rmb() in __pv_queued_spin_unlock(). > - */ > - if (xchg(&l->locked, _Q_SLOW_VAL) == 0) { > - /* > - * The lock was free and now we own the lock. > - * Change the lock value back to _Q_LOCKED_VAL > - * and unhash the table. > - */ > - WRITE_ONCE(l->locked, _Q_LOCKED_VAL); > - WRITE_ONCE(*lp, NULL); > - goto gotlock; > - } > + /* > + * Ooh, unlocked, try and grab it. > + */ > + case 0: > + continue; > } > - WRITE_ONCE(pn->state, vcpu_hashed); > + > + WRITE_ONCE(pn->state, vcpu_head_halted); > qstat_inc(qstat_pv_wait_head, true); > qstat_inc(qstat_pv_wait_again, waitcnt); > pv_wait(&l->locked, _Q_SLOW_VAL); > @@ -480,7 +475,7 @@ __pv_queued_spin_unlock_slowpath(struct > struct __qspinlock *l = (void *)lock; > struct pv_node *node; > > - if (unlikely(locked != _Q_SLOW_VAL)) { > + if (unlikely(locked != _Q_SLOW_VAL && locked != _Q_HASH_VAL)) { > WARN(!debug_locks_silent, > "pvqspinlock: lock 0x%lx has corrupted value 0x%x!\n", > (unsigned long)lock, atomic_read(&lock->val)); > @@ -488,18 +483,17 @@ __pv_queued_spin_unlock_slowpath(struct > } > > /* > - * A failed cmpxchg doesn't provide any memory-ordering guarantees, > - * so we need a barrier to order the read of the node data in > - * pv_unhash *after* we've read the lock being _Q_SLOW_VAL. > - * > - * Matches the cmpxchg() in pv_wait_head_or_lock() setting _Q_SLOW_VAL. > + * Wait until the hash-bucket is complete. > */ > - smp_rmb(); > + while (READ_ONCE(l->locked) == _Q_HASH_VAL) > + cpu_relax(); > > /* > - * Since the above failed to release, this must be the SLOW path. > - * Therefore start by looking up the blocked node and unhashing it. > + * Must first observe _Q_SLOW_VAL in order to observe > + * consistent hash bucket. > */ > + smp_rmb(); > + > node = pv_unhash(lock); > > /* >