All of lore.kernel.org
 help / color / mirror / Atom feed
From: Waiman Long <waiman.long@hpe.com>
To: Boqun Feng <boqun.feng@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>, <linux-kernel@vger.kernel.org>,
	Pan Xinhui <xinhui@linux.vnet.ibm.com>,
	Scott J Norton <scott.norton@hpe.com>,
	Douglas Hatch <doug.hatch@hpe.com>
Subject: Re: [PATCH 1/2] locking/pvqspinlock: Fix missed PV wakeup problem
Date: Fri, 27 May 2016 15:28:39 -0400	[thread overview]
Message-ID: <57489FE7.8080402@hpe.com> (raw)
In-Reply-To: <20160527074331.GB8096@insomnia>

On 05/27/2016 03:43 AM, Boqun Feng wrote:
> Hi Waiman,
>
> On Thu, May 26, 2016 at 02:21:57PM -0400, Waiman Long wrote:
>> Currently, calling pv_hash() and setting _Q_SLOW_VAL is only
>> done once for any pv_node. It is either in pv_kick_node() or in
>> pv_wait_head_or_lock(). Because of lock stealing, a pv_kick'ed node is
>> not guaranteed to get the lock before the spinning threshold expires
>> and has to call pv_wait() again. As a result, the new lock holder
>> won't see _Q_SLOW_VAL and so won't wake up the sleeping vCPU.
>>
>> This patch fixes this missed PV wakeup problem by allowing multiple
>> _Q_SLOW_VAL settings within pv_wait_head_or_lock() and matching each
>> successful setting of _Q_SLOW_VAL to a pv_hash() call.
>>
>> Reported-by: Pan Xinhui<xinhui.pan@linux.vnet.ibm.com>
>> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
>> ---
>>   kernel/locking/qspinlock_paravirt.h |   48 ++++++++++++++++++++++------------
>>   1 files changed, 31 insertions(+), 17 deletions(-)
>>
>> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
>> index 21ede57..452d06d 100644
>> --- a/kernel/locking/qspinlock_paravirt.h
>> +++ b/kernel/locking/qspinlock_paravirt.h
>> @@ -369,12 +369,16 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
>>   	/*
>>   	 * Put the lock into the hash table and set the _Q_SLOW_VAL.
>>   	 *
>> -	 * As this is the same vCPU that will check the _Q_SLOW_VAL value and
>> -	 * the hash table later on at unlock time, no atomic instruction is
>> -	 * needed.
>> +	 * It is very unlikely that this will race with the _Q_SLOW_VAL setting
>> +	 * in pv_wait_head_or_lock(). However, we use cmpxchg() here to be
>> +	 * sure that we won't do a double pv_hash().
>> +	 *
>> +	 * As it is the lock holder, it won't race with
>> +	 * __pv_queued_spin_unlock().
>>   	 */
>> -	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
>> -	(void)pv_hash(lock, pn);
>> +	if (likely(cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)
>> +			== _Q_LOCKED_VAL))
>> +		pv_hash(lock, pn);
>>   }
>>
>>   /*
>> @@ -389,18 +393,10 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>>   {
>>   	struct pv_node *pn = (struct pv_node *)node;
>>   	struct __qspinlock *l = (void *)lock;
>> -	struct qspinlock **lp = NULL;
>>   	int waitcnt = 0;
>>   	int loop;
>>
>>   	/*
>> -	 * If pv_kick_node() already advanced our state, we don't need to
>> -	 * insert ourselves into the hash table anymore.
>> -	 */
>> -	if (READ_ONCE(pn->state) == vcpu_hashed)
>> -		lp = (struct qspinlock **)1;
>> -
>> -	/*
>>   	 * Tracking # of slowpath locking operations
>>   	 */
>>   	qstat_inc(qstat_pv_lock_slowpath, true);
>> @@ -422,11 +418,19 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>>   				goto gotlock;
>>   			cpu_relax();
>>   		}
>> -		clear_pending(lock);
>>
>> +		/*
>> +		 * Make sure the lock value check below is executed after
>> +		 * all the previous loads.
>> +		 */
>> +		smp_rmb();
>>
>> -		if (!lp) { /* ONCE */
>> -			lp = pv_hash(lock, pn);
>> +		/*
>> +		 * Set _Q_SLOW_VAL and hash the PV node, if necessary.
>> +		 */
>> +		if (READ_ONCE(l->locked) != _Q_SLOW_VAL) {
>> +			struct qspinlock **lp = pv_hash(lock, pn);
>> +			u8 locked;
>>
> Just out of curiosity, what if the following sequence happens:
>
> 	CPU 0			CPU 1
> 	=================	====================
> 	spin_lock():		spin_lock():
> 	  pv_kick_node(): 	  pv_wait_head_or_lock():
> 	  			  if (READ_ONCE(l->locked) != _Q_SLOW_VAL) { // True
> 				    pv_hash();
>
> 	    cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL);
> 	    pv_hash();
> 				    locked = xchg(&l->locked, _Q_SLOW_VAL);
> 	do_something();		    if(...) {
> 				    }
> 	spin_unlock():
> 	  pv_unhash();
> 				    else if (unlikely(locked == _Q_SLOW_VAL)) {
> 				    	WRITE_ONCE(*lp, NULL);
>
> because pv_hash() on CPU 1 called before the one on CPU 0, therefore
> the hash entry from CPU 1 is preceding the hash entry from CPU 0 in the
> hash table, so that when pv_unhash() called, hash entry from CPU 1 will
> be unhashed, however, the WRITE_ONCE(*lp, NULL) on CPU 1 will also
> unhash the same entry, leaving that hash entry from CPU 0 not unhashed.
>
> This could result in several interesting problems, right? ;-)

This is a very unlikely scenario, but I agree that it can happen. I 
think the only way to close this loophole is to make pv_unhash() atomic. 
By using pv_unhash() instead of WRITE_ONCE(), it should fix this issue. 
I will send out an updated patch to fix that.

Cheers,
Longman

  reply	other threads:[~2016-05-27 19:28 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-26 18:21 [PATCH 0/2] locking/pvqspinlock: Fix missed PV wakeup & support PPC Waiman Long
2016-05-26 18:21 ` [PATCH 1/2] locking/pvqspinlock: Fix missed PV wakeup problem Waiman Long
2016-05-27  7:43   ` Boqun Feng
2016-05-27 19:28     ` Waiman Long [this message]
2016-05-26 18:21 ` [PATCH 2/2] locking/pvqspinlock: Add lock holder CPU argument to pv_wait() Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57489FE7.8080402@hpe.com \
    --to=waiman.long@hpe.com \
    --cc=boqun.feng@gmail.com \
    --cc=doug.hatch@hpe.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=scott.norton@hpe.com \
    --cc=xinhui@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.