Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Boqun Feng <boqun.feng@gmail.com>
To: Andrea Parri <parri.andrea@gmail.com>
Cc: Waiman Long <longman@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Pan Xinhui <xinhui@linux.vnet.ibm.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
Date: Mon, 20 Feb 2017 12:53:58 +0800	[thread overview]
Message-ID: <20170220045358.GI9178@tardis.cn.ibm.com> (raw)
In-Reply-To: <20170220042052.GA4529@andrea>

[-- Attachment #1: Type: text/plain, Size: 5093 bytes --]

On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote:
> On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote:
> > All the locking related cmpxchg's in the following functions are
> > replaced with the _acquire variants:
> >  - pv_queued_spin_steal_lock()
> >  - trylock_clear_pending()
> > 
> > This change should help performance on architectures that use LL/SC.
> > 
> > On a 2-core 16-thread Power8 system with pvqspinlock explicitly
> > enabled, the performance of a locking microbenchmark with and without
> > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
> > were as follows:
> > 
> >   # of thread     w/o patch    with patch      % Change
> >   -----------     ---------    ----------      --------
> >        4         4053.3 Mop/s  4223.7 Mop/s     +4.2%
> >        8         3310.4 Mop/s  3406.0 Mop/s     +2.9%
> >       12         2576.4 Mop/s  2674.6 Mop/s     +3.8%
> > 
> > Signed-off-by: Waiman Long <longman@redhat.com>
> > ---
> > 
> >  v2->v3:
> >   - Reduce scope by relaxing cmpxchg's in fast path only.
> > 
> >  v1->v2:
> >   - Add comments in changelog and code for the rationale of the change.
> > 
> >  kernel/locking/qspinlock_paravirt.h | 15 +++++++++------
> >  1 file changed, 9 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> > index e6b2f7a..a59dc05 100644
> > --- a/kernel/locking/qspinlock_paravirt.h
> > +++ b/kernel/locking/qspinlock_paravirt.h
> > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
> >  	struct __qspinlock *l = (void *)lock;
> >  
> >  	if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
> > -	    (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> > +	    (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> >  		qstat_inc(qstat_pv_lock_stealing, true);
> >  		return true;
> >  	}
> > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock)
> >  
> >  /*
> >   * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
> > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
> > - * just to be sure that it will get it.
> > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
> > + * lock just to be sure that it will get it.
> >   */
> >  static __always_inline int trylock_clear_pending(struct qspinlock *lock)
> >  {
> >  	struct __qspinlock *l = (void *)lock;
> >  
> >  	return !READ_ONCE(l->locked) &&
> > -	       (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
> > -			== _Q_PENDING_VAL);
> > +	       (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
> > +				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
> >  }
> >  #else /* _Q_PENDING_BITS == 8 */
> >  static __always_inline void set_pending(struct qspinlock *lock)
> > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
> >  		 */
> >  		old = val;
> >  		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
> > -		val = atomic_cmpxchg(&lock->val, old, new);
> > +		val = atomic_cmpxchg_acquire(&lock->val, old, new);
> >  
> >  		if (val == old)
> >  			return 1;
> > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
> >  	 * observe its next->locked value and advance itself.
> >  	 *
> >  	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> > +	 *
> > +	 * We can't used relaxed form of cmpxchg here as the loading of
> > +	 * pn->state can happen before setting next->locked in some archs.
> >  	 */
> >  	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> 
> Hi Waiman.
> 
> cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f.,
> e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled"
> to something like:
> 
>     _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit:
> 

Yes, sorry for be late for this one.

So Waiman, the fact is that in this case, we want the following code
sequence:

	CPU 0					CPU 1
	=================			====================
	{pn->state = vcpu_running, node->locked = 0}

	smp_store_smb(&pn->state, vcpu_halted):
	  WRITE_ONCE(pn->state, vcpu_halted);
	  smp_mb();
	r1 = READ_ONCE(node->locked);
						arch_mcs_spin_unlock_contented();
						  WRITE_ONCE(node->locked, 1)

						cmpxchg(&pn->state, vcpu_halted, vcpu_hashed);

never ends up in:

	r1 == 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the
	value vcpu_running).

We can have such a guarantee if cmpxchg has a smp_mb() before its load
part, which is true for PPC. But semantically, cmpxchg() doesn't provide
any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will
in Cc for his insight ;-)).

So a possible "fix"(in case ARM64 will use qspinlock some day), would be
replace cmpxchg() with smp_mb() + cmpxchg_relaxed().

Regards,
Boqun

>   Andrea
> 
> 
> >  		return;
> > -- 
> > 1.8.3.1
> > 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

next prev parent reply	other threads:[~2017-02-20  4:52 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-17 20:43 [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs Waiman Long
2017-02-20  4:20 ` Andrea Parri
2017-02-20  4:53   ` Boqun Feng [this message]
2017-02-20  4:58     ` Boqun Feng
2017-02-21 13:04       ` Will Deacon
2017-02-20 15:58     ` Waiman Long
2017-02-20 11:00   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170220045358.GI9178@tardis.cn.ibm.com \
    --to=boqun.feng@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@redhat.com \
    --cc=parri.andrea@gmail.com \
    --cc=peterz@infradead.org \
    --cc=xinhui@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox