From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752106AbdBTEVF (ORCPT ); Sun, 19 Feb 2017 23:21:05 -0500 Received: from mail-wr0-f193.google.com ([209.85.128.193]:34867 "EHLO mail-wr0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751918AbdBTEVE (ORCPT ); Sun, 19 Feb 2017 23:21:04 -0500 Date: Mon, 20 Feb 2017 05:20:52 +0100 From: Andrea Parri To: Waiman Long Cc: Peter Zijlstra , Ingo Molnar , Pan Xinhui , Boqun Feng , linux-kernel@vger.kernel.org Subject: Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs Message-ID: <20170220042052.GA4529@andrea> References: <1487364220-11641-1-git-send-email-longman@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1487364220-11641-1-git-send-email-longman@redhat.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote: > All the locking related cmpxchg's in the following functions are > replaced with the _acquire variants: > - pv_queued_spin_steal_lock() > - trylock_clear_pending() > > This change should help performance on architectures that use LL/SC. > > On a 2-core 16-thread Power8 system with pvqspinlock explicitly > enabled, the performance of a locking microbenchmark with and without > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch > were as follows: > > # of thread w/o patch with patch % Change > ----------- --------- ---------- -------- > 4 4053.3 Mop/s 4223.7 Mop/s +4.2% > 8 3310.4 Mop/s 3406.0 Mop/s +2.9% > 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > > Signed-off-by: Waiman Long > --- > > v2->v3: > - Reduce scope by relaxing cmpxchg's in fast path only. > > v1->v2: > - Add comments in changelog and code for the rationale of the change. > > kernel/locking/qspinlock_paravirt.h | 15 +++++++++------ > 1 file changed, 9 insertions(+), 6 deletions(-) > > diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h > index e6b2f7a..a59dc05 100644 > --- a/kernel/locking/qspinlock_paravirt.h > +++ b/kernel/locking/qspinlock_paravirt.h > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock) > struct __qspinlock *l = (void *)lock; > > if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) && > - (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { > + (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { > qstat_inc(qstat_pv_lock_stealing, true); > return true; > } > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock) > > /* > * The pending bit check in pv_queued_spin_steal_lock() isn't a memory > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock > - * just to be sure that it will get it. > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the > + * lock just to be sure that it will get it. > */ > static __always_inline int trylock_clear_pending(struct qspinlock *lock) > { > struct __qspinlock *l = (void *)lock; > > return !READ_ONCE(l->locked) && > - (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) > - == _Q_PENDING_VAL); > + (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL, > + _Q_LOCKED_VAL) == _Q_PENDING_VAL); > } > #else /* _Q_PENDING_BITS == 8 */ > static __always_inline void set_pending(struct qspinlock *lock) > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock) > */ > old = val; > new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; > - val = atomic_cmpxchg(&lock->val, old, new); > + val = atomic_cmpxchg_acquire(&lock->val, old, new); > > if (val == old) > return 1; > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) > * observe its next->locked value and advance itself. > * > * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > + * > + * We can't used relaxed form of cmpxchg here as the loading of > + * pn->state can happen before setting next->locked in some archs. > */ > if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) Hi Waiman. cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f., e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled" to something like: _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit: Andrea > return; > -- > 1.8.3.1 >