From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752094AbdBTE4z (ORCPT ); Sun, 19 Feb 2017 23:56:55 -0500 Received: from mail-pf0-f195.google.com ([209.85.192.195]:34265 "EHLO mail-pf0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751875AbdBTE4y (ORCPT ); Sun, 19 Feb 2017 23:56:54 -0500 Date: Mon, 20 Feb 2017 12:58:39 +0800 From: Boqun Feng To: Andrea Parri Cc: Waiman Long , Peter Zijlstra , Ingo Molnar , Pan Xinhui , linux-kernel@vger.kernel.org, Will Deacon Subject: Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs Message-ID: <20170220045839.GJ9178@tardis.cn.ibm.com> References: <1487364220-11641-1-git-send-email-longman@redhat.com> <20170220042052.GA4529@andrea> <20170220045358.GI9178@tardis.cn.ibm.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="nO3oAMapP4dBpMZi" Content-Disposition: inline In-Reply-To: <20170220045358.GI9178@tardis.cn.ibm.com> User-Agent: Mutt/1.7.2 (2016-11-26) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --nO3oAMapP4dBpMZi Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable (Really add Will this time ...) On Mon, Feb 20, 2017 at 12:53:58PM +0800, Boqun Feng wrote: > On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote: > > On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote: > > > All the locking related cmpxchg's in the following functions are > > > replaced with the _acquire variants: > > > - pv_queued_spin_steal_lock() > > > - trylock_clear_pending() > > >=20 > > > This change should help performance on architectures that use LL/SC. > > >=20 > > > On a 2-core 16-thread Power8 system with pvqspinlock explicitly > > > enabled, the performance of a locking microbenchmark with and without > > > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch > > > were as follows: > > >=20 > > > # of thread w/o patch with patch % Change > > > ----------- --------- ---------- -------- > > > 4 4053.3 Mop/s 4223.7 Mop/s +4.2% > > > 8 3310.4 Mop/s 3406.0 Mop/s +2.9% > > > 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > > >=20 > > > Signed-off-by: Waiman Long > > > --- > > >=20 > > > v2->v3: > > > - Reduce scope by relaxing cmpxchg's in fast path only. > > >=20 > > > v1->v2: > > > - Add comments in changelog and code for the rationale of the chang= e. > > >=20 > > > kernel/locking/qspinlock_paravirt.h | 15 +++++++++------ > > > 1 file changed, 9 insertions(+), 6 deletions(-) > > >=20 > > > diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qsp= inlock_paravirt.h > > > index e6b2f7a..a59dc05 100644 > > > --- a/kernel/locking/qspinlock_paravirt.h > > > +++ b/kernel/locking/qspinlock_paravirt.h > > > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct= qspinlock *lock) > > > struct __qspinlock *l =3D (void *)lock; > > > =20 > > > if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) && > > > - (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) =3D=3D 0)) { > > > + (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) =3D=3D 0)) { > > > qstat_inc(qstat_pv_lock_stealing, true); > > > return true; > > > } > > > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struc= t qspinlock *lock) > > > =20 > > > /* > > > * The pending bit check in pv_queued_spin_steal_lock() isn't a memo= ry > > > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lo= ck > > > - * just to be sure that it will get it. > > > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquir= e the > > > + * lock just to be sure that it will get it. > > > */ > > > static __always_inline int trylock_clear_pending(struct qspinlock *l= ock) > > > { > > > struct __qspinlock *l =3D (void *)lock; > > > =20 > > > return !READ_ONCE(l->locked) && > > > - (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) > > > - =3D=3D _Q_PENDING_VAL); > > > + (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL, > > > + _Q_LOCKED_VAL) =3D=3D _Q_PENDING_VAL); > > > } > > > #else /* _Q_PENDING_BITS =3D=3D 8 */ > > > static __always_inline void set_pending(struct qspinlock *lock) > > > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(= struct qspinlock *lock) > > > */ > > > old =3D val; > > > new =3D (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; > > > - val =3D atomic_cmpxchg(&lock->val, old, new); > > > + val =3D atomic_cmpxchg_acquire(&lock->val, old, new); > > > =20 > > > if (val =3D=3D old) > > > return 1; > > > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, = struct mcs_spinlock *node) > > > * observe its next->locked value and advance itself. > > > * > > > * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > > > + * > > > + * We can't used relaxed form of cmpxchg here as the loading of > > > + * pn->state can happen before setting next->locked in some archs. > > > */ > > > if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) !=3D vcpu_halted) > >=20 > > Hi Waiman. > >=20 > > cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c= =2Ef., > > e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compi= led" > > to something like: > >=20 > > _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit: > >=20 >=20 > Yes, sorry for be late for this one. >=20 > So Waiman, the fact is that in this case, we want the following code > sequence: >=20 > CPU 0 CPU 1 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > {pn->state =3D vcpu_running, node->locked =3D 0} >=20 > smp_store_smb(&pn->state, vcpu_halted): > WRITE_ONCE(pn->state, vcpu_halted); > smp_mb(); > r1 =3D READ_ONCE(node->locked); > arch_mcs_spin_unlock_contented(); > WRITE_ONCE(node->locked, 1) >=20 > cmpxchg(&pn->state, vcpu_halted, vcpu_hashed); >=20 > never ends up in: >=20 > r1 =3D=3D 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the > value vcpu_running). >=20 > We can have such a guarantee if cmpxchg has a smp_mb() before its load > part, which is true for PPC. But semantically, cmpxchg() doesn't provide > any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will > in Cc for his insight ;-)). >=20 > So a possible "fix"(in case ARM64 will use qspinlock some day), would be > replace cmpxchg() with smp_mb() + cmpxchg_relaxed(). >=20 > Regards, > Boqun >=20 > > Andrea > >=20 > >=20 > > > return; > > > --=20 > > > 1.8.3.1 > > >=20 --nO3oAMapP4dBpMZi Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAliqd3sACgkQSXnow7UH +rhdiQf8D7LORDmi8G3mFerHSx/QI0ZlABoUxsmaxVx0lPcqznJV/HAdnKuuNoC5 RnmeUFqfm0R5aQPIOIUqf4zikMtNMXZJsyaZN8dgzVSEfypMEgaMW+fUTb0n5KBQ Dx+qkOU2bNzTUhIVEMgAGfcjkGqzUIKYklTjMg1VuqO0CiHx+Cx8vZyXZdQ2JINo +YvxtEItvLL1eK7dghhurXoN7EVyYW83L8HiQEiuGsN+BjdVqlnohg6eCmqV0/aP 67wCLOM7VDkpigL7LpDoCb85+5nKr5zoC4lnIChiDZB8KrAR8/jOxXEp9DH+MIes eTysQN9wl9x13NFTidBgUotwux65ug== =P8FE -----END PGP SIGNATURE----- --nO3oAMapP4dBpMZi--