From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752094AbdBTE4z (ORCPT <rfc822;w@1wt.eu>);
        Sun, 19 Feb 2017 23:56:55 -0500
Received: from mail-pf0-f195.google.com ([209.85.192.195]:34265 "EHLO
        mail-pf0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751875AbdBTE4y (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Sun, 19 Feb 2017 23:56:54 -0500
Date: Mon, 20 Feb 2017 12:58:39 +0800
From: Boqun Feng <boqun.feng@gmail.com>
To: Andrea Parri <parri.andrea@gmail.com>
Cc: Waiman Long <longman@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>,
        Pan Xinhui <xinhui@linux.vnet.ibm.com>, linux-kernel@vger.kernel.org,
        Will Deacon <will.deacon@arm.com>
Subject: Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve
 performance on some archs
Message-ID: <20170220045839.GJ9178@tardis.cn.ibm.com>
References: <1487364220-11641-1-git-send-email-longman@redhat.com>
 <20170220042052.GA4529@andrea>
 <20170220045358.GI9178@tardis.cn.ibm.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
        protocol="application/pgp-signature"; boundary="nO3oAMapP4dBpMZi"
Content-Disposition: inline
In-Reply-To: <20170220045358.GI9178@tardis.cn.ibm.com>
User-Agent: Mutt/1.7.2 (2016-11-26)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


--nO3oAMapP4dBpMZi
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

(Really add Will this time ...)

On Mon, Feb 20, 2017 at 12:53:58PM +0800, Boqun Feng wrote:
> On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote:
> > On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote:
> > > All the locking related cmpxchg's in the following functions are
> > > replaced with the _acquire variants:
> > >  - pv_queued_spin_steal_lock()
> > >  - trylock_clear_pending()
> > >=20
> > > This change should help performance on architectures that use LL/SC.
> > >=20
> > > On a 2-core 16-thread Power8 system with pvqspinlock explicitly
> > > enabled, the performance of a locking microbenchmark with and without
> > > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
> > > were as follows:
> > >=20
> > >   # of thread     w/o patch    with patch      % Change
> > >   -----------     ---------    ----------      --------
> > >        4         4053.3 Mop/s  4223.7 Mop/s     +4.2%
> > >        8         3310.4 Mop/s  3406.0 Mop/s     +2.9%
> > >       12         2576.4 Mop/s  2674.6 Mop/s     +3.8%
> > >=20
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > ---
> > >=20
> > >  v2->v3:
> > >   - Reduce scope by relaxing cmpxchg's in fast path only.
> > >=20
> > >  v1->v2:
> > >   - Add comments in changelog and code for the rationale of the chang=
e.
> > >=20
> > >  kernel/locking/qspinlock_paravirt.h | 15 +++++++++------
> > >  1 file changed, 9 insertions(+), 6 deletions(-)
> > >=20
> > > diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qsp=
inlock_paravirt.h
> > > index e6b2f7a..a59dc05 100644
> > > --- a/kernel/locking/qspinlock_paravirt.h
> > > +++ b/kernel/locking/qspinlock_paravirt.h
> > > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct=
 qspinlock *lock)
> > >  	struct __qspinlock *l =3D (void *)lock;
> > > =20
> > >  	if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
> > > -	    (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) =3D=3D 0)) {
> > > +	    (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) =3D=3D 0)) {
> > >  		qstat_inc(qstat_pv_lock_stealing, true);
> > >  		return true;
> > >  	}
> > > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struc=
t qspinlock *lock)
> > > =20
> > >  /*
> > >   * The pending bit check in pv_queued_spin_steal_lock() isn't a memo=
ry
> > > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lo=
ck
> > > - * just to be sure that it will get it.
> > > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquir=
e the
> > > + * lock just to be sure that it will get it.
> > >   */
> > >  static __always_inline int trylock_clear_pending(struct qspinlock *l=
ock)
> > >  {
> > >  	struct __qspinlock *l =3D (void *)lock;
> > > =20
> > >  	return !READ_ONCE(l->locked) &&
> > > -	       (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
> > > -			=3D=3D _Q_PENDING_VAL);
> > > +	       (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
> > > +				_Q_LOCKED_VAL) =3D=3D _Q_PENDING_VAL);
> > >  }
> > >  #else /* _Q_PENDING_BITS =3D=3D 8 */
> > >  static __always_inline void set_pending(struct qspinlock *lock)
> > > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(=
struct qspinlock *lock)
> > >  		 */
> > >  		old =3D val;
> > >  		new =3D (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
> > > -		val =3D atomic_cmpxchg(&lock->val, old, new);
> > > +		val =3D atomic_cmpxchg_acquire(&lock->val, old, new);
> > > =20
> > >  		if (val =3D=3D old)
> > >  			return 1;
> > > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, =
struct mcs_spinlock *node)
> > >  	 * observe its next->locked value and advance itself.
> > >  	 *
> > >  	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> > > +	 *
> > > +	 * We can't used relaxed form of cmpxchg here as the loading of
> > > +	 * pn->state can happen before setting next->locked in some archs.
> > >  	 */
> > >  	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) !=3D vcpu_halted)
> >=20
> > Hi Waiman.
> >=20
> > cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c=
=2Ef.,
> > e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compi=
led"
> > to something like:
> >=20
> >     _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit:
> >=20
>=20
> Yes, sorry for be late for this one.
>=20
> So Waiman, the fact is that in this case, we want the following code
> sequence:
>=20
> 	CPU 0					CPU 1
> 	=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D			=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 	{pn->state =3D vcpu_running, node->locked =3D 0}
>=20
> 	smp_store_smb(&pn->state, vcpu_halted):
> 	  WRITE_ONCE(pn->state, vcpu_halted);
> 	  smp_mb();
> 	r1 =3D READ_ONCE(node->locked);
> 						arch_mcs_spin_unlock_contented();
> 						  WRITE_ONCE(node->locked, 1)
>=20
> 						cmpxchg(&pn->state, vcpu_halted, vcpu_hashed);
>=20
> never ends up in:
>=20
> 	r1 =3D=3D 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the
> 	value vcpu_running).
>=20
> We can have such a guarantee if cmpxchg has a smp_mb() before its load
> part, which is true for PPC. But semantically, cmpxchg() doesn't provide
> any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will
> in Cc for his insight ;-)).
>=20
> So a possible "fix"(in case ARM64 will use qspinlock some day), would be
> replace cmpxchg() with smp_mb() + cmpxchg_relaxed().
>=20
> Regards,
> Boqun
>=20
> >   Andrea
> >=20
> >=20
> > >  		return;
> > > --=20
> > > 1.8.3.1
> > >=20


--nO3oAMapP4dBpMZi
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQEzBAABCAAdFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAliqd3sACgkQSXnow7UH
+rhdiQf8D7LORDmi8G3mFerHSx/QI0ZlABoUxsmaxVx0lPcqznJV/HAdnKuuNoC5
RnmeUFqfm0R5aQPIOIUqf4zikMtNMXZJsyaZN8dgzVSEfypMEgaMW+fUTb0n5KBQ
Dx+qkOU2bNzTUhIVEMgAGfcjkGqzUIKYklTjMg1VuqO0CiHx+Cx8vZyXZdQ2JINo
+YvxtEItvLL1eK7dghhurXoN7EVyYW83L8HiQEiuGsN+BjdVqlnohg6eCmqV0/aP
67wCLOM7VDkpigL7LpDoCb85+5nKr5zoC4lnIChiDZB8KrAR8/jOxXEp9DH+MIes
eTysQN9wl9x13NFTidBgUotwux65ug==
=P8FE
-----END PGP SIGNATURE-----

--nO3oAMapP4dBpMZi--