* [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
@ 2017-02-17 20:43 Waiman Long
2017-02-20 4:20 ` Andrea Parri
0 siblings, 1 reply; 7+ messages in thread
From: Waiman Long @ 2017-02-17 20:43 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: linux-kernel, Pan Xinhui, Boqun Feng, Waiman Long
All the locking related cmpxchg's in the following functions are
replaced with the _acquire variants:
- pv_queued_spin_steal_lock()
- trylock_clear_pending()
This change should help performance on architectures that use LL/SC.
On a 2-core 16-thread Power8 system with pvqspinlock explicitly
enabled, the performance of a locking microbenchmark with and without
this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
were as follows:
# of thread w/o patch with patch % Change
----------- --------- ---------- --------
4 4053.3 Mop/s 4223.7 Mop/s +4.2%
8 3310.4 Mop/s 3406.0 Mop/s +2.9%
12 2576.4 Mop/s 2674.6 Mop/s +3.8%
Signed-off-by: Waiman Long <longman@redhat.com>
---
v2->v3:
- Reduce scope by relaxing cmpxchg's in fast path only.
v1->v2:
- Add comments in changelog and code for the rationale of the change.
kernel/locking/qspinlock_paravirt.h | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e6b2f7a..a59dc05 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
struct __qspinlock *l = (void *)lock;
if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
- (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
+ (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
qstat_inc(qstat_pv_lock_stealing, true);
return true;
}
@@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock)
/*
* The pending bit check in pv_queued_spin_steal_lock() isn't a memory
- * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
- * just to be sure that it will get it.
+ * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
+ * lock just to be sure that it will get it.
*/
static __always_inline int trylock_clear_pending(struct qspinlock *lock)
{
struct __qspinlock *l = (void *)lock;
return !READ_ONCE(l->locked) &&
- (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
- == _Q_PENDING_VAL);
+ (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
+ _Q_LOCKED_VAL) == _Q_PENDING_VAL);
}
#else /* _Q_PENDING_BITS == 8 */
static __always_inline void set_pending(struct qspinlock *lock)
@@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
*/
old = val;
new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
- val = atomic_cmpxchg(&lock->val, old, new);
+ val = atomic_cmpxchg_acquire(&lock->val, old, new);
if (val == old)
return 1;
@@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
* observe its next->locked value and advance itself.
*
* Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
+ *
+ * We can't used relaxed form of cmpxchg here as the loading of
+ * pn->state can happen before setting next->locked in some archs.
*/
if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
return;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs 2017-02-17 20:43 [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs Waiman Long @ 2017-02-20 4:20 ` Andrea Parri 2017-02-20 4:53 ` Boqun Feng 2017-02-20 11:00 ` Peter Zijlstra 0 siblings, 2 replies; 7+ messages in thread From: Andrea Parri @ 2017-02-20 4:20 UTC (permalink / raw) To: Waiman Long Cc: Peter Zijlstra, Ingo Molnar, Pan Xinhui, Boqun Feng, linux-kernel On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote: > All the locking related cmpxchg's in the following functions are > replaced with the _acquire variants: > - pv_queued_spin_steal_lock() > - trylock_clear_pending() > > This change should help performance on architectures that use LL/SC. > > On a 2-core 16-thread Power8 system with pvqspinlock explicitly > enabled, the performance of a locking microbenchmark with and without > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch > were as follows: > > # of thread w/o patch with patch % Change > ----------- --------- ---------- -------- > 4 4053.3 Mop/s 4223.7 Mop/s +4.2% > 8 3310.4 Mop/s 3406.0 Mop/s +2.9% > 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > > Signed-off-by: Waiman Long <longman@redhat.com> > --- > > v2->v3: > - Reduce scope by relaxing cmpxchg's in fast path only. > > v1->v2: > - Add comments in changelog and code for the rationale of the change. > > kernel/locking/qspinlock_paravirt.h | 15 +++++++++------ > 1 file changed, 9 insertions(+), 6 deletions(-) > > diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h > index e6b2f7a..a59dc05 100644 > --- a/kernel/locking/qspinlock_paravirt.h > +++ b/kernel/locking/qspinlock_paravirt.h > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock) > struct __qspinlock *l = (void *)lock; > > if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) && > - (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { > + (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { > qstat_inc(qstat_pv_lock_stealing, true); > return true; > } > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock) > > /* > * The pending bit check in pv_queued_spin_steal_lock() isn't a memory > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock > - * just to be sure that it will get it. > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the > + * lock just to be sure that it will get it. > */ > static __always_inline int trylock_clear_pending(struct qspinlock *lock) > { > struct __qspinlock *l = (void *)lock; > > return !READ_ONCE(l->locked) && > - (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) > - == _Q_PENDING_VAL); > + (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL, > + _Q_LOCKED_VAL) == _Q_PENDING_VAL); > } > #else /* _Q_PENDING_BITS == 8 */ > static __always_inline void set_pending(struct qspinlock *lock) > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock) > */ > old = val; > new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; > - val = atomic_cmpxchg(&lock->val, old, new); > + val = atomic_cmpxchg_acquire(&lock->val, old, new); > > if (val == old) > return 1; > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) > * observe its next->locked value and advance itself. > * > * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > + * > + * We can't used relaxed form of cmpxchg here as the loading of > + * pn->state can happen before setting next->locked in some archs. > */ > if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) Hi Waiman. cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f., e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled" to something like: _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit: Andrea > return; > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs 2017-02-20 4:20 ` Andrea Parri @ 2017-02-20 4:53 ` Boqun Feng 2017-02-20 4:58 ` Boqun Feng 2017-02-20 15:58 ` Waiman Long 2017-02-20 11:00 ` Peter Zijlstra 1 sibling, 2 replies; 7+ messages in thread From: Boqun Feng @ 2017-02-20 4:53 UTC (permalink / raw) To: Andrea Parri Cc: Waiman Long, Peter Zijlstra, Ingo Molnar, Pan Xinhui, linux-kernel [-- Attachment #1: Type: text/plain, Size: 5093 bytes --] On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote: > On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote: > > All the locking related cmpxchg's in the following functions are > > replaced with the _acquire variants: > > - pv_queued_spin_steal_lock() > > - trylock_clear_pending() > > > > This change should help performance on architectures that use LL/SC. > > > > On a 2-core 16-thread Power8 system with pvqspinlock explicitly > > enabled, the performance of a locking microbenchmark with and without > > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch > > were as follows: > > > > # of thread w/o patch with patch % Change > > ----------- --------- ---------- -------- > > 4 4053.3 Mop/s 4223.7 Mop/s +4.2% > > 8 3310.4 Mop/s 3406.0 Mop/s +2.9% > > 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > > > > Signed-off-by: Waiman Long <longman@redhat.com> > > --- > > > > v2->v3: > > - Reduce scope by relaxing cmpxchg's in fast path only. > > > > v1->v2: > > - Add comments in changelog and code for the rationale of the change. > > > > kernel/locking/qspinlock_paravirt.h | 15 +++++++++------ > > 1 file changed, 9 insertions(+), 6 deletions(-) > > > > diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h > > index e6b2f7a..a59dc05 100644 > > --- a/kernel/locking/qspinlock_paravirt.h > > +++ b/kernel/locking/qspinlock_paravirt.h > > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock) > > struct __qspinlock *l = (void *)lock; > > > > if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) && > > - (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { > > + (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { > > qstat_inc(qstat_pv_lock_stealing, true); > > return true; > > } > > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock) > > > > /* > > * The pending bit check in pv_queued_spin_steal_lock() isn't a memory > > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock > > - * just to be sure that it will get it. > > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the > > + * lock just to be sure that it will get it. > > */ > > static __always_inline int trylock_clear_pending(struct qspinlock *lock) > > { > > struct __qspinlock *l = (void *)lock; > > > > return !READ_ONCE(l->locked) && > > - (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) > > - == _Q_PENDING_VAL); > > + (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL, > > + _Q_LOCKED_VAL) == _Q_PENDING_VAL); > > } > > #else /* _Q_PENDING_BITS == 8 */ > > static __always_inline void set_pending(struct qspinlock *lock) > > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock) > > */ > > old = val; > > new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; > > - val = atomic_cmpxchg(&lock->val, old, new); > > + val = atomic_cmpxchg_acquire(&lock->val, old, new); > > > > if (val == old) > > return 1; > > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) > > * observe its next->locked value and advance itself. > > * > > * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > > + * > > + * We can't used relaxed form of cmpxchg here as the loading of > > + * pn->state can happen before setting next->locked in some archs. > > */ > > if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) > > Hi Waiman. > > cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f., > e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled" > to something like: > > _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit: > Yes, sorry for be late for this one. So Waiman, the fact is that in this case, we want the following code sequence: CPU 0 CPU 1 ================= ==================== {pn->state = vcpu_running, node->locked = 0} smp_store_smb(&pn->state, vcpu_halted): WRITE_ONCE(pn->state, vcpu_halted); smp_mb(); r1 = READ_ONCE(node->locked); arch_mcs_spin_unlock_contented(); WRITE_ONCE(node->locked, 1) cmpxchg(&pn->state, vcpu_halted, vcpu_hashed); never ends up in: r1 == 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the value vcpu_running). We can have such a guarantee if cmpxchg has a smp_mb() before its load part, which is true for PPC. But semantically, cmpxchg() doesn't provide any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will in Cc for his insight ;-)). So a possible "fix"(in case ARM64 will use qspinlock some day), would be replace cmpxchg() with smp_mb() + cmpxchg_relaxed(). Regards, Boqun > Andrea > > > > return; > > -- > > 1.8.3.1 > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs 2017-02-20 4:53 ` Boqun Feng @ 2017-02-20 4:58 ` Boqun Feng 2017-02-21 13:04 ` Will Deacon 2017-02-20 15:58 ` Waiman Long 1 sibling, 1 reply; 7+ messages in thread From: Boqun Feng @ 2017-02-20 4:58 UTC (permalink / raw) To: Andrea Parri Cc: Waiman Long, Peter Zijlstra, Ingo Molnar, Pan Xinhui, linux-kernel, Will Deacon [-- Attachment #1: Type: text/plain, Size: 5463 bytes --] (Really add Will this time ...) On Mon, Feb 20, 2017 at 12:53:58PM +0800, Boqun Feng wrote: > On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote: > > On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote: > > > All the locking related cmpxchg's in the following functions are > > > replaced with the _acquire variants: > > > - pv_queued_spin_steal_lock() > > > - trylock_clear_pending() > > > > > > This change should help performance on architectures that use LL/SC. > > > > > > On a 2-core 16-thread Power8 system with pvqspinlock explicitly > > > enabled, the performance of a locking microbenchmark with and without > > > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch > > > were as follows: > > > > > > # of thread w/o patch with patch % Change > > > ----------- --------- ---------- -------- > > > 4 4053.3 Mop/s 4223.7 Mop/s +4.2% > > > 8 3310.4 Mop/s 3406.0 Mop/s +2.9% > > > 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > > > > > > Signed-off-by: Waiman Long <longman@redhat.com> > > > --- > > > > > > v2->v3: > > > - Reduce scope by relaxing cmpxchg's in fast path only. > > > > > > v1->v2: > > > - Add comments in changelog and code for the rationale of the change. > > > > > > kernel/locking/qspinlock_paravirt.h | 15 +++++++++------ > > > 1 file changed, 9 insertions(+), 6 deletions(-) > > > > > > diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h > > > index e6b2f7a..a59dc05 100644 > > > --- a/kernel/locking/qspinlock_paravirt.h > > > +++ b/kernel/locking/qspinlock_paravirt.h > > > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock) > > > struct __qspinlock *l = (void *)lock; > > > > > > if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) && > > > - (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { > > > + (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { > > > qstat_inc(qstat_pv_lock_stealing, true); > > > return true; > > > } > > > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock) > > > > > > /* > > > * The pending bit check in pv_queued_spin_steal_lock() isn't a memory > > > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock > > > - * just to be sure that it will get it. > > > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the > > > + * lock just to be sure that it will get it. > > > */ > > > static __always_inline int trylock_clear_pending(struct qspinlock *lock) > > > { > > > struct __qspinlock *l = (void *)lock; > > > > > > return !READ_ONCE(l->locked) && > > > - (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) > > > - == _Q_PENDING_VAL); > > > + (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL, > > > + _Q_LOCKED_VAL) == _Q_PENDING_VAL); > > > } > > > #else /* _Q_PENDING_BITS == 8 */ > > > static __always_inline void set_pending(struct qspinlock *lock) > > > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock) > > > */ > > > old = val; > > > new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; > > > - val = atomic_cmpxchg(&lock->val, old, new); > > > + val = atomic_cmpxchg_acquire(&lock->val, old, new); > > > > > > if (val == old) > > > return 1; > > > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) > > > * observe its next->locked value and advance itself. > > > * > > > * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > > > + * > > > + * We can't used relaxed form of cmpxchg here as the loading of > > > + * pn->state can happen before setting next->locked in some archs. > > > */ > > > if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) > > > > Hi Waiman. > > > > cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f., > > e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled" > > to something like: > > > > _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit: > > > > Yes, sorry for be late for this one. > > So Waiman, the fact is that in this case, we want the following code > sequence: > > CPU 0 CPU 1 > ================= ==================== > {pn->state = vcpu_running, node->locked = 0} > > smp_store_smb(&pn->state, vcpu_halted): > WRITE_ONCE(pn->state, vcpu_halted); > smp_mb(); > r1 = READ_ONCE(node->locked); > arch_mcs_spin_unlock_contented(); > WRITE_ONCE(node->locked, 1) > > cmpxchg(&pn->state, vcpu_halted, vcpu_hashed); > > never ends up in: > > r1 == 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the > value vcpu_running). > > We can have such a guarantee if cmpxchg has a smp_mb() before its load > part, which is true for PPC. But semantically, cmpxchg() doesn't provide > any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will > in Cc for his insight ;-)). > > So a possible "fix"(in case ARM64 will use qspinlock some day), would be > replace cmpxchg() with smp_mb() + cmpxchg_relaxed(). > > Regards, > Boqun > > > Andrea > > > > > > > return; > > > -- > > > 1.8.3.1 > > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs 2017-02-20 4:58 ` Boqun Feng @ 2017-02-21 13:04 ` Will Deacon 0 siblings, 0 replies; 7+ messages in thread From: Will Deacon @ 2017-02-21 13:04 UTC (permalink / raw) To: Boqun Feng Cc: Andrea Parri, Waiman Long, Peter Zijlstra, Ingo Molnar, Pan Xinhui, linux-kernel On Mon, Feb 20, 2017 at 12:58:39PM +0800, Boqun Feng wrote: > > So Waiman, the fact is that in this case, we want the following code > > sequence: > > > > CPU 0 CPU 1 > > ================= ==================== > > {pn->state = vcpu_running, node->locked = 0} > > > > smp_store_smb(&pn->state, vcpu_halted): > > WRITE_ONCE(pn->state, vcpu_halted); > > smp_mb(); > > r1 = READ_ONCE(node->locked); > > arch_mcs_spin_unlock_contented(); > > WRITE_ONCE(node->locked, 1) > > > > cmpxchg(&pn->state, vcpu_halted, vcpu_hashed); > > > > never ends up in: > > > > r1 == 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the > > value vcpu_running). > > > > We can have such a guarantee if cmpxchg has a smp_mb() before its load > > part, which is true for PPC. But semantically, cmpxchg() doesn't provide > > any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will > > in Cc for his insight ;-)). I think you're right. The write to node->locked on CPU1 is not required to be ordered before the load part of the failing cmpxchg. > > So a possible "fix"(in case ARM64 will use qspinlock some day), would be > > replace cmpxchg() with smp_mb() + cmpxchg_relaxed(). Peversely, we could actually get away with cmpxchg_acquire on arm64 because arch_mcs_spin_unlock_contended is smp_store_release and we order release -> acquire in the architecture. But that just brings up the age old unlock/lock discussion again... Will ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs 2017-02-20 4:53 ` Boqun Feng 2017-02-20 4:58 ` Boqun Feng @ 2017-02-20 15:58 ` Waiman Long 1 sibling, 0 replies; 7+ messages in thread From: Waiman Long @ 2017-02-20 15:58 UTC (permalink / raw) To: Boqun Feng, Andrea Parri Cc: Peter Zijlstra, Ingo Molnar, Pan Xinhui, linux-kernel On 02/19/2017 11:53 PM, Boqun Feng wrote: > On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote: >> On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote: >>> All the locking related cmpxchg's in the following functions are >>> replaced with the _acquire variants: >>> - pv_queued_spin_steal_lock() >>> - trylock_clear_pending() >>> >>> This change should help performance on architectures that use LL/SC. >>> >>> On a 2-core 16-thread Power8 system with pvqspinlock explicitly >>> enabled, the performance of a locking microbenchmark with and without >>> this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch >>> were as follows: >>> >>> # of thread w/o patch with patch % Change >>> ----------- --------- ---------- -------- >>> 4 4053.3 Mop/s 4223.7 Mop/s +4.2% >>> 8 3310.4 Mop/s 3406.0 Mop/s +2.9% >>> 12 2576.4 Mop/s 2674.6 Mop/s +3.8% >>> >>> Signed-off-by: Waiman Long <longman@redhat.com> >>> --- >>> >>> v2->v3: >>> - Reduce scope by relaxing cmpxchg's in fast path only. >>> >>> v1->v2: >>> - Add comments in changelog and code for the rationale of the change. >>> >>> kernel/locking/qspinlock_paravirt.h | 15 +++++++++------ >>> 1 file changed, 9 insertions(+), 6 deletions(-) >>> >>> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h >>> index e6b2f7a..a59dc05 100644 >>> --- a/kernel/locking/qspinlock_paravirt.h >>> +++ b/kernel/locking/qspinlock_paravirt.h >>> @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock) >>> struct __qspinlock *l = (void *)lock; >>> >>> if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) && >>> - (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { >>> + (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { >>> qstat_inc(qstat_pv_lock_stealing, true); >>> return true; >>> } >>> @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock) >>> >>> /* >>> * The pending bit check in pv_queued_spin_steal_lock() isn't a memory >>> - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock >>> - * just to be sure that it will get it. >>> + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the >>> + * lock just to be sure that it will get it. >>> */ >>> static __always_inline int trylock_clear_pending(struct qspinlock *lock) >>> { >>> struct __qspinlock *l = (void *)lock; >>> >>> return !READ_ONCE(l->locked) && >>> - (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) >>> - == _Q_PENDING_VAL); >>> + (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL, >>> + _Q_LOCKED_VAL) == _Q_PENDING_VAL); >>> } >>> #else /* _Q_PENDING_BITS == 8 */ >>> static __always_inline void set_pending(struct qspinlock *lock) >>> @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock) >>> */ >>> old = val; >>> new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; >>> - val = atomic_cmpxchg(&lock->val, old, new); >>> + val = atomic_cmpxchg_acquire(&lock->val, old, new); >>> >>> if (val == old) >>> return 1; >>> @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) >>> * observe its next->locked value and advance itself. >>> * >>> * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() >>> + * >>> + * We can't used relaxed form of cmpxchg here as the loading of >>> + * pn->state can happen before setting next->locked in some archs. >>> */ >>> if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) >> Hi Waiman. >> >> cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f., >> e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled" >> to something like: >> >> _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit: >> > Yes, sorry for be late for this one. > > So Waiman, the fact is that in this case, we want the following code > sequence: > > CPU 0 CPU 1 > ================= ==================== > {pn->state = vcpu_running, node->locked = 0} > > smp_store_smb(&pn->state, vcpu_halted): > WRITE_ONCE(pn->state, vcpu_halted); > smp_mb(); > r1 = READ_ONCE(node->locked); > arch_mcs_spin_unlock_contented(); > WRITE_ONCE(node->locked, 1) > > cmpxchg(&pn->state, vcpu_halted, vcpu_hashed); > > never ends up in: > > r1 == 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the > value vcpu_running). > > We can have such a guarantee if cmpxchg has a smp_mb() before its load > part, which is true for PPC. But semantically, cmpxchg() doesn't provide > any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will > in Cc for his insight ;-)). > > So a possible "fix"(in case ARM64 will use qspinlock some day), would be > replace cmpxchg() with smp_mb() + cmpxchg_relaxed(). > It is the pvqspinlock code, the native qspinlock will not have this problem. So we will need to make sure that the write to node->locked will always precede the read of pn->state whether the cmpxchg is successful or not. We are not going to replace cmpxchg() with smp_mb() + cmpxchg_relaxed() as that will impact x86 performance. Perhaps, we could provide another cmpxchg variant that can provide that memory barrier guarantee for all archs. In the mean time, I am going to update the patch to document this limitation so that we could do something about it when archs like ARM64 want pvqspinlock support. Cheers, Longman ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs 2017-02-20 4:20 ` Andrea Parri 2017-02-20 4:53 ` Boqun Feng @ 2017-02-20 11:00 ` Peter Zijlstra 1 sibling, 0 replies; 7+ messages in thread From: Peter Zijlstra @ 2017-02-20 11:00 UTC (permalink / raw) To: Andrea Parri Cc: Waiman Long, Ingo Molnar, Pan Xinhui, Boqun Feng, linux-kernel, Will Deacon On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote: > On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote: > > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) > > * observe its next->locked value and advance itself. > > * > > * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > > + * > > + * We can't used relaxed form of cmpxchg here as the loading of > > + * pn->state can happen before setting next->locked in some archs. > > */ > > if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) > > Hi Waiman. > > cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f., > e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled" > to something like: > > _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit: So cmpxchg() should have an effective smp_mb() before and after the operation, _except_ in the case of failure, in which case it need not imply any barrier. And since a successful cmpxchg does the store, which is a store-release in that arm64 sequence, that provides the ordering against all prior state. The dmb-ish provides the required barrier after the operation. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-02-21 13:04 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-02-17 20:43 [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs Waiman Long 2017-02-20 4:20 ` Andrea Parri 2017-02-20 4:53 ` Boqun Feng 2017-02-20 4:58 ` Boqun Feng 2017-02-21 13:04 ` Will Deacon 2017-02-20 15:58 ` Waiman Long 2017-02-20 11:00 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox