[PATCH tip/locking/core v8 0/5] locking/qspinlock: Enhance pvqspinlock performance

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH tip/locking/core v8 0/5] locking/qspinlock: Enhance pvqspinlock performance
@ 2015-10-15 22:51 Waiman Long
  2015-10-15 22:51 ` [PATCH tip/locking/core v8 1/5] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg Waiman Long
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Waiman Long @ 2015-10-15 22:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
	Waiman Long

v7->v8:
 - Annotated the use of each _acquire/_release variants in qspinlock.c.
 - Used the available pending bit in the lock stealing patch to disable
   lock stealing when the queue head vCPU is actively spinning on the
   lock to avoid lock starvation.
 - Restructured the lock stealing patch to reduce code duplication.
 - Verified that the waitcnt processing will be compiled away if
   QUEUED_LOCK_STAT isn't enabled.

v6->v7:
 - Removed arch/x86/include/asm/qspinlock.h from patch 1.
 - Removed the unconditional PV kick patch as it has been merged
   into tip.
 - Changed the pvstat_inc() API to add a new condition parameter.
 - Added comments and rearrange code in patch 4 to clarify where
   lock stealing happened.
 - In patch 5, removed the check for pv_wait count when deciding when
   to wait early.
 - Updated copyrights and email address.

v5->v6:
 - Added a new patch 1 to relax the cmpxchg and xchg operations in
   the native code path to reduce performance overhead on non-x86
   architectures.
 - Updated the unconditional PV kick patch as suggested by PeterZ.
 - Added a new patch to allow one lock stealing attempt at slowpath
   entry point to reduce performance penalty due to lock waiter
   preemption.
 - Removed the pending bit and kick-ahead patches as they didn't show
   any noticeable performance improvement on top of the lock stealing
   patch.
 - Simplified the adaptive spinning patch as the lock stealing patch
   allows more aggressive pv_wait() without much performance penalty
   in non-overcommitted VMs.

v4->v5:
 - Rebased the patch to the latest tip tree.
 - Corrected the comments and commit log for patch 1.
 - Removed the v4 patch 5 as PV kick deferment is no longer needed with
   the new tip tree.
 - Simplified the adaptive spinning patch (patch 6) & improve its
   performance a bit further.
 - Re-ran the benchmark test with the new patch.

v3->v4:
 - Patch 1: add comment about possible racing condition in PV unlock.
 - Patch 2: simplified the pv_pending_lock() function as suggested by
   Davidlohr.
 - Move PV unlock optimization patch forward to patch 4 & rerun
   performance test.

v2->v3:
 - Moved deferred kicking enablement patch forward & move back
   the kick-ahead patch to make the effect of kick-ahead more visible.
 - Reworked patch 6 to make it more readable.
 - Reverted back to use state as a tri-state variable instead of
   adding an additional bistate variable.
 - Added performance data for different values of PV_KICK_AHEAD_MAX.
 - Add a new patch to optimize PV unlock code path performance.

v1->v2:
 - Take out the queued unfair lock patches
 - Add a patch to simplify the PV unlock code
 - Move pending bit and statistics collection patches to the front
 - Keep vCPU kicking in pv_kick_node(), but defer it to unlock time
   when appropriate.
 - Change the wait-early patch to use adaptive spinning to better
   balance the difference effect on normal and over-committed guests.
 - Add patch-to-patch performance changes in the patch commit logs.

This patchset tries to improve the performance of both regular and
over-commmitted VM guests. The adaptive spinning patch was inspired
by the "Do Virtual Machines Really Scale?" blog from Sanidhya Kashyap.

Patch 1 relaxes the memory order restriction of atomic operations by
using less restrictive _acquire and _release variants of cmpxchg()
and xchg(). This will reduce performance overhead when ported to other
non-x86 architectures.

Patch 2 optimizes the PV unlock code path performance for x86-64
architecture.

Patch 3 allows the collection of various count data that are useful
to see what is happening in the system. They do add a bit of overhead
when enabled slowing performance a tiny bit.

Patch 4 allows one lock stealing attempt at slowpath entry. This causes
a pretty big performance improvement for over-committed VM guests.

Patch 5 enables adaptive spinning in the queue nodes. This patch
leads to further performance improvement in over-committed guest,
though it is not as big as the previous patch.

Waiman Long (5):
  locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg
  locking/pvqspinlock, x86: Optimize PV unlock code path
  locking/pvqspinlock: Collect slowpath lock statistics
  locking/pvqspinlock: Allow 1 lock stealing attempt
  locking/pvqspinlock: Queue node adaptive spinning

 arch/x86/Kconfig                          |    9 +
 arch/x86/include/asm/qspinlock_paravirt.h |   59 +++++
 include/asm-generic/qspinlock.h           |    9 +-
 kernel/locking/qspinlock.c                |   63 ++++--
 kernel/locking/qspinlock_paravirt.h       |  375 ++++++++++++++++++++++++++---
 5 files changed, 456 insertions(+), 59 deletions(-)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH tip/locking/core v8 1/5] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg
  2015-10-15 22:51 [PATCH tip/locking/core v8 0/5] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
@ 2015-10-15 22:51 ` Waiman Long
  2015-10-20  3:17   ` Boqun Feng
  2015-10-15 22:51 ` [PATCH tip/locking/core v8 2/5] locking/pvqspinlock, x86: Optimize PV unlock code path Waiman Long
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Waiman Long @ 2015-10-15 22:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
	Waiman Long

This patch replaces the cmpxchg() and xchg() calls in the native
qspinlock code with the more relaxed _acquire or _release versions of
those calls to enable other architectures to adopt queued spinlocks
with less memory barrier performance overhead.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/asm-generic/qspinlock.h |    9 +++++----
 kernel/locking/qspinlock.c      |   29 ++++++++++++++++++++++++-----
 2 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index e2aadbc..39e1cb2 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -12,8 +12,9 @@
  * GNU General Public License for more details.
  *
  * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
+ * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
  *
- * Authors: Waiman Long <waiman.long@hp.com>
+ * Authors: Waiman Long <waiman.long@hpe.com>
  */
 #ifndef __ASM_GENERIC_QSPINLOCK_H
 #define __ASM_GENERIC_QSPINLOCK_H
@@ -62,7 +63,7 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 static __always_inline int queued_spin_trylock(struct qspinlock *lock)
 {
 	if (!atomic_read(&lock->val) &&
-	   (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
+	   (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0))
 		return 1;
 	return 0;
 }
@@ -77,7 +78,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
 {
 	u32 val;
 
-	val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
+	val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
 	if (likely(val == 0))
 		return;
 	queued_spin_lock_slowpath(lock, val);
@@ -93,7 +94,7 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
 	/*
 	 * smp_mb__before_atomic() in order to guarantee release semantics
 	 */
-	smp_mb__before_atomic_dec();
+	smp_mb__before_atomic();
 	atomic_sub(_Q_LOCKED_VAL, &lock->val);
 }
 #endif
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 87e9ce6..7868418 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -14,8 +14,9 @@
  * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
  * (C) Copyright 2013-2014 Red Hat, Inc.
  * (C) Copyright 2015 Intel Corp.
+ * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
  *
- * Authors: Waiman Long <waiman.long@hp.com>
+ * Authors: Waiman Long <waiman.long@hpe.com>
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
@@ -176,7 +177,12 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
 	struct __qspinlock *l = (void *)lock;
 
-	return (u32)xchg(&l->tail, tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+	/*
+	 * Use release semantics to make sure that the MCS node is properly
+	 * initialized before changing the tail code.
+	 */
+	return (u32)xchg_release(&l->tail,
+				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 
 #else /* _Q_PENDING_BITS == 8 */
@@ -208,7 +214,11 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 
 	for (;;) {
 		new = (val & _Q_LOCKED_PENDING_MASK) | tail;
-		old = atomic_cmpxchg(&lock->val, val, new);
+		/*
+		 * Use release semantics to make sure that the MCS node is
+		 * properly initialized before changing the tail code.
+		 */
+		old = atomic_cmpxchg_release(&lock->val, val, new);
 		if (old == val)
 			break;
 
@@ -319,7 +329,11 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		if (val == new)
 			new |= _Q_PENDING_VAL;
 
-		old = atomic_cmpxchg(&lock->val, val, new);
+		/*
+		 * Acquire semantic is required here as the function may
+		 * return immediately if the lock was free.
+		 */
+		old = atomic_cmpxchg_acquire(&lock->val, val, new);
 		if (old == val)
 			break;
 
@@ -426,7 +440,12 @@ queue:
 			set_locked(lock);
 			break;
 		}
-		old = atomic_cmpxchg(&lock->val, val, _Q_LOCKED_VAL);
+		/*
+		 * The smp_load_acquire() call above has provided the necessary
+		 * acquire semantics required for locking. At most two
+		 * iterations of this loop may be ran.
+		 */
+		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
 		if (old == val)
 			goto release;	/* No contention */
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH tip/locking/core v8 1/5] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg
  2015-10-15 22:51 ` [PATCH tip/locking/core v8 1/5] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg Waiman Long
@ 2015-10-20  3:17   ` Boqun Feng
  2015-10-20  3:46     ` Davidlohr Bueso
  0 siblings, 1 reply; 9+ messages in thread
From: Boqun Feng @ 2015-10-20  3:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
	linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso

[-- Attachment #1: Type: text/plain, Size: 2295 bytes --]

Hi Waiman,

On Thu, Oct 15, 2015 at 06:51:03PM -0400, Waiman Long wrote:
> This patch replaces the cmpxchg() and xchg() calls in the native
> qspinlock code with the more relaxed _acquire or _release versions of
> those calls to enable other architectures to adopt queued spinlocks
> with less memory barrier performance overhead.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
> ---
>  include/asm-generic/qspinlock.h |    9 +++++----
>  kernel/locking/qspinlock.c      |   29 ++++++++++++++++++++++++-----
>  2 files changed, 29 insertions(+), 9 deletions(-)
> 
> diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
> index e2aadbc..39e1cb2 100644
> --- a/include/asm-generic/qspinlock.h
> +++ b/include/asm-generic/qspinlock.h
> @@ -12,8 +12,9 @@
>   * GNU General Public License for more details.
>   *
>   * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
> + * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
>   *
> - * Authors: Waiman Long <waiman.long@hp.com>
> + * Authors: Waiman Long <waiman.long@hpe.com>
>   */
>  #ifndef __ASM_GENERIC_QSPINLOCK_H
>  #define __ASM_GENERIC_QSPINLOCK_H
> @@ -62,7 +63,7 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
>  static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
>  	if (!atomic_read(&lock->val) &&
> -	   (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
> +	   (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0))
>  		return 1;
>  	return 0;
>  }
> @@ -77,7 +78,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
>  {
>  	u32 val;
>  
> -	val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
> +	val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
>  	if (likely(val == 0))
>  		return;
>  	queued_spin_lock_slowpath(lock, val);
> @@ -93,7 +94,7 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
>  	/*
>  	 * smp_mb__before_atomic() in order to guarantee release semantics
>  	 */
> -	smp_mb__before_atomic_dec();
> +	smp_mb__before_atomic();
>  	atomic_sub(_Q_LOCKED_VAL, &lock->val);

Just be curious, you don't use atomic_sub_release() here on purpose?

Regards,
Boqun

>  }
>  #endif

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH tip/locking/core v8 1/5] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg
  2015-10-20  3:17   ` Boqun Feng
@ 2015-10-20  3:46     ` Davidlohr Bueso
  2015-10-20  3:48       ` Boqun Feng
  0 siblings, 1 reply; 9+ messages in thread
From: Davidlohr Bueso @ 2015-10-20  3:46 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Waiman Long, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	H. Peter Anvin, x86, linux-kernel, Scott J Norton, Douglas Hatch

[-- Attachment #1: Type: text/plain, Size: 542 bytes --]

On Tue, 20 Oct 2015, Boqun Feng wrote:

>> @@ -93,7 +94,7 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
>>  	/*
>>  	 * smp_mb__before_atomic() in order to guarantee release semantics
>>  	 */
>> -	smp_mb__before_atomic_dec();
>> +	smp_mb__before_atomic();
>>  	atomic_sub(_Q_LOCKED_VAL, &lock->val);
>
>Just be curious, you don't use atomic_sub_release() here on purpose?

atomic_sub() does not imply barriers, so there's no relaxed variants; that's
only for _return() (and such) to the caller.

Thanks,
Davidlohr

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH tip/locking/core v8 1/5] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg
  2015-10-20  3:46     ` Davidlohr Bueso
@ 2015-10-20  3:48       ` Boqun Feng
  0 siblings, 0 replies; 9+ messages in thread
From: Boqun Feng @ 2015-10-20  3:48 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Waiman Long, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	H. Peter Anvin, x86, linux-kernel, Scott J Norton, Douglas Hatch

[-- Attachment #1: Type: text/plain, Size: 715 bytes --]

On Mon, Oct 19, 2015 at 08:46:02PM -0700, Davidlohr Bueso wrote:
> On Tue, 20 Oct 2015, Boqun Feng wrote:
> 
> >>@@ -93,7 +94,7 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
> >> 	/*
> >> 	 * smp_mb__before_atomic() in order to guarantee release semantics
> >> 	 */
> >>-	smp_mb__before_atomic_dec();
> >>+	smp_mb__before_atomic();
> >> 	atomic_sub(_Q_LOCKED_VAL, &lock->val);
> >
> >Just be curious, you don't use atomic_sub_release() here on purpose?
> 
> atomic_sub() does not imply barriers, so there's no relaxed variants; that's
> only for _return() (and such) to the caller.
> 

Ah.. my mistake ;-(

Thank you.

Regards,
Boqun

> Thanks,
> Davidlohr



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH tip/locking/core v8 2/5] locking/pvqspinlock, x86: Optimize PV unlock code path
  2015-10-15 22:51 [PATCH tip/locking/core v8 0/5] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
  2015-10-15 22:51 ` [PATCH tip/locking/core v8 1/5] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg Waiman Long
@ 2015-10-15 22:51 ` Waiman Long
  2015-10-15 22:51 ` [PATCH tip/locking/core v8 3/5] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2015-10-15 22:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
	Waiman Long

The unlock function in queued spinlocks was optimized for better
performance on bare metal systems at the expense of virtualized guests.

For x86-64 systems, the unlock call needs to go through a
PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit
registers before calling the real __pv_queued_spin_unlock()
function. The thunk code may also be in a separate cacheline from
__pv_queued_spin_unlock().

This patch optimizes the PV unlock code path by:
 1) Moving the unlock slowpath code from the fastpath into a separate
    __pv_queued_spin_unlock_slowpath() function to make the fastpath
    as simple as possible..
 2) For x86-64, hand-coded an assembly function to combine the register
    saving thunk code with the fastpath code. Only registers that
    are used in the fastpath will be saved and restored. If the
    fastpath fails, the slowpath function will be called via another
    PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C
    __pv_queued_spin_unlock() code as the thunk saves and restores
    only one 32-bit register.

With a microbenchmark of 5M lock-unlock loop, the table below shows
the execution times before and after the patch with different number
of threads in a VM running on a 32-core Westmere-EX box with x86-64
4.2-rc1 based kernels:

  Threads	Before patch	After patch	% Change
  -------	------------	-----------	--------
     1		   134.1 ms	  119.3 ms	  -11%
     2		   1286  ms	   953  ms	  -26%
     3		   3715  ms	  3480  ms	  -6.3%
     4		   4092  ms	  3764  ms	  -8.0%

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 arch/x86/include/asm/qspinlock_paravirt.h |   59 +++++++++++++++++++++++++++++
 kernel/locking/qspinlock_paravirt.h       |   43 +++++++++++++--------
 2 files changed, 86 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index b002e71..9f92c18 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -1,6 +1,65 @@
 #ifndef __ASM_QSPINLOCK_PARAVIRT_H
 #define __ASM_QSPINLOCK_PARAVIRT_H
 
+/*
+ * For x86-64, PV_CALLEE_SAVE_REGS_THUNK() saves and restores 8 64-bit
+ * registers. For i386, however, only 1 32-bit register needs to be saved
+ * and restored. So an optimized version of __pv_queued_spin_unlock() is
+ * hand-coded for 64-bit, but it isn't worthwhile to do it for 32-bit.
+ */
+#ifdef CONFIG_64BIT
+
+PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
+#define __pv_queued_spin_unlock	__pv_queued_spin_unlock
+#define PV_UNLOCK		"__raw_callee_save___pv_queued_spin_unlock"
+#define PV_UNLOCK_SLOWPATH	"__raw_callee_save___pv_queued_spin_unlock_slowpath"
+
+/*
+ * Optimized assembly version of __raw_callee_save___pv_queued_spin_unlock
+ * which combines the registers saving trunk and the body of the following
+ * C code:
+ *
+ * void __pv_queued_spin_unlock(struct qspinlock *lock)
+ * {
+ *	struct __qspinlock *l = (void *)lock;
+ *	u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+ *
+ *	if (likely(lockval == _Q_LOCKED_VAL))
+ *		return;
+ *	pv_queued_spin_unlock_slowpath(lock, lockval);
+ * }
+ *
+ * For x86-64,
+ *   rdi = lock              (first argument)
+ *   rsi = lockval           (second argument)
+ *   rdx = internal variable (set to 0)
+ */
+asm    (".pushsection .text;"
+	".globl " PV_UNLOCK ";"
+	".align 4,0x90;"
+	PV_UNLOCK ": "
+	"push  %rdx;"
+	"mov   $0x1,%eax;"
+	"xor   %edx,%edx;"
+	"lock cmpxchg %dl,(%rdi);"
+	"cmp   $0x1,%al;"
+	"jne   .slowpath;"
+	"pop   %rdx;"
+	"ret;"
+	".slowpath: "
+	"push   %rsi;"
+	"movzbl %al,%esi;"
+	"call " PV_UNLOCK_SLOWPATH ";"
+	"pop    %rsi;"
+	"pop    %rdx;"
+	"ret;"
+	".size " PV_UNLOCK ", .-" PV_UNLOCK ";"
+	".popsection");
+
+#else /* CONFIG_64BIT */
+
+extern void __pv_queued_spin_unlock(struct qspinlock *lock);
 PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock);
 
+#endif /* CONFIG_64BIT */
 #endif
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index f0450ff..4bd323d 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -308,23 +308,14 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 }
 
 /*
- * PV version of the unlock function to be used in stead of
- * queued_spin_unlock().
+ * PV versions of the unlock fastpath and slowpath functions to be used
+ * instead of queued_spin_unlock().
  */
-__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+__visible void
+__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 {
 	struct __qspinlock *l = (void *)lock;
 	struct pv_node *node;
-	u8 locked;
-
-	/*
-	 * We must not unlock if SLOW, because in that case we must first
-	 * unhash. Otherwise it would be possible to have multiple @lock
-	 * entries, which would be BAD.
-	 */
-	locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
-	if (likely(locked == _Q_LOCKED_VAL))
-		return;
 
 	if (unlikely(locked != _Q_SLOW_VAL)) {
 		WARN(!debug_locks_silent,
@@ -363,12 +354,32 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 	 */
 	pv_kick(node->cpu);
 }
+
 /*
  * Include the architecture specific callee-save thunk of the
  * __pv_queued_spin_unlock(). This thunk is put together with
- * __pv_queued_spin_unlock() near the top of the file to make sure
- * that the callee-save thunk and the real unlock function are close
- * to each other sharing consecutive instruction cachelines.
+ * __pv_queued_spin_unlock() to make the callee-save thunk and the real unlock
+ * function close to each other sharing consecutive instruction cachelines.
+ * Alternatively, architecture specific version of __pv_queued_spin_unlock()
+ * can be defined.
  */
 #include <asm/qspinlock_paravirt.h>
 
+#ifndef __pv_queued_spin_unlock
+__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+{
+	struct __qspinlock *l = (void *)lock;
+	u8 locked;
+
+	/*
+	 * We must not unlock if SLOW, because in that case we must first
+	 * unhash. Otherwise it would be possible to have multiple @lock
+	 * entries, which would be BAD.
+	 */
+	locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+	if (likely(locked == _Q_LOCKED_VAL))
+		return;
+
+	__pv_queued_spin_unlock_slowpath(lock, locked);
+}
+#endif /* __pv_queued_spin_unlock */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH tip/locking/core v8 3/5] locking/pvqspinlock: Collect slowpath lock statistics
  2015-10-15 22:51 [PATCH tip/locking/core v8 0/5] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
  2015-10-15 22:51 ` [PATCH tip/locking/core v8 1/5] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg Waiman Long
  2015-10-15 22:51 ` [PATCH tip/locking/core v8 2/5] locking/pvqspinlock, x86: Optimize PV unlock code path Waiman Long
@ 2015-10-15 22:51 ` Waiman Long
  2015-10-15 22:51 ` [PATCH tip/locking/core v8 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
  2015-10-15 22:51 ` [PATCH tip/locking/core v8 5/5] locking/pvqspinlock: Queue node adaptive spinning Waiman Long
  4 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2015-10-15 22:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
	Waiman Long

This patch enables the accumulation of kicking and waiting related
PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
option is selected. It also enables the collection of data which
enable us to calculate the kicking and wakeup latencies which have
a heavy dependency on the CPUs being used.

Debugfs is used for managing the statistics counts for its simplicity.
It has the limitation that per-cpu statistic counters cannot be
used.  Therefore, atomic instructions will be needed to update the
counters. This introduces a certain amount of performance overhead
when this feature is enabled.

The measured latencies for different CPUs are:

	CPU		Wakeup		Kicking
	---		------		-------
	Haswell-EX	63.6us		 7.4us
	Westmere-EX	67.6us		 9.3us

The measured latencies varied a bit from run-to-run. The wakeup
latency is much higher than the kicking latency.

A sample of statistics counts after system bootup (with vCPU
overcommit) was:

hash_hops_count=9001
kick_latencies=138047878
kick_unlock_count=9001
kick_wait_count=9000
spurious_wakeup=3
wait_again_count=2
wait_head_count=10
wait_node_count=8994
wake_latencies=713195944

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 arch/x86/Kconfig                    |    9 ++
 kernel/locking/qspinlock_paravirt.h |  169 +++++++++++++++++++++++++++++++++-
 2 files changed, 173 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ef36fe1..d920ce1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -721,6 +721,15 @@ config PARAVIRT_SPINLOCKS
 
 	  If you are unsure how to answer this question, answer Y.
 
+config QUEUED_LOCK_STAT
+	bool "Paravirt queued lock statistics"
+	depends on PARAVIRT && DEBUG_FS && QUEUED_SPINLOCKS
+	---help---
+	  Enable the collection of statistical data on the behavior of
+	  paravirtualized queued spinlocks and report them on debugfs.
+	  There will be a slight performance overhead when this option
+	  is enabled.
+
 source "arch/x86/xen/Kconfig"
 
 config KVM_GUEST
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 4bd323d..730fdac 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -41,6 +41,148 @@ struct pv_node {
 };
 
 /*
+ * PV qspinlock statistics
+ */
+enum pv_qlock_stat {
+	pvstat_wait_head,
+	pvstat_wait_node,
+	pvstat_wait_again,
+	pvstat_kick_wait,
+	pvstat_kick_unlock,
+	pvstat_spurious,
+	pvstat_hops,
+	pvstat_num	/* Total number of statistics counts */
+};
+
+#ifdef CONFIG_QUEUED_LOCK_STAT
+/*
+ * Collect pvqspinlock statiatics
+ */
+#include <linux/debugfs.h>
+#include <linux/sched.h>
+
+static const char * const stat_fsnames[pvstat_num] = {
+	[pvstat_wait_head]   = "wait_head_count",
+	[pvstat_wait_node]   = "wait_node_count",
+	[pvstat_wait_again]  = "wait_again_count",
+	[pvstat_kick_wait]   = "kick_wait_count",
+	[pvstat_kick_unlock] = "kick_unlock_count",
+	[pvstat_spurious]    = "spurious_wakeup",
+	[pvstat_hops]	     = "hash_hops_count",
+};
+
+static atomic_t pvstats[pvstat_num];
+
+/*
+ * pv_kick_latencies = sum of all pv_kick latencies in ns
+ * pv_wake_latencies = sum of all wakeup latencies in ns
+ *
+ * Avg kick latency   = pv_kick_latencies/kick_unlock_count
+ * Avg wake latency   = pv_wake_latencies/kick_wait_count
+ * Avg # of hops/hash = hash_hops_count/kick_unlock_count
+ */
+static atomic64_t pv_kick_latencies, pv_wake_latencies;
+static DEFINE_PER_CPU(u64, pv_kick_time);
+
+/*
+ * Reset all the statistics counts if set
+ */
+static bool reset_cnts __read_mostly;
+
+/*
+ * Initialize debugfs for the PV qspinlock statistics
+ */
+static int __init pv_qspinlock_debugfs(void)
+{
+	struct dentry *d_pvqlock = debugfs_create_dir("pv-qspinlock", NULL);
+	int i;
+
+	if (!d_pvqlock)
+		pr_warn("Could not create 'pv-qspinlock' debugfs directory\n");
+
+	for (i = 0; i < pvstat_num; i++)
+		debugfs_create_u32(stat_fsnames[i], 0444, d_pvqlock,
+				  (u32 *)&pvstats[i]);
+	debugfs_create_u64("kick_latencies", 0444, d_pvqlock,
+			   (u64 *)&pv_kick_latencies);
+	debugfs_create_u64("wake_latencies", 0444, d_pvqlock,
+			   (u64 *)&pv_wake_latencies);
+	debugfs_create_bool("reset_cnts", 0644, d_pvqlock, (u32 *)&reset_cnts);
+	return 0;
+}
+fs_initcall(pv_qspinlock_debugfs);
+
+/*
+ * Reset all the counts
+ */
+static noinline void pvstat_reset(void)
+{
+	int i;
+
+	for (i = 0; i < pvstat_num; i++)
+		atomic_set(&pvstats[i], 0);
+	atomic64_set(&pv_kick_latencies, 0);
+	atomic64_set(&pv_wake_latencies, 0);
+	reset_cnts = 0;
+}
+
+/*
+ * Increment the PV qspinlock statistics counts
+ */
+static inline void pvstat_inc(enum pv_qlock_stat stat, bool cond)
+{
+	if (cond)
+		atomic_inc(&pvstats[stat]);
+	if (unlikely(reset_cnts))
+		pvstat_reset();
+}
+
+/*
+ * PV hash hop count
+ */
+static inline void pvstat_hop(int hopcnt)
+{
+	atomic_add(hopcnt, &pvstats[pvstat_hops]);
+}
+
+/*
+ * Replacement function for pv_kick()
+ */
+static inline void __pv_kick(int cpu)
+{
+	u64 start = sched_clock();
+
+	*per_cpu_ptr(&pv_kick_time, cpu) = start;
+	pv_kick(cpu);
+	atomic64_add(sched_clock() - start, &pv_kick_latencies);
+}
+
+/*
+ * Replacement function for pv_wait()
+ */
+static inline void __pv_wait(u8 *ptr, u8 val)
+{
+	u64 *pkick_time = this_cpu_ptr(&pv_kick_time);
+
+	*pkick_time = 0;
+	pv_wait(ptr, val);
+	if (*pkick_time) {
+		atomic64_add(sched_clock() - *pkick_time, &pv_wake_latencies);
+		pvstat_inc(pvstat_kick_wait, true);
+	}
+}
+
+#define pv_kick(c)	__pv_kick(c)
+#define pv_wait(p, v)	__pv_wait(p, v)
+
+#else /* CONFIG_QUEUED_LOCK_STAT */
+
+static inline void pvstat_inc(enum pv_qlock_stat stat, bool cond)	{ }
+static inline void pvstat_hop(int hopcnt)				{ }
+
+#endif /* CONFIG_QUEUED_LOCK_STAT */
+
+/*
  * Lock and MCS node addresses hash table for fast lookup
  *
  * Hashing is done on a per-cacheline basis to minimize the need to access
@@ -100,10 +242,13 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
 {
 	unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
 	struct pv_hash_entry *he;
+	int hopcnt = 0;
 
 	for_each_hash_entry(he, offset, hash) {
+		hopcnt++;
 		if (!cmpxchg(&he->lock, NULL, lock)) {
 			WRITE_ONCE(he->node, node);
+			pvstat_hop(hopcnt);
 			return &he->lock;
 		}
 	}
@@ -164,9 +309,11 @@ static void pv_init_node(struct mcs_spinlock *node)
 static void pv_wait_node(struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
+	int waitcnt = 0;
 	int loop;
 
-	for (;;) {
+	/* waitcnt processing will be compiled out if !QUEUED_LOCK_STAT */
+	for (;; waitcnt++) {
 		for (loop = SPIN_THRESHOLD; loop; loop--) {
 			if (READ_ONCE(node->locked))
 				return;
@@ -184,12 +331,16 @@ static void pv_wait_node(struct mcs_spinlock *node)
 		 */
 		smp_store_mb(pn->state, vcpu_halted);
 
-		if (!READ_ONCE(node->locked))
+		if (!READ_ONCE(node->locked)) {
+			pvstat_inc(pvstat_wait_node, true);
+			pvstat_inc(pvstat_wait_again, waitcnt);
 			pv_wait(&pn->state, vcpu_halted);
+		}
 
 		/*
-		 * If pv_kick_node() changed us to vcpu_hashed, retain that value
-		 * so that pv_wait_head() knows to not also try to hash this lock.
+		 * If pv_kick_node() changed us to vcpu_hashed, retain that
+		 * value so that pv_wait_head() knows to not also try to hash
+		 * this lock.
 		 */
 		cmpxchg(&pn->state, vcpu_halted, vcpu_running);
 
@@ -200,6 +351,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
 		 * So it is better to spin for a while in the hope that the
 		 * MCS lock will be released soon.
 		 */
+		pvstat_inc(pvstat_spurious, !READ_ONCE(node->locked));
 	}
 
 	/*
@@ -250,6 +402,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 	struct pv_node *pn = (struct pv_node *)node;
 	struct __qspinlock *l = (void *)lock;
 	struct qspinlock **lp = NULL;
+	int waitcnt = 0;
 	int loop;
 
 	/*
@@ -259,7 +412,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 	if (READ_ONCE(pn->state) == vcpu_hashed)
 		lp = (struct qspinlock **)1;
 
-	for (;;) {
+	for (;; waitcnt++) {
 		for (loop = SPIN_THRESHOLD; loop; loop--) {
 			if (!READ_ONCE(l->locked))
 				return;
@@ -290,14 +443,19 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 				return;
 			}
 		}
+		pvstat_inc(pvstat_wait_head, true);
+		pvstat_inc(pvstat_wait_again, waitcnt);
 		pv_wait(&l->locked, _Q_SLOW_VAL);
 
+		if (!READ_ONCE(l->locked))
+			return;
 		/*
 		 * The unlocker should have freed the lock before kicking the
 		 * CPU. So if the lock is still not free, it is a spurious
 		 * wakeup and so the vCPU should wait again after spinning for
 		 * a while.
 		 */
+		pvstat_inc(pvstat_spurious, true);
 	}
 
 	/*
@@ -352,6 +510,7 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 	 * vCPU is harmless other than the additional latency in completing
 	 * the unlock.
 	 */
+	pvstat_inc(pvstat_kick_unlock, true);
 	pv_kick(node->cpu);
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH tip/locking/core v8 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt
  2015-10-15 22:51 [PATCH tip/locking/core v8 0/5] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
                   ` (2 preceding siblings ...)
  2015-10-15 22:51 ` [PATCH tip/locking/core v8 3/5] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
@ 2015-10-15 22:51 ` Waiman Long
  2015-10-15 22:51 ` [PATCH tip/locking/core v8 5/5] locking/pvqspinlock: Queue node adaptive spinning Waiman Long
  4 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2015-10-15 22:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
	Waiman Long

This patch allows one attempt for the lock waiter to steal the lock
when entering the PV slowpath. To prevent lock starvation, the pending
bit will be set by the queue head vCPU when it is in the active lock
spinning loop to disable any lock stealing attempt.  This helps to
reduce the performance penalty caused by lock waiter preemption while
not having much of the downsides of a real unfair lock.

The pv_wait_head() function was renamed as pv_wait_head_lock() as it
was modified to acquire the lock before returning. This is necessary
because of possible lock stealing attempts from other tasks.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:

                    Westmere                    Haswell
  Patch         32 vCPUs    48 vCPUs    32 vCPUs    48 vCPUs
  -----         --------    --------    --------    --------
  Before patch   3m15.6s    10m56.1s     1m44.1s     5m29.1s
  After patch    3m02.3s     5m00.2s     1m43.7s     3m03.5s

For the overcommited case (48 vCPUs), this patch is able to reduce
kernel build time by more than 54% for Westmere and 44% for Haswell.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/qspinlock.c          |   29 ++++++--
 kernel/locking/qspinlock_paravirt.h |  126 +++++++++++++++++++++++++++++------
 2 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 7868418..76f3441 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -251,15 +251,16 @@ static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
 static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
 static __always_inline void __pv_kick_node(struct qspinlock *lock,
 					   struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_head(struct qspinlock *lock,
-					   struct mcs_spinlock *node) { }
+static __always_inline u32  __pv_wait_head_lock(struct qspinlock *lock,
+						struct mcs_spinlock *node)
+						{ return 0; }
 
 #define pv_enabled()		false
 
 #define pv_init_node		__pv_init_node
 #define pv_wait_node		__pv_wait_node
 #define pv_kick_node		__pv_kick_node
-#define pv_wait_head		__pv_wait_head
+#define pv_wait_head_lock	__pv_wait_head_lock
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
@@ -420,10 +421,22 @@ queue:
 	 * sequentiality; this is because the set_locked() function below
 	 * does not imply a full barrier.
 	 *
+	 * The PV pv_wait_head_lock function, if active, will acquire the lock
+	 * and return a non-zero value. So we have to skip the
+	 * smp_load_acquire() call. As the next PV queue head hasn't been
+	 * designated yet, there is no way for the locked value to become
+	 * _Q_SLOW_VAL. So both the redundant set_locked() and the
+	 * atomic_cmpxchg_relaxed() calls will be safe. The cost of the
+	 * redundant set_locked() call below should be negligible, too.
+	 *
+	 * If PV isn't active, 0 will be returned instead.
 	 */
-	pv_wait_head(lock, node);
-	while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_PENDING_MASK)
-		cpu_relax();
+	val = pv_wait_head_lock(lock, node);
+	if (!val) {
+		while ((val = smp_load_acquire(&lock->val.counter))
+				& _Q_LOCKED_PENDING_MASK)
+			cpu_relax();
+	}
 
 	/*
 	 * claim the lock:
@@ -436,7 +449,7 @@ queue:
 	 * to grab the lock.
 	 */
 	for (;;) {
-		if (val != tail) {
+		if ((val & _Q_TAIL_MASK) != tail) {
 			set_locked(lock);
 			break;
 		}
@@ -481,7 +494,7 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #undef pv_init_node
 #undef pv_wait_node
 #undef pv_kick_node
-#undef pv_wait_head
+#undef pv_wait_head_lock
 
 #undef  queued_spin_lock_slowpath
 #define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 730fdac..f447ef8 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -41,6 +41,56 @@ struct pv_node {
 };
 
 /*
+ * Allow one unfair trylock when entering the PV slowpath when the pending
+ * bit isn't set to reduce the performance impact of lock waiter preemption
+ *
+ * By replacing the regular queued_spin_trylock() with the function below,
+ * it will be called once when a lock waiter enter the slowpath before being
+ * queued.
+ *
+ * A little bit of unfairness here can improve performance without many
+ * of the downsides of a real unfair lock.
+ */
+#define queued_spin_trylock(l)	pv_queued_spin_trylock_unfair(l)
+static inline bool pv_queued_spin_trylock_unfair(struct qspinlock *lock)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	return !(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
+		(cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0);
+}
+
+/*
+ * The pending bit is used by the queue head vCPU to indicate that it
+ * is actively spinning on the lock and no lock stealing is allowed.
+ */
+#if _Q_PENDING_BITS == 8
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	WRITE_ONCE(l->pending, 0);
+}
+
+static __always_inline void set_pending(struct qspinlock *lock)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	WRITE_ONCE(l->pending, 1);
+}
+#else /* _Q_PENDING_BITS == 8 */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+	atomic_clear_mask(&lock->val, _Q_PENDING_MASK);
+}
+
+static __always_inline void set_pending(struct qspinlock *lock)
+{
+	atomic_set_mask(&lock->val, _Q_PENDING_MASK);
+}
+#endif /* _Q_PENDING_BITS == 8 */
+
+/*
  * PV qspinlock statistics
  */
 enum pv_qlock_stat {
@@ -49,6 +99,7 @@ enum pv_qlock_stat {
 	pvstat_wait_again,
 	pvstat_kick_wait,
 	pvstat_kick_unlock,
+	pvstat_steal_lock,
 	pvstat_spurious,
 	pvstat_hops,
 	pvstat_num	/* Total number of statistics counts */
@@ -67,6 +118,7 @@ static const char * const stat_fsnames[pvstat_num] = {
 	[pvstat_wait_again]  = "wait_again_count",
 	[pvstat_kick_wait]   = "kick_wait_count",
 	[pvstat_kick_unlock] = "kick_unlock_count",
+	[pvstat_steal_lock]  = "lock_stealing_count",
 	[pvstat_spurious]    = "spurious_wakeup",
 	[pvstat_hops]	     = "hash_hops_count",
 };
@@ -146,6 +198,19 @@ static inline void pvstat_hop(int hopcnt)
 }
 
 /*
+ * PV unfair trylock count tracking function
+ */
+static inline int pvstat_trylock_unfair(struct qspinlock *lock)
+{
+	int ret = pv_queued_spin_trylock_unfair(lock);
+
+	pvstat_inc(pvstat_steal_lock, ret);
+	return ret;
+}
+#undef  queued_spin_trylock
+#define queued_spin_trylock(l)	pvstat_trylock_unfair(l)
+
+/*
  * Replacement function for pv_kick()
  */
 static inline void __pv_kick(int cpu)
@@ -339,8 +404,8 @@ static void pv_wait_node(struct mcs_spinlock *node)
 
 		/*
 		 * If pv_kick_node() changed us to vcpu_hashed, retain that
-		 * value so that pv_wait_head() knows to not also try to hash
-		 * this lock.
+		 * value so that pv_wait_head_lock() knows to not also try
+		 * to hash this lock.
 		 */
 		cmpxchg(&pn->state, vcpu_halted, vcpu_running);
 
@@ -364,8 +429,9 @@ static void pv_wait_node(struct mcs_spinlock *node)
 /*
  * Called after setting next->locked = 1 when we're the lock owner.
  *
- * Instead of waking the waiters stuck in pv_wait_node() advance their state such
- * that they're waiting in pv_wait_head(), this avoids a wake/sleep cycle.
+ * Instead of waking the waiters stuck in pv_wait_node() advance their state
+ * such that they're waiting in pv_wait_head_lock(), this avoids a
+ * wake/sleep cycle.
  */
 static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 {
@@ -394,10 +460,13 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 }
 
 /*
- * Wait for l->locked to become clear; halt the vcpu after a short spin.
+ * Wait for l->locked to become clear and acquire the lock;
+ * halt the vcpu after a short spin.
  * __pv_queued_spin_unlock() will wake us.
+ *
+ * The current value of the lock will be returned for additional processing.
  */
-static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
+static u32 pv_wait_head_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
 	struct __qspinlock *l = (void *)lock;
@@ -413,11 +482,24 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 		lp = (struct qspinlock **)1;
 
 	for (;; waitcnt++) {
+		/*
+		 * Set the pending bit in the active lock spinning loop to
+		 * disable lock stealing. However, the pending bit check in
+		 * pv_queued_spin_trylock_unfair() and the setting/clearing
+		 * of pending bit here aren't memory barriers. So a cmpxchg()
+		 * is used to acquire the lock to be sure.
+		 */
+		set_pending(lock);
 		for (loop = SPIN_THRESHOLD; loop; loop--) {
-			if (!READ_ONCE(l->locked))
-				return;
+			if (!READ_ONCE(l->locked) &&
+			   (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
+				clear_pending(lock);
+				goto gotlock;
+			}
 			cpu_relax();
 		}
+		clear_pending(lock);
+
 
 		if (!lp) { /* ONCE */
 			lp = pv_hash(lock, pn);
@@ -433,36 +515,36 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 			 *
 			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
 			 */
-			if (!cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) {
+			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
 				/*
-				 * The lock is free and _Q_SLOW_VAL has never
-				 * been set. Therefore we need to unhash before
-				 * getting the lock.
+				 * The lock was free and now we own the lock.
+				 * Change the lock value back to _Q_LOCKED_VAL
+				 * and unhash the table.
 				 */
+				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
 				WRITE_ONCE(*lp, NULL);
-				return;
+				goto gotlock;
 			}
 		}
 		pvstat_inc(pvstat_wait_head, true);
 		pvstat_inc(pvstat_wait_again, waitcnt);
 		pv_wait(&l->locked, _Q_SLOW_VAL);
 
-		if (!READ_ONCE(l->locked))
-			return;
 		/*
 		 * The unlocker should have freed the lock before kicking the
 		 * CPU. So if the lock is still not free, it is a spurious
-		 * wakeup and so the vCPU should wait again after spinning for
-		 * a while.
+		 * wakeup or another vCPU has stolen the lock. The current
+		 * vCPU should spin again.
 		 */
-		pvstat_inc(pvstat_spurious, true);
+		pvstat_inc(pvstat_spurious, READ_ONCE(l->locked));
 	}
 
 	/*
-	 * Lock is unlocked now; the caller will acquire it without waiting.
-	 * As with pv_wait_node() we rely on the caller to do a load-acquire
-	 * for us.
+	 * The cmpxchg() or xchg() call before coming here provides the
+	 * acquire semantics for locking.
 	 */
+gotlock:
+	return (u32)atomic_read(&lock->val);
 }
 
 /*
@@ -487,7 +569,7 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 	 * so we need a barrier to order the read of the node data in
 	 * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
 	 *
-	 * Matches the cmpxchg() in pv_wait_head() setting _Q_SLOW_VAL.
+	 * Matches the cmpxchg() in pv_wait_head_lock() setting _Q_SLOW_VAL.
 	 */
 	smp_rmb();
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH tip/locking/core v8 5/5] locking/pvqspinlock: Queue node adaptive spinning
  2015-10-15 22:51 [PATCH tip/locking/core v8 0/5] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
                   ` (3 preceding siblings ...)
  2015-10-15 22:51 ` [PATCH tip/locking/core v8 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
@ 2015-10-15 22:51 ` Waiman Long
  4 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2015-10-15 22:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
	Waiman Long

In an overcommitted guest where some vCPUs have to be halted to make
forward progress in other areas, it is highly likely that a vCPU later
in the spinlock queue will be spinning while the ones earlier in the
queue would have been halted. The spinning in the later vCPUs is then
just a waste of precious CPU cycles because they are not going to
get the lock soon as the earlier ones have to be woken up and take
their turn to get the lock.

This patch implements an adaptive spinning mechanism where the vCPU
will call pv_wait() if the previous vCPU is not running.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:

		    Westmere			Haswell
  Patch		32 vCPUs    48 vCPUs	32 vCPUs    48 vCPUs
  -----		--------    --------    --------    --------
  Before patch   3m02.3s     5m00.2s     1m43.7s     3m03.5s
  After patch    3m03.0s     4m37.5s	 1m43.0s     2m47.2s

For 32 vCPUs, this patch doesn't cause any noticeable change in
performance. For 48 vCPUs (over-committed), there is about 8%
performance improvement.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/locking/qspinlock.c          |    5 ++-
 kernel/locking/qspinlock_paravirt.h |   47 +++++++++++++++++++++++++++++++++-
 2 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 76f3441..b72c9b0 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -248,7 +248,8 @@ static __always_inline void set_locked(struct qspinlock *lock)
  */
 
 static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node,
+					   struct mcs_spinlock *prev) { }
 static __always_inline void __pv_kick_node(struct qspinlock *lock,
 					   struct mcs_spinlock *node) { }
 static __always_inline u32  __pv_wait_head_lock(struct qspinlock *lock,
@@ -406,7 +407,7 @@ queue:
 		prev = decode_tail(old);
 		WRITE_ONCE(prev->next, node);
 
-		pv_wait_node(node);
+		pv_wait_node(node, prev);
 		arch_mcs_spin_lock_contended(&node->locked);
 	}
 
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index f447ef8..46c6644 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -23,6 +23,19 @@
 #define _Q_SLOW_VAL	(3U << _Q_LOCKED_OFFSET)
 
 /*
+ * Queue Node Adaptive Spinning
+ *
+ * A queue node vCPU will stop spinning if the vCPU in the previous node is
+ * not running. The one lock stealing attempt allowed at slowpath entry
+ * mitigates the slight slowdown for non-overcommitted guest with this
+ * aggressive wait-early mechanism.
+ *
+ * The status of the previous node will be checked at fixed interval
+ * controlled by PV_PREV_CHECK_MASK.
+ */
+#define PV_PREV_CHECK_MASK	0xff
+
+/*
  * Queue node uses: vcpu_running & vcpu_halted.
  * Queue head uses: vcpu_running & vcpu_hashed.
  */
@@ -97,6 +110,7 @@ enum pv_qlock_stat {
 	pvstat_wait_head,
 	pvstat_wait_node,
 	pvstat_wait_again,
+	pvstat_wait_early,
 	pvstat_kick_wait,
 	pvstat_kick_unlock,
 	pvstat_steal_lock,
@@ -116,6 +130,7 @@ static const char * const stat_fsnames[pvstat_num] = {
 	[pvstat_wait_head]   = "wait_head_count",
 	[pvstat_wait_node]   = "wait_node_count",
 	[pvstat_wait_again]  = "wait_again_count",
+	[pvstat_wait_early]  = "wait_early_count",
 	[pvstat_kick_wait]   = "kick_wait_count",
 	[pvstat_kick_unlock] = "kick_unlock_count",
 	[pvstat_steal_lock]  = "lock_stealing_count",
@@ -354,6 +369,20 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
 }
 
 /*
+ * Return true if when it is time to check the previous node which is not
+ * in a running state.
+ */
+static inline bool
+pv_wait_early(struct pv_node *prev, int loop)
+{
+
+	if ((loop & PV_PREV_CHECK_MASK) != 0)
+		return false;
+
+	return READ_ONCE(prev->state) != vcpu_running;
+}
+
+/*
  * Initialize the PV part of the mcs_spinlock node.
  */
 static void pv_init_node(struct mcs_spinlock *node)
@@ -371,17 +400,23 @@ static void pv_init_node(struct mcs_spinlock *node)
  * pv_kick_node() is used to set _Q_SLOW_VAL and fill in hash table on its
  * behalf.
  */
-static void pv_wait_node(struct mcs_spinlock *node)
+static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 {
 	struct pv_node *pn = (struct pv_node *)node;
+	struct pv_node *pp = (struct pv_node *)prev;
 	int waitcnt = 0;
 	int loop;
+	bool wait_early;
 
 	/* waitcnt processing will be compiled out if !QUEUED_LOCK_STAT */
 	for (;; waitcnt++) {
-		for (loop = SPIN_THRESHOLD; loop; loop--) {
+		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
 			if (READ_ONCE(node->locked))
 				return;
+			if (pv_wait_early(pp, loop)) {
+				wait_early = true;
+				break;
+			}
 			cpu_relax();
 		}
 
@@ -399,6 +434,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
 		if (!READ_ONCE(node->locked)) {
 			pvstat_inc(pvstat_wait_node, true);
 			pvstat_inc(pvstat_wait_again, waitcnt);
+			pvstat_inc(pvstat_wait_early, wait_early);
 			pv_wait(&pn->state, vcpu_halted);
 		}
 
@@ -483,6 +519,12 @@ static u32 pv_wait_head_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 
 	for (;; waitcnt++) {
 		/*
+		 * Set correct vCPU state to be used by queue node wait-early
+		 * mechanism.
+		 */
+		WRITE_ONCE(pn->state, vcpu_running);
+
+		/*
 		 * Set the pending bit in the active lock spinning loop to
 		 * disable lock stealing. However, the pending bit check in
 		 * pv_queued_spin_trylock_unfair() and the setting/clearing
@@ -526,6 +568,7 @@ static u32 pv_wait_head_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 				goto gotlock;
 			}
 		}
+		WRITE_ONCE(pn->state, vcpu_halted);
 		pvstat_inc(pvstat_wait_head, true);
 		pvstat_inc(pvstat_wait_again, waitcnt);
 		pv_wait(&l->locked, _Q_SLOW_VAL);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-10-20  3:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-15 22:51 [PATCH tip/locking/core v8 0/5] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
2015-10-15 22:51 ` [PATCH tip/locking/core v8 1/5] locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg Waiman Long
2015-10-20  3:17   ` Boqun Feng
2015-10-20  3:46     ` Davidlohr Bueso
2015-10-20  3:48       ` Boqun Feng
2015-10-15 22:51 ` [PATCH tip/locking/core v8 2/5] locking/pvqspinlock, x86: Optimize PV unlock code path Waiman Long
2015-10-15 22:51 ` [PATCH tip/locking/core v8 3/5] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
2015-10-15 22:51 ` [PATCH tip/locking/core v8 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
2015-10-15 22:51 ` [PATCH tip/locking/core v8 5/5] locking/pvqspinlock: Queue node adaptive spinning Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.