All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/7] locking/qspinlock: Enhance pvqspinlock performance
@ 2015-08-01  2:21 Waiman Long
  2015-08-01  2:21 ` [PATCH v4 1/7] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL Waiman Long
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Waiman Long @ 2015-08-01  2:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
	Waiman Long

v3->v4:
 - Patch 1: add comment about possible racing condition in PV unlock.
 - Patch 2: simplified the pv_pending_lock() function as suggested by
   Davidlohr.
 - Move PV unlock optimization patch forward to patch 4 & rerun
   performance test.

v2->v3:
 - Moved deferred kicking enablement patch forward & move back
   the kick-ahead patch to make the effect of kick-ahead more visible.
 - Reworked patch 6 to make it more readable.
 - Reverted back to use state as a tri-state variable instead of
   adding an additional bistate variable.
 - Added performance data for different values of PV_KICK_AHEAD_MAX.
 - Add a new patch to optimize PV unlock code path performance.

v1->v2:
 - Take out the queued unfair lock patches
 - Add a patch to simplify the PV unlock code
 - Move pending bit and statistics collection patches to the front
 - Keep vCPU kicking in pv_kick_node(), but defer it to unlock time
   when appropriate.
 - Change the wait-early patch to use adaptive spinning to better
   balance the difference effect on normal and over-committed guests.
 - Add patch-to-patch performance changes in the patch commit logs.

This patchset tries to improve the performance of both normal and
over-commmitted VM guests. The kick-ahead and adaptive spinning
patches are inspired by the "Do Virtual Machines Really Scale?" blog
from Sanidhya Kashyap.

Patch 1 simplifies the unlock code by doing unconditional vCPU kick
when _Q_SLOW_VAL is set as the chance of spurious wakeup showing
up in the statistical data that I collected was very low (1 or 2
occasionally).

Patch 2 adds pending bit support to pvqspinlock improving performance
at light load.

Patch 3 allows the collection of various count data that are useful
to see what is happening in the system. They do add a bit of overhead
when enabled slowing performance a tiny bit.

Patch 4 optimizes the PV unlock code path performance for x86-64
architecture.

Patch 5 is an enablement patch for deferring vCPU kickings from the
lock side to the unlock side.

Patch 6 enables multiple vCPU kick-ahead's at unlock time, outside of
the critical section which can improve performance in overcommitted
guests and sometime even in normal guests.

Patch 7 enables adaptive spinning in the queue nodes. This patch can
lead to pretty big performance increase in over-committed guest at
the expense of a slight performance hit in normal guests.

Patches 2 & 4 improves performance of common uncontended and lightly
contended cases. Patches 5-7 are for improving performance in
over-committed VM guests.

Performance measurements were done on a 32-CPU Westmere-EX and
Haswell-EX systems. The Westmere-EX system got the most performance
gain from patch 5, whereas the Haswell-EX system got the most gain
from patch 6 for over-committed guests.

The table below shows the Linux kernel build times for various
values of PV_KICK_AHEAD_MAX on an over-committed 48-vCPU guest on
the Westmere-EX system:

  PV_KICK_AHEAD_MAX	Patches 1-5	Patches 1-6
  -----------------	-----------	-----------
	  1		  9m46.9s	 11m10.1s
	  2		  9m40.2s	 10m08.3s
	  3		  9m36.8s	  9m49.8s
	  4		  9m35.9s	  9m38.7s
	  5		  9m35.1s	  9m33.0s
	  6		  9m35.7s	  9m28.5s

With patches 1-5, the performance wasn't very sensitive to different
PV_KICK_AHEAD_MAX values. Adding patch 6 into the mix, however, changes
the picture quite dramatically. There is a performance regression if
PV_KICK_AHEAD_MAX is too small. Starting with a value of 4, increasing
PV_KICK_AHEAD_MAX only gets us a minor benefit.

Waiman Long (7):
  locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL
  locking/pvqspinlock: Add pending bit support
  locking/pvqspinlock: Collect slowpath lock statistics
  locking/pvqspinlock, x86: Optimize PV unlock code path
  locking/pvqspinlock: Enable deferment of vCPU kicking to unlock call
  locking/pvqspinlock: Allow vCPUs kick-ahead
  locking/pvqspinlock: Queue node adaptive spinning

 arch/x86/Kconfig                          |    7 +
 arch/x86/include/asm/qspinlock_paravirt.h |   59 +++
 kernel/locking/qspinlock.c                |   38 ++-
 kernel/locking/qspinlock_paravirt.h       |  546 +++++++++++++++++++++++++++--
 4 files changed, 612 insertions(+), 38 deletions(-)

 v3-to-v4 diff:

 arch/x86/include/asm/qspinlock_paravirt.h |    5 +--
 kernel/locking/qspinlock_paravirt.h       |   45 +++++++++++++++++-----------
 2 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index 46f0f82..3001972 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -9,10 +9,10 @@
  */
 #ifdef CONFIG_64BIT
 
-PV_CALLEE_SAVE_REGS_THUNK(pv_queued_spin_unlock_slowpath);
+PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
 #define __pv_queued_spin_unlock	__pv_queued_spin_unlock
 #define PV_UNLOCK		"__raw_callee_save___pv_queued_spin_unlock"
-#define PV_UNLOCK_SLOWPATH	"__raw_callee_save_pv_queued_spin_unlock_slowpath"
+#define PV_UNLOCK_SLOWPATH	"__raw_callee_save___pv_queued_spin_unlock_slowpath"
 
 /*
  * Optimized assembly version of __raw_callee_save___pv_queued_spin_unlock
@@ -46,7 +46,6 @@ asm(".pushsection .text;"
     "jne   .slowpath;"
     "pop   %rdx;"
     "ret;"
-    "nop;"
     ".slowpath: "
     "push   %rsi;"
     "movzbl %al,%esi;"
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 56c3717..d04911b 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -463,16 +463,16 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
 	/*
 	 * wait for in-progress pending->locked hand-overs
 	 */
-	if (val == _Q_PENDING_VAL) {
-		while (((val = atomic_read(&lock->val)) == _Q_PENDING_VAL) &&
-			loop--)
-			cpu_relax();
+	while ((val == _Q_PENDING_VAL) && loop) {
+		cpu_relax();
+		val = atomic_read(&lock->val);
+		loop--;
 	}
 
 	/*
 	 * trylock || pending
 	 */
-	for (;;) {
+	for (;; loop--) {
 		if (val & ~_Q_LOCKED_MASK)
 			goto queue;
 		new = _Q_LOCKED_VAL;
@@ -481,7 +481,7 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
 		old = atomic_cmpxchg(&lock->val, val, new);
 		if (old == val)
 			break;
-		if (loop-- <= 0)
+		if (!loop)
 			goto queue;
 	}
 
@@ -490,16 +490,18 @@ static int pv_pending_lock(struct qspinlock *lock, u32 val)
 	/*
 	 * We are pending, wait for the owner to go away.
 	 */
-	while (((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK)
-		&& (loop-- > 0))
-		cpu_relax();
-
-	if (!(val & _Q_LOCKED_MASK)) {
-		clear_pending_set_locked(lock);
-		goto gotlock;
+	for (; loop; loop--, cpu_relax()) {
+		val = smp_load_acquire(&lock->val.counter);
+		if (!(val & _Q_LOCKED_MASK)) {
+			clear_pending_set_locked(lock);
+			pvstat_inc(pvstat_pend_lock);
+			goto gotlock;
+		}
 	}
+
 	/*
-	 * Clear the pending bit and fall back to queuing
+	 * Fail to acquire the lock within the spinning period,
+	 * so clear the pending bit and fall back to queuing.
 	 */
 	clear_pending(lock);
 	pvstat_inc(pvstat_pend_fail);
@@ -719,11 +721,11 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 }
 
 /*
- * PV version of the unlock fastpath and slowpath functions to be used
- * in stead of queued_spin_unlock().
+ * PV versions of the unlock fastpath and slowpath functions to be used
+ * instead of queued_spin_unlock().
  */
 __visible void
-pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
+__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
 {
 	struct __qspinlock *l = (void *)lock;
 	struct pv_node *node, *next;
@@ -771,6 +773,13 @@ pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
 	/*
 	 * At this point the memory pointed at by lock can be freed/reused,
 	 * however we can still use the pv_node to kick the CPU.
+	 *
+	 * As smp_store_release() is not a full barrier, adding a check to
+	 * the node->state doesn't guarantee the checking is really done
+	 * after clearing the lock byte since they are in 2 separate
+	 * cachelines and so hardware can reorder them. So either we insert
+	 * memory barrier here and in the corresponding pv_wait_head()
+	 * function or we do an unconditional kick which is what is done here.
 	 */
 	pvstat_inc(pvstat_unlock_kick);
 	pv_kick(node->cpu);
@@ -803,6 +812,6 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 	if (likely(lockval == _Q_LOCKED_VAL))
 		return;
 
-	pv_queued_spin_unlock_slowpath(lock, lockval);
+	__pv_queued_spin_unlock_slowpath(lock, lockval);
 }
 #endif /* __pv_queued_spin_unlock */

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-08-04  3:26 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-01  2:21 [PATCH v4 0/7] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
2015-08-01  2:21 ` [PATCH v4 1/7] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL Waiman Long
2015-08-01 18:01   ` Davidlohr Bueso
2015-08-01 20:12     ` Long, Wai Man
2015-08-01 22:29   ` Peter Zijlstra
2015-08-03 18:22     ` Davidlohr Bueso
2015-08-03 18:37       ` Peter Zijlstra
2015-08-03 19:09         ` Davidlohr Bueso
2015-08-03 19:56           ` Peter Zijlstra
2015-08-04  3:26     ` Waiman Long
2015-08-01  2:21 ` [PATCH v4 2/7] locking/pvqspinlock: Add pending bit support Waiman Long
2015-08-03 18:37   ` Davidlohr Bueso
2015-08-03 21:30     ` Waiman Long
2015-08-01  2:22 ` [PATCH v4 3/7] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
2015-08-01  2:22 ` [PATCH v4 4/7] locking/pvqspinlock, x86: Optimize PV unlock code path Waiman Long
2015-08-01  2:22 ` [PATCH v4 5/7] locking/pvqspinlock: Enable deferment of vCPU kicking to unlock call Waiman Long
2015-08-01  2:22 ` [PATCH v4 6/7] locking/pvqspinlock: Allow vCPUs kick-ahead Waiman Long
2015-08-01  2:22 ` [PATCH v4 7/7] locking/pvqspinlock: Queue node adaptive spinning Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.