linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Waiman Long <Waiman.Long@hp.com>
To: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org, x86@kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org,
	xen-devel@lists.xenproject.org, kvm@vger.kernel.org,
	Paolo Bonzini <paolo.bonzini@gmail.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Rik van Riel <riel@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
	David Vrabel <david.vrabel@citrix.com>,
	Oleg Nesterov <oleg@redhat.com>, Gleb Natapov <gleb@redhat.com>,
	Aswin Chandramouleeswaran <aswin@hp.com>,
	Scott J Norton <scott.norton@hp.com>,
	Chegu Vinod <chegu_vinod@hp.com>,
	Waiman Long <Waiman.Long@hp.com>
Subject: [PATCH v8 04/10] qspinlock: Optimized code path for 2 contending tasks
Date: Wed,  2 Apr 2014 09:27:33 -0400	[thread overview]
Message-ID: <1396445259-27670-5-git-send-email-Waiman.Long@hp.com> (raw)
In-Reply-To: <1396445259-27670-1-git-send-email-Waiman.Long@hp.com>

A major problem with the queue spinlock patch is its performance at
low contention level (2-4 contending tasks) where it is slower than
the corresponding ticket spinlock code. The following table shows
the execution time (in ms) of a micro-benchmark where 5M iterations
of the lock/unlock cycles were run on a 10-core Westere-EX x86-64
CPU with 2 different types of loads - standalone (lock and protected
data in different cachelines) and embedded (lock and protected data
in the same cacheline).

		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	  135/111	 135/101	  0%/-9%
       2	 1045/950	1884/1853	+80%/+95%
       3	 1827/1783	2256/2264	+23%/+27%
       4	 2689/2725	2880/2884	 +7%/+6%
       5	 3736/3748	3636/3617	 -3%/-3%
       6	 4942/4984	4294/4253	-13%/-15%
       7	 6304/6319	4976/4958	-21%/-22%
       8	 7736/7629	5662/5651	-27%/-26%

It can be seen that the performance degradation is particular bad
with 2 and 3 contending tasks. To reduce that performance deficit
at low contention level, a special specific optimized code path
for 2 contending tasks was added. This special code path can only be
activated with less than 16K of configured CPUs because it uses a byte
in the 32-bit lock word to hold a waiting bit for the 2nd contending
tasks instead of queuing the waiting task in the queue.

With the change, the performance data became:

		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       2	 1045/950	 951/955	 -9%/+1%

In a multi-socketed server, the optimized code path also seems to
produce a pretty good performance improvement in cross-node contention
traffic at low contention level. The table below show the performance
with 1 contending task per node:

		[Standalone]
  # of nodes	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	   135		  135		  0%
       2	  4452		 1024		-77%
       3	 10767		14030		+30%
       4	 20835		10740		-48%

Except some drop in performance at the 3 contending tasks level,
the queue spinlock performs much better than the ticket spinlock at
2 and 4 contending tasks level.

With IvyBridge-EX (2.8 GHz), the performance profile of qspinlock vs
ticket spinlock changes quite a bit due to the fact that the pause
instruction in IvyBridge-EX is about 3 times as long as that in
Westmere-EX (25 cycles vs. 8 cycles according to my measurement). So
spinning on the lock word by the ticket spinlock is less problematic
in IvyBridge-EX.

The table below shows the results of the same low-level spinlock test
run on one socket of IvyBridge-EX (15 cores, 1M iterations instead
of 5M):

		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	   59/48	  56/42		 -5%/-13%
       2	  514/487	 289/345	-44%/-29%
       3	  812/768	1048/1053	+29%/+37%
       4	 1136/1077	1219/1220	 +7%/+13%
       5	 1464/1398	1560/1581	 +7%/+13%
       6	 1806/1787	1952/1959	 +8%/+10%
       7	 2185/2204	2360/2377	 +8%/ +8%
       8	 2582/2656	2759/2747	 +7%/ +3%
       9	 2979/3131	3120/3103	 +5%/ -1%
      10	 3398/3602	3484/3498	 +3%/ -3%
      11	 3848/4110	3807/3829	 -1%/ -7%
      12	 4326/4655	4132/4117	 -4%/-12%

The queue spinlock is still faster than the ticket spinlock with 1 or
2 contending tasks (light spinlock contention). After that, ticket
spinlock is faster (moderate spinlock contention) until there are
11 or more contending tasks for a standalone spinlock or 9 or more
contending tasks for an embedded spinlocks.

The table below shows the performance profile when there is one
contending task are from each of the different nodes in an 8-node
prototype IvyBridge-EX machine:

		[Standalone]
  # of nodes	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       2	   532		  503		 -5%
       3	  2449		 3812		+56%
       4	  6211		 4571		-26%
       5	  8606		 6104		-29%
       6	  9011		 7641		-15%
       7	 12907		 8373		-35%
       8	 15094		10259		-32%

There is some performance drop at the 3 contending tasks level.
Other than that, queue spinlock is faster than ticket spinlock.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/qspinlock.h |    3 +-
 kernel/locking/qspinlock.c       |  212 +++++++++++++++++++++++++++++++++-----
 2 files changed, 187 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index f058b91..265b10b 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -21,9 +21,10 @@ union arch_qspinlock {
 	struct qspinlock slock;
 	struct {
 		u8  lock;	/* Lock bit	*/
-		u8  reserved;
+		u8  wait;	/* Waiting bit	*/
 		u16 qcode;	/* Queue code	*/
 	};
+	u16 lock_wait;		/* Lock and wait bits */
 	u32 qlcode;		/* Complete lock word */
 };
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 45c68a4..cf16bba 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -121,6 +121,8 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
  *      o lock      - the lock byte					*
  *      o qcode     - the queue node code				*
  *      o qlcode    - the 32-bit qspinlock word				*
+ *      o wait      - the waiting byte					*
+ *      o lock_wait - the combined lock and waiting bytes		*
  *									*
  ************************************************************************
  */
@@ -131,9 +133,135 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
  * architectures that allows atomic 8/16 bit operations:
  *  1) The 16-bit queue code can be accessed or modified directly as a
  *     16-bit short value without disturbing the first 2 bytes.
+ *  2) The 2nd byte of the 32-bit lock word can be used as a pending bit
+ *     for waiting lock acquirer so that it won't need to go through the
+ *     MCS style locking queuing which has a higher overhead.
  */
+
+/*
+ * Masks for the lock and wait bits
+ */
+#define _QLOCK_WAIT_SHIFT	8	/* Waiting bit position */
+#define _QLOCK_WAITING		(1 << _QLOCK_WAIT_SHIFT)
+#define _QLOCK_LW_MASK		(_QLOCK_WAITING | _QLOCK_LOCKED)
+
 #define queue_encode_qcode(cpu, idx)	(((cpu) + 1) << 2 | (idx))
 
+#define queue_spin_trylock_quick queue_spin_trylock_quick
+/**
+ * queue_spin_trylock_quick - quick spinning on the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Old queue spinlock value
+ * Return: 1 if lock acquired, 0 if failed
+ *
+ * This is an optimized contention path for 2 contending tasks. It
+ * should only be entered if no task is waiting in the queue.
+ *
+ * The lock and wait bits can be in one of following 4 states:
+ *
+ * State lock wait
+ * ----- ---------
+ *  [0]   0    0	Lock is free and no one is waiting
+ *  [1]   1    0	Lock is not available, but no one is waiting
+ *  [2]   0    1	Lock is free, but a waiter is present
+ *  [3]   1    1	Lock is not available and a waiter is present
+ *
+ * A task entering the quick path will set the wait bit and be in either
+ * either states 2 or 3. The allowable transitions are:
+ *
+ *   [3] => [2] => [1] => [0]
+ *    ^             |
+ *    +-------------+
+ *
+ * N.B. The wait bit won't be set if the queue code has been set before.
+ *	As a result, the queue head can safely get the lock without using
+ *	atomic operation as long as it checks that the wait bit hasn't been
+ *	set. The cpu_relax() function is used after atomic operation whereas
+ *	arch_mutex_cpu_relax() is used after a read.
+ */
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+	int		      wset  = false;	/* True if wait bit was set */
+
+	/*
+	 * Fall into the quick spinning code path only if no task is waiting
+	 * in the queue.
+	 */
+	for (; likely(!(qsval >> _QCODE_OFFSET));
+			qsval = atomic_read(&lock->qlcode)) {
+
+		if (unlikely((qsval & _QLOCK_LW_MASK) == _QLOCK_LW_MASK)) {
+			/*
+			 * Wait a while to see if either lock or wait bit
+			 * is cleared. Leave if the condition isn't changed.
+			 */
+			cpu_relax();
+			cpu_relax();
+			qsval = atomic_read(&lock->qlcode);
+			if ((qsval >> _QCODE_OFFSET) ||
+			   ((qsval & _QLOCK_LW_MASK) == _QLOCK_LW_MASK))
+				return 0;
+		}
+		if (unlikely(qsval == 0)) {
+			/*
+			 * Attempt to acquire the lock directly here
+			 */
+			qsval = atomic_cmpxchg(&lock->qlcode, 0, _QLOCK_LOCKED);
+			if (qsval == 0)
+				return 1;	/* Got the lock */
+			cpu_relax();
+			continue;
+		}
+		if (unlikely(qsval & _QLOCK_WAITING)) {
+			arch_mutex_cpu_relax();
+			wset = true;
+			continue;
+		}
+
+		/*
+		 * If the wait bit has just been cleared, the new lock holder
+		 * should be busy in the critical section. It was found that
+		 * waiting a bit longer before the exchange operation helps
+		 * performance.
+		 */
+		if (unlikely(wset))
+			arch_mutex_cpu_relax();
+
+		if (atomic_cmpxchg(&lock->qlcode, qsval, qsval|_QLOCK_WAITING)
+				  != qsval) {
+			/*
+			 * Another task has got the wait bit before us or
+			 * the queue code has been set. Wait a bit and try
+			 * again.
+			 */
+			cpu_relax();
+			wset = false;
+			continue;
+		}
+
+		/*
+		 * Wait a bit here if the lock bit was set.
+		 */
+		if (unlikely(qsval & _QLOCK_LOCKED))
+			arch_mutex_cpu_relax();
+
+		/*
+		 * Now wait until the lock bit is cleared
+		 */
+		while (smp_load_acquire(&qlock->qlcode) & _QLOCK_LOCKED)
+			arch_mutex_cpu_relax();
+
+		/*
+		 * Set the lock bit & clear the waiting bit simultaneously
+		 * No lock stealing is allowed when this quick path is active.
+		 */
+		ACCESS_ONCE(qlock->lock_wait) = _QLOCK_LOCKED;
+		return 1;
+	}
+	return 0;
+}
+
 #define queue_code_xchg queue_code_xchg
 /**
  * queue_code_xchg - exchange a queue code value
@@ -192,7 +320,7 @@ static __always_inline int __queue_spin_trylock(struct qspinlock *lock)
 {
 	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
 
-	return cmpxchg(&qlock->lock, 0, _QLOCK_LOCKED) == 0;
+	return cmpxchg(&qlock->lock_wait, 0, _QLOCK_LOCKED) == 0;
 }
 #endif
 #endif /*  _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS  */
@@ -203,6 +331,10 @@ static __always_inline int __queue_spin_trylock(struct qspinlock *lock)
  * that may get superseded by a more optimized version.			*
  ************************************************************************
  */
+#ifndef queue_spin_trylock_quick
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{ return 0; }
+#endif
 
 #ifndef __queue_spin_trylock
 /**
@@ -365,37 +497,20 @@ static inline void put_qnode(void)
 }
 
 /**
- * queue_spin_lock_slowpath - acquire the queue spinlock
- * @lock : Pointer to queue spinlock structure
- * @qsval: Current value of the queue spinlock 32-bit word
+ * queue_spin_lock_slowerpath - put lock waiter in queue & wait for its turn
+ * @lock    : Pointer to queue spinlock structure
+ * @node    : Pointer to the queue node
+ * @my_qcode: Queue code for the queue node
  */
-void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+static noinline void queue_spin_lock_slowerpath(struct qspinlock *lock,
+					struct qnode *node, u32 my_qcode)
 {
-	unsigned int cpu_nr;
-	struct qnode *node, *next;
-	u32 prev_qcode, my_qcode;
+	int qsval;
+	struct qnode *next;
+	u32 prev_qcode;
 	enum exitval exitval;
 
 	/*
-	 * Get the queue node
-	 */
-	cpu_nr = smp_processor_id();
-	node   = get_qnode(cpu_nr, &my_qcode);
-
-	/*
-	 * Initialize the queue node
-	 */
-	node->qhead = false;
-	node->next  = NULL;
-
-	/*
-	 * The lock may be available at this point, try again if no task was
-	 * waiting in the queue.
-	 */
-	if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock))
-		goto release_node;
-
-	/*
 	 * Exchange current copy of the queue node code
 	 */
 	exitval = queue_code_xchg(lock, &prev_qcode, my_qcode);
@@ -463,4 +578,47 @@ set_qhead:
 release_node:
 	put_qnode();
 }
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Current value of the queue spinlock 32-bit word
+ *
+ * The slowpath was broken into a regular slowpath and a slowerpath. The
+ * slowpath contains the quick spinning code and the slower path contains
+ * the regular lock waiter queuing code. This is to avoid the complexity in
+ * the queuing code from slowing down the quick spinning path due to
+ * compiler optimization.
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+{
+	struct qnode *node;
+	u32 my_qcode, cpu_nr;
+
+	/*
+	 * Try the quick spinning code path
+	 */
+	if (queue_spin_trylock_quick(lock, qsval))
+		return;
+	/*
+	 * Get the queue node
+	 */
+	cpu_nr = smp_processor_id();
+	node   = get_qnode(cpu_nr, &my_qcode);
+
+	/*
+	 * Initialize the queue node
+	 */
+	node->qhead = false;
+	node->next  = NULL;
+
+	/*
+	 * The lock may be available at this point, try again if no task was
+	 * waiting in the queue.
+	 */
+	if (likely(!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)))
+		put_qnode();	/* Return the queue node */
+	else
+		queue_spin_lock_slowerpath(lock, node, my_qcode);
+}
 EXPORT_SYMBOL(queue_spin_lock_slowpath);
-- 
1.7.1

  parent reply	other threads:[~2014-04-02 13:28 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-02 13:27 [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
2014-04-02 13:27 ` [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation Waiman Long
2014-04-02 13:27   ` Waiman Long
2014-04-04 13:00   ` Peter Zijlstra
2014-04-04 13:00     ` Peter Zijlstra
2014-04-04 14:59     ` Waiman Long
2014-04-04 14:59       ` Waiman Long
2014-04-04 17:53       ` Ingo Molnar
2014-04-04 17:53         ` Ingo Molnar
2014-04-07 14:16       ` Peter Zijlstra
2014-04-07 14:16         ` Peter Zijlstra
2014-04-04 16:57     ` Konrad Rzeszutek Wilk
2014-04-04 16:57       ` Konrad Rzeszutek Wilk
2014-04-04 17:08       ` Waiman Long
2014-04-04 17:08         ` Waiman Long
2014-04-04 17:54         ` Ingo Molnar
2014-04-04 17:54           ` Ingo Molnar
2014-04-07 14:09         ` Peter Zijlstra
2014-04-07 14:09           ` Peter Zijlstra
2014-04-07 16:59           ` Waiman Long
2014-04-07 16:59             ` Waiman Long
2014-04-07 14:12       ` Peter Zijlstra
2014-04-07 14:12         ` Peter Zijlstra
2014-04-07 14:33         ` Konrad Rzeszutek Wilk
2014-04-07 14:33           ` Konrad Rzeszutek Wilk
2014-04-02 13:27 ` [PATCH v8 02/10] qspinlock, x86: Enable x86-64 to use queue spinlock Waiman Long
2014-04-02 13:27   ` Waiman Long
2014-04-02 13:27 ` [PATCH v8 03/10] qspinlock: More optimized code for smaller NR_CPUS Waiman Long
2014-04-02 13:27   ` Waiman Long
2014-04-02 13:27 ` Waiman Long [this message]
2014-04-02 13:27 ` [PATCH v8 05/10] pvqspinlock, x86: Allow unfair spinlock in a PV guest Waiman Long
2014-04-02 13:27   ` Waiman Long
2014-04-02 13:27 ` [PATCH v8 06/10] pvqspinlock: Enable lock stealing in queue lock waiters Waiman Long
2014-04-02 13:27 ` [PATCH v8 07/10] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled Waiman Long
2014-04-02 13:27   ` Waiman Long
2014-04-02 13:27 ` [PATCH v8 08/10] pvqspinlock, x86: Add qspinlock para-virtualization support Waiman Long
2014-04-02 13:27   ` Waiman Long
2014-04-02 13:27 ` [PATCH v8 09/10] pvqspinlock, x86: Enable qspinlock PV support for KVM Waiman Long
2014-04-02 13:27 ` [PATCH v8 10/10] pvqspinlock, x86: Enable qspinlock PV support for XEN Waiman Long
2014-04-02 13:27   ` Waiman Long
2014-04-02 14:39   ` Konrad Rzeszutek Wilk
2014-04-02 14:39     ` Konrad Rzeszutek Wilk
2014-04-02 20:38     ` Waiman Long
2014-04-02 20:38       ` Waiman Long
2014-04-02 14:32 ` [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support Konrad Rzeszutek Wilk
2014-04-02 14:32   ` Konrad Rzeszutek Wilk
2014-04-02 20:35   ` Waiman Long
2014-04-03  2:10     ` Waiman Long
2014-04-03  2:10       ` Waiman Long
2014-04-03 17:23       ` Konrad Rzeszutek Wilk
2014-04-03 17:23         ` Konrad Rzeszutek Wilk
2014-04-04  2:57         ` Waiman Long
2014-04-04  2:57           ` Waiman Long
2014-04-04 16:55           ` Konrad Rzeszutek Wilk
2014-04-04 16:55             ` Konrad Rzeszutek Wilk
2014-04-04 17:13             ` Waiman Long
2014-04-04 17:13               ` Waiman Long
2014-04-04 17:58               ` Konrad Rzeszutek Wilk
2014-04-04 17:58                 ` Konrad Rzeszutek Wilk
2014-04-04 18:33                 ` Konrad Rzeszutek Wilk
2014-04-04 18:33                   ` Konrad Rzeszutek Wilk
2014-04-04 18:14             ` Marcos E. Matsunaga
2014-04-04 15:25   ` Konrad Rzeszutek Wilk
2014-04-04 15:25     ` Konrad Rzeszutek Wilk
2014-04-07  6:14 ` Raghavendra K T
2014-04-07 16:38   ` Waiman Long
2014-04-07 16:38     ` Waiman Long
2014-04-07 17:51     ` Raghavendra K T
2014-04-07 17:51       ` Raghavendra K T
2014-04-08 19:15       ` Waiman Long
2014-04-08 19:15         ` Waiman Long
2014-04-09 12:08         ` Raghavendra K T
  -- strict thread matches above, loose matches on Subject: below --
2014-04-01 20:47 Waiman Long
2014-04-01 20:47 ` [PATCH v8 04/10] qspinlock: Optimized code path for 2 contending tasks Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1396445259-27670-5-git-send-email-Waiman.Long@hp.com \
    --to=waiman.long@hp.com \
    --cc=aswin@hp.com \
    --cc=chegu_vinod@hp.com \
    --cc=david.vrabel@citrix.com \
    --cc=gleb@redhat.com \
    --cc=hpa@zytor.com \
    --cc=konrad.wilk@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=paolo.bonzini@gmail.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=scott.norton@hp.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=x86@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).