- * [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-17 20:41 [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock Waiman Long
@ 2014-02-17 20:41 ` Waiman Long
  2014-02-17 20:41   ` Waiman Long
                     ` (3 more replies)
  2014-02-17 20:41 ` [PATCH v4 2/3] qspinlock, x86: Enable x86-64 to use queue spinlock Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-17 20:41 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke, Waiman Long
This patch introduces a new queue spinlock implementation that can
serve as an alternative to the default ticket spinlock. Compared with
the ticket spinlock, this queue spinlock should be almost as fair as
the ticket spinlock. It has about the same speed in single-thread and
it can be much faster in high contention situations. Only in light to
moderate contention where the average queue depth is around 1-3 will
this queue spinlock be potentially a bit slower due to the higher
slowpath overhead.
This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.
The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock. Ignoring interrupt handling, a CPU can only be contending
in one spinlock at any one time. Of course, interrupt handler can try
to acquire one spinlock while the interrupted user process is in the
process of getting another spinlock. By allocating a set of per-cpu
queue nodes and used them to form a waiting queue, we can encode the
queue node address into a much smaller 16-bit size. Together with
the 1-byte lock bit, this queue spinlock implementation will only
need 4 bytes to hold all the information that it needs.
The current queue node address encoding of the 4-byte word is as
follows:
Bits 0-7  : the locked byte
Bits 8-9  : queue node index in the per-cpu array (4 entries)
Bits 10-31: cpu number + 1 (max cpus = 4M -1)
In the extremely unlikely case that all the queue node entries are
used up, the current code will fall back to busy spinning without
waiting in a queue with warning message.
For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU.  The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):
  Lock Type			Time (ns)
  ---------			---------
  Ticket spinlock		  14.1
  Queue spinlock		   8.8
So the queue spinlock is much faster than the ticket spinlock, even
though the overhead of locking and unlocking should be pretty small
when there is no contention. The performance advantage is mainly
due to the fact that ticket spinlock does a read-modify-write (add)
instruction in unlock whereas queue spinlock only does a simple write
in unlock which can be much faster in a pipelined CPU.
The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere
x86-64 CPUs with XFS filesystem on a ramdisk and HT off to evaluate
the performance impact of this patch on a 3.13 kernel.
  +------------+----------+-----------------+---------+
  | Kernel     | 3.13 JPM |    3.13 with    | %Change |
  |            |          | qspinlock patch |	      |
  +------------+----------+-----------------+---------+
  |		      10-100 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   357459 |      363109     |  +1.58% |
  |dbase       |   496847 |      498801	    |  +0.39% |
  |disk        |  2925312 |     2771387     |  -5.26% |
  |five_sec    |   166612 |      169215     |  +1.56% |
  |fserver     |   382129 |      383279     |  +0.30% |
  |high_systime|    16356 |       16380     |  +0.15% |
  |short       |  4521978 |     4257363     |  -5.85% |
  +------------+----------+-----------------+---------+
  |		     200-1000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   449070 |      447711     |  -0.30% |
  |dbase       |   845029 |      853362	    |  +0.99% |
  |disk        |  2725249 |     4892907     | +79.54% |
  |five_sec    |   169410 |      170638     |  +0.72% |
  |fserver     |   489662 |      491828     |  +0.44% |
  |high_systime|   142823 |      143790     |  +0.68% |
  |short       |  7435288 |     9016171     | +21.26% |
  +------------+----------+-----------------+---------+
  |		     1100-2000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   432470 |      432570     |  +0.02% |
  |dbase       |   889289 |      890026	    |  +0.08% |
  |disk        |  2565138 |     5008732     | +95.26% |
  |five_sec    |   169141 |      170034     |  +0.53% |
  |fserver     |   498569 |      500701     |  +0.43% |
  |high_systime|   229913 |      245866     |  +6.94% |
  |short       |  8496794 |     8281918     |  -2.53% |
  +------------+----------+-----------------+---------+
The workload with the most gain was the disk workload. Without the
patch, the perf profile at 1500 users looked like:
 26.19%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--47.28%-- evict
              |--46.87%-- inode_sb_list_add
              |--1.24%-- xlog_cil_insert_items
              |--0.68%-- __remove_inode_hash
              |--0.67%-- inode_wait_for_writeback
               --3.26%-- [...]
 22.96%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  5.56%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  4.87%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.04%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.30%    reaim  [kernel.kallsyms]  [k] memcpy
  1.08%    reaim  [unknown]          [.] 0x0000003c52009447
There was pretty high spinlock contention on the inode_sb_list_lock
and maybe the inode's i_lock.
With the patch, the perf profile at 1500 users became:
 26.82%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  4.66%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  3.97%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.40%    reaim  [kernel.kallsyms]  [k] queue_spin_lock_slowpath
              |--88.31%-- _raw_spin_lock
              |          |--36.02%-- inode_sb_list_add
              |          |--35.09%-- evict
              |          |--16.89%-- xlog_cil_insert_items
              |          |--6.30%-- try_to_wake_up
              |          |--2.20%-- _xfs_buf_find
              |          |--0.75%-- __remove_inode_hash
              |          |--0.72%-- __mutex_lock_slowpath
              |          |--0.53%-- load_balance
              |--6.02%-- _raw_spin_lock_irqsave
              |          |--74.75%-- down_trylock
              |          |--9.69%-- rcu_check_quiescent_state
              |          |--7.47%-- down
              |          |--3.57%-- up
              |          |--1.67%-- rwsem_wake
              |          |--1.00%-- remove_wait_queue
              |          |--0.56%-- pagevec_lru_move_fn
              |--5.39%-- _raw_spin_lock_irq
              |          |--82.05%-- rwsem_down_read_failed
              |          |--10.48%-- rwsem_down_write_failed
              |          |--4.24%-- __down
              |          |--2.74%-- __schedule
               --0.28%-- [...]
  2.20%    reaim  [kernel.kallsyms]  [k] memcpy
  1.84%    reaim  [unknown]          [.] 0x000000000041517b
  1.77%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--21.08%-- xlog_cil_insert_items
              |--10.14%-- xfs_icsb_modify_counters
              |--7.20%-- xfs_iget_cache_hit
              |--6.56%-- inode_sb_list_add
              |--5.49%-- _xfs_buf_find
              |--5.25%-- evict
              |--5.03%-- __remove_inode_hash
              |--4.64%-- __mutex_lock_slowpath
              |--3.78%-- selinux_inode_free_security
              |--2.95%-- xfs_inode_is_filestream
              |--2.35%-- try_to_wake_up
              |--2.07%-- xfs_inode_set_reclaim_tag
              |--1.52%-- list_lru_add
              |--1.16%-- xfs_inode_clear_eofblocks_tag
		  :
  1.30%    reaim  [kernel.kallsyms]  [k] effective_load
  1.27%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.10%    reaim  [kernel.kallsyms]  [k] security_compute_sid
On the ext4 filesystem, the disk workload improved from 416281 JPM
to 899101 JPM (+116%) with the patch. In this case, the contended
spinlock is the mb_cache_spinlock.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/asm-generic/qspinlock.h       |  122 ++++++++++++
 include/asm-generic/qspinlock_types.h |   43 ++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 +
 kernel/locking/qspinlock.c            |  353 +++++++++++++++++++++++++++++++++
 5 files changed, 526 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 0000000..d2196ed
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,122 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/*
+ * External function declarations
+ */
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval);
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+	return _QLOCK(lock);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+	return !_QLOCK(&lock);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+	return _QCODE(lock);
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+	if (!atomic_read(&lock->qlcode) &&
+	   (atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+	int qsval;
+
+	/*
+	 * To reduce memory access to only once for the cold cache case,
+	 * a direct cmpxchg() is performed in the fastpath to optimize the
+	 * uncontended case. The contended performance, however, may suffer
+	 * a bit because of that.
+	 */
+	qsval = atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED);
+	if (likely(qsval == 0))
+		return;
+	queue_spin_lock_slowpath(lock, qsval);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	/*
+	 * Use an atomic subtraction to clear the lock bit.
+	 */
+	smp_mb__before_atomic_dec();
+	atomic_sub(_QSPINLOCK_LOCKED, &lock->qlcode);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l)		queue_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queue_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queue_spin_lock(l)
+#define arch_spin_trylock(l)		queue_spin_trylock(l)
+#define arch_spin_unlock(l)		queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f)	queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
new file mode 100644
index 0000000..e8cc9ca
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,43 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
+/*
+ * The queue spinlock data structure - a 32-bit word
+ *
+ * The bits assignment are:
+ *   Bit  0   : Set if locked
+ *   Bits 1-7 : Not used
+ *   Bits 8-31: Queue code
+ */
+typedef struct qspinlock {
+	atomic_t	qlcode;	/* Lock + queue code */
+} arch_spinlock_t;
+
+#define _QCODE_OFFSET		8
+#define _QSPINLOCK_LOCKED	1U
+#define	_QCODE(lock)	(atomic_read(&(lock)->qlcode) >> _QCODE_OFFSET)
+#define	_QLOCK(lock)	(atomic_read(&(lock)->qlcode) & _QSPINLOCK_LOCKED)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..df71bfa 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_SPINLOCK
+	bool
+
+config QUEUE_SPINLOCK
+	def_bool y if ARCH_USE_QUEUE_SPINLOCK
+	depends on SMP && !PARAVIRT
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..1a17380 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
 obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
 obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
 obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
new file mode 100644
index 0000000..fab9819
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,353 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock with twists
+ * to make it fit the following constraints:
+ * 1. A max spinlock size of 4 bytes
+ * 2. Good fastpath performance
+ * 3. No change in the locking APIs
+ *
+ * The queue spinlock fastpath is as simple as it can get, all the heavy
+ * lifting is done in the lock slowpath. The main idea behind this queue
+ * spinlock implementation is to keep the spinlock size at 4 bytes while
+ * at the same time implement a queue structure to queue up the waiting
+ * lock spinners.
+ *
+ * Since preemption is disabled before getting the lock, a given CPU will
+ * only need to use one queue node structure in a non-interrupt context.
+ * A percpu queue node structure will be allocated for this purpose and the
+ * cpu number will be put into the queue spinlock structure to indicate the
+ * tail of the queue.
+ *
+ * To handle spinlock acquisition at interrupt context (softirq or hardirq),
+ * the queue node structure is actually an array for supporting nested spin
+ * locking operations in interrupt handlers. If all the entries in the
+ * array are used up, a warning message will be printed (as that shouldn't
+ * happen in normal circumstances) and the lock spinner will fall back to
+ * busy spinning instead of waiting in a queue.
+ */
+
+/*
+ * The queue node structure
+ *
+ * This structure is essentially the same as the mcs_spinlock structure
+ * in mcs_spinlock.h file. This structure is retained for future extension
+ * where new fields may be added.
+ */
+struct qnode {
+	u32		 wait;		/* Waiting flag		*/
+	struct qnode	*next;		/* Next queue node addr */
+};
+
+/*
+ * The 24-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
+ *
+ * The 16-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-15: CPU number + 1   (16K - 1 CPUs)
+ *
+ * A queue node code of 0 indicates that no one is waiting for the lock.
+ * As the value 0 cannot be used as a valid CPU number. We need to add
+ * 1 to it before putting it into the queue code.
+ */
+#ifndef _QCODE_VAL_OFFSET
+#define _QCODE_VAL_OFFSET	_QCODE_OFFSET
+#endif
+#define MAX_QNODES		4
+#define GET_QN_IDX(code)	(((code) >> _QCODE_VAL_OFFSET) & 3)
+#define GET_CPU_NR(code)	(((code) >> (_QCODE_VAL_OFFSET + 2)) - 1)
+#ifndef _SET_QCODE
+#define _SET_QCODE(cpu, idx)	((((cpu) + 1) << (_QCODE_VAL_OFFSET + 2)) |\
+				((idx) << _QCODE_VAL_OFFSET) |\
+				 _QSPINLOCK_LOCKED)
+#endif
+
+struct qnode_set {
+	int		node_idx;	/* Current node to use */
+	struct qnode	nodes[MAX_QNODES];
+};
+
+/*
+ * Per-CPU queue node structures
+ */
+static DEFINE_PER_CPU(struct qnode_set, qnset) ____cacheline_aligned
+	= { 0 };
+
+/**
+ * xlate_qcode - translate the queue code into the queue node address
+ * @qcode: Queue code to be translated
+ * Return: The corresponding queue node address
+ */
+static inline struct qnode *xlate_qcode(u32 qcode)
+{
+	u32 cpu_nr = GET_CPU_NR(qcode);
+	u8  qn_idx = GET_QN_IDX(qcode);
+
+	return per_cpu_ptr(&qnset.nodes[qn_idx], cpu_nr);
+}
+
+#ifndef queue_spin_trylock_unfair
+/**
+ * queue_spin_trylock_unfair - try to acquire the lock ignoring the qcode
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+	int qlcode = atomic_read(lock->qlcode);
+
+	if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode,
+		qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
+			return 1;
+	return 0;
+}
+#endif /* queue_spin_trylock_unfair */
+
+#ifndef queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock  : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value (not used)
+ * Return : > 0 if lock is not available, = 0 if lock is free
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+	int qlcode = atomic_read(&lock->qlcode);
+
+	*qcode = qlcode;
+	return qlcode & _QSPINLOCK_LOCKED;
+}
+#endif /* queue_get_lock_qcode */
+
+#ifndef queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+	return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+#endif /* queue_spin_trylock_and_clr_qcode */
+
+/**
+ * get_qnode - Get a queue node address
+ * @qn_idx: Pointer to queue node index [out]
+ * Return : queue node address & queue node index in qn_idx, or NULL if
+ *	    no free queue node available.
+ */
+static struct qnode *get_qnode(unsigned int *qn_idx)
+{
+	struct qnode_set *qset = this_cpu_ptr(&qnset);
+	int i;
+
+	if (unlikely(qset->node_idx >= MAX_QNODES))
+		return NULL;
+	i = qset->node_idx++;
+	*qn_idx = i;
+	return &qset->nodes[i];
+}
+
+/**
+ * put_qnode - Return a queue node to the pool
+ */
+static void put_qnode(void)
+{
+	struct qnode_set *qset = this_cpu_ptr(&qnset);
+
+	qset->node_idx--;
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Current value of the queue spinlock 32-bit word
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+{
+	unsigned int cpu_nr, qn_idx;
+	struct qnode *node, *next;
+	u32 prev_qcode, my_qcode;
+
+#ifdef queue_spin_trylock_quick
+	/*
+	 * Try the quick spinning code path
+	 */
+	if (queue_spin_trylock_quick(lock, qsval))
+		return;
+#endif
+	/*
+	 * Get the queue node
+	 */
+	cpu_nr = smp_processor_id();
+	node   = get_qnode(&qn_idx);
+
+	if (unlikely(!node)) {
+		/*
+		 * This shouldn't happen, print a warning message
+		 * & busy spinning on the lock.
+		 */
+		printk_sched(
+		  "qspinlock: queue node table exhausted at cpu %d!\n",
+		  cpu_nr);
+		while (!queue_spin_trylock_unfair(lock))
+			arch_mutex_cpu_relax();
+		return;
+	}
+
+	/*
+	 * Set up the new cpu code to be exchanged
+	 */
+	my_qcode = _SET_QCODE(cpu_nr, qn_idx);
+
+	/*
+	 * Initialize the queue node
+	 */
+	node->wait = true;
+	node->next = NULL;
+
+	/*
+	 * The lock may be available at this point, try again if no task was
+	 * waiting in the queue.
+	 */
+	if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)) {
+		put_qnode();
+		return;
+	}
+
+#ifdef queue_code_xchg
+	prev_qcode = queue_code_xchg(lock, my_qcode);
+#else
+	/*
+	 * Exchange current copy of the queue node code
+	 */
+	prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
+	/*
+	 * It is possible that we may accidentally steal the lock. If this is
+	 * the case, we need to either release it if not the head of the queue
+	 * or get the lock and be done with it.
+	 */
+	if (unlikely(!(prev_qcode & _QSPINLOCK_LOCKED))) {
+		if (prev_qcode == 0) {
+			/*
+			 * Got the lock since it is at the head of the queue
+			 * Now try to atomically clear the queue code.
+			 */
+			if (atomic_cmpxchg(&lock->qlcode, my_qcode,
+					  _QSPINLOCK_LOCKED) == my_qcode)
+				goto release_node;
+			/*
+			 * The cmpxchg fails only if one or more tasks
+			 * are added to the queue. In this case, we need to
+			 * notify the next one to be the head of the queue.
+			 */
+			goto notify_next;
+		}
+		/*
+		 * Accidentally steal the lock, release the lock and
+		 * let the queue head get it.
+		 */
+		queue_spin_unlock(lock);
+	} else
+		prev_qcode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
+	my_qcode &= ~_QSPINLOCK_LOCKED;
+#endif /* queue_code_xchg */
+
+	if (prev_qcode) {
+		/*
+		 * Not at the queue head, get the address of the previous node
+		 * and set up the "next" fields of the that node.
+		 */
+		struct qnode *prev = xlate_qcode(prev_qcode);
+
+		ACCESS_ONCE(prev->next) = node;
+		/*
+		 * Wait until the waiting flag is off
+		 */
+		while (smp_load_acquire(&node->wait))
+			arch_mutex_cpu_relax();
+	}
+
+	/*
+	 * At the head of the wait queue now
+	 */
+	while (true) {
+		u32 qcode;
+		int retval;
+
+		retval = queue_get_lock_qcode(lock, &qcode, my_qcode);
+		if (retval > 0)
+			;	/* Lock not available yet */
+		else if (retval < 0)
+			/* Lock taken, can release the node & return */
+			goto release_node;
+		else if (qcode != my_qcode) {
+			/*
+			 * Just get the lock with other spinners waiting
+			 * in the queue.
+			 */
+			if (queue_spin_trylock_unfair(lock))
+				goto notify_next;
+		} else {
+			/*
+			 * Get the lock & clear the queue code simultaneously
+			 */
+			if (queue_spin_trylock_and_clr_qcode(lock, qcode))
+				/* No need to notify the next one */
+				goto release_node;
+		}
+		arch_mutex_cpu_relax();
+	}
+
+notify_next:
+	/*
+	 * Wait, if needed, until the next one in queue set up the next field
+	 */
+	while (!(next = ACCESS_ONCE(node->next)))
+		arch_mutex_cpu_relax();
+	/*
+	 * The next one in queue is now at the head
+	 */
+	smp_store_release(&next->wait, false);
+
+release_node:
+	put_qnode();
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 71+ messages in thread
- * [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-17 20:41 ` [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
@ 2014-02-17 20:41   ` Waiman Long
  2014-02-18  7:30   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-17 20:41 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke, Waiman Long
This patch introduces a new queue spinlock implementation that can
serve as an alternative to the default ticket spinlock. Compared with
the ticket spinlock, this queue spinlock should be almost as fair as
the ticket spinlock. It has about the same speed in single-thread and
it can be much faster in high contention situations. Only in light to
moderate contention where the average queue depth is around 1-3 will
this queue spinlock be potentially a bit slower due to the higher
slowpath overhead.
This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.
The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock. Ignoring interrupt handling, a CPU can only be contending
in one spinlock at any one time. Of course, interrupt handler can try
to acquire one spinlock while the interrupted user process is in the
process of getting another spinlock. By allocating a set of per-cpu
queue nodes and used them to form a waiting queue, we can encode the
queue node address into a much smaller 16-bit size. Together with
the 1-byte lock bit, this queue spinlock implementation will only
need 4 bytes to hold all the information that it needs.
The current queue node address encoding of the 4-byte word is as
follows:
Bits 0-7  : the locked byte
Bits 8-9  : queue node index in the per-cpu array (4 entries)
Bits 10-31: cpu number + 1 (max cpus = 4M -1)
In the extremely unlikely case that all the queue node entries are
used up, the current code will fall back to busy spinning without
waiting in a queue with warning message.
For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU.  The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):
  Lock Type			Time (ns)
  ---------			---------
  Ticket spinlock		  14.1
  Queue spinlock		   8.8
So the queue spinlock is much faster than the ticket spinlock, even
though the overhead of locking and unlocking should be pretty small
when there is no contention. The performance advantage is mainly
due to the fact that ticket spinlock does a read-modify-write (add)
instruction in unlock whereas queue spinlock only does a simple write
in unlock which can be much faster in a pipelined CPU.
The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere
x86-64 CPUs with XFS filesystem on a ramdisk and HT off to evaluate
the performance impact of this patch on a 3.13 kernel.
  +------------+----------+-----------------+---------+
  | Kernel     | 3.13 JPM |    3.13 with    | %Change |
  |            |          | qspinlock patch |	      |
  +------------+----------+-----------------+---------+
  |		      10-100 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   357459 |      363109     |  +1.58% |
  |dbase       |   496847 |      498801	    |  +0.39% |
  |disk        |  2925312 |     2771387     |  -5.26% |
  |five_sec    |   166612 |      169215     |  +1.56% |
  |fserver     |   382129 |      383279     |  +0.30% |
  |high_systime|    16356 |       16380     |  +0.15% |
  |short       |  4521978 |     4257363     |  -5.85% |
  +------------+----------+-----------------+---------+
  |		     200-1000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   449070 |      447711     |  -0.30% |
  |dbase       |   845029 |      853362	    |  +0.99% |
  |disk        |  2725249 |     4892907     | +79.54% |
  |five_sec    |   169410 |      170638     |  +0.72% |
  |fserver     |   489662 |      491828     |  +0.44% |
  |high_systime|   142823 |      143790     |  +0.68% |
  |short       |  7435288 |     9016171     | +21.26% |
  +------------+----------+-----------------+---------+
  |		     1100-2000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   432470 |      432570     |  +0.02% |
  |dbase       |   889289 |      890026	    |  +0.08% |
  |disk        |  2565138 |     5008732     | +95.26% |
  |five_sec    |   169141 |      170034     |  +0.53% |
  |fserver     |   498569 |      500701     |  +0.43% |
  |high_systime|   229913 |      245866     |  +6.94% |
  |short       |  8496794 |     8281918     |  -2.53% |
  +------------+----------+-----------------+---------+
The workload with the most gain was the disk workload. Without the
patch, the perf profile at 1500 users looked like:
 26.19%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--47.28%-- evict
              |--46.87%-- inode_sb_list_add
              |--1.24%-- xlog_cil_insert_items
              |--0.68%-- __remove_inode_hash
              |--0.67%-- inode_wait_for_writeback
               --3.26%-- [...]
 22.96%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  5.56%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  4.87%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.04%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.30%    reaim  [kernel.kallsyms]  [k] memcpy
  1.08%    reaim  [unknown]          [.] 0x0000003c52009447
There was pretty high spinlock contention on the inode_sb_list_lock
and maybe the inode's i_lock.
With the patch, the perf profile at 1500 users became:
 26.82%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  4.66%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  3.97%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.40%    reaim  [kernel.kallsyms]  [k] queue_spin_lock_slowpath
              |--88.31%-- _raw_spin_lock
              |          |--36.02%-- inode_sb_list_add
              |          |--35.09%-- evict
              |          |--16.89%-- xlog_cil_insert_items
              |          |--6.30%-- try_to_wake_up
              |          |--2.20%-- _xfs_buf_find
              |          |--0.75%-- __remove_inode_hash
              |          |--0.72%-- __mutex_lock_slowpath
              |          |--0.53%-- load_balance
              |--6.02%-- _raw_spin_lock_irqsave
              |          |--74.75%-- down_trylock
              |          |--9.69%-- rcu_check_quiescent_state
              |          |--7.47%-- down
              |          |--3.57%-- up
              |          |--1.67%-- rwsem_wake
              |          |--1.00%-- remove_wait_queue
              |          |--0.56%-- pagevec_lru_move_fn
              |--5.39%-- _raw_spin_lock_irq
              |          |--82.05%-- rwsem_down_read_failed
              |          |--10.48%-- rwsem_down_write_failed
              |          |--4.24%-- __down
              |          |--2.74%-- __schedule
               --0.28%-- [...]
  2.20%    reaim  [kernel.kallsyms]  [k] memcpy
  1.84%    reaim  [unknown]          [.] 0x000000000041517b
  1.77%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--21.08%-- xlog_cil_insert_items
              |--10.14%-- xfs_icsb_modify_counters
              |--7.20%-- xfs_iget_cache_hit
              |--6.56%-- inode_sb_list_add
              |--5.49%-- _xfs_buf_find
              |--5.25%-- evict
              |--5.03%-- __remove_inode_hash
              |--4.64%-- __mutex_lock_slowpath
              |--3.78%-- selinux_inode_free_security
              |--2.95%-- xfs_inode_is_filestream
              |--2.35%-- try_to_wake_up
              |--2.07%-- xfs_inode_set_reclaim_tag
              |--1.52%-- list_lru_add
              |--1.16%-- xfs_inode_clear_eofblocks_tag
		  :
  1.30%    reaim  [kernel.kallsyms]  [k] effective_load
  1.27%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.10%    reaim  [kernel.kallsyms]  [k] security_compute_sid
On the ext4 filesystem, the disk workload improved from 416281 JPM
to 899101 JPM (+116%) with the patch. In this case, the contended
spinlock is the mb_cache_spinlock.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/asm-generic/qspinlock.h       |  122 ++++++++++++
 include/asm-generic/qspinlock_types.h |   43 ++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 +
 kernel/locking/qspinlock.c            |  353 +++++++++++++++++++++++++++++++++
 5 files changed, 526 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 0000000..d2196ed
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,122 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/*
+ * External function declarations
+ */
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval);
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+	return _QLOCK(lock);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+	return !_QLOCK(&lock);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+	return _QCODE(lock);
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+	if (!atomic_read(&lock->qlcode) &&
+	   (atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+	int qsval;
+
+	/*
+	 * To reduce memory access to only once for the cold cache case,
+	 * a direct cmpxchg() is performed in the fastpath to optimize the
+	 * uncontended case. The contended performance, however, may suffer
+	 * a bit because of that.
+	 */
+	qsval = atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED);
+	if (likely(qsval == 0))
+		return;
+	queue_spin_lock_slowpath(lock, qsval);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	/*
+	 * Use an atomic subtraction to clear the lock bit.
+	 */
+	smp_mb__before_atomic_dec();
+	atomic_sub(_QSPINLOCK_LOCKED, &lock->qlcode);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l)		queue_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queue_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queue_spin_lock(l)
+#define arch_spin_trylock(l)		queue_spin_trylock(l)
+#define arch_spin_unlock(l)		queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f)	queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
new file mode 100644
index 0000000..e8cc9ca
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,43 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
+/*
+ * The queue spinlock data structure - a 32-bit word
+ *
+ * The bits assignment are:
+ *   Bit  0   : Set if locked
+ *   Bits 1-7 : Not used
+ *   Bits 8-31: Queue code
+ */
+typedef struct qspinlock {
+	atomic_t	qlcode;	/* Lock + queue code */
+} arch_spinlock_t;
+
+#define _QCODE_OFFSET		8
+#define _QSPINLOCK_LOCKED	1U
+#define	_QCODE(lock)	(atomic_read(&(lock)->qlcode) >> _QCODE_OFFSET)
+#define	_QLOCK(lock)	(atomic_read(&(lock)->qlcode) & _QSPINLOCK_LOCKED)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..df71bfa 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_SPINLOCK
+	bool
+
+config QUEUE_SPINLOCK
+	def_bool y if ARCH_USE_QUEUE_SPINLOCK
+	depends on SMP && !PARAVIRT
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..1a17380 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
 obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
 obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
 obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
new file mode 100644
index 0000000..fab9819
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,353 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock with twists
+ * to make it fit the following constraints:
+ * 1. A max spinlock size of 4 bytes
+ * 2. Good fastpath performance
+ * 3. No change in the locking APIs
+ *
+ * The queue spinlock fastpath is as simple as it can get, all the heavy
+ * lifting is done in the lock slowpath. The main idea behind this queue
+ * spinlock implementation is to keep the spinlock size at 4 bytes while
+ * at the same time implement a queue structure to queue up the waiting
+ * lock spinners.
+ *
+ * Since preemption is disabled before getting the lock, a given CPU will
+ * only need to use one queue node structure in a non-interrupt context.
+ * A percpu queue node structure will be allocated for this purpose and the
+ * cpu number will be put into the queue spinlock structure to indicate the
+ * tail of the queue.
+ *
+ * To handle spinlock acquisition at interrupt context (softirq or hardirq),
+ * the queue node structure is actually an array for supporting nested spin
+ * locking operations in interrupt handlers. If all the entries in the
+ * array are used up, a warning message will be printed (as that shouldn't
+ * happen in normal circumstances) and the lock spinner will fall back to
+ * busy spinning instead of waiting in a queue.
+ */
+
+/*
+ * The queue node structure
+ *
+ * This structure is essentially the same as the mcs_spinlock structure
+ * in mcs_spinlock.h file. This structure is retained for future extension
+ * where new fields may be added.
+ */
+struct qnode {
+	u32		 wait;		/* Waiting flag		*/
+	struct qnode	*next;		/* Next queue node addr */
+};
+
+/*
+ * The 24-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
+ *
+ * The 16-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-15: CPU number + 1   (16K - 1 CPUs)
+ *
+ * A queue node code of 0 indicates that no one is waiting for the lock.
+ * As the value 0 cannot be used as a valid CPU number. We need to add
+ * 1 to it before putting it into the queue code.
+ */
+#ifndef _QCODE_VAL_OFFSET
+#define _QCODE_VAL_OFFSET	_QCODE_OFFSET
+#endif
+#define MAX_QNODES		4
+#define GET_QN_IDX(code)	(((code) >> _QCODE_VAL_OFFSET) & 3)
+#define GET_CPU_NR(code)	(((code) >> (_QCODE_VAL_OFFSET + 2)) - 1)
+#ifndef _SET_QCODE
+#define _SET_QCODE(cpu, idx)	((((cpu) + 1) << (_QCODE_VAL_OFFSET + 2)) |\
+				((idx) << _QCODE_VAL_OFFSET) |\
+				 _QSPINLOCK_LOCKED)
+#endif
+
+struct qnode_set {
+	int		node_idx;	/* Current node to use */
+	struct qnode	nodes[MAX_QNODES];
+};
+
+/*
+ * Per-CPU queue node structures
+ */
+static DEFINE_PER_CPU(struct qnode_set, qnset) ____cacheline_aligned
+	= { 0 };
+
+/**
+ * xlate_qcode - translate the queue code into the queue node address
+ * @qcode: Queue code to be translated
+ * Return: The corresponding queue node address
+ */
+static inline struct qnode *xlate_qcode(u32 qcode)
+{
+	u32 cpu_nr = GET_CPU_NR(qcode);
+	u8  qn_idx = GET_QN_IDX(qcode);
+
+	return per_cpu_ptr(&qnset.nodes[qn_idx], cpu_nr);
+}
+
+#ifndef queue_spin_trylock_unfair
+/**
+ * queue_spin_trylock_unfair - try to acquire the lock ignoring the qcode
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+	int qlcode = atomic_read(lock->qlcode);
+
+	if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode,
+		qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
+			return 1;
+	return 0;
+}
+#endif /* queue_spin_trylock_unfair */
+
+#ifndef queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock  : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value (not used)
+ * Return : > 0 if lock is not available, = 0 if lock is free
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+	int qlcode = atomic_read(&lock->qlcode);
+
+	*qcode = qlcode;
+	return qlcode & _QSPINLOCK_LOCKED;
+}
+#endif /* queue_get_lock_qcode */
+
+#ifndef queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+	return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+#endif /* queue_spin_trylock_and_clr_qcode */
+
+/**
+ * get_qnode - Get a queue node address
+ * @qn_idx: Pointer to queue node index [out]
+ * Return : queue node address & queue node index in qn_idx, or NULL if
+ *	    no free queue node available.
+ */
+static struct qnode *get_qnode(unsigned int *qn_idx)
+{
+	struct qnode_set *qset = this_cpu_ptr(&qnset);
+	int i;
+
+	if (unlikely(qset->node_idx >= MAX_QNODES))
+		return NULL;
+	i = qset->node_idx++;
+	*qn_idx = i;
+	return &qset->nodes[i];
+}
+
+/**
+ * put_qnode - Return a queue node to the pool
+ */
+static void put_qnode(void)
+{
+	struct qnode_set *qset = this_cpu_ptr(&qnset);
+
+	qset->node_idx--;
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Current value of the queue spinlock 32-bit word
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+{
+	unsigned int cpu_nr, qn_idx;
+	struct qnode *node, *next;
+	u32 prev_qcode, my_qcode;
+
+#ifdef queue_spin_trylock_quick
+	/*
+	 * Try the quick spinning code path
+	 */
+	if (queue_spin_trylock_quick(lock, qsval))
+		return;
+#endif
+	/*
+	 * Get the queue node
+	 */
+	cpu_nr = smp_processor_id();
+	node   = get_qnode(&qn_idx);
+
+	if (unlikely(!node)) {
+		/*
+		 * This shouldn't happen, print a warning message
+		 * & busy spinning on the lock.
+		 */
+		printk_sched(
+		  "qspinlock: queue node table exhausted at cpu %d!\n",
+		  cpu_nr);
+		while (!queue_spin_trylock_unfair(lock))
+			arch_mutex_cpu_relax();
+		return;
+	}
+
+	/*
+	 * Set up the new cpu code to be exchanged
+	 */
+	my_qcode = _SET_QCODE(cpu_nr, qn_idx);
+
+	/*
+	 * Initialize the queue node
+	 */
+	node->wait = true;
+	node->next = NULL;
+
+	/*
+	 * The lock may be available at this point, try again if no task was
+	 * waiting in the queue.
+	 */
+	if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)) {
+		put_qnode();
+		return;
+	}
+
+#ifdef queue_code_xchg
+	prev_qcode = queue_code_xchg(lock, my_qcode);
+#else
+	/*
+	 * Exchange current copy of the queue node code
+	 */
+	prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
+	/*
+	 * It is possible that we may accidentally steal the lock. If this is
+	 * the case, we need to either release it if not the head of the queue
+	 * or get the lock and be done with it.
+	 */
+	if (unlikely(!(prev_qcode & _QSPINLOCK_LOCKED))) {
+		if (prev_qcode == 0) {
+			/*
+			 * Got the lock since it is at the head of the queue
+			 * Now try to atomically clear the queue code.
+			 */
+			if (atomic_cmpxchg(&lock->qlcode, my_qcode,
+					  _QSPINLOCK_LOCKED) == my_qcode)
+				goto release_node;
+			/*
+			 * The cmpxchg fails only if one or more tasks
+			 * are added to the queue. In this case, we need to
+			 * notify the next one to be the head of the queue.
+			 */
+			goto notify_next;
+		}
+		/*
+		 * Accidentally steal the lock, release the lock and
+		 * let the queue head get it.
+		 */
+		queue_spin_unlock(lock);
+	} else
+		prev_qcode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
+	my_qcode &= ~_QSPINLOCK_LOCKED;
+#endif /* queue_code_xchg */
+
+	if (prev_qcode) {
+		/*
+		 * Not at the queue head, get the address of the previous node
+		 * and set up the "next" fields of the that node.
+		 */
+		struct qnode *prev = xlate_qcode(prev_qcode);
+
+		ACCESS_ONCE(prev->next) = node;
+		/*
+		 * Wait until the waiting flag is off
+		 */
+		while (smp_load_acquire(&node->wait))
+			arch_mutex_cpu_relax();
+	}
+
+	/*
+	 * At the head of the wait queue now
+	 */
+	while (true) {
+		u32 qcode;
+		int retval;
+
+		retval = queue_get_lock_qcode(lock, &qcode, my_qcode);
+		if (retval > 0)
+			;	/* Lock not available yet */
+		else if (retval < 0)
+			/* Lock taken, can release the node & return */
+			goto release_node;
+		else if (qcode != my_qcode) {
+			/*
+			 * Just get the lock with other spinners waiting
+			 * in the queue.
+			 */
+			if (queue_spin_trylock_unfair(lock))
+				goto notify_next;
+		} else {
+			/*
+			 * Get the lock & clear the queue code simultaneously
+			 */
+			if (queue_spin_trylock_and_clr_qcode(lock, qcode))
+				/* No need to notify the next one */
+				goto release_node;
+		}
+		arch_mutex_cpu_relax();
+	}
+
+notify_next:
+	/*
+	 * Wait, if needed, until the next one in queue set up the next field
+	 */
+	while (!(next = ACCESS_ONCE(node->next)))
+		arch_mutex_cpu_relax();
+	/*
+	 * The next one in queue is now at the head
+	 */
+	smp_store_release(&next->wait, false);
+
+release_node:
+	put_qnode();
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-17 20:41 ` [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
  2014-02-17 20:41   ` Waiman Long
@ 2014-02-18  7:30   ` Peter Zijlstra
  2014-02-18 19:29     ` Waiman Long
  2014-02-18  7:33   ` Peter Zijlstra
  2014-02-18  7:39   ` Peter Zijlstra
  3 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-18  7:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
> +/*
> + * The queue node structure
> + *
> + * This structure is essentially the same as the mcs_spinlock structure
> + * in mcs_spinlock.h file. This structure is retained for future extension
> + * where new fields may be added.
> + */
> +struct qnode {
> +	u32		 wait;		/* Waiting flag		*/
> +	struct qnode	*next;		/* Next queue node addr */
> +};
> +
> +struct qnode_set {
> +	int		node_idx;	/* Current node to use */
> +	struct qnode	nodes[MAX_QNODES];
> +};
> +
> +/*
> + * Per-CPU queue node structures
> + */
> +static DEFINE_PER_CPU(struct qnode_set, qnset) ____cacheline_aligned
> +	= { 0 };
You really didn't pay attention did you.
That should be DEFINE_PER_CPU_ALIGNED()
Furthermore, your structure is bigger than 1 cacheline; your struct
qnode is 16 bytes, 4*16=64, and then you add that int.
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-18  7:30   ` Peter Zijlstra
@ 2014-02-18 19:29     ` Waiman Long
  2014-02-18 19:29       ` Waiman Long
  0 siblings, 1 reply; 71+ messages in thread
From: Waiman Long @ 2014-02-18 19:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/18/2014 02:30 AM, Peter Zijlstra wrote:
> On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
>> +/*
>> + * The queue node structure
>> + *
>> + * This structure is essentially the same as the mcs_spinlock structure
>> + * in mcs_spinlock.h file. This structure is retained for future extension
>> + * where new fields may be added.
>> + */
>> +struct qnode {
>> +	u32		 wait;		/* Waiting flag		*/
>> +	struct qnode	*next;		/* Next queue node addr */
>> +};
>> +
>> +struct qnode_set {
>> +	int		node_idx;	/* Current node to use */
>> +	struct qnode	nodes[MAX_QNODES];
>> +};
>> +
>> +/*
>> + * Per-CPU queue node structures
>> + */
>> +static DEFINE_PER_CPU(struct qnode_set, qnset) ____cacheline_aligned
>> +	= { 0 };
> You really didn't pay attention did you.
>
> That should be DEFINE_PER_CPU_ALIGNED()
>
> Furthermore, your structure is bigger than 1 cacheline; your struct
> qnode is 16 bytes, 4*16=64, and then you add that int.
Thank for letting me know about that. This is a minor problem that I 
will fix in the next version.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-18 19:29     ` Waiman Long
@ 2014-02-18 19:29       ` Waiman Long
  0 siblings, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-18 19:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/18/2014 02:30 AM, Peter Zijlstra wrote:
> On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
>> +/*
>> + * The queue node structure
>> + *
>> + * This structure is essentially the same as the mcs_spinlock structure
>> + * in mcs_spinlock.h file. This structure is retained for future extension
>> + * where new fields may be added.
>> + */
>> +struct qnode {
>> +	u32		 wait;		/* Waiting flag		*/
>> +	struct qnode	*next;		/* Next queue node addr */
>> +};
>> +
>> +struct qnode_set {
>> +	int		node_idx;	/* Current node to use */
>> +	struct qnode	nodes[MAX_QNODES];
>> +};
>> +
>> +/*
>> + * Per-CPU queue node structures
>> + */
>> +static DEFINE_PER_CPU(struct qnode_set, qnset) ____cacheline_aligned
>> +	= { 0 };
> You really didn't pay attention did you.
>
> That should be DEFINE_PER_CPU_ALIGNED()
>
> Furthermore, your structure is bigger than 1 cacheline; your struct
> qnode is 16 bytes, 4*16=64, and then you add that int.
Thank for letting me know about that. This is a minor problem that I 
will fix in the next version.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread
 
 
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-17 20:41 ` [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
  2014-02-17 20:41   ` Waiman Long
  2014-02-18  7:30   ` Peter Zijlstra
@ 2014-02-18  7:33   ` Peter Zijlstra
  2014-02-18 19:31     ` Waiman Long
  2014-02-18  7:39   ` Peter Zijlstra
  3 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-18  7:33 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
> +#define	_QCODE(lock)	(atomic_read(&(lock)->qlcode) >> _QCODE_OFFSET)
> +#define	_QLOCK(lock)	(atomic_read(&(lock)->qlcode) & _QSPINLOCK_LOCKED)
> +#define GET_QN_IDX(code)	(((code) >> _QCODE_VAL_OFFSET) & 3)
> +#define GET_CPU_NR(code)	(((code) >> (_QCODE_VAL_OFFSET + 2)) - 1)
> +#ifndef _SET_QCODE
> +#define _SET_QCODE(cpu, idx)	((((cpu) + 1) << (_QCODE_VAL_OFFSET + 2)) |\
> +				((idx) << _QCODE_VAL_OFFSET) |\
Someone please take the CPP away.. dude most this doesn't need to be a
macro. Its used only in 1 or two places and its utterly unreadable.
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-18  7:33   ` Peter Zijlstra
@ 2014-02-18 19:31     ` Waiman Long
  2014-02-18 19:31       ` Waiman Long
  0 siblings, 1 reply; 71+ messages in thread
From: Waiman Long @ 2014-02-18 19:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/18/2014 02:33 AM, Peter Zijlstra wrote:
> On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
>> +#define	_QCODE(lock)	(atomic_read(&(lock)->qlcode)>>  _QCODE_OFFSET)
>> +#define	_QLOCK(lock)	(atomic_read(&(lock)->qlcode)&  _QSPINLOCK_LOCKED)
>> +#define GET_QN_IDX(code)	(((code)>>  _QCODE_VAL_OFFSET)&  3)
>> +#define GET_CPU_NR(code)	(((code)>>  (_QCODE_VAL_OFFSET + 2)) - 1)
>> +#ifndef _SET_QCODE
>> +#define _SET_QCODE(cpu, idx)	((((cpu) + 1)<<  (_QCODE_VAL_OFFSET + 2)) |\
>> +				((idx)<<  _QCODE_VAL_OFFSET) |\
>
> Someone please take the CPP away.. dude most this doesn't need to be a
> macro. Its used only in 1 or two places and its utterly unreadable.
Yes, I can take them away. It is not a big problem.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-18 19:31     ` Waiman Long
@ 2014-02-18 19:31       ` Waiman Long
  0 siblings, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-18 19:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/18/2014 02:33 AM, Peter Zijlstra wrote:
> On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
>> +#define	_QCODE(lock)	(atomic_read(&(lock)->qlcode)>>  _QCODE_OFFSET)
>> +#define	_QLOCK(lock)	(atomic_read(&(lock)->qlcode)&  _QSPINLOCK_LOCKED)
>> +#define GET_QN_IDX(code)	(((code)>>  _QCODE_VAL_OFFSET)&  3)
>> +#define GET_CPU_NR(code)	(((code)>>  (_QCODE_VAL_OFFSET + 2)) - 1)
>> +#ifndef _SET_QCODE
>> +#define _SET_QCODE(cpu, idx)	((((cpu) + 1)<<  (_QCODE_VAL_OFFSET + 2)) |\
>> +				((idx)<<  _QCODE_VAL_OFFSET) |\
>
> Someone please take the CPP away.. dude most this doesn't need to be a
> macro. Its used only in 1 or two places and its utterly unreadable.
Yes, I can take them away. It is not a big problem.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
 
 
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-17 20:41 ` [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
                     ` (2 preceding siblings ...)
  2014-02-18  7:33   ` Peter Zijlstra
@ 2014-02-18  7:39   ` Peter Zijlstra
  2014-02-18  7:39     ` Peter Zijlstra
  2014-02-18 19:39     ` Waiman Long
  3 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-18  7:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
> +void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
> +{
> +	unsigned int cpu_nr, qn_idx;
> +	struct qnode *node, *next;
> +	u32 prev_qcode, my_qcode;
> +
> +#ifdef queue_spin_trylock_quick
> +	/*
> +	 * Try the quick spinning code path
> +	 */
> +	if (queue_spin_trylock_quick(lock, qsval))
> +		return;
> +#endif
why oh why?
> +	/*
> +	 * Get the queue node
> +	 */
> +	cpu_nr = smp_processor_id();
> +	node   = get_qnode(&qn_idx);
> +
> +	if (unlikely(!node)) {
> +		/*
> +		 * This shouldn't happen, print a warning message
> +		 * & busy spinning on the lock.
> +		 */
> +		printk_sched(
> +		  "qspinlock: queue node table exhausted at cpu %d!\n",
> +		  cpu_nr);
> +		while (!queue_spin_trylock_unfair(lock))
> +			arch_mutex_cpu_relax();
> +		return;
> +	}
> +
> +	/*
> +	 * Set up the new cpu code to be exchanged
> +	 */
> +	my_qcode = _SET_QCODE(cpu_nr, qn_idx);
> +
> +	/*
> +	 * Initialize the queue node
> +	 */
> +	node->wait = true;
> +	node->next = NULL;
> +
> +	/*
> +	 * The lock may be available at this point, try again if no task was
> +	 * waiting in the queue.
> +	 */
> +	if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)) {
> +		put_qnode();
> +		return;
> +	}
> +
> +#ifdef queue_code_xchg
> +	prev_qcode = queue_code_xchg(lock, my_qcode);
> +#else
> +	/*
> +	 * Exchange current copy of the queue node code
> +	 */
> +	prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
> +	/*
> +	 * It is possible that we may accidentally steal the lock. If this is
> +	 * the case, we need to either release it if not the head of the queue
> +	 * or get the lock and be done with it.
> +	 */
> +	if (unlikely(!(prev_qcode & _QSPINLOCK_LOCKED))) {
> +		if (prev_qcode == 0) {
> +			/*
> +			 * Got the lock since it is at the head of the queue
> +			 * Now try to atomically clear the queue code.
> +			 */
> +			if (atomic_cmpxchg(&lock->qlcode, my_qcode,
> +					  _QSPINLOCK_LOCKED) == my_qcode)
> +				goto release_node;
> +			/*
> +			 * The cmpxchg fails only if one or more tasks
> +			 * are added to the queue. In this case, we need to
> +			 * notify the next one to be the head of the queue.
> +			 */
> +			goto notify_next;
> +		}
> +		/*
> +		 * Accidentally steal the lock, release the lock and
> +		 * let the queue head get it.
> +		 */
> +		queue_spin_unlock(lock);
> +	} else
> +		prev_qcode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
> +	my_qcode &= ~_QSPINLOCK_LOCKED;
> +#endif /* queue_code_xchg */
WTF is this #ifdef for?
> +	if (prev_qcode) {
> +		/*
> +		 * Not at the queue head, get the address of the previous node
> +		 * and set up the "next" fields of the that node.
> +		 */
> +		struct qnode *prev = xlate_qcode(prev_qcode);
> +
> +		ACCESS_ONCE(prev->next) = node;
> +		/*
> +		 * Wait until the waiting flag is off
> +		 */
> +		while (smp_load_acquire(&node->wait))
> +			arch_mutex_cpu_relax();
> +	}
> +
> +	/*
> +	 * At the head of the wait queue now
> +	 */
> +	while (true) {
> +		u32 qcode;
> +		int retval;
> +
> +		retval = queue_get_lock_qcode(lock, &qcode, my_qcode);
> +		if (retval > 0)
> +			;	/* Lock not available yet */
> +		else if (retval < 0)
> +			/* Lock taken, can release the node & return */
> +			goto release_node;
> +		else if (qcode != my_qcode) {
> +			/*
> +			 * Just get the lock with other spinners waiting
> +			 * in the queue.
> +			 */
> +			if (queue_spin_trylock_unfair(lock))
> +				goto notify_next;
Why is this an option at all?
> +		} else {
> +			/*
> +			 * Get the lock & clear the queue code simultaneously
> +			 */
> +			if (queue_spin_trylock_and_clr_qcode(lock, qcode))
> +				/* No need to notify the next one */
> +				goto release_node;
> +		}
> +		arch_mutex_cpu_relax();
> +	}
> +
> +notify_next:
> +	/*
> +	 * Wait, if needed, until the next one in queue set up the next field
> +	 */
> +	while (!(next = ACCESS_ONCE(node->next)))
> +		arch_mutex_cpu_relax();
> +	/*
> +	 * The next one in queue is now at the head
> +	 */
> +	smp_store_release(&next->wait, false);
> +
> +release_node:
> +	put_qnode();
> +}
> +EXPORT_SYMBOL(queue_spin_lock_slowpath);
> -- 
> 1.7.1
> 
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-18  7:39   ` Peter Zijlstra
@ 2014-02-18  7:39     ` Peter Zijlstra
  2014-02-18 19:39     ` Waiman Long
  1 sibling, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-18  7:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
> +void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
> +{
> +	unsigned int cpu_nr, qn_idx;
> +	struct qnode *node, *next;
> +	u32 prev_qcode, my_qcode;
> +
> +#ifdef queue_spin_trylock_quick
> +	/*
> +	 * Try the quick spinning code path
> +	 */
> +	if (queue_spin_trylock_quick(lock, qsval))
> +		return;
> +#endif
why oh why?
> +	/*
> +	 * Get the queue node
> +	 */
> +	cpu_nr = smp_processor_id();
> +	node   = get_qnode(&qn_idx);
> +
> +	if (unlikely(!node)) {
> +		/*
> +		 * This shouldn't happen, print a warning message
> +		 * & busy spinning on the lock.
> +		 */
> +		printk_sched(
> +		  "qspinlock: queue node table exhausted at cpu %d!\n",
> +		  cpu_nr);
> +		while (!queue_spin_trylock_unfair(lock))
> +			arch_mutex_cpu_relax();
> +		return;
> +	}
> +
> +	/*
> +	 * Set up the new cpu code to be exchanged
> +	 */
> +	my_qcode = _SET_QCODE(cpu_nr, qn_idx);
> +
> +	/*
> +	 * Initialize the queue node
> +	 */
> +	node->wait = true;
> +	node->next = NULL;
> +
> +	/*
> +	 * The lock may be available at this point, try again if no task was
> +	 * waiting in the queue.
> +	 */
> +	if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)) {
> +		put_qnode();
> +		return;
> +	}
> +
> +#ifdef queue_code_xchg
> +	prev_qcode = queue_code_xchg(lock, my_qcode);
> +#else
> +	/*
> +	 * Exchange current copy of the queue node code
> +	 */
> +	prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
> +	/*
> +	 * It is possible that we may accidentally steal the lock. If this is
> +	 * the case, we need to either release it if not the head of the queue
> +	 * or get the lock and be done with it.
> +	 */
> +	if (unlikely(!(prev_qcode & _QSPINLOCK_LOCKED))) {
> +		if (prev_qcode == 0) {
> +			/*
> +			 * Got the lock since it is at the head of the queue
> +			 * Now try to atomically clear the queue code.
> +			 */
> +			if (atomic_cmpxchg(&lock->qlcode, my_qcode,
> +					  _QSPINLOCK_LOCKED) == my_qcode)
> +				goto release_node;
> +			/*
> +			 * The cmpxchg fails only if one or more tasks
> +			 * are added to the queue. In this case, we need to
> +			 * notify the next one to be the head of the queue.
> +			 */
> +			goto notify_next;
> +		}
> +		/*
> +		 * Accidentally steal the lock, release the lock and
> +		 * let the queue head get it.
> +		 */
> +		queue_spin_unlock(lock);
> +	} else
> +		prev_qcode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
> +	my_qcode &= ~_QSPINLOCK_LOCKED;
> +#endif /* queue_code_xchg */
WTF is this #ifdef for?
> +	if (prev_qcode) {
> +		/*
> +		 * Not at the queue head, get the address of the previous node
> +		 * and set up the "next" fields of the that node.
> +		 */
> +		struct qnode *prev = xlate_qcode(prev_qcode);
> +
> +		ACCESS_ONCE(prev->next) = node;
> +		/*
> +		 * Wait until the waiting flag is off
> +		 */
> +		while (smp_load_acquire(&node->wait))
> +			arch_mutex_cpu_relax();
> +	}
> +
> +	/*
> +	 * At the head of the wait queue now
> +	 */
> +	while (true) {
> +		u32 qcode;
> +		int retval;
> +
> +		retval = queue_get_lock_qcode(lock, &qcode, my_qcode);
> +		if (retval > 0)
> +			;	/* Lock not available yet */
> +		else if (retval < 0)
> +			/* Lock taken, can release the node & return */
> +			goto release_node;
> +		else if (qcode != my_qcode) {
> +			/*
> +			 * Just get the lock with other spinners waiting
> +			 * in the queue.
> +			 */
> +			if (queue_spin_trylock_unfair(lock))
> +				goto notify_next;
Why is this an option at all?
> +		} else {
> +			/*
> +			 * Get the lock & clear the queue code simultaneously
> +			 */
> +			if (queue_spin_trylock_and_clr_qcode(lock, qcode))
> +				/* No need to notify the next one */
> +				goto release_node;
> +		}
> +		arch_mutex_cpu_relax();
> +	}
> +
> +notify_next:
> +	/*
> +	 * Wait, if needed, until the next one in queue set up the next field
> +	 */
> +	while (!(next = ACCESS_ONCE(node->next)))
> +		arch_mutex_cpu_relax();
> +	/*
> +	 * The next one in queue is now at the head
> +	 */
> +	smp_store_release(&next->wait, false);
> +
> +release_node:
> +	put_qnode();
> +}
> +EXPORT_SYMBOL(queue_spin_lock_slowpath);
> -- 
> 1.7.1
> 
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-18  7:39   ` Peter Zijlstra
  2014-02-18  7:39     ` Peter Zijlstra
@ 2014-02-18 19:39     ` Waiman Long
  2014-02-18 21:34       ` Peter Zijlstra
  2014-02-18 21:37       ` Peter Zijlstra
  1 sibling, 2 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-18 19:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/18/2014 02:39 AM, Peter Zijlstra wrote:
> On Mon, Feb 17, 2014 at 03:41:22PM -0500, Waiman Long wrote:
>> +void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>> +{
>> +	unsigned int cpu_nr, qn_idx;
>> +	struct qnode *node, *next;
>> +	u32 prev_qcode, my_qcode;
>> +
>> +#ifdef queue_spin_trylock_quick
>> +	/*
>> +	 * Try the quick spinning code path
>> +	 */
>> +	if (queue_spin_trylock_quick(lock, qsval))
>> +		return;
>> +#endif
> why oh why?
I could take this #ifdef away. I just need to add a default version that 
always return 0.
>> +	/*
>> +	 * Get the queue node
>> +	 */
>> +	cpu_nr = smp_processor_id();
>> +	node   = get_qnode(&qn_idx);
>> +
>> +	if (unlikely(!node)) {
>> +		/*
>> +		 * This shouldn't happen, print a warning message
>> +		 *&  busy spinning on the lock.
>> +		 */
>> +		printk_sched(
>> +		  "qspinlock: queue node table exhausted at cpu %d!\n",
>> +		  cpu_nr);
>> +		while (!queue_spin_trylock_unfair(lock))
>> +			arch_mutex_cpu_relax();
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * Set up the new cpu code to be exchanged
>> +	 */
>> +	my_qcode = _SET_QCODE(cpu_nr, qn_idx);
>> +
>> +	/*
>> +	 * Initialize the queue node
>> +	 */
>> +	node->wait = true;
>> +	node->next = NULL;
>> +
>> +	/*
>> +	 * The lock may be available at this point, try again if no task was
>> +	 * waiting in the queue.
>> +	 */
>> +	if (!(qsval>>  _QCODE_OFFSET)&&  queue_spin_trylock(lock)) {
>> +		put_qnode();
>> +		return;
>> +	}
>> +
>> +#ifdef queue_code_xchg
>> +	prev_qcode = queue_code_xchg(lock, my_qcode);
>> +#else
>> +	/*
>> +	 * Exchange current copy of the queue node code
>> +	 */
>> +	prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
>> +	/*
>> +	 * It is possible that we may accidentally steal the lock. If this is
>> +	 * the case, we need to either release it if not the head of the queue
>> +	 * or get the lock and be done with it.
>> +	 */
>> +	if (unlikely(!(prev_qcode&  _QSPINLOCK_LOCKED))) {
>> +		if (prev_qcode == 0) {
>> +			/*
>> +			 * Got the lock since it is at the head of the queue
>> +			 * Now try to atomically clear the queue code.
>> +			 */
>> +			if (atomic_cmpxchg(&lock->qlcode, my_qcode,
>> +					  _QSPINLOCK_LOCKED) == my_qcode)
>> +				goto release_node;
>> +			/*
>> +			 * The cmpxchg fails only if one or more tasks
>> +			 * are added to the queue. In this case, we need to
>> +			 * notify the next one to be the head of the queue.
>> +			 */
>> +			goto notify_next;
>> +		}
>> +		/*
>> +		 * Accidentally steal the lock, release the lock and
>> +		 * let the queue head get it.
>> +		 */
>> +		queue_spin_unlock(lock);
>> +	} else
>> +		prev_qcode&= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
>> +	my_qcode&= ~_QSPINLOCK_LOCKED;
>> +#endif /* queue_code_xchg */
> WTF is this #ifdef for?
The #ifdef is harder to take away here. The point is that doing a 32-bit 
exchange may accidentally steal the lock with the additional code to 
handle that. Doing a 16-bit exchange, on the other hand, will never 
steal the lock and so don't need the extra handling code. I could 
construct a function with different return values to handle the 
different cases if you think it will make the code easier to read.
>> +	if (prev_qcode) {
>> +		/*
>> +		 * Not at the queue head, get the address of the previous node
>> +		 * and set up the "next" fields of the that node.
>> +		 */
>> +		struct qnode *prev = xlate_qcode(prev_qcode);
>> +
>> +		ACCESS_ONCE(prev->next) = node;
>> +		/*
>> +		 * Wait until the waiting flag is off
>> +		 */
>> +		while (smp_load_acquire(&node->wait))
>> +			arch_mutex_cpu_relax();
>> +	}
>> +
>> +	/*
>> +	 * At the head of the wait queue now
>> +	 */
>> +	while (true) {
>> +		u32 qcode;
>> +		int retval;
>> +
>> +		retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
>> +		if (retval>  0)
>> +			;	/* Lock not available yet */
>> +		else if (retval<  0)
>> +			/* Lock taken, can release the node&  return */
>> +			goto release_node;
>> +		else if (qcode != my_qcode) {
>> +			/*
>> +			 * Just get the lock with other spinners waiting
>> +			 * in the queue.
>> +			 */
>> +			if (queue_spin_trylock_unfair(lock))
>> +				goto notify_next;
> Why is this an option at all?
>
>
Are you referring to the case (qcode != my_qcode)? This condition will 
be true if more than one tasks have queued up.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-18 19:39     ` Waiman Long
@ 2014-02-18 21:34       ` Peter Zijlstra
  2014-02-19  0:50         ` Waiman Long
  2014-02-18 21:37       ` Peter Zijlstra
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-18 21:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
> The #ifdef is harder to take away here. The point is that doing a 32-bit
> exchange may accidentally steal the lock with the additional code to handle
> that. Doing a 16-bit exchange, on the other hand, will never steal the lock
> and so don't need the extra handling code. I could construct a function with
> different return values to handle the different cases if you think it will
> make the code easier to read.
Does it really pay to use xchg() with all those fixup cases? Why not
have a single cmpxchg() loop that does just the exact atomic op you
want?
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-18 21:34       ` Peter Zijlstra
@ 2014-02-19  0:50         ` Waiman Long
  2014-02-19  0:50           ` Waiman Long
  2014-02-19  8:52           ` Peter Zijlstra
  0 siblings, 2 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-19  0:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/18/2014 04:34 PM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
>> The #ifdef is harder to take away here. The point is that doing a 32-bit
>> exchange may accidentally steal the lock with the additional code to handle
>> that. Doing a 16-bit exchange, on the other hand, will never steal the lock
>> and so don't need the extra handling code. I could construct a function with
>> different return values to handle the different cases if you think it will
>> make the code easier to read.
> Does it really pay to use xchg() with all those fixup cases? Why not
> have a single cmpxchg() loop that does just the exact atomic op you
> want?
The main reason for using xchg instead of cmpxchg is its performance 
impact when the lock is heavily contended. Under those circumstances, a 
task may need to do several tries of read+atomic-RMV before getting it 
right. This may cause a lot of cacheline contention. With xchg, we need 
at most 2 atomic ops. Using cmpxchg() does simplify the code a bit at 
the expense of performance with heavy contention.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-19  0:50         ` Waiman Long
@ 2014-02-19  0:50           ` Waiman Long
  2014-02-19  8:52           ` Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-19  0:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/18/2014 04:34 PM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
>> The #ifdef is harder to take away here. The point is that doing a 32-bit
>> exchange may accidentally steal the lock with the additional code to handle
>> that. Doing a 16-bit exchange, on the other hand, will never steal the lock
>> and so don't need the extra handling code. I could construct a function with
>> different return values to handle the different cases if you think it will
>> make the code easier to read.
> Does it really pay to use xchg() with all those fixup cases? Why not
> have a single cmpxchg() loop that does just the exact atomic op you
> want?
The main reason for using xchg instead of cmpxchg is its performance 
impact when the lock is heavily contended. Under those circumstances, a 
task may need to do several tries of read+atomic-RMV before getting it 
right. This may cause a lot of cacheline contention. With xchg, we need 
at most 2 atomic ops. Using cmpxchg() does simplify the code a bit at 
the expense of performance with heavy contention.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-19  0:50         ` Waiman Long
  2014-02-19  0:50           ` Waiman Long
@ 2014-02-19  8:52           ` Peter Zijlstra
  2014-02-19  8:52             ` Peter Zijlstra
  2014-02-19 19:26             ` Waiman Long
  1 sibling, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-19  8:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Tue, Feb 18, 2014 at 07:50:13PM -0500, Waiman Long wrote:
> On 02/18/2014 04:34 PM, Peter Zijlstra wrote:
> >On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
> >>The #ifdef is harder to take away here. The point is that doing a 32-bit
> >>exchange may accidentally steal the lock with the additional code to handle
> >>that. Doing a 16-bit exchange, on the other hand, will never steal the lock
> >>and so don't need the extra handling code. I could construct a function with
> >>different return values to handle the different cases if you think it will
> >>make the code easier to read.
> >Does it really pay to use xchg() with all those fixup cases? Why not
> >have a single cmpxchg() loop that does just the exact atomic op you
> >want?
> 
> The main reason for using xchg instead of cmpxchg is its performance impact
> when the lock is heavily contended. Under those circumstances, a task may
> need to do several tries of read+atomic-RMV before getting it right. This
> may cause a lot of cacheline contention. With xchg, we need at most 2 atomic
> ops. Using cmpxchg() does simplify the code a bit at the expense of
> performance with heavy contention.
Have you actually measured this?
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-19  8:52           ` Peter Zijlstra
@ 2014-02-19  8:52             ` Peter Zijlstra
  2014-02-19 19:26             ` Waiman Long
  1 sibling, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-19  8:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Tue, Feb 18, 2014 at 07:50:13PM -0500, Waiman Long wrote:
> On 02/18/2014 04:34 PM, Peter Zijlstra wrote:
> >On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
> >>The #ifdef is harder to take away here. The point is that doing a 32-bit
> >>exchange may accidentally steal the lock with the additional code to handle
> >>that. Doing a 16-bit exchange, on the other hand, will never steal the lock
> >>and so don't need the extra handling code. I could construct a function with
> >>different return values to handle the different cases if you think it will
> >>make the code easier to read.
> >Does it really pay to use xchg() with all those fixup cases? Why not
> >have a single cmpxchg() loop that does just the exact atomic op you
> >want?
> 
> The main reason for using xchg instead of cmpxchg is its performance impact
> when the lock is heavily contended. Under those circumstances, a task may
> need to do several tries of read+atomic-RMV before getting it right. This
> may cause a lot of cacheline contention. With xchg, we need at most 2 atomic
> ops. Using cmpxchg() does simplify the code a bit at the expense of
> performance with heavy contention.
Have you actually measured this?
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-19  8:52           ` Peter Zijlstra
  2014-02-19  8:52             ` Peter Zijlstra
@ 2014-02-19 19:26             ` Waiman Long
  1 sibling, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-19 19:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/19/2014 03:52 AM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 07:50:13PM -0500, Waiman Long wrote:
>> On 02/18/2014 04:34 PM, Peter Zijlstra wrote:
>>> On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
>>>> The #ifdef is harder to take away here. The point is that doing a 32-bit
>>>> exchange may accidentally steal the lock with the additional code to handle
>>>> that. Doing a 16-bit exchange, on the other hand, will never steal the lock
>>>> and so don't need the extra handling code. I could construct a function with
>>>> different return values to handle the different cases if you think it will
>>>> make the code easier to read.
>>> Does it really pay to use xchg() with all those fixup cases? Why not
>>> have a single cmpxchg() loop that does just the exact atomic op you
>>> want?
>> The main reason for using xchg instead of cmpxchg is its performance impact
>> when the lock is heavily contended. Under those circumstances, a task may
>> need to do several tries of read+atomic-RMV before getting it right. This
>> may cause a lot of cacheline contention. With xchg, we need at most 2 atomic
>> ops. Using cmpxchg() does simplify the code a bit at the expense of
>> performance with heavy contention.
> Have you actually measured this?
I haven't actually measured that myself. It is mostly from my 
experience. I could do some timing experiment with the cmpxchg() change 
and report back to you later.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
 
 
 
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-18 19:39     ` Waiman Long
  2014-02-18 21:34       ` Peter Zijlstra
@ 2014-02-18 21:37       ` Peter Zijlstra
  2014-02-19  0:58         ` Waiman Long
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-18 21:37 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
> >>+	/*
> >>+	 * At the head of the wait queue now
> >>+	 */
> >>+	while (true) {
> >>+		u32 qcode;
> >>+		int retval;
> >>+
> >>+		retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
> >>+		if (retval>  0)
> >>+			;	/* Lock not available yet */
> >>+		else if (retval<  0)
> >>+			/* Lock taken, can release the node&  return */
> >>+			goto release_node;
> >>+		else if (qcode != my_qcode) {
> >>+			/*
> >>+			 * Just get the lock with other spinners waiting
> >>+			 * in the queue.
> >>+			 */
> >>+			if (queue_spin_trylock_unfair(lock))
> >>+				goto notify_next;
> >Why is this an option at all?
> >
> >
> 
> Are you referring to the case (qcode != my_qcode)? This condition will be
> true if more than one tasks have queued up.
But in no case should we revert to unfair spinning or stealing. We
should always respect the queueing order.
If the lock tail no longer points to us, then there's further waiters
and we should wait for ->next and unlock it -- after we've taken the
lock.
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-18 21:37       ` Peter Zijlstra
@ 2014-02-19  0:58         ` Waiman Long
  2014-02-19  0:58           ` Waiman Long
  2014-02-19  8:55           ` Peter Zijlstra
  0 siblings, 2 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-19  0:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/18/2014 04:37 PM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
>>>> +	/*
>>>> +	 * At the head of the wait queue now
>>>> +	 */
>>>> +	while (true) {
>>>> +		u32 qcode;
>>>> +		int retval;
>>>> +
>>>> +		retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
>>>> +		if (retval>   0)
>>>> +			;	/* Lock not available yet */
>>>> +		else if (retval<   0)
>>>> +			/* Lock taken, can release the node&   return */
>>>> +			goto release_node;
>>>> +		else if (qcode != my_qcode) {
>>>> +			/*
>>>> +			 * Just get the lock with other spinners waiting
>>>> +			 * in the queue.
>>>> +			 */
>>>> +			if (queue_spin_trylock_unfair(lock))
>>>> +				goto notify_next;
>>> Why is this an option at all?
>>>
>>>
>> Are you referring to the case (qcode != my_qcode)? This condition will be
>> true if more than one tasks have queued up.
> But in no case should we revert to unfair spinning or stealing. We
> should always respect the queueing order.
>
> If the lock tail no longer points to us, then there's further waiters
> and we should wait for ->next and unlock it -- after we've taken the
> lock.
>
A task will be in this loop when it is already the head of a queue and 
is entitled to take the lock. The condition (qcode != my_qcode) is to 
decide whether it should just take the lock or take the lock & clear the 
code simultaneously. I am a bit cautious to use 
queue_spin_trylock_unfair() as there is a possibility that a CPU may run 
out of the queue node and need to do unfair busy spinning.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-19  0:58         ` Waiman Long
@ 2014-02-19  0:58           ` Waiman Long
  2014-02-19  8:55           ` Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-19  0:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/18/2014 04:37 PM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
>>>> +	/*
>>>> +	 * At the head of the wait queue now
>>>> +	 */
>>>> +	while (true) {
>>>> +		u32 qcode;
>>>> +		int retval;
>>>> +
>>>> +		retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
>>>> +		if (retval>   0)
>>>> +			;	/* Lock not available yet */
>>>> +		else if (retval<   0)
>>>> +			/* Lock taken, can release the node&   return */
>>>> +			goto release_node;
>>>> +		else if (qcode != my_qcode) {
>>>> +			/*
>>>> +			 * Just get the lock with other spinners waiting
>>>> +			 * in the queue.
>>>> +			 */
>>>> +			if (queue_spin_trylock_unfair(lock))
>>>> +				goto notify_next;
>>> Why is this an option at all?
>>>
>>>
>> Are you referring to the case (qcode != my_qcode)? This condition will be
>> true if more than one tasks have queued up.
> But in no case should we revert to unfair spinning or stealing. We
> should always respect the queueing order.
>
> If the lock tail no longer points to us, then there's further waiters
> and we should wait for ->next and unlock it -- after we've taken the
> lock.
>
A task will be in this loop when it is already the head of a queue and 
is entitled to take the lock. The condition (qcode != my_qcode) is to 
decide whether it should just take the lock or take the lock & clear the 
code simultaneously. I am a bit cautious to use 
queue_spin_trylock_unfair() as there is a possibility that a CPU may run 
out of the queue node and need to do unfair busy spinning.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-19  0:58         ` Waiman Long
  2014-02-19  0:58           ` Waiman Long
@ 2014-02-19  8:55           ` Peter Zijlstra
  2014-02-19 19:30             ` Waiman Long
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-19  8:55 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Tue, Feb 18, 2014 at 07:58:49PM -0500, Waiman Long wrote:
> On 02/18/2014 04:37 PM, Peter Zijlstra wrote:
> >On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
> >>>>+	/*
> >>>>+	 * At the head of the wait queue now
> >>>>+	 */
> >>>>+	while (true) {
> >>>>+		u32 qcode;
> >>>>+		int retval;
> >>>>+
> >>>>+		retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
> >>>>+		if (retval>   0)
> >>>>+			;	/* Lock not available yet */
> >>>>+		else if (retval<   0)
> >>>>+			/* Lock taken, can release the node&   return */
> >>>>+			goto release_node;
> >>>>+		else if (qcode != my_qcode) {
> >>>>+			/*
> >>>>+			 * Just get the lock with other spinners waiting
> >>>>+			 * in the queue.
> >>>>+			 */
> >>>>+			if (queue_spin_trylock_unfair(lock))
> >>>>+				goto notify_next;
> >>>Why is this an option at all?
> >>>
> >>>
> >>Are you referring to the case (qcode != my_qcode)? This condition will be
> >>true if more than one tasks have queued up.
> >But in no case should we revert to unfair spinning or stealing. We
> >should always respect the queueing order.
> >
> >If the lock tail no longer points to us, then there's further waiters
> >and we should wait for ->next and unlock it -- after we've taken the
> >lock.
> >
> 
> A task will be in this loop when it is already the head of a queue and is
> entitled to take the lock. The condition (qcode != my_qcode) is to decide
> whether it should just take the lock or take the lock & clear the code
> simultaneously. I am a bit cautious to use queue_spin_trylock_unfair() as
> there is a possibility that a CPU may run out of the queue node and need to
> do unfair busy spinning.
No; there is no such possibility. Add BUG_ON(idx>=4) and make sure of
it.
There's simply no more than 4 contexts what can nest at any one time:
  task context
  softirq context
  hardirq context
  nmi context
And someone contending a spinlock from NMI context should be shot
anyway.
Getting more nested spinlocks is an absolute hard fail.
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-19  8:55           ` Peter Zijlstra
@ 2014-02-19 19:30             ` Waiman Long
  2014-02-19 19:30               ` Waiman Long
  0 siblings, 1 reply; 71+ messages in thread
From: Waiman Long @ 2014-02-19 19:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/19/2014 03:55 AM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 07:58:49PM -0500, Waiman Long wrote:
>> On 02/18/2014 04:37 PM, Peter Zijlstra wrote:
>>> On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
>>>>>> +	/*
>>>>>> +	 * At the head of the wait queue now
>>>>>> +	 */
>>>>>> +	while (true) {
>>>>>> +		u32 qcode;
>>>>>> +		int retval;
>>>>>> +
>>>>>> +		retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
>>>>>> +		if (retval>    0)
>>>>>> +			;	/* Lock not available yet */
>>>>>> +		else if (retval<    0)
>>>>>> +			/* Lock taken, can release the node&    return */
>>>>>> +			goto release_node;
>>>>>> +		else if (qcode != my_qcode) {
>>>>>> +			/*
>>>>>> +			 * Just get the lock with other spinners waiting
>>>>>> +			 * in the queue.
>>>>>> +			 */
>>>>>> +			if (queue_spin_trylock_unfair(lock))
>>>>>> +				goto notify_next;
>>>>> Why is this an option at all?
>>>>>
>>>>>
>>>> Are you referring to the case (qcode != my_qcode)? This condition will be
>>>> true if more than one tasks have queued up.
>>> But in no case should we revert to unfair spinning or stealing. We
>>> should always respect the queueing order.
>>>
>>> If the lock tail no longer points to us, then there's further waiters
>>> and we should wait for ->next and unlock it -- after we've taken the
>>> lock.
>>>
>> A task will be in this loop when it is already the head of a queue and is
>> entitled to take the lock. The condition (qcode != my_qcode) is to decide
>> whether it should just take the lock or take the lock&  clear the code
>> simultaneously. I am a bit cautious to use queue_spin_trylock_unfair() as
>> there is a possibility that a CPU may run out of the queue node and need to
>> do unfair busy spinning.
> No; there is no such possibility. Add BUG_ON(idx>=4) and make sure of
> it.
Yes, I could do that.
However in the generic implementation, I still need some kind of atomic 
cmpxchg to set the lock bit. I could probably just do a simple 
assignment of 1 to the lock byte in x86.
> There's simply no more than 4 contexts what can nest at any one time:
>
>    task context
>    softirq context
>    hardirq context
>    nmi context
>
> And someone contending a spinlock from NMI context should be shot
> anyway.
>
> Getting more nested spinlocks is an absolute hard fail.
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-19 19:30             ` Waiman Long
@ 2014-02-19 19:30               ` Waiman Long
  0 siblings, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-19 19:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/19/2014 03:55 AM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 07:58:49PM -0500, Waiman Long wrote:
>> On 02/18/2014 04:37 PM, Peter Zijlstra wrote:
>>> On Tue, Feb 18, 2014 at 02:39:31PM -0500, Waiman Long wrote:
>>>>>> +	/*
>>>>>> +	 * At the head of the wait queue now
>>>>>> +	 */
>>>>>> +	while (true) {
>>>>>> +		u32 qcode;
>>>>>> +		int retval;
>>>>>> +
>>>>>> +		retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
>>>>>> +		if (retval>    0)
>>>>>> +			;	/* Lock not available yet */
>>>>>> +		else if (retval<    0)
>>>>>> +			/* Lock taken, can release the node&    return */
>>>>>> +			goto release_node;
>>>>>> +		else if (qcode != my_qcode) {
>>>>>> +			/*
>>>>>> +			 * Just get the lock with other spinners waiting
>>>>>> +			 * in the queue.
>>>>>> +			 */
>>>>>> +			if (queue_spin_trylock_unfair(lock))
>>>>>> +				goto notify_next;
>>>>> Why is this an option at all?
>>>>>
>>>>>
>>>> Are you referring to the case (qcode != my_qcode)? This condition will be
>>>> true if more than one tasks have queued up.
>>> But in no case should we revert to unfair spinning or stealing. We
>>> should always respect the queueing order.
>>>
>>> If the lock tail no longer points to us, then there's further waiters
>>> and we should wait for ->next and unlock it -- after we've taken the
>>> lock.
>>>
>> A task will be in this loop when it is already the head of a queue and is
>> entitled to take the lock. The condition (qcode != my_qcode) is to decide
>> whether it should just take the lock or take the lock&  clear the code
>> simultaneously. I am a bit cautious to use queue_spin_trylock_unfair() as
>> there is a possibility that a CPU may run out of the queue node and need to
>> do unfair busy spinning.
> No; there is no such possibility. Add BUG_ON(idx>=4) and make sure of
> it.
Yes, I could do that.
However in the generic implementation, I still need some kind of atomic 
cmpxchg to set the lock bit. I could probably just do a simple 
assignment of 1 to the lock byte in x86.
> There's simply no more than 4 contexts what can nest at any one time:
>
>    task context
>    softirq context
>    hardirq context
>    nmi context
>
> And someone contending a spinlock from NMI context should be shot
> anyway.
>
> Getting more nested spinlocks is an absolute hard fail.
^ permalink raw reply	[flat|nested] 71+ messages in thread
 
 
 
 
 
 
 
- * [PATCH v4 2/3] qspinlock, x86: Enable x86-64 to use queue spinlock
  2014-02-17 20:41 [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock Waiman Long
  2014-02-17 20:41 ` [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
@ 2014-02-17 20:41 ` Waiman Long
  2014-02-17 20:41 ` [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks Waiman Long
  2014-02-17 22:47 ` [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock H. Peter Anvin
  3 siblings, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-17 20:41 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke, Waiman Long
This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.
Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.
The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/qspinlock.h      |   55 +++++++++++++++++++++++++++++++++
 arch/x86/include/asm/spinlock.h       |    5 +++
 arch/x86/include/asm/spinlock_types.h |    4 ++
 4 files changed, 65 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1b4ff87..5bf70ab 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -17,6 +17,7 @@ config X86_64
 	depends on 64BIT
 	select X86_DEV_DMA_OPS
 	select ARCH_USE_CMPXCHG_LOCKREF
+	select ARCH_USE_QUEUE_SPINLOCK
 
 ### Arch settings
 config X86
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 0000000..7316ec4
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,55 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+
+/*
+ * Queue spinlock union structure to be used within this header file only
+ */
+union qspinlock_x86 {
+	struct qspinlock slock;
+	u8		 lock;	/* Lock bit	*/
+};
+
+#define queue_spin_trylock_unfair queue_spin_trylock_unfair
+/**
+ * queue_spin_trylock_unfair - try to acquire the lock ignoring the qcode
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+	union qspinlock_x86 *qlock = (union qspinlock_x86 *)lock;
+
+	if (!ACCESS_ONCE(qlock->lock) &&
+	   (cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+
+#define	queue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * No special memory barrier other than a compiler one is needed for the
+ * x86 architecture. A compiler barrier is added at the end to make sure
+ * that the clearing the lock bit is done ASAP without artificial delay
+ * due to compiler optimization.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	union qspinlock_x86 *qlock = (union qspinlock_x86 *)lock;
+
+	barrier();
+	ACCESS_ONCE(qlock->lock) = 0;
+	barrier();
+}
+
+#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..6e6de1f 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -43,6 +43,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm/qspinlock.h>
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -181,6 +185,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
 {
 	arch_spin_lock(lock);
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..98da00b 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -11,6 +11,9 @@
 #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
 #endif
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm-generic/qspinlock_types.h>
+#else
 #if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
 typedef u8  __ticket_t;
 typedef u16 __ticketpair_t;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #include <asm/rwlock.h>
 
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 71+ messages in thread
- * [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-17 20:41 [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock Waiman Long
  2014-02-17 20:41 ` [PATCH v4 1/3] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
  2014-02-17 20:41 ` [PATCH v4 2/3] qspinlock, x86: Enable x86-64 to use queue spinlock Waiman Long
@ 2014-02-17 20:41 ` Waiman Long
  2014-02-17 20:41   ` Waiman Long
                     ` (2 more replies)
  2014-02-17 22:47 ` [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock H. Peter Anvin
  3 siblings, 3 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-17 20:41 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke, Waiman Long
A major problem with the queue spinlock patch is its performance at
low contention level (2-4 contending tasks) where it is slower than
the corresponding ticket spinlock code path. The following table shows
the execution time (in ms) of a micro-benchmark where 5M iterations
of the lock/unlock cycles were run on a 10-core Westere-EX CPU with
2 different types loads - standalone (lock and protected data in
different cachelines) and embedded (lock and protected data in the
same cacheline).
		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	  135/111	 135/102	  0%/-8%
       2	  732/950	1315/1573	+80%/+66%
       3	 1827/1783	2372/2428	+30%/+36%
       4	 2689/2725	2934/2934	 +9%/+8%
       5	 3736/3748	3658/3652	 -2%/-3%
       6	 4942/4984	4434/4428	-10%/-11%
       7	 6304/6319	5176/5163	-18%/-18%
       8	 7736/7629	5955/5944	-23%/-22%
It can be seen that the performance degradation is particular bad
with 2 and 3 contending tasks. To reduce that performance deficit
at low contention level, a special x86 specific optimized code path
for 2 contending tasks was added. This special code path will only
be activated with less than 16K of configured CPUs because it uses
a byte in the 32-bit lock word to hold a waiting bit for the 2nd
contending tasks instead of queuing the waiting task in the queue.
With the change, the performance data became:
		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       2	  732/950	 523/528	-29%/-44%
       3	 1827/1783	2366/2384	+30%/+34%
The queue spinlock code path is now a bit faster with 2 contending
tasks.  There is also a very slight improvement with 3 contending
tasks.
The performance of the optimized code path can vary depending on which
of the several different code paths is taken. It is also not as fair as
the ticket spinlock and some variations on the execution times of the
2 contending tasks.  Testing with different pairs of cores within the
same CPUs shows an execution time that varies from 400ms to 1194ms.
The ticket spinlock code also shows a variation of 718-1146ms which
is probably due to the CPU topology within a socket.
In a multi-socketed server, the optimized code path also seems to
produce a big performance improvement in cross-node contention traffic
at low contention level. The table below show the performance with
1 contending task per node:
		[Standalone]
  # of nodes	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	   135		 135		  0%
       2	  4452		 528		-88%
       3	 10767		2369		-78%
       4	 20835		2921		-86%
The micro-benchmark was also run on a 4-core Ivy-Bridge PC. The table
below shows the collected performance data:
		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	  197/178	  181/150	 -8%/-16%
       2	 1109/928    435-1417/697-2125
       3	 1836/1702  1372-3112/1379-3138
       4	 2717/2429  1842-4158/1846-4170
The performance of the queue lock patch varied from run to run whereas
the performance of the ticket lock was more consistent. The queue
lock figures above were the range of values that were reported.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/qspinlock.h      |  183 ++++++++++++++++++++++++++++++++-
 include/asm-generic/qspinlock_types.h |   16 +++-
 2 files changed, 196 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 7316ec4..ff205c4 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -7,10 +7,18 @@
 
 /*
  * Queue spinlock union structure to be used within this header file only
+ * Besides the slock and lock fields, the other fields are only valid with
+ * less than 16K CPUs.
  */
 union qspinlock_x86 {
 	struct qspinlock slock;
-	u8		 lock;	/* Lock bit	*/
+	struct {
+		u8  lock;	/* Lock bit	*/
+		u8  wait;	/* Waiting bit	*/
+		u16 qcode;	/* Queue code	*/
+	};
+	u16 lock_wait;		/* Lock and wait bits */
+	int qlcode;
 };
 
 #define queue_spin_trylock_unfair queue_spin_trylock_unfair
@@ -48,6 +56,179 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
 	barrier();
 }
 
+#ifndef _Q_MANY_CPUS
+/*
+ * With less than 16K CPUs, the following optimizations are possible with
+ * the x86 architecture:
+ *  1) The 2nd byte of the 32-bit lock word can be used as a pending bit
+ *     for waiting lock acquirer so that it won't need to go through the
+ *     MCS style locking queuing which has a higher overhead.
+ *  2) The 16-bit queue code can be accessed or modified directly as a
+ *     16-bit short value without disturbing the first 2 bytes.
+ */
+#define	_QSPINLOCK_WAITING	0x100U	/* Waiting bit in 2nd byte   */
+#define	_QSPINLOCK_LWMASK	0xffff	/* Mask for lock & wait bits */
+
+/*
+ * As the qcode will be accessed as a 16-bit word, no offset is needed
+ */
+#define _QCODE_VAL_OFFSET	0
+#define _SET_QCODE(cpu, idx)	(((cpu) + 1) << 2 | (idx))
+
+#define queue_spin_trylock_quick queue_spin_trylock_quick
+/**
+ * queue_spin_trylock_quick - fast spinning on the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Old queue spinlock value
+ * Return: 1 if lock acquired, 0 if failed
+ *
+ * This is an optimized contention path for 2 contending tasks. It
+ * should only be entered if no task is waiting in the queue. This
+ * optimized path is not as fair as the ticket spinlock, but it offers
+ * slightly better performance. The regular MCS locking path for 3 or
+ * more contending tasks, however, is fair.
+ *
+ * Depending on the exact timing, there are several different paths where
+ * a contending task can take. The actual contention performance depends
+ * on which path is taken. So it can be faster or slower than the
+ * corresponding ticket spinlock path. On average, it is probably on par
+ * with ticket spinlock.
+ */
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{
+	union qspinlock_x86 *qlock = (union qspinlock_x86 *)lock;
+	u16		     old;
+
+	/*
+	 * Fall into the quick spinning code path only if no one is waiting
+	 * or the lock is available.
+	 */
+	if (unlikely((qsval != _QSPINLOCK_LOCKED) &&
+		     (qsval != _QSPINLOCK_WAITING)))
+		return 0;
+
+	old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
+
+	if (old == 0) {
+		/*
+		 * Got the lock, can clear the waiting bit now
+		 */
+		ACCESS_ONCE(qlock->wait) = 0;
+		return 1;
+	} else if (old == _QSPINLOCK_LOCKED) {
+try_again:
+		/*
+		 * Wait until the lock byte is cleared to get the lock
+		 */
+		do {
+			cpu_relax();
+		} while (ACCESS_ONCE(qlock->lock));
+		/*
+		 * Set the lock bit & clear the waiting bit
+		 */
+		if (cmpxchg(&qlock->lock_wait, _QSPINLOCK_WAITING,
+			   _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+			return 1;
+		/*
+		 * Someone has steal the lock, so wait again
+		 */
+		goto try_again;
+	} else if (old == _QSPINLOCK_WAITING) {
+		/*
+		 * Another task is already waiting while it steals the lock.
+		 * A bit of unfairness here won't change the big picture.
+		 * So just take the lock and return.
+		 */
+		return 1;
+	}
+	/*
+	 * Nothing need to be done if the old value is
+	 * (_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED).
+	 */
+	return 0;
+}
+
+#define queue_code_xchg	queue_code_xchg
+/**
+ * queue_code_xchg - exchange a queue code value
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: New queue code to be exchanged
+ * Return: The original qcode value in the queue spinlock
+ */
+static inline u32 queue_code_xchg(struct qspinlock *lock, u32 qcode)
+{
+	union qspinlock_x86 *qlock = (union qspinlock_x86 *)lock;
+
+	return (u32)xchg(&qlock->qcode, (u16)qcode);
+}
+
+#define queue_spin_trylock_and_clr_qcode queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+	qcode <<= _QCODE_OFFSET;
+	return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+
+#define queue_get_lock_qcode queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock  : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value
+ * Return : > 0 if lock is not available
+ *	   = 0 if lock is free
+ *	   < 0 if lock is taken & can return after cleanup
+ *
+ * It is considered locked when either the lock bit or the wait bit is set.
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+	u32 qlcode;
+
+	qlcode = (u32)atomic_read(&lock->qlcode);
+	/*
+	 * With the special case that qlcode contains only _QSPINLOCK_LOCKED
+	 * and mycode. It will try to transition back to the quick spinning
+	 * code by clearing the qcode and setting the _QSPINLOCK_WAITING
+	 * bit.
+	 */
+	if (qlcode == (_QSPINLOCK_LOCKED | (mycode << _QCODE_OFFSET))) {
+		u32 old = qlcode;
+
+		qlcode = atomic_cmpxchg(&lock->qlcode, old,
+				_QSPINLOCK_LOCKED|_QSPINLOCK_WAITING);
+		if (qlcode == old) {
+			union qspinlock_x86 *slock =
+				(union qspinlock_x86 *)lock;
+try_again:
+			/*
+			 * Wait until the lock byte is cleared
+			 */
+			do {
+				cpu_relax();
+			} while (ACCESS_ONCE(slock->lock));
+			/*
+			 * Set the lock bit & clear the waiting bit
+			 */
+			if (cmpxchg(&slock->lock_wait, _QSPINLOCK_WAITING,
+				    _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+				return -1;	/* Got the lock */
+			goto try_again;
+		}
+	}
+	*qcode = qlcode >> _QCODE_OFFSET;
+	return qlcode & _QSPINLOCK_LWMASK;
+}
+
+#endif /* _Q_MANY_CPUS */
 #endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */
 
 #include <asm-generic/qspinlock.h>
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index e8cc9ca..486d8fd 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -26,16 +26,28 @@
 /*
  * The queue spinlock data structure - a 32-bit word
  *
- * The bits assignment are:
+ * For NR_CPUS >= 16K, the bits assignment are:
  *   Bit  0   : Set if locked
  *   Bits 1-7 : Not used
  *   Bits 8-31: Queue code
+ *
+ * For NR_CPUS < 16K, the bits assignment are:
+ *   Bit   0   : Set if locked
+ *   Bits  1-7 : Not used
+ *   Bits  8-15: Reserved for architecture specific optimization
+ *   Bits 16-31: Queue code
  */
 typedef struct qspinlock {
 	atomic_t	qlcode;	/* Lock + queue code */
 } arch_spinlock_t;
 
-#define _QCODE_OFFSET		8
+#if CONFIG_NR_CPUS >= (1 << 14)
+# define _Q_MANY_CPUS
+# define _QCODE_OFFSET	8
+#else
+# define _QCODE_OFFSET	16
+#endif
+
 #define _QSPINLOCK_LOCKED	1U
 #define	_QCODE(lock)	(atomic_read(&(lock)->qlcode) >> _QCODE_OFFSET)
 #define	_QLOCK(lock)	(atomic_read(&(lock)->qlcode) & _QSPINLOCK_LOCKED)
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 71+ messages in thread
- * [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-17 20:41 ` [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks Waiman Long
@ 2014-02-17 20:41   ` Waiman Long
  2014-02-21 12:12   ` Peter Zijlstra
  2014-02-21 17:28   ` Peter Zijlstra
  2 siblings, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-17 20:41 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke, Waiman Long
A major problem with the queue spinlock patch is its performance at
low contention level (2-4 contending tasks) where it is slower than
the corresponding ticket spinlock code path. The following table shows
the execution time (in ms) of a micro-benchmark where 5M iterations
of the lock/unlock cycles were run on a 10-core Westere-EX CPU with
2 different types loads - standalone (lock and protected data in
different cachelines) and embedded (lock and protected data in the
same cacheline).
		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	  135/111	 135/102	  0%/-8%
       2	  732/950	1315/1573	+80%/+66%
       3	 1827/1783	2372/2428	+30%/+36%
       4	 2689/2725	2934/2934	 +9%/+8%
       5	 3736/3748	3658/3652	 -2%/-3%
       6	 4942/4984	4434/4428	-10%/-11%
       7	 6304/6319	5176/5163	-18%/-18%
       8	 7736/7629	5955/5944	-23%/-22%
It can be seen that the performance degradation is particular bad
with 2 and 3 contending tasks. To reduce that performance deficit
at low contention level, a special x86 specific optimized code path
for 2 contending tasks was added. This special code path will only
be activated with less than 16K of configured CPUs because it uses
a byte in the 32-bit lock word to hold a waiting bit for the 2nd
contending tasks instead of queuing the waiting task in the queue.
With the change, the performance data became:
		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       2	  732/950	 523/528	-29%/-44%
       3	 1827/1783	2366/2384	+30%/+34%
The queue spinlock code path is now a bit faster with 2 contending
tasks.  There is also a very slight improvement with 3 contending
tasks.
The performance of the optimized code path can vary depending on which
of the several different code paths is taken. It is also not as fair as
the ticket spinlock and some variations on the execution times of the
2 contending tasks.  Testing with different pairs of cores within the
same CPUs shows an execution time that varies from 400ms to 1194ms.
The ticket spinlock code also shows a variation of 718-1146ms which
is probably due to the CPU topology within a socket.
In a multi-socketed server, the optimized code path also seems to
produce a big performance improvement in cross-node contention traffic
at low contention level. The table below show the performance with
1 contending task per node:
		[Standalone]
  # of nodes	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	   135		 135		  0%
       2	  4452		 528		-88%
       3	 10767		2369		-78%
       4	 20835		2921		-86%
The micro-benchmark was also run on a 4-core Ivy-Bridge PC. The table
below shows the collected performance data:
		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	  197/178	  181/150	 -8%/-16%
       2	 1109/928    435-1417/697-2125
       3	 1836/1702  1372-3112/1379-3138
       4	 2717/2429  1842-4158/1846-4170
The performance of the queue lock patch varied from run to run whereas
the performance of the ticket lock was more consistent. The queue
lock figures above were the range of values that were reported.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/qspinlock.h      |  183 ++++++++++++++++++++++++++++++++-
 include/asm-generic/qspinlock_types.h |   16 +++-
 2 files changed, 196 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 7316ec4..ff205c4 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -7,10 +7,18 @@
 
 /*
  * Queue spinlock union structure to be used within this header file only
+ * Besides the slock and lock fields, the other fields are only valid with
+ * less than 16K CPUs.
  */
 union qspinlock_x86 {
 	struct qspinlock slock;
-	u8		 lock;	/* Lock bit	*/
+	struct {
+		u8  lock;	/* Lock bit	*/
+		u8  wait;	/* Waiting bit	*/
+		u16 qcode;	/* Queue code	*/
+	};
+	u16 lock_wait;		/* Lock and wait bits */
+	int qlcode;
 };
 
 #define queue_spin_trylock_unfair queue_spin_trylock_unfair
@@ -48,6 +56,179 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
 	barrier();
 }
 
+#ifndef _Q_MANY_CPUS
+/*
+ * With less than 16K CPUs, the following optimizations are possible with
+ * the x86 architecture:
+ *  1) The 2nd byte of the 32-bit lock word can be used as a pending bit
+ *     for waiting lock acquirer so that it won't need to go through the
+ *     MCS style locking queuing which has a higher overhead.
+ *  2) The 16-bit queue code can be accessed or modified directly as a
+ *     16-bit short value without disturbing the first 2 bytes.
+ */
+#define	_QSPINLOCK_WAITING	0x100U	/* Waiting bit in 2nd byte   */
+#define	_QSPINLOCK_LWMASK	0xffff	/* Mask for lock & wait bits */
+
+/*
+ * As the qcode will be accessed as a 16-bit word, no offset is needed
+ */
+#define _QCODE_VAL_OFFSET	0
+#define _SET_QCODE(cpu, idx)	(((cpu) + 1) << 2 | (idx))
+
+#define queue_spin_trylock_quick queue_spin_trylock_quick
+/**
+ * queue_spin_trylock_quick - fast spinning on the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Old queue spinlock value
+ * Return: 1 if lock acquired, 0 if failed
+ *
+ * This is an optimized contention path for 2 contending tasks. It
+ * should only be entered if no task is waiting in the queue. This
+ * optimized path is not as fair as the ticket spinlock, but it offers
+ * slightly better performance. The regular MCS locking path for 3 or
+ * more contending tasks, however, is fair.
+ *
+ * Depending on the exact timing, there are several different paths where
+ * a contending task can take. The actual contention performance depends
+ * on which path is taken. So it can be faster or slower than the
+ * corresponding ticket spinlock path. On average, it is probably on par
+ * with ticket spinlock.
+ */
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{
+	union qspinlock_x86 *qlock = (union qspinlock_x86 *)lock;
+	u16		     old;
+
+	/*
+	 * Fall into the quick spinning code path only if no one is waiting
+	 * or the lock is available.
+	 */
+	if (unlikely((qsval != _QSPINLOCK_LOCKED) &&
+		     (qsval != _QSPINLOCK_WAITING)))
+		return 0;
+
+	old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
+
+	if (old == 0) {
+		/*
+		 * Got the lock, can clear the waiting bit now
+		 */
+		ACCESS_ONCE(qlock->wait) = 0;
+		return 1;
+	} else if (old == _QSPINLOCK_LOCKED) {
+try_again:
+		/*
+		 * Wait until the lock byte is cleared to get the lock
+		 */
+		do {
+			cpu_relax();
+		} while (ACCESS_ONCE(qlock->lock));
+		/*
+		 * Set the lock bit & clear the waiting bit
+		 */
+		if (cmpxchg(&qlock->lock_wait, _QSPINLOCK_WAITING,
+			   _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+			return 1;
+		/*
+		 * Someone has steal the lock, so wait again
+		 */
+		goto try_again;
+	} else if (old == _QSPINLOCK_WAITING) {
+		/*
+		 * Another task is already waiting while it steals the lock.
+		 * A bit of unfairness here won't change the big picture.
+		 * So just take the lock and return.
+		 */
+		return 1;
+	}
+	/*
+	 * Nothing need to be done if the old value is
+	 * (_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED).
+	 */
+	return 0;
+}
+
+#define queue_code_xchg	queue_code_xchg
+/**
+ * queue_code_xchg - exchange a queue code value
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: New queue code to be exchanged
+ * Return: The original qcode value in the queue spinlock
+ */
+static inline u32 queue_code_xchg(struct qspinlock *lock, u32 qcode)
+{
+	union qspinlock_x86 *qlock = (union qspinlock_x86 *)lock;
+
+	return (u32)xchg(&qlock->qcode, (u16)qcode);
+}
+
+#define queue_spin_trylock_and_clr_qcode queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+	qcode <<= _QCODE_OFFSET;
+	return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+
+#define queue_get_lock_qcode queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock  : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value
+ * Return : > 0 if lock is not available
+ *	   = 0 if lock is free
+ *	   < 0 if lock is taken & can return after cleanup
+ *
+ * It is considered locked when either the lock bit or the wait bit is set.
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+	u32 qlcode;
+
+	qlcode = (u32)atomic_read(&lock->qlcode);
+	/*
+	 * With the special case that qlcode contains only _QSPINLOCK_LOCKED
+	 * and mycode. It will try to transition back to the quick spinning
+	 * code by clearing the qcode and setting the _QSPINLOCK_WAITING
+	 * bit.
+	 */
+	if (qlcode == (_QSPINLOCK_LOCKED | (mycode << _QCODE_OFFSET))) {
+		u32 old = qlcode;
+
+		qlcode = atomic_cmpxchg(&lock->qlcode, old,
+				_QSPINLOCK_LOCKED|_QSPINLOCK_WAITING);
+		if (qlcode == old) {
+			union qspinlock_x86 *slock =
+				(union qspinlock_x86 *)lock;
+try_again:
+			/*
+			 * Wait until the lock byte is cleared
+			 */
+			do {
+				cpu_relax();
+			} while (ACCESS_ONCE(slock->lock));
+			/*
+			 * Set the lock bit & clear the waiting bit
+			 */
+			if (cmpxchg(&slock->lock_wait, _QSPINLOCK_WAITING,
+				    _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+				return -1;	/* Got the lock */
+			goto try_again;
+		}
+	}
+	*qcode = qlcode >> _QCODE_OFFSET;
+	return qlcode & _QSPINLOCK_LWMASK;
+}
+
+#endif /* _Q_MANY_CPUS */
 #endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */
 
 #include <asm-generic/qspinlock.h>
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index e8cc9ca..486d8fd 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -26,16 +26,28 @@
 /*
  * The queue spinlock data structure - a 32-bit word
  *
- * The bits assignment are:
+ * For NR_CPUS >= 16K, the bits assignment are:
  *   Bit  0   : Set if locked
  *   Bits 1-7 : Not used
  *   Bits 8-31: Queue code
+ *
+ * For NR_CPUS < 16K, the bits assignment are:
+ *   Bit   0   : Set if locked
+ *   Bits  1-7 : Not used
+ *   Bits  8-15: Reserved for architecture specific optimization
+ *   Bits 16-31: Queue code
  */
 typedef struct qspinlock {
 	atomic_t	qlcode;	/* Lock + queue code */
 } arch_spinlock_t;
 
-#define _QCODE_OFFSET		8
+#if CONFIG_NR_CPUS >= (1 << 14)
+# define _Q_MANY_CPUS
+# define _QCODE_OFFSET	8
+#else
+# define _QCODE_OFFSET	16
+#endif
+
 #define _QSPINLOCK_LOCKED	1U
 #define	_QCODE(lock)	(atomic_read(&(lock)->qlcode) >> _QCODE_OFFSET)
 #define	_QLOCK(lock)	(atomic_read(&(lock)->qlcode) & _QSPINLOCK_LOCKED)
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-17 20:41 ` [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks Waiman Long
  2014-02-17 20:41   ` Waiman Long
@ 2014-02-21 12:12   ` Peter Zijlstra
  2014-02-21 17:08     ` Waiman Long
  2014-02-21 17:28   ` Peter Zijlstra
  2 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-21 12:12 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
Why is this x86 only code?
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-21 12:12   ` Peter Zijlstra
@ 2014-02-21 17:08     ` Waiman Long
  2014-02-21 17:09       ` Waiman Long
  0 siblings, 1 reply; 71+ messages in thread
From: Waiman Long @ 2014-02-21 17:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/21/2014 07:12 AM, Peter Zijlstra wrote:
>
> Why is this x86 only code?
The code is making use of the fact that byte write is atomic which is 
true in x86 and probably in a few other architectures. I could pull 
these codes into the generic qspinlock.c file and set a flag in the asm 
header file to activate it if it is what you want.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-21 17:08     ` Waiman Long
@ 2014-02-21 17:09       ` Waiman Long
  2014-02-21 17:26         ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Waiman Long @ 2014-02-21 17:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/21/2014 12:08 PM, Waiman Long wrote:
> On 02/21/2014 07:12 AM, Peter Zijlstra wrote:
>>
>> Why is this x86 only code?
>
> The code is making use of the fact that byte write is atomic which is 
> true in x86 and probably in a few other architectures. I could pull 
> these codes into the generic qspinlock.c file and set a flag in the 
> asm header file to activate it if it is what you want.
>
> -Longman
BTW, I also assume that 8-bit and 16-bit cmpxchg() and xchg() are available.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-21 17:09       ` Waiman Long
@ 2014-02-21 17:26         ` Peter Zijlstra
  2014-02-21 17:26           ` Peter Zijlstra
  2014-02-22  1:36           ` Waiman Long
  0 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-21 17:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Fri, Feb 21, 2014 at 12:09:57PM -0500, Waiman Long wrote:
> On 02/21/2014 12:08 PM, Waiman Long wrote:
> >On 02/21/2014 07:12 AM, Peter Zijlstra wrote:
> >>
> >>Why is this x86 only code?
> >
> >The code is making use of the fact that byte write is atomic which is true
> >in x86 and probably in a few other architectures. I could pull these codes
> >into the generic qspinlock.c file and set a flag in the asm header file to
> >activate it if it is what you want.
> >
> >-Longman
> 
> BTW, I also assume that 8-bit and 16-bit cmpxchg() and xchg() are available.
Right, screw Alpha :-) Just pull it into the generic code; its far too
much code to replicate per arch.
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-21 17:26         ` Peter Zijlstra
@ 2014-02-21 17:26           ` Peter Zijlstra
  2014-02-22  1:36           ` Waiman Long
  1 sibling, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-21 17:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Fri, Feb 21, 2014 at 12:09:57PM -0500, Waiman Long wrote:
> On 02/21/2014 12:08 PM, Waiman Long wrote:
> >On 02/21/2014 07:12 AM, Peter Zijlstra wrote:
> >>
> >>Why is this x86 only code?
> >
> >The code is making use of the fact that byte write is atomic which is true
> >in x86 and probably in a few other architectures. I could pull these codes
> >into the generic qspinlock.c file and set a flag in the asm header file to
> >activate it if it is what you want.
> >
> >-Longman
> 
> BTW, I also assume that 8-bit and 16-bit cmpxchg() and xchg() are available.
Right, screw Alpha :-) Just pull it into the generic code; its far too
much code to replicate per arch.
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-21 17:26         ` Peter Zijlstra
  2014-02-21 17:26           ` Peter Zijlstra
@ 2014-02-22  1:36           ` Waiman Long
  1 sibling, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-22  1:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/21/2014 12:26 PM, Peter Zijlstra wrote:
> On Fri, Feb 21, 2014 at 12:09:57PM -0500, Waiman Long wrote:
>> On 02/21/2014 12:08 PM, Waiman Long wrote:
>>> On 02/21/2014 07:12 AM, Peter Zijlstra wrote:
>>>> Why is this x86 only code?
>>> The code is making use of the fact that byte write is atomic which is true
>>> in x86 and probably in a few other architectures. I could pull these codes
>>> into the generic qspinlock.c file and set a flag in the asm header file to
>>> activate it if it is what you want.
>>>
>>> -Longman
>> BTW, I also assume that 8-bit and 16-bit cmpxchg() and xchg() are available.
> Right, screw Alpha :-) Just pull it into the generic code; its far too
> much code to replicate per arch.
OK, I will do that in the next version.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
 
 
 
 
- * Re: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-17 20:41 ` [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks Waiman Long
  2014-02-17 20:41   ` Waiman Long
  2014-02-21 12:12   ` Peter Zijlstra
@ 2014-02-21 17:28   ` Peter Zijlstra
  2014-02-22  1:39     ` Waiman Long
  2 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-21 17:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Mon, Feb 17, 2014 at 03:41:24PM -0500, Waiman Long wrote:
> +	struct {
> +		u8  lock;	/* Lock bit	*/
> +		u8  wait;	/* Waiting bit	*/
> +		u16 qcode;	/* Queue code	*/
> +	};
16 bit code would result in 14 bits for the CPU number, that's only 16k,
I think SGI actually had a machine with that many CPUs in.
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-21 17:28   ` Peter Zijlstra
@ 2014-02-22  1:39     ` Waiman Long
  0 siblings, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-22  1:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/21/2014 12:28 PM, Peter Zijlstra wrote:
> On Mon, Feb 17, 2014 at 03:41:24PM -0500, Waiman Long wrote:
>> +	struct {
>> +		u8  lock;	/* Lock bit	*/
>> +		u8  wait;	/* Waiting bit	*/
>> +		u16 qcode;	/* Queue code	*/
>> +	};
> 16 bit code would result in 14 bits for the CPU number, that's only 16k,
> I think SGI actually had a machine with that many CPUs in.
>
If NR_CPUS >= 16k, the 2-task optimized code path will be disabled. 
However, with that many CPUs, it is high spinlock contention behavior 
that is more important than a bit better performance at low contention 
level.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread
 
 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-17 20:41 [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock Waiman Long
                   ` (2 preceding siblings ...)
  2014-02-17 20:41 ` [PATCH v4 3/3] qspinlock, x86: Add x86 specific optimization for 2 contending tasks Waiman Long
@ 2014-02-17 22:47 ` H. Peter Anvin
  2014-02-18  7:31   ` Peter Zijlstra
                     ` (2 more replies)
  3 siblings, 3 replies; 71+ messages in thread
From: H. Peter Anvin @ 2014-02-17 22:47 UTC (permalink / raw)
  To: Waiman Long, Thomas Gleixner, Ingo Molnar, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke
On 02/17/2014 12:41 PM, Waiman Long wrote:
> v3->v4:
>  - Remove debugging code and fix a configuration error
>  - Simplify the qspinlock structure and streamline the code to make it
>    perform a bit better
>  - Add an x86 version of asm/qspinlock.h for holding x86 specific
>    optimization.
>  - Add an optimized x86 code path for 2 contending tasks to improve
>    low contention performance.
> 
> v2->v3:
>  - Simplify the code by using numerous mode only without an unfair option.
>  - Use the latest smp_load_acquire()/smp_store_release() barriers.
>  - Move the queue spinlock code to kernel/locking.
>  - Make the use of queue spinlock the default for x86-64 without user
>    configuration.
>  - Additional performance tuning.
> 
> v1->v2:
>  - Add some more comments to document what the code does.
>  - Add a numerous CPU mode to support >= 16K CPUs
>  - Add a configuration option to allow lock stealing which can further
>    improve performance in many cases.
>  - Enable wakeup of queue head CPU at unlock time for non-numerous
>    CPU mode.
> 
> This patch set introduces a queue-based spinlock implementation that
> can replace the default ticket spinlock without increasing the size
> of the spinlock data structure. As a result, critical kernel data
> structures that embed spinlock won't increase in size and breaking
> data alignments.
> 
This is starting to look good, so I have pulled it into
tip:x86/spinlocks to start give it some testing mileage.
	-hpa
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-17 22:47 ` [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock H. Peter Anvin
@ 2014-02-18  7:31   ` Peter Zijlstra
  2014-02-18  7:39     ` H. Peter Anvin
  2014-02-18 19:30     ` Waiman Long
  2014-02-18  7:38   ` Peter Zijlstra
  2014-02-18 19:27   ` Waiman Long
  2 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-18  7:31 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Mon, Feb 17, 2014 at 02:47:03PM -0800, H. Peter Anvin wrote:
> On 02/17/2014 12:41 PM, Waiman Long wrote:
> > v3->v4:
> >  - Remove debugging code and fix a configuration error
> >  - Simplify the qspinlock structure and streamline the code to make it
> >    perform a bit better
> >  - Add an x86 version of asm/qspinlock.h for holding x86 specific
> >    optimization.
> >  - Add an optimized x86 code path for 2 contending tasks to improve
> >    low contention performance.
> > 
> > v2->v3:
> >  - Simplify the code by using numerous mode only without an unfair option.
> >  - Use the latest smp_load_acquire()/smp_store_release() barriers.
> >  - Move the queue spinlock code to kernel/locking.
> >  - Make the use of queue spinlock the default for x86-64 without user
> >    configuration.
> >  - Additional performance tuning.
> > 
> > v1->v2:
> >  - Add some more comments to document what the code does.
> >  - Add a numerous CPU mode to support >= 16K CPUs
> >  - Add a configuration option to allow lock stealing which can further
> >    improve performance in many cases.
> >  - Enable wakeup of queue head CPU at unlock time for non-numerous
> >    CPU mode.
> > 
> > This patch set introduces a queue-based spinlock implementation that
> > can replace the default ticket spinlock without increasing the size
> > of the spinlock data structure. As a result, critical kernel data
> > structures that embed spinlock won't increase in size and breaking
> > data alignments.
> > 
> 
> This is starting to look good, so I have pulled it into
> tip:x86/spinlocks to start give it some testing mileage.
It very much needs paravirt muck before we can even consider it.
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-18  7:31   ` Peter Zijlstra
@ 2014-02-18  7:39     ` H. Peter Anvin
  2014-02-18  7:39       ` H. Peter Anvin
  2014-02-18 19:30     ` Waiman Long
  1 sibling, 1 reply; 71+ messages in thread
From: H. Peter Anvin @ 2014-02-18  7:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/17/2014 11:31 PM, Peter Zijlstra wrote:
>>
>> This is starting to look good, so I have pulled it into
>> tip:x86/spinlocks to start give it some testing mileage.
> 
> It very much needs paravirt muck before we can even consider it.
> 
Ah yes.  Joy ... I so love PV.
	-hpa
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-18  7:39     ` H. Peter Anvin
@ 2014-02-18  7:39       ` H. Peter Anvin
  0 siblings, 0 replies; 71+ messages in thread
From: H. Peter Anvin @ 2014-02-18  7:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/17/2014 11:31 PM, Peter Zijlstra wrote:
>>
>> This is starting to look good, so I have pulled it into
>> tip:x86/spinlocks to start give it some testing mileage.
> 
> It very much needs paravirt muck before we can even consider it.
> 
Ah yes.  Joy ... I so love PV.
	-hpa
^ permalink raw reply	[flat|nested] 71+ messages in thread 
 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-18  7:31   ` Peter Zijlstra
  2014-02-18  7:39     ` H. Peter Anvin
@ 2014-02-18 19:30     ` Waiman Long
  2014-02-18 21:28       ` Peter Zijlstra
  1 sibling, 1 reply; 71+ messages in thread
From: Waiman Long @ 2014-02-18 19:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/18/2014 02:31 AM, Peter Zijlstra wrote:
> On Mon, Feb 17, 2014 at 02:47:03PM -0800, H. Peter Anvin wrote:
>> On 02/17/2014 12:41 PM, Waiman Long wrote:
>>> v3->v4:
>>>   - Remove debugging code and fix a configuration error
>>>   - Simplify the qspinlock structure and streamline the code to make it
>>>     perform a bit better
>>>   - Add an x86 version of asm/qspinlock.h for holding x86 specific
>>>     optimization.
>>>   - Add an optimized x86 code path for 2 contending tasks to improve
>>>     low contention performance.
>>>
>>> v2->v3:
>>>   - Simplify the code by using numerous mode only without an unfair option.
>>>   - Use the latest smp_load_acquire()/smp_store_release() barriers.
>>>   - Move the queue spinlock code to kernel/locking.
>>>   - Make the use of queue spinlock the default for x86-64 without user
>>>     configuration.
>>>   - Additional performance tuning.
>>>
>>> v1->v2:
>>>   - Add some more comments to document what the code does.
>>>   - Add a numerous CPU mode to support>= 16K CPUs
>>>   - Add a configuration option to allow lock stealing which can further
>>>     improve performance in many cases.
>>>   - Enable wakeup of queue head CPU at unlock time for non-numerous
>>>     CPU mode.
>>>
>>> This patch set introduces a queue-based spinlock implementation that
>>> can replace the default ticket spinlock without increasing the size
>>> of the spinlock data structure. As a result, critical kernel data
>>> structures that embed spinlock won't increase in size and breaking
>>> data alignments.
>>>
>> This is starting to look good, so I have pulled it into
>> tip:x86/spinlocks to start give it some testing mileage.
> It very much needs paravirt muck before we can even consider it.
I will start looking at how to make it work with paravirt. Hopefully, it 
won't take too long.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-18 19:30     ` Waiman Long
@ 2014-02-18 21:28       ` Peter Zijlstra
  2014-02-18 21:28         ` Peter Zijlstra
  2014-02-19  0:42         ` Waiman Long
  0 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-18 21:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
> I will start looking at how to make it work with paravirt. Hopefully, it
> won't take too long.
The cheap way out is to simply switch to the test-and-set spinlock on
whatever X86_FEATURE_ indicates a guest I suppose.
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-18 21:28       ` Peter Zijlstra
@ 2014-02-18 21:28         ` Peter Zijlstra
  2014-02-19  0:42         ` Waiman Long
  1 sibling, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-18 21:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
> I will start looking at how to make it work with paravirt. Hopefully, it
> won't take too long.
The cheap way out is to simply switch to the test-and-set spinlock on
whatever X86_FEATURE_ indicates a guest I suppose.
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-18 21:28       ` Peter Zijlstra
  2014-02-18 21:28         ` Peter Zijlstra
@ 2014-02-19  0:42         ` Waiman Long
  2014-02-19  7:09           ` Raghavendra K T
  2014-02-19  8:51           ` Peter Zijlstra
  1 sibling, 2 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-19  0:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/18/2014 04:28 PM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
>> I will start looking at how to make it work with paravirt. Hopefully, it
>> won't take too long.
> The cheap way out is to simply switch to the test-and-set spinlock on
> whatever X86_FEATURE_ indicates a guest I suppose.
I don't think there is X86_FEATURE flag that indicates running in a 
guest. In fact, a guest should never find out if it is running virtualized.
After reading the current PV ticketlock implementation, I have a rough 
idea of what I need to do to implement PV support in qspinlock. A large 
portion of PV ticketlock code is find out the CPU number of the next one 
to get the lock. The current qspinlock implementation has already 
included CPU number of the previous member in the queue and it should be 
pretty easy to store CPU number of the next one in the queue node 
structure. These CPU numbers can then be supplied to the kick_cpu() 
function to schedule in the require the CPU to make sure that progress 
can be made.
I will try to implement this idea to see how thing work out.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-19  0:42         ` Waiman Long
@ 2014-02-19  7:09           ` Raghavendra K T
  2014-02-19  8:51           ` Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: Raghavendra K T @ 2014-02-19  7:09 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/19/2014 06:12 AM, Waiman Long wrote:
> On 02/18/2014 04:28 PM, Peter Zijlstra wrote:
>> On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
>>> I will start looking at how to make it work with paravirt. Hopefully, it
>>> won't take too long.
>> The cheap way out is to simply switch to the test-and-set spinlock on
>> whatever X86_FEATURE_ indicates a guest I suppose.
>
> I don't think there is X86_FEATURE flag that indicates running in a
> guest. In fact, a guest should never find out if it is running virtualized.
>
> After reading the current PV ticketlock implementation, I have a rough
> idea of what I need to do to implement PV support in qspinlock. A large
> portion of PV ticketlock code is find out the CPU number of the next one
> to get the lock. The current qspinlock implementation has already
> included CPU number of the previous member in the queue and it should be
> pretty easy to store CPU number of the next one in the queue node
> structure. These CPU numbers can then be supplied to the kick_cpu()
> function to schedule in the require the CPU to make sure that progress
> can be made.
That is correct.
Strict serialization of the lock is usually a headache for virtualized
guest (especially when overcommitted). I am eager to test the next
version.
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-19  0:42         ` Waiman Long
  2014-02-19  7:09           ` Raghavendra K T
@ 2014-02-19  8:51           ` Peter Zijlstra
  2014-02-19 19:24             ` Waiman Long
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-19  8:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Tue, Feb 18, 2014 at 07:42:20PM -0500, Waiman Long wrote:
> On 02/18/2014 04:28 PM, Peter Zijlstra wrote:
> >On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
> >>I will start looking at how to make it work with paravirt. Hopefully, it
> >>won't take too long.
> >The cheap way out is to simply switch to the test-and-set spinlock on
> >whatever X86_FEATURE_ indicates a guest I suppose.
> 
> I don't think there is X86_FEATURE flag that indicates running in a guest.
> In fact, a guest should never find out if it is running virtualized.
No it very much should; how else is paravirt ever going to work?
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-19  8:51           ` Peter Zijlstra
@ 2014-02-19 19:24             ` Waiman Long
  2014-02-19 19:48               ` Peter Zijlstra
                                 ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-19 19:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/19/2014 03:51 AM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 07:42:20PM -0500, Waiman Long wrote:
>> On 02/18/2014 04:28 PM, Peter Zijlstra wrote:
>>> On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
>>>> I will start looking at how to make it work with paravirt. Hopefully, it
>>>> won't take too long.
>>> The cheap way out is to simply switch to the test-and-set spinlock on
>>> whatever X86_FEATURE_ indicates a guest I suppose.
>> I don't think there is X86_FEATURE flag that indicates running in a guest.
>> In fact, a guest should never find out if it is running virtualized.
> No it very much should; how else is paravirt ever going to work?
We do have a CONFIG_PARAVIRT macro that turns on or off PV support. The 
queue spinlock can be easily changed into an unfair lock which allows 
lock stealing. We could have a config option to make it unfair in the 
PARAVIRT environment, but I don't think Linus like the idea of an unfair 
lock.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-19 19:24             ` Waiman Long
@ 2014-02-19 19:48               ` Peter Zijlstra
  2014-02-20 17:37                 ` Waiman Long
  2014-02-19 20:17               ` Linus Torvalds
  2014-02-19 21:29               ` H. Peter Anvin
  2 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-19 19:48 UTC (permalink / raw)
  To: Waiman Long
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Wed, Feb 19, 2014 at 02:24:49PM -0500, Waiman Long wrote:
> On 02/19/2014 03:51 AM, Peter Zijlstra wrote:
> >On Tue, Feb 18, 2014 at 07:42:20PM -0500, Waiman Long wrote:
> >>On 02/18/2014 04:28 PM, Peter Zijlstra wrote:
> >>>On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
> >>>>I will start looking at how to make it work with paravirt. Hopefully, it
> >>>>won't take too long.
> >>>The cheap way out is to simply switch to the test-and-set spinlock on
> >>>whatever X86_FEATURE_ indicates a guest I suppose.
> >>I don't think there is X86_FEATURE flag that indicates running in a guest.
> >>In fact, a guest should never find out if it is running virtualized.
> >No it very much should; how else is paravirt ever going to work?
> 
> We do have a CONFIG_PARAVIRT macro that turns on or off PV support. The
> queue spinlock can be easily changed into an unfair lock which allows lock
> stealing. We could have a config option to make it unfair in the PARAVIRT
> environment, but I don't think Linus like the idea of an unfair lock.
No; a guest is very much aware of paravirt. See for example the
static_key_false(¶virt_ticketlocks_enabled). It would be impossible
to set that branch if you never knew you were a guest.
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-19 19:48               ` Peter Zijlstra
@ 2014-02-20 17:37                 ` Waiman Long
  2014-02-20 17:44                   ` Linus Torvalds
  2014-02-20 18:15                   ` Peter Zijlstra
  0 siblings, 2 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-20 17:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/19/2014 02:48 PM, Peter Zijlstra wrote:
> On Wed, Feb 19, 2014 at 02:24:49PM -0500, Waiman Long wrote:
>> On 02/19/2014 03:51 AM, Peter Zijlstra wrote:
>>> On Tue, Feb 18, 2014 at 07:42:20PM -0500, Waiman Long wrote:
>>>> On 02/18/2014 04:28 PM, Peter Zijlstra wrote:
>>>>> On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
>>>>>> I will start looking at how to make it work with paravirt. Hopefully, it
>>>>>> won't take too long.
>>>>> The cheap way out is to simply switch to the test-and-set spinlock on
>>>>> whatever X86_FEATURE_ indicates a guest I suppose.
>>>> I don't think there is X86_FEATURE flag that indicates running in a guest.
>>>> In fact, a guest should never find out if it is running virtualized.
>>> No it very much should; how else is paravirt ever going to work?
>> We do have a CONFIG_PARAVIRT macro that turns on or off PV support. The
>> queue spinlock can be easily changed into an unfair lock which allows lock
>> stealing. We could have a config option to make it unfair in the PARAVIRT
>> environment, but I don't think Linus like the idea of an unfair lock.
> No; a guest is very much aware of paravirt. See for example the
> static_key_false(¶virt_ticketlocks_enabled). It would be impossible
> to set that branch if you never knew you were a guest.
Yes, that is true for paravirt, but I was talking about virtualization 
in general.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-20 17:37                 ` Waiman Long
@ 2014-02-20 17:44                   ` Linus Torvalds
  2014-02-20 17:44                     ` Linus Torvalds
  2014-02-20 18:15                   ` Peter Zijlstra
  1 sibling, 1 reply; 71+ messages in thread
From: Linus Torvalds @ 2014-02-20 17:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton, Thavatchai
On Thu, Feb 20, 2014 at 9:37 AM, Waiman Long <waiman.long@hp.com> wrote:
>
> Yes, that is true for paravirt, but I was talking about virtualization in
> general.
For non-paravirtualized environments, we should basically optimize for
the actual hardware, and assume that the full virtualization is good
enough that we don't care.
Obviously, sometimes we might take "this is particularly expensive to
virtualize" into account, and if it doesn't hurt hardware, try to help
the virtual environment along. So you might have cases where certain
hardware (typically interrupt controllers etc) should be accessed
certain particular ways, because those not only work on real hardware,
but behave well in virtualized environments too.
But for spinlocks, just make the raw hardware work well when not
*actively* paravirtualized. If some hypervisor deals badly with it,
it's a hypervisor or management problem.
            Linus
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-20 17:44                   ` Linus Torvalds
@ 2014-02-20 17:44                     ` Linus Torvalds
  0 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2014-02-20 17:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke
On Thu, Feb 20, 2014 at 9:37 AM, Waiman Long <waiman.long@hp.com> wrote:
>
> Yes, that is true for paravirt, but I was talking about virtualization in
> general.
For non-paravirtualized environments, we should basically optimize for
the actual hardware, and assume that the full virtualization is good
enough that we don't care.
Obviously, sometimes we might take "this is particularly expensive to
virtualize" into account, and if it doesn't hurt hardware, try to help
the virtual environment along. So you might have cases where certain
hardware (typically interrupt controllers etc) should be accessed
certain particular ways, because those not only work on real hardware,
but behave well in virtualized environments too.
But for spinlocks, just make the raw hardware work well when not
*actively* paravirtualized. If some hypervisor deals badly with it,
it's a hypervisor or management problem.
            Linus
^ permalink raw reply	[flat|nested] 71+ messages in thread
 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-20 17:37                 ` Waiman Long
  2014-02-20 17:44                   ` Linus Torvalds
@ 2014-02-20 18:15                   ` Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-20 18:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Thu, Feb 20, 2014 at 12:37:14PM -0500, Waiman Long wrote:
> Yes, that is true for paravirt, but I was talking about virtualization in
> general.
Them we really don't care about :-)
^ permalink raw reply	[flat|nested] 71+ messages in thread 
 
 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-19 19:24             ` Waiman Long
  2014-02-19 19:48               ` Peter Zijlstra
@ 2014-02-19 20:17               ` Linus Torvalds
  2014-02-19 20:17                 ` Linus Torvalds
  2014-02-20 17:54                 ` Waiman Long
  2014-02-19 21:29               ` H. Peter Anvin
  2 siblings, 2 replies; 71+ messages in thread
From: Linus Torvalds @ 2014-02-19 20:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton, Thavatchai
On Wed, Feb 19, 2014 at 11:24 AM, Waiman Long <waiman.long@hp.com> wrote:
>
> We do have a CONFIG_PARAVIRT macro that turns on or off PV support. The
> queue spinlock can be easily changed into an unfair lock which allows lock
> stealing. We could have a config option to make it unfair in the PARAVIRT
> environment, but I don't think Linus like the idea of an unfair lock.
I could care less for the paravirt case. As long as the native case
ends up being sane (even when CONFIG_PARAVIRT is set at compile time),
I'm fine.
When actually running in a paravirtualized environment, locks are
always going to have problems.
                  Linus
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-19 20:17               ` Linus Torvalds
@ 2014-02-19 20:17                 ` Linus Torvalds
  2014-02-20 17:54                 ` Waiman Long
  1 sibling, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2014-02-19 20:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke
On Wed, Feb 19, 2014 at 11:24 AM, Waiman Long <waiman.long@hp.com> wrote:
>
> We do have a CONFIG_PARAVIRT macro that turns on or off PV support. The
> queue spinlock can be easily changed into an unfair lock which allows lock
> stealing. We could have a config option to make it unfair in the PARAVIRT
> environment, but I don't think Linus like the idea of an unfair lock.
I could care less for the paravirt case. As long as the native case
ends up being sane (even when CONFIG_PARAVIRT is set at compile time),
I'm fine.
When actually running in a paravirtualized environment, locks are
always going to have problems.
                  Linus
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-19 20:17               ` Linus Torvalds
  2014-02-19 20:17                 ` Linus Torvalds
@ 2014-02-20 17:54                 ` Waiman Long
  2014-02-20 17:54                   ` Waiman Long
                                     ` (2 more replies)
  1 sibling, 3 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-20 17:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton, Thavatchai
On 02/19/2014 03:17 PM, Linus Torvalds wrote:
> On Wed, Feb 19, 2014 at 11:24 AM, Waiman Long<waiman.long@hp.com>  wrote:
>> We do have a CONFIG_PARAVIRT macro that turns on or off PV support. The
>> queue spinlock can be easily changed into an unfair lock which allows lock
>> stealing. We could have a config option to make it unfair in the PARAVIRT
>> environment, but I don't think Linus like the idea of an unfair lock.
> I could care less for the paravirt case. As long as the native case
> ends up being sane (even when CONFIG_PARAVIRT is set at compile time),
> I'm fine.
>
> When actually running in a paravirtualized environment, locks are
> always going to have problems.
>
>                    Linus
I think we could implement 2 versions of _raw_spin_lock. The primary one 
is fair and the other one is unfair with a jump label that jump from the 
fair version to the unfair version. At boot time, the kernel can check 
to see it is really running in a PV environment and then activate the 
jump label to use the unfair version.
Now the key is how to detect if a kernel is really running in a PV 
environment. I need to ask some virtualization experts on that.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-20 17:54                 ` Waiman Long
@ 2014-02-20 17:54                   ` Waiman Long
  2014-02-20 18:42                   ` Linus Torvalds
  2014-02-20 19:32                   ` Raghavendra K T
  2 siblings, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-20 17:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke
On 02/19/2014 03:17 PM, Linus Torvalds wrote:
> On Wed, Feb 19, 2014 at 11:24 AM, Waiman Long<waiman.long@hp.com>  wrote:
>> We do have a CONFIG_PARAVIRT macro that turns on or off PV support. The
>> queue spinlock can be easily changed into an unfair lock which allows lock
>> stealing. We could have a config option to make it unfair in the PARAVIRT
>> environment, but I don't think Linus like the idea of an unfair lock.
> I could care less for the paravirt case. As long as the native case
> ends up being sane (even when CONFIG_PARAVIRT is set at compile time),
> I'm fine.
>
> When actually running in a paravirtualized environment, locks are
> always going to have problems.
>
>                    Linus
I think we could implement 2 versions of _raw_spin_lock. The primary one 
is fair and the other one is unfair with a jump label that jump from the 
fair version to the unfair version. At boot time, the kernel can check 
to see it is really running in a PV environment and then activate the 
jump label to use the unfair version.
Now the key is how to detect if a kernel is really running in a PV 
environment. I need to ask some virtualization experts on that.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-20 17:54                 ` Waiman Long
  2014-02-20 17:54                   ` Waiman Long
@ 2014-02-20 18:42                   ` Linus Torvalds
  2014-02-20 18:42                     ` Linus Torvalds
  2014-02-20 19:21                     ` Waiman Long
  2014-02-20 19:32                   ` Raghavendra K T
  2 siblings, 2 replies; 71+ messages in thread
From: Linus Torvalds @ 2014-02-20 18:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton, Thavatchai
On Thu, Feb 20, 2014 at 9:54 AM, Waiman Long <waiman.long@hp.com> wrote:
>
> I think we could implement 2 versions of _raw_spin_lock.
Yup. Or rather,  I'd suggest implement just one version of
arch_spin_lock(), but at the top of it you do something like
   #if CONFIG_PARAVIRT_SPINLOCK
        if (static_key_false(&unfair_spinlocks)) {
           .. do paravirt unfair lock version ..
        }
   #endif
which should basically generate almost-perfect code:  it's one extra
no-op for the native case if CONFIG_PARAVIRT_SPINLOCK is on, which
turns into a branch for the unfair version for paravirtualization.
Or something like that.
              Linus
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-20 18:42                   ` Linus Torvalds
@ 2014-02-20 18:42                     ` Linus Torvalds
  2014-02-20 19:21                     ` Waiman Long
  1 sibling, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2014-02-20 18:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke
On Thu, Feb 20, 2014 at 9:54 AM, Waiman Long <waiman.long@hp.com> wrote:
>
> I think we could implement 2 versions of _raw_spin_lock.
Yup. Or rather,  I'd suggest implement just one version of
arch_spin_lock(), but at the top of it you do something like
   #if CONFIG_PARAVIRT_SPINLOCK
        if (static_key_false(&unfair_spinlocks)) {
           .. do paravirt unfair lock version ..
        }
   #endif
which should basically generate almost-perfect code:  it's one extra
no-op for the native case if CONFIG_PARAVIRT_SPINLOCK is on, which
turns into a branch for the unfair version for paravirtualization.
Or something like that.
              Linus
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-20 18:42                   ` Linus Torvalds
  2014-02-20 18:42                     ` Linus Torvalds
@ 2014-02-20 19:21                     ` Waiman Long
  2014-02-20 19:21                       ` Waiman Long
  1 sibling, 1 reply; 71+ messages in thread
From: Waiman Long @ 2014-02-20 19:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton, Thavatchai
On 02/20/2014 01:42 PM, Linus Torvalds wrote:
> On Thu, Feb 20, 2014 at 9:54 AM, Waiman Long<waiman.long@hp.com>  wrote:
>> I think we could implement 2 versions of _raw_spin_lock.
> Yup. Or rather,  I'd suggest implement just one version of
> arch_spin_lock(), but at the top of it you do something like
>
>     #if CONFIG_PARAVIRT_SPINLOCK
>          if (static_key_false(&unfair_spinlocks)) {
>             .. do paravirt unfair lock version ..
>          }
>     #endif
>
> which should basically generate almost-perfect code:  it's one extra
> no-op for the native case if CONFIG_PARAVIRT_SPINLOCK is on, which
> turns into a branch for the unfair version for paravirtualization.
>
> Or something like that.
>
>                Linus
Yes, this is actually what I meant. The only difference is that I am 
thinking about using a different config variable as PARAVIRT_SPINLOCKS 
actually mean something else.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-20 19:21                     ` Waiman Long
@ 2014-02-20 19:21                       ` Waiman Long
  0 siblings, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-20 19:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke
On 02/20/2014 01:42 PM, Linus Torvalds wrote:
> On Thu, Feb 20, 2014 at 9:54 AM, Waiman Long<waiman.long@hp.com>  wrote:
>> I think we could implement 2 versions of _raw_spin_lock.
> Yup. Or rather,  I'd suggest implement just one version of
> arch_spin_lock(), but at the top of it you do something like
>
>     #if CONFIG_PARAVIRT_SPINLOCK
>          if (static_key_false(&unfair_spinlocks)) {
>             .. do paravirt unfair lock version ..
>          }
>     #endif
>
> which should basically generate almost-perfect code:  it's one extra
> no-op for the native case if CONFIG_PARAVIRT_SPINLOCK is on, which
> turns into a branch for the unfair version for paravirtualization.
>
> Or something like that.
>
>                Linus
Yes, this is actually what I meant. The only difference is that I am 
thinking about using a different config variable as PARAVIRT_SPINLOCKS 
actually mean something else.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread
 
 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-20 17:54                 ` Waiman Long
  2014-02-20 17:54                   ` Waiman Long
  2014-02-20 18:42                   ` Linus Torvalds
@ 2014-02-20 19:32                   ` Raghavendra K T
  2014-02-20 19:32                     ` Raghavendra K T
  2014-02-21 17:02                     ` Waiman Long
  2 siblings, 2 replies; 71+ messages in thread
From: Raghavendra K T @ 2014-02-20 19:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai
On 02/20/2014 11:24 PM, Waiman Long wrote:
> Now the key is how to detect if a kernel is really running in a PV
> environment. I need to ask some virtualization experts on that.
>
For kvm you could just follow,
kvm_spinlock_init_jump() and KVM_FEATURE_* encoding in cpuid
part __do_cpuid_ent().
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-20 19:32                   ` Raghavendra K T
@ 2014-02-20 19:32                     ` Raghavendra K T
  2014-02-21 17:02                     ` Waiman Long
  1 sibling, 0 replies; 71+ messages in thread
From: Raghavendra K T @ 2014-02-20 19:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/20/2014 11:24 PM, Waiman Long wrote:
> Now the key is how to detect if a kernel is really running in a PV
> environment. I need to ask some virtualization experts on that.
>
For kvm you could just follow,
kvm_spinlock_init_jump() and KVM_FEATURE_* encoding in cpuid
part __do_cpuid_ent().
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-20 19:32                   ` Raghavendra K T
  2014-02-20 19:32                     ` Raghavendra K T
@ 2014-02-21 17:02                     ` Waiman Long
  2014-02-21 17:02                       ` Waiman Long
  2014-02-21 17:05                       ` Linus Torvalds
  1 sibling, 2 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-21 17:02 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Linus Torvalds, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai
On 02/20/2014 02:32 PM, Raghavendra K T wrote:
> On 02/20/2014 11:24 PM, Waiman Long wrote:
>
>> Now the key is how to detect if a kernel is really running in a PV
>> environment. I need to ask some virtualization experts on that.
>>
>
> For kvm you could just follow,
> kvm_spinlock_init_jump() and KVM_FEATURE_* encoding in cpuid
> part __do_cpuid_ent().
>
I saw that in the arch/x86/kernel/kvm.c file. So I think I can use the 
returned value of kvm_para_available() to decide if it is running in a 
KVM guest. However, I am not so sure of what to use in xen.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-21 17:02                     ` Waiman Long
@ 2014-02-21 17:02                       ` Waiman Long
  2014-02-21 17:05                       ` Linus Torvalds
  1 sibling, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-21 17:02 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Linus Torvalds, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/20/2014 02:32 PM, Raghavendra K T wrote:
> On 02/20/2014 11:24 PM, Waiman Long wrote:
>
>> Now the key is how to detect if a kernel is really running in a PV
>> environment. I need to ask some virtualization experts on that.
>>
>
> For kvm you could just follow,
> kvm_spinlock_init_jump() and KVM_FEATURE_* encoding in cpuid
> part __do_cpuid_ent().
>
I saw that in the arch/x86/kernel/kvm.c file. So I think I can use the 
returned value of kvm_para_available() to decide if it is running in a 
KVM guest. However, I am not so sure of what to use in xen.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-21 17:02                     ` Waiman Long
  2014-02-21 17:02                       ` Waiman Long
@ 2014-02-21 17:05                       ` Linus Torvalds
  2014-02-21 17:05                         ` Linus Torvalds
  1 sibling, 1 reply; 71+ messages in thread
From: Linus Torvalds @ 2014-02-21 17:05 UTC (permalink / raw)
  To: Waiman Long
  Cc: Raghavendra K T, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai
On Fri, Feb 21, 2014 at 9:02 AM, Waiman Long <waiman.long@hp.com> wrote:
> On 02/20/2014 02:32 PM, Raghavendra K T wrote:
>>
>> On 02/20/2014 11:24 PM, Waiman Long wrote:
>>
>>> Now the key is how to detect if a kernel is really running in a PV
>>> environment. I need to ask some virtualization experts on that.
>>>
>>
>> For kvm you could just follow,
>> kvm_spinlock_init_jump() and KVM_FEATURE_* encoding in cpuid
>> part __do_cpuid_ent().
>>
>
> I saw that in the arch/x86/kernel/kvm.c file. So I think I can use the
> returned value of kvm_para_available() to decide if it is running in a KVM
> guest. However, I am not so sure of what to use in xen.
Just use that 'paravirt_ticketlocks_enabled' static key. Xen enables it too.
            Linus
^ permalink raw reply	[flat|nested] 71+ messages in thread
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-21 17:05                       ` Linus Torvalds
@ 2014-02-21 17:05                         ` Linus Torvalds
  0 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2014-02-21 17:05 UTC (permalink / raw)
  To: Waiman Long
  Cc: Raghavendra K T, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Steven Rostedt, Andrew Morton, Michel Lespinasse, Andi Kleen,
	Rik van Riel, Paul E. McKenney, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Fri, Feb 21, 2014 at 9:02 AM, Waiman Long <waiman.long@hp.com> wrote:
> On 02/20/2014 02:32 PM, Raghavendra K T wrote:
>>
>> On 02/20/2014 11:24 PM, Waiman Long wrote:
>>
>>> Now the key is how to detect if a kernel is really running in a PV
>>> environment. I need to ask some virtualization experts on that.
>>>
>>
>> For kvm you could just follow,
>> kvm_spinlock_init_jump() and KVM_FEATURE_* encoding in cpuid
>> part __do_cpuid_ent().
>>
>
> I saw that in the arch/x86/kernel/kvm.c file. So I think I can use the
> returned value of kvm_para_available() to decide if it is running in a KVM
> guest. However, I am not so sure of what to use in xen.
Just use that 'paravirt_ticketlocks_enabled' static key. Xen enables it too.
            Linus
^ permalink raw reply	[flat|nested] 71+ messages in thread
 
 
 
 
 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-19 19:24             ` Waiman Long
  2014-02-19 19:48               ` Peter Zijlstra
  2014-02-19 20:17               ` Linus Torvalds
@ 2014-02-19 21:29               ` H. Peter Anvin
  2 siblings, 0 replies; 71+ messages in thread
From: H. Peter Anvin @ 2014-02-19 21:29 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Arnd Bergmann, linux-arch, x86,
	linux-kernel, Steven Rostedt, Andrew Morton, Michel Lespinasse,
	Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds,
	Raghavendra K T, George Spelvin, Tim Chen, Daniel J Blueman,
	Alexander Fyodorov, Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke
On 02/19/2014 11:24 AM, Waiman Long wrote:
> On 02/19/2014 03:51 AM, Peter Zijlstra wrote:
>> On Tue, Feb 18, 2014 at 07:42:20PM -0500, Waiman Long wrote:
>>> On 02/18/2014 04:28 PM, Peter Zijlstra wrote:
>>>> On Tue, Feb 18, 2014 at 02:30:12PM -0500, Waiman Long wrote:
>>>>> I will start looking at how to make it work with paravirt.
>>>>> Hopefully, it
>>>>> won't take too long.
>>>> The cheap way out is to simply switch to the test-and-set spinlock on
>>>> whatever X86_FEATURE_ indicates a guest I suppose.
>>> I don't think there is X86_FEATURE flag that indicates running in a
>>> guest.
>>> In fact, a guest should never find out if it is running virtualized.
>> No it very much should; how else is paravirt ever going to work?
> 
> We do have a CONFIG_PARAVIRT macro that turns on or off PV support. The
> queue spinlock can be easily changed into an unfair lock which allows
> lock stealing. We could have a config option to make it unfair in the
> PARAVIRT environment, but I don't think Linus like the idea of an unfair
> lock.
> 
The case where we run native on a system with CONFIG_PARAVIRT enabled
DOES matter.
	-hpa
^ permalink raw reply	[flat|nested] 71+ messages in thread 
 
 
 
 
 
 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-17 22:47 ` [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock H. Peter Anvin
  2014-02-18  7:31   ` Peter Zijlstra
@ 2014-02-18  7:38   ` Peter Zijlstra
  2014-02-22 16:56     ` Ingo Molnar
  2014-02-18 19:27   ` Waiman Long
  2 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-18  7:38 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On Mon, Feb 17, 2014 at 02:47:03PM -0800, H. Peter Anvin wrote:
> This is starting to look good, so I have pulled it into
> tip:x86/spinlocks to start give it some testing mileage.
Its still atrociously ugly code please drop it.
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-18  7:38   ` Peter Zijlstra
@ 2014-02-22 16:56     ` Ingo Molnar
  2014-02-25  3:37       ` Waiman Long
  0 siblings, 1 reply; 71+ messages in thread
From: Ingo Molnar @ 2014-02-22 16:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Waiman Long, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke
* Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Feb 17, 2014 at 02:47:03PM -0800, H. Peter Anvin wrote:
> > This is starting to look good, so I have pulled it into
> > tip:x86/spinlocks to start give it some testing mileage.
> 
> Its still atrociously ugly code please drop it.
Fair enough - I've excluded it from tip:master for now, until the 
uglies get resolved.
Waiman: please address PeterZ's feedback.
Thanks,
	Ingo
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-22 16:56     ` Ingo Molnar
@ 2014-02-25  3:37       ` Waiman Long
  2014-02-25  3:37         ` Waiman Long
  0 siblings, 1 reply; 71+ messages in thread
From: Waiman Long @ 2014-02-25  3:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke
On 02/22/2014 11:56 AM, Ingo Molnar wrote:
> * Peter Zijlstra<peterz@infradead.org>  wrote:
>
>> On Mon, Feb 17, 2014 at 02:47:03PM -0800, H. Peter Anvin wrote:
>>> This is starting to look good, so I have pulled it into
>>> tip:x86/spinlocks to start give it some testing mileage.
>> Its still atrociously ugly code please drop it.
> Fair enough - I've excluded it from tip:master for now, until the
> uglies get resolved.
>
> Waiman: please address PeterZ's feedback.
>
> Thanks,
>
> 	Ingo
I am trying to address PeterZ's feedback while adding 
para-virtualization support right now. I will have a new version ready 
sometime this week.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-25  3:37       ` Waiman Long
@ 2014-02-25  3:37         ` Waiman Long
  0 siblings, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-25  3:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Daniel J Blueman, Alexander Fyodorov,
	Aswin Chandramouleeswaran, Scott J Norton,
	Thavatchai Makphaibulchoke
On 02/22/2014 11:56 AM, Ingo Molnar wrote:
> * Peter Zijlstra<peterz@infradead.org>  wrote:
>
>> On Mon, Feb 17, 2014 at 02:47:03PM -0800, H. Peter Anvin wrote:
>>> This is starting to look good, so I have pulled it into
>>> tip:x86/spinlocks to start give it some testing mileage.
>> Its still atrociously ugly code please drop it.
> Fair enough - I've excluded it from tip:master for now, until the
> uglies get resolved.
>
> Waiman: please address PeterZ's feedback.
>
> Thanks,
>
> 	Ingo
I am trying to address PeterZ's feedback while adding 
para-virtualization support right now. I will have a new version ready 
sometime this week.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread 
 
 
 
- * Re: [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock
  2014-02-17 22:47 ` [PATCH v4 0/3] qspinlock: Introducing a 4-byte queue spinlock H. Peter Anvin
  2014-02-18  7:31   ` Peter Zijlstra
  2014-02-18  7:38   ` Peter Zijlstra
@ 2014-02-18 19:27   ` Waiman Long
  2 siblings, 0 replies; 71+ messages in thread
From: Waiman Long @ 2014-02-18 19:27 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Thomas Gleixner, Ingo Molnar, Arnd Bergmann, linux-arch, x86,
	linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Daniel J Blueman, Alexander Fyodorov, Aswin Chandramouleeswaran,
	Scott J Norton, Thavatchai Makphaibulchoke
On 02/17/2014 05:47 PM, H. Peter Anvin wrote:
> On 02/17/2014 12:41 PM, Waiman Long wrote:
>> v3->v4:
>>   - Remove debugging code and fix a configuration error
>>   - Simplify the qspinlock structure and streamline the code to make it
>>     perform a bit better
>>   - Add an x86 version of asm/qspinlock.h for holding x86 specific
>>     optimization.
>>   - Add an optimized x86 code path for 2 contending tasks to improve
>>     low contention performance.
>>
>> v2->v3:
>>   - Simplify the code by using numerous mode only without an unfair option.
>>   - Use the latest smp_load_acquire()/smp_store_release() barriers.
>>   - Move the queue spinlock code to kernel/locking.
>>   - Make the use of queue spinlock the default for x86-64 without user
>>     configuration.
>>   - Additional performance tuning.
>>
>> v1->v2:
>>   - Add some more comments to document what the code does.
>>   - Add a numerous CPU mode to support>= 16K CPUs
>>   - Add a configuration option to allow lock stealing which can further
>>     improve performance in many cases.
>>   - Enable wakeup of queue head CPU at unlock time for non-numerous
>>     CPU mode.
>>
>> This patch set introduces a queue-based spinlock implementation that
>> can replace the default ticket spinlock without increasing the size
>> of the spinlock data structure. As a result, critical kernel data
>> structures that embed spinlock won't increase in size and breaking
>> data alignments.
>>
> This is starting to look good, so I have pulled it into
> tip:x86/spinlocks to start give it some testing mileage.
>
> 	-hpa
>
>
Thank for the additional testing. Please let me know if you find 
anything wrong.
-Longman
^ permalink raw reply	[flat|nested] 71+ messages in thread