[RFC][PATCH 0/7] locking: qspinlock

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/7] locking: qspinlock
@ 2014-03-10 15:42 Peter Zijlstra
  2014-03-10 15:42 ` [RFC][PATCH 1/7] qspinlock: Introducing a 4-byte queue spinlock implementation Peter Zijlstra
                   ` (8 more replies)
  0 siblings, 9 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-10 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg, Peter Zijlstra

Hi Waiman,

I promised you this series a number of days ago; sorry for the delay I've been
somewhat unwell :/

That said, these few patches start with a (hopefully) simple and correct form
of the queue spinlock, and then gradually build upon it, explaining each
optimization as we go.

Having these optimizations as separate patches helps twofold; firstly it makes
one aware of which exact optimizations were done, and secondly it allows one to
proove or disprove any one step; seeing how they should be mostly identity
transforms.

The resulting code is near to what you posted I think; however it has one
atomic op less in the pending wait-acquire case for NR_CPUS != huge. It also
doesn't do lock stealing; its still perfectly fair afaict.

Have I missed any tricks from your code?

The patches apply to tip/master + lkml.kernel.org/r/20140210195820.834693028@infradead.org

I've yet to look at the paravirt stuff.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 1/7] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-03-10 15:42 [RFC][PATCH 0/7] locking: qspinlock Peter Zijlstra
@ 2014-03-10 15:42 ` Peter Zijlstra
  2014-03-10 15:42   ` Peter Zijlstra
  2014-03-13 13:07   ` Peter Zijlstra
  2014-03-10 15:42 ` [RFC][PATCH 2/7] qspinlock, x86: Enable x86 to use queue spinlock Peter Zijlstra
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-10 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg, Waiman Long, Peter Zijlstra

[-- Attachment #1: waiman_long-qspinlock-introducing_a_4-byte_queue_spinlock_implementation.patch --]
[-- Type: text/plain, Size: 12799 bytes --]

Simple 4 byte MCS like spinlock implementation.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/asm-generic/qspinlock.h       |  113 +++++++++++++++++++
 include/asm-generic/qspinlock_types.h |   59 ++++++++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 
 kernel/locking/mcs_spinlock.h         |    1 
 kernel/locking/qspinlock.c            |  196 ++++++++++++++++++++++++++++++++++
 6 files changed, 377 insertions(+)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,113 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+	return atomic_read(&lock->val) & _Q_LOCKED_VAL;
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+	return !(atomic_read(&lock.val) & _Q_LOCKED_VAL);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+	return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+	if (!atomic_read(&lock->val) &&
+	   (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
+		return 1;
+	return 0;
+}
+
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+	u32 val;
+
+	val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
+	if (likely(val == 0))
+		return;
+	queue_spin_lock_slowpath(lock, val);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	/*
+	 * smp_mb__before_atomic() in order to guarantee release semantics
+	 */
+	smp_mb__before_atomic_dec();
+	atomic_sub(_QSPINLOCK_LOCKED, &lock->val);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l)		queue_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queue_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queue_spin_lock(l)
+#define arch_spin_trylock(l)		queue_spin_trylock(l)
+#define arch_spin_unlock(l)		queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f)	queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,59 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+/*
+ * Including atomic.h with PARAVIRT on will cause compilation errors because
+ * of recursive header file incluson via paravirt_types.h. A workaround is
+ * to include paravirt_types.h here.
+ */
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt_types.h>
+#endif
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <linux/bug.h>
+
+typedef struct qspinlock {
+	atomic_t	val;
+} arch_spinlock_t;
+
+/*
+ * Bitfields in the atomic value:
+ *
+ *  0- 7: locked byte
+ *  8- 9: tail index
+ * 10-31: tail cpu (+1)
+ */
+
+#define _Q_LOCKED_OFFSET	0
+#define _Q_LOCKED_BITS		8
+#define _Q_LOCKED_MASK		(((1U << _Q_LOCKED_BITS) - 1) << _Q_LOCKED_OFFSET)
+
+#define _Q_TAIL_IDX_OFFSET	(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_TAIL_IDX_BITS	2
+#define _Q_TAIL_IDX_MASK	(((1U << _Q_TAIL_IDX_BITS) - 1) << _Q_TAIL_IDX_OFFSET)
+
+#define _Q_TAIL_CPU_OFFSET	(_Q_TAIL_IDX_OFFSET + _Q_TAIL_IDX_BITS)
+#define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
+#define _Q_TAIL_CPU_MASK	(((1U << _Q_TAIL_CPU_BITS) - 1) << _Q_TAIL_CPU_OFFSET)
+
+#define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -224,6 +224,13 @@ config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
 
+config ARCH_USE_QUEUE_SPINLOCK
+	bool
+
+config QUEUE_SPINLOCK
+	def_bool y if ARCH_USE_QUEUE_SPINLOCK
+	depends on SMP && !PARAVIRT_SPINLOCKS
+
 config ARCH_USE_QUEUE_RWLOCK
 	bool
 
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -15,6 +15,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
 obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
 obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,6 +17,7 @@
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
 	int locked; /* 1 if lock acquired */
+	int count;
 };
 
 #ifndef arch_mcs_spin_lock_contended
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,196 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ *          Peter Zijlstra <pzijlstr@redhat.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <asm/qspinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock, however to make
+ * it fit the 4 bytes we assume spinlock_t to be, and preserve its existing
+ * API, we must modify it some.
+ *
+ * In particular; where the traditional MCS lock consists of a tail pointer
+ * (8 bytes) and needs the next pointer (another 8 bytes) of its own node to
+ * unlock the next pending (next->locked), we compress both these: {tail,
+ * next->locked} into a single u32 value.
+ *
+ * Since a spinlock disables recursion of its own context and there is a limit
+ * to the contexts that can nest; namely: task, softirq, hardirq, nmi, we can
+ * encode the tail as and index indicating this context and a cpu number.
+ *
+ * We can further change the first spinner to spin on a bit in the lock word
+ * instead of its node; whereby avoiding the need to carry a node from lock to
+ * unlock, and preserving API.
+ */
+
+#include "mcs_spinlock.h"
+
+/*
+ * Per-CPU queue node structures; we can never have more than 4 nested
+ * contexts: task, softirq, hardirq, nmi.
+ *
+ * Exactly fits one cacheline.
+ */
+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
+
+/*
+ * We must be able to distinguish between no-tail and the tail at 0:0,
+ * therefore increment the cpu number by one.
+ */
+
+static inline u32 encode_tail(int cpu, int idx)
+{
+	u32 tail;
+
+	tail  = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
+	tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */
+
+	return tail;
+}
+
+static inline struct mcs_spinlock *decode_tail(u32 tail)
+{
+	int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
+	int idx = (tail >> _Q_TAIL_IDX_OFFSET) & _Q_TAIL_IDX_MASK;
+
+	return per_cpu_ptr(&mcs_nodes[idx], cpu);
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ * @val: Current value of the queue spinlock 32-bit word
+ *
+ *
+ *              fast      :    slow                                  :    unlock
+ *                        :                                          :
+ * uncontended  (0,0)   --:--> (0,1) --------------------------------:--> (*,0)
+ *                        :       | ^--------.                    /  :
+ *                        :       v           \                   |  :
+ * uncontended            :    (n,x) --+--> (n,0)                 |  :
+ *   queue                :       | ^--'                          |  :
+ *                        :       v                               |  :
+ * contended              :    (*,x) --+--> (*,0) -----> (*,1) ---'  :
+ *   queue                :         ^--'                             :
+ *
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+	struct mcs_spinlock *prev, *next, *node;
+	u32 new, old, tail;
+	int idx;
+
+	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+
+	node = this_cpu_ptr(&mcs_nodes[0]);
+	idx = node->count++;
+	tail = encode_tail(smp_processor_id(), idx);
+
+	node += idx;
+	node->locked = 0;
+	node->next = NULL;
+
+	/*
+	 * trylock || xchg(lock, node)
+	 *
+	 * 0,0 -> 0,1 ; trylock
+	 * p,x -> n,x ; prev = xchg(lock, node)
+	 */
+	for (;;) {
+		new = _Q_LOCKED_VAL;
+		if (val)
+			new = tail | (val & _Q_LOCKED_MASK);
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		val = old;
+	}
+
+	/*
+	 * we won the trylock; forget about queueing.
+	 */
+	if (new == _Q_LOCKED_VAL)
+		goto release;
+
+	/*
+	 * if there was a previous node; link it and wait.
+	 */
+	if (old & ~_Q_LOCKED_MASK) {
+		prev = decode_tail(old);
+		ACCESS_ONCE(prev->next) = node;
+
+		arch_mcs_spin_lock_contended(&node->locked);
+	}
+
+	/*
+	 * we're at the head of the waitqueue, wait for the owner to go away.
+	 *
+	 * *,x -> *,0
+	 */
+	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
+		cpu_relax();
+
+	/*
+	 * claim the lock:
+	 *
+	 * n,0 -> 0,1 : lock, uncontended
+	 * *,0 -> *,1 : lock, contended
+	 */
+	for (;;) {
+		new = _Q_LOCKED_VAL;
+		if (val != tail)
+			new |= val;
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		val = old;
+	}
+
+	/*
+	 * contended path; wait for next, release.
+	 */
+	if (new != _Q_LOCKED_VAL) {
+		while (!(next = ACCESS_ONCE(node->next)))
+			arch_mutex_cpu_relax();
+
+		arch_mcs_spin_unlock_contended(&next->locked);
+	}
+
+release:
+	/*
+	 * release the node
+	 */
+	this_cpu_dec(mcs_nodes[0].count);
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 1/7] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-03-10 15:42 ` [RFC][PATCH 1/7] qspinlock: Introducing a 4-byte queue spinlock implementation Peter Zijlstra
@ 2014-03-10 15:42   ` Peter Zijlstra
  2014-03-13 13:07   ` Peter Zijlstra
  1 sibling, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-10 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg, Waiman Long, Peter Zijlstra

[-- Attachment #1: waiman_long-qspinlock-introducing_a_4-byte_queue_spinlock_implementation.patch --]
[-- Type: text/plain, Size: 12801 bytes --]

Simple 4 byte MCS like spinlock implementation.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/asm-generic/qspinlock.h       |  113 +++++++++++++++++++
 include/asm-generic/qspinlock_types.h |   59 ++++++++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 
 kernel/locking/mcs_spinlock.h         |    1 
 kernel/locking/qspinlock.c            |  196 ++++++++++++++++++++++++++++++++++
 6 files changed, 377 insertions(+)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,113 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+	return atomic_read(&lock->val) & _Q_LOCKED_VAL;
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+	return !(atomic_read(&lock.val) & _Q_LOCKED_VAL);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+	return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+	if (!atomic_read(&lock->val) &&
+	   (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
+		return 1;
+	return 0;
+}
+
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+	u32 val;
+
+	val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
+	if (likely(val == 0))
+		return;
+	queue_spin_lock_slowpath(lock, val);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	/*
+	 * smp_mb__before_atomic() in order to guarantee release semantics
+	 */
+	smp_mb__before_atomic_dec();
+	atomic_sub(_QSPINLOCK_LOCKED, &lock->val);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l)		queue_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queue_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queue_spin_lock(l)
+#define arch_spin_trylock(l)		queue_spin_trylock(l)
+#define arch_spin_unlock(l)		queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f)	queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,59 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+/*
+ * Including atomic.h with PARAVIRT on will cause compilation errors because
+ * of recursive header file incluson via paravirt_types.h. A workaround is
+ * to include paravirt_types.h here.
+ */
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt_types.h>
+#endif
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <linux/bug.h>
+
+typedef struct qspinlock {
+	atomic_t	val;
+} arch_spinlock_t;
+
+/*
+ * Bitfields in the atomic value:
+ *
+ *  0- 7: locked byte
+ *  8- 9: tail index
+ * 10-31: tail cpu (+1)
+ */
+
+#define _Q_LOCKED_OFFSET	0
+#define _Q_LOCKED_BITS		8
+#define _Q_LOCKED_MASK		(((1U << _Q_LOCKED_BITS) - 1) << _Q_LOCKED_OFFSET)
+
+#define _Q_TAIL_IDX_OFFSET	(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_TAIL_IDX_BITS	2
+#define _Q_TAIL_IDX_MASK	(((1U << _Q_TAIL_IDX_BITS) - 1) << _Q_TAIL_IDX_OFFSET)
+
+#define _Q_TAIL_CPU_OFFSET	(_Q_TAIL_IDX_OFFSET + _Q_TAIL_IDX_BITS)
+#define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
+#define _Q_TAIL_CPU_MASK	(((1U << _Q_TAIL_CPU_BITS) - 1) << _Q_TAIL_CPU_OFFSET)
+
+#define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -224,6 +224,13 @@ config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
 
+config ARCH_USE_QUEUE_SPINLOCK
+	bool
+
+config QUEUE_SPINLOCK
+	def_bool y if ARCH_USE_QUEUE_SPINLOCK
+	depends on SMP && !PARAVIRT_SPINLOCKS
+
 config ARCH_USE_QUEUE_RWLOCK
 	bool
 
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -15,6 +15,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
 obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
 obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,6 +17,7 @@
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
 	int locked; /* 1 if lock acquired */
+	int count;
 };
 
 #ifndef arch_mcs_spin_lock_contended
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,196 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ *          Peter Zijlstra <pzijlstr@redhat.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <asm/qspinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock, however to make
+ * it fit the 4 bytes we assume spinlock_t to be, and preserve its existing
+ * API, we must modify it some.
+ *
+ * In particular; where the traditional MCS lock consists of a tail pointer
+ * (8 bytes) and needs the next pointer (another 8 bytes) of its own node to
+ * unlock the next pending (next->locked), we compress both these: {tail,
+ * next->locked} into a single u32 value.
+ *
+ * Since a spinlock disables recursion of its own context and there is a limit
+ * to the contexts that can nest; namely: task, softirq, hardirq, nmi, we can
+ * encode the tail as and index indicating this context and a cpu number.
+ *
+ * We can further change the first spinner to spin on a bit in the lock word
+ * instead of its node; whereby avoiding the need to carry a node from lock to
+ * unlock, and preserving API.
+ */
+
+#include "mcs_spinlock.h"
+
+/*
+ * Per-CPU queue node structures; we can never have more than 4 nested
+ * contexts: task, softirq, hardirq, nmi.
+ *
+ * Exactly fits one cacheline.
+ */
+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
+
+/*
+ * We must be able to distinguish between no-tail and the tail at 0:0,
+ * therefore increment the cpu number by one.
+ */
+
+static inline u32 encode_tail(int cpu, int idx)
+{
+	u32 tail;
+
+	tail  = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
+	tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */
+
+	return tail;
+}
+
+static inline struct mcs_spinlock *decode_tail(u32 tail)
+{
+	int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
+	int idx = (tail >> _Q_TAIL_IDX_OFFSET) & _Q_TAIL_IDX_MASK;
+
+	return per_cpu_ptr(&mcs_nodes[idx], cpu);
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ * @val: Current value of the queue spinlock 32-bit word
+ *
+ *
+ *              fast      :    slow                                  :    unlock
+ *                        :                                          :
+ * uncontended  (0,0)   --:--> (0,1) --------------------------------:--> (*,0)
+ *                        :       | ^--------.                    /  :
+ *                        :       v           \                   |  :
+ * uncontended            :    (n,x) --+--> (n,0)                 |  :
+ *   queue                :       | ^--'                          |  :
+ *                        :       v                               |  :
+ * contended              :    (*,x) --+--> (*,0) -----> (*,1) ---'  :
+ *   queue                :         ^--'                             :
+ *
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+	struct mcs_spinlock *prev, *next, *node;
+	u32 new, old, tail;
+	int idx;
+
+	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+
+	node = this_cpu_ptr(&mcs_nodes[0]);
+	idx = node->count++;
+	tail = encode_tail(smp_processor_id(), idx);
+
+	node += idx;
+	node->locked = 0;
+	node->next = NULL;
+
+	/*
+	 * trylock || xchg(lock, node)
+	 *
+	 * 0,0 -> 0,1 ; trylock
+	 * p,x -> n,x ; prev = xchg(lock, node)
+	 */
+	for (;;) {
+		new = _Q_LOCKED_VAL;
+		if (val)
+			new = tail | (val & _Q_LOCKED_MASK);
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		val = old;
+	}
+
+	/*
+	 * we won the trylock; forget about queueing.
+	 */
+	if (new == _Q_LOCKED_VAL)
+		goto release;
+
+	/*
+	 * if there was a previous node; link it and wait.
+	 */
+	if (old & ~_Q_LOCKED_MASK) {
+		prev = decode_tail(old);
+		ACCESS_ONCE(prev->next) = node;
+
+		arch_mcs_spin_lock_contended(&node->locked);
+	}
+
+	/*
+	 * we're at the head of the waitqueue, wait for the owner to go away.
+	 *
+	 * *,x -> *,0
+	 */
+	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
+		cpu_relax();
+
+	/*
+	 * claim the lock:
+	 *
+	 * n,0 -> 0,1 : lock, uncontended
+	 * *,0 -> *,1 : lock, contended
+	 */
+	for (;;) {
+		new = _Q_LOCKED_VAL;
+		if (val != tail)
+			new |= val;
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		val = old;
+	}
+
+	/*
+	 * contended path; wait for next, release.
+	 */
+	if (new != _Q_LOCKED_VAL) {
+		while (!(next = ACCESS_ONCE(node->next)))
+			arch_mutex_cpu_relax();
+
+		arch_mcs_spin_unlock_contended(&next->locked);
+	}
+
+release:
+	/*
+	 * release the node
+	 */
+	this_cpu_dec(mcs_nodes[0].count);
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 2/7] qspinlock, x86: Enable x86 to use queue spinlock
  2014-03-10 15:42 [RFC][PATCH 0/7] locking: qspinlock Peter Zijlstra
  2014-03-10 15:42 ` [RFC][PATCH 1/7] qspinlock: Introducing a 4-byte queue spinlock implementation Peter Zijlstra
@ 2014-03-10 15:42 ` Peter Zijlstra
  2014-03-10 15:42   ` Peter Zijlstra
  2014-03-10 15:42 ` [RFC][PATCH 3/7] qspinlock: Add pending bit Peter Zijlstra
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-10 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg, Waiman Long, Peter Zijlstra

[-- Attachment #1: waiman_long-qspinlock_x86-enable_x86-64_to_use_queue_spinlock.patch --]
[-- Type: text/plain, Size: 2865 bytes --]

This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/qspinlock.h      |   29 +++++++++++++++++++++++++++++
 arch/x86/include/asm/spinlock.h       |    5 +++++
 arch/x86/include/asm/spinlock_types.h |    4 ++++
 4 files changed, 39 insertions(+)
 create mode 100644 arch/x86/include/asm/qspinlock.h

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -128,6 +128,7 @@ config X86
 	select HAVE_DEBUG_STACKOVERFLOW
 	select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
 	select HAVE_CC_STACKPROTECTOR
+	select ARCH_USE_QUEUE_SPINLOCK
 
 config INSTRUCTION_DECODER
 	def_bool y
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,29 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+
+#define	queue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * No special memory barrier other than a compiler one is needed for the
+ * x86 architecture. A compiler barrier is added at the end to make sure
+ * that the clearing the lock bit is done ASAP without artificial delay
+ * due to compiler optimization.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	barrier();
+	ACCESS_ONCE(*(u8 *)lock) = 0;
+	barrier();
+}
+
+#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_X86_QSPINLOCK_H */
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -43,6 +43,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm/qspinlock.h>
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -181,6 +185,7 @@ static __always_inline void arch_spin_lo
 {
 	arch_spin_lock(lock);
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT	(sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm-generic/qspinlock_types.h>
+#else
 typedef struct arch_spinlock {
 	union {
 		__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #ifdef CONFIG_QUEUE_RWLOCK
 #include <asm/qrwlock.h>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 2/7] qspinlock, x86: Enable x86 to use queue spinlock
  2014-03-10 15:42 ` [RFC][PATCH 2/7] qspinlock, x86: Enable x86 to use queue spinlock Peter Zijlstra
@ 2014-03-10 15:42   ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-10 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg, Waiman Long, Peter Zijlstra

[-- Attachment #1: waiman_long-qspinlock_x86-enable_x86-64_to_use_queue_spinlock.patch --]
[-- Type: text/plain, Size: 2867 bytes --]

This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/qspinlock.h      |   29 +++++++++++++++++++++++++++++
 arch/x86/include/asm/spinlock.h       |    5 +++++
 arch/x86/include/asm/spinlock_types.h |    4 ++++
 4 files changed, 39 insertions(+)
 create mode 100644 arch/x86/include/asm/qspinlock.h

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -128,6 +128,7 @@ config X86
 	select HAVE_DEBUG_STACKOVERFLOW
 	select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
 	select HAVE_CC_STACKPROTECTOR
+	select ARCH_USE_QUEUE_SPINLOCK
 
 config INSTRUCTION_DECODER
 	def_bool y
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,29 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+
+#define	queue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * No special memory barrier other than a compiler one is needed for the
+ * x86 architecture. A compiler barrier is added at the end to make sure
+ * that the clearing the lock bit is done ASAP without artificial delay
+ * due to compiler optimization.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	barrier();
+	ACCESS_ONCE(*(u8 *)lock) = 0;
+	barrier();
+}
+
+#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_X86_QSPINLOCK_H */
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -43,6 +43,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm/qspinlock.h>
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -181,6 +185,7 @@ static __always_inline void arch_spin_lo
 {
 	arch_spin_lock(lock);
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT	(sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm-generic/qspinlock_types.h>
+#else
 typedef struct arch_spinlock {
 	union {
 		__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #ifdef CONFIG_QUEUE_RWLOCK
 #include <asm/qrwlock.h>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 3/7] qspinlock: Add pending bit
  2014-03-10 15:42 [RFC][PATCH 0/7] locking: qspinlock Peter Zijlstra
  2014-03-10 15:42 ` [RFC][PATCH 1/7] qspinlock: Introducing a 4-byte queue spinlock implementation Peter Zijlstra
  2014-03-10 15:42 ` [RFC][PATCH 2/7] qspinlock, x86: Enable x86 to use queue spinlock Peter Zijlstra
@ 2014-03-10 15:42 ` Peter Zijlstra
  2014-03-10 15:42   ` Peter Zijlstra
  2014-03-10 15:42 ` [RFC][PATCH 4/7] x86: Add atomic_test_and_set_bit() Peter Zijlstra
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-10 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg, Peter Zijlstra

[-- Attachment #1: peterz-qspinlock-pending.patch --]
[-- Type: text/plain, Size: 5896 bytes --]

Because the qspinlock needs to touch a second cacheline; add a pending
bit and allow a single in-word spinner before we punt to the second
cacheline.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/asm-generic/qspinlock_types.h |   12 +++-
 kernel/locking/qspinlock.c            |   92 ++++++++++++++++++++++++++++------
 2 files changed, 87 insertions(+), 17 deletions(-)

--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -38,15 +38,20 @@ typedef struct qspinlock {
  * Bitfields in the atomic value:
  *
  *  0- 7: locked byte
- *  8- 9: tail index
- * 10-31: tail cpu (+1)
+ *     8: pending
+ *  9-10: tail index
+ * 11-31: tail cpu (+1)
  */
 
 #define _Q_LOCKED_OFFSET	0
 #define _Q_LOCKED_BITS		8
 #define _Q_LOCKED_MASK		(((1U << _Q_LOCKED_BITS) - 1) << _Q_LOCKED_OFFSET)
 
-#define _Q_TAIL_IDX_OFFSET	(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_OFFSET	(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_BITS		1
+#define _Q_PENDING_MASK		(((1U << _Q_PENDING_BITS) - 1) << _Q_PENDING_OFFSET)
+
+#define _Q_TAIL_IDX_OFFSET	(_Q_PENDING_OFFSET + _Q_PENDING_BITS)
 #define _Q_TAIL_IDX_BITS	2
 #define _Q_TAIL_IDX_MASK	(((1U << _Q_TAIL_IDX_BITS) - 1) << _Q_TAIL_IDX_OFFSET)
 
@@ -55,5 +60,6 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_MASK	(((1U << _Q_TAIL_CPU_BITS) - 1) << _Q_TAIL_CPU_OFFSET)
 
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL		(1U << _Q_PENDING_OFFSET)
 
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -83,6 +83,8 @@ static inline struct mcs_spinlock *decod
 	return per_cpu_ptr(&mcs_nodes[idx], cpu);
 }
 
+#define _Q_LOCKED_PENDING_MASK	(_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
@@ -91,13 +93,16 @@ static inline struct mcs_spinlock *decod
  *
  *              fast      :    slow                                  :    unlock
  *                        :                                          :
- * uncontended  (0,0)   --:--> (0,1) --------------------------------:--> (*,0)
- *                        :       | ^--------.                    /  :
- *                        :       v           \                   |  :
- * uncontended            :    (n,x) --+--> (n,0)                 |  :
+ * uncontended  (0,0,0) --:--> (0,0,1) ------------------------------:--> (*,*,0)
+ *                        :       | ^--------.------.             /  :
+ *                        :       v           \      \            |  :
+ * pending                :    (0,1,1) +--> (0,1,0)   \           |  :
+ *                        :       | ^--'              |           |  :
+ *                        :       v                   |           |  :
+ * uncontended            :    (n,x,y) +--> (n,0,0) --'           |  :
  *   queue                :       | ^--'                          |  :
  *                        :       v                               |  :
- * contended              :    (*,x) --+--> (*,0) -----> (*,1) ---'  :
+ * contended              :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
  *   queue                :         ^--'                             :
  *
  */
@@ -109,6 +114,61 @@ void queue_spin_lock_slowpath(struct qsp
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
+	/*
+	 * trylock || pending
+	 *
+	 * 0,0,0 -> 0,0,1 ; trylock
+	 * 0,0,1 -> 0,1,1 ; pending
+	 */
+	for (;;) {
+		/*
+		 * If we observe any contention; queue.
+		 */
+		if (val & ~_Q_LOCKED_MASK)
+			goto queue;
+
+		new = _Q_LOCKED_VAL;
+		if (val == new)
+			new |= _Q_PENDING_VAL;
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		val = old;
+	}
+
+	/*
+	 * we won the trylock
+	 */
+	if (new == _Q_LOCKED_VAL)
+		return;
+
+	/*
+	 * we're pending, wait for the owner to go away.
+	 *
+	 * *,1,1 -> *,1,0
+	 */
+	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
+		cpu_relax();
+
+	/*
+	 * take ownership and clear the pending bit.
+	 *
+	 * *,1,0 -> *,0,1
+	 */
+	for (;;) {
+		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		val = old;
+	}
+	return;
+
+queue:
 	node = this_cpu_ptr(&mcs_nodes[0]);
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
@@ -118,15 +178,18 @@ void queue_spin_lock_slowpath(struct qsp
 	node->next = NULL;
 
 	/*
+	 * we already touched the queueing cacheline; don't bother with pending
+	 * stuff.
+	 *
 	 * trylock || xchg(lock, node)
 	 *
-	 * 0,0 -> 0,1 ; trylock
-	 * p,x -> n,x ; prev = xchg(lock, node)
+	 * 0,0,0 -> 0,0,1 ; trylock
+	 * p,y,x -> n,y,x ; prev = xchg(lock, node)
 	 */
 	for (;;) {
 		new = _Q_LOCKED_VAL;
 		if (val)
-			new = tail | (val & _Q_LOCKED_MASK);
+			new = tail | (val & _Q_LOCKED_PENDING_MASK);
 
 		old = atomic_cmpxchg(&lock->val, val, new);
 		if (old == val)
@@ -144,7 +207,7 @@ void queue_spin_lock_slowpath(struct qsp
 	/*
 	 * if there was a previous node; link it and wait.
 	 */
-	if (old & ~_Q_LOCKED_MASK) {
+	if (old & ~_Q_LOCKED_PENDING_MASK) {
 		prev = decode_tail(old);
 		ACCESS_ONCE(prev->next) = node;
 
@@ -152,18 +215,19 @@ void queue_spin_lock_slowpath(struct qsp
 	}
 
 	/*
-	 * we're at the head of the waitqueue, wait for the owner to go away.
+	 * we're at the head of the waitqueue, wait for the owner & pending to
+	 * go away.
 	 *
-	 * *,x -> *,0
+	 * *,x,y -> *,0,0
 	 */
-	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
+	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK)
 		cpu_relax();
 
 	/*
 	 * claim the lock:
 	 *
-	 * n,0 -> 0,1 : lock, uncontended
-	 * *,0 -> *,1 : lock, contended
+	 * n,0,0 -> 0,0,1 : lock, uncontended
+	 * *,0,0 -> *,0,1 : lock, contended
 	 */
 	for (;;) {
 		new = _Q_LOCKED_VAL;

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 3/7] qspinlock: Add pending bit
  2014-03-10 15:42 ` [RFC][PATCH 3/7] qspinlock: Add pending bit Peter Zijlstra
@ 2014-03-10 15:42   ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-10 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg, Peter Zijlstra

[-- Attachment #1: peterz-qspinlock-pending.patch --]
[-- Type: text/plain, Size: 5898 bytes --]

Because the qspinlock needs to touch a second cacheline; add a pending
bit and allow a single in-word spinner before we punt to the second
cacheline.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/asm-generic/qspinlock_types.h |   12 +++-
 kernel/locking/qspinlock.c            |   92 ++++++++++++++++++++++++++++------
 2 files changed, 87 insertions(+), 17 deletions(-)

--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -38,15 +38,20 @@ typedef struct qspinlock {
  * Bitfields in the atomic value:
  *
  *  0- 7: locked byte
- *  8- 9: tail index
- * 10-31: tail cpu (+1)
+ *     8: pending
+ *  9-10: tail index
+ * 11-31: tail cpu (+1)
  */
 
 #define _Q_LOCKED_OFFSET	0
 #define _Q_LOCKED_BITS		8
 #define _Q_LOCKED_MASK		(((1U << _Q_LOCKED_BITS) - 1) << _Q_LOCKED_OFFSET)
 
-#define _Q_TAIL_IDX_OFFSET	(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_OFFSET	(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_BITS		1
+#define _Q_PENDING_MASK		(((1U << _Q_PENDING_BITS) - 1) << _Q_PENDING_OFFSET)
+
+#define _Q_TAIL_IDX_OFFSET	(_Q_PENDING_OFFSET + _Q_PENDING_BITS)
 #define _Q_TAIL_IDX_BITS	2
 #define _Q_TAIL_IDX_MASK	(((1U << _Q_TAIL_IDX_BITS) - 1) << _Q_TAIL_IDX_OFFSET)
 
@@ -55,5 +60,6 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_MASK	(((1U << _Q_TAIL_CPU_BITS) - 1) << _Q_TAIL_CPU_OFFSET)
 
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL		(1U << _Q_PENDING_OFFSET)
 
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -83,6 +83,8 @@ static inline struct mcs_spinlock *decod
 	return per_cpu_ptr(&mcs_nodes[idx], cpu);
 }
 
+#define _Q_LOCKED_PENDING_MASK	(_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
@@ -91,13 +93,16 @@ static inline struct mcs_spinlock *decod
  *
  *              fast      :    slow                                  :    unlock
  *                        :                                          :
- * uncontended  (0,0)   --:--> (0,1) --------------------------------:--> (*,0)
- *                        :       | ^--------.                    /  :
- *                        :       v           \                   |  :
- * uncontended            :    (n,x) --+--> (n,0)                 |  :
+ * uncontended  (0,0,0) --:--> (0,0,1) ------------------------------:--> (*,*,0)
+ *                        :       | ^--------.------.             /  :
+ *                        :       v           \      \            |  :
+ * pending                :    (0,1,1) +--> (0,1,0)   \           |  :
+ *                        :       | ^--'              |           |  :
+ *                        :       v                   |           |  :
+ * uncontended            :    (n,x,y) +--> (n,0,0) --'           |  :
  *   queue                :       | ^--'                          |  :
  *                        :       v                               |  :
- * contended              :    (*,x) --+--> (*,0) -----> (*,1) ---'  :
+ * contended              :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
  *   queue                :         ^--'                             :
  *
  */
@@ -109,6 +114,61 @@ void queue_spin_lock_slowpath(struct qsp
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
+	/*
+	 * trylock || pending
+	 *
+	 * 0,0,0 -> 0,0,1 ; trylock
+	 * 0,0,1 -> 0,1,1 ; pending
+	 */
+	for (;;) {
+		/*
+		 * If we observe any contention; queue.
+		 */
+		if (val & ~_Q_LOCKED_MASK)
+			goto queue;
+
+		new = _Q_LOCKED_VAL;
+		if (val == new)
+			new |= _Q_PENDING_VAL;
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		val = old;
+	}
+
+	/*
+	 * we won the trylock
+	 */
+	if (new == _Q_LOCKED_VAL)
+		return;
+
+	/*
+	 * we're pending, wait for the owner to go away.
+	 *
+	 * *,1,1 -> *,1,0
+	 */
+	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
+		cpu_relax();
+
+	/*
+	 * take ownership and clear the pending bit.
+	 *
+	 * *,1,0 -> *,0,1
+	 */
+	for (;;) {
+		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		val = old;
+	}
+	return;
+
+queue:
 	node = this_cpu_ptr(&mcs_nodes[0]);
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
@@ -118,15 +178,18 @@ void queue_spin_lock_slowpath(struct qsp
 	node->next = NULL;
 
 	/*
+	 * we already touched the queueing cacheline; don't bother with pending
+	 * stuff.
+	 *
 	 * trylock || xchg(lock, node)
 	 *
-	 * 0,0 -> 0,1 ; trylock
-	 * p,x -> n,x ; prev = xchg(lock, node)
+	 * 0,0,0 -> 0,0,1 ; trylock
+	 * p,y,x -> n,y,x ; prev = xchg(lock, node)
 	 */
 	for (;;) {
 		new = _Q_LOCKED_VAL;
 		if (val)
-			new = tail | (val & _Q_LOCKED_MASK);
+			new = tail | (val & _Q_LOCKED_PENDING_MASK);
 
 		old = atomic_cmpxchg(&lock->val, val, new);
 		if (old == val)
@@ -144,7 +207,7 @@ void queue_spin_lock_slowpath(struct qsp
 	/*
 	 * if there was a previous node; link it and wait.
 	 */
-	if (old & ~_Q_LOCKED_MASK) {
+	if (old & ~_Q_LOCKED_PENDING_MASK) {
 		prev = decode_tail(old);
 		ACCESS_ONCE(prev->next) = node;
 
@@ -152,18 +215,19 @@ void queue_spin_lock_slowpath(struct qsp
 	}
 
 	/*
-	 * we're at the head of the waitqueue, wait for the owner to go away.
+	 * we're at the head of the waitqueue, wait for the owner & pending to
+	 * go away.
 	 *
-	 * *,x -> *,0
+	 * *,x,y -> *,0,0
 	 */
-	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
+	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK)
 		cpu_relax();
 
 	/*
 	 * claim the lock:
 	 *
-	 * n,0 -> 0,1 : lock, uncontended
-	 * *,0 -> *,1 : lock, contended
+	 * n,0,0 -> 0,0,1 : lock, uncontended
+	 * *,0,0 -> *,0,1 : lock, contended
 	 */
 	for (;;) {
 		new = _Q_LOCKED_VAL;



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 4/7] x86: Add atomic_test_and_set_bit()
  2014-03-10 15:42 [RFC][PATCH 0/7] locking: qspinlock Peter Zijlstra
                   ` (2 preceding siblings ...)
  2014-03-10 15:42 ` [RFC][PATCH 3/7] qspinlock: Add pending bit Peter Zijlstra
@ 2014-03-10 15:42 ` Peter Zijlstra
  2014-03-10 15:42 ` [RFC][PATCH 5/7] qspinlock: Optimize the pending case Peter Zijlstra
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-10 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg, Peter Zijlstra

[-- Attachment #1: peterz-atomic_set_and_test_bit.patch --]
[-- Type: text/plain, Size: 817 bytes --]

For use in qspinlock because unconditional atomic ops scale better
than cmpxchg loops.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/atomic.h |   13 +++++++++++++
 1 file changed, 13 insertions(+)

--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -218,6 +218,19 @@ static inline short int atomic_inc_short
 	return *v;
 }
 
+/**
+ * test_and_set_bit - Set a bit and return its old value
+ * @nr: Bit to set
+ * @addr: Address to count from
+ *
+ * This operation is atomic and cannot be reordered.
+ * It also implies a memory barrier.
+ */
+static inline int atomic_test_and_set_bit(int nr, atomic_t *v)
+{
+	GEN_BINARY_RMWcc(LOCK_PREFIX "bts", v->counter, "Ir", nr, "%0", "c");
+}
+
 #ifdef CONFIG_X86_64
 /**
  * atomic_or_long - OR of two long integers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 5/7] qspinlock: Optimize the pending case
  2014-03-10 15:42 [RFC][PATCH 0/7] locking: qspinlock Peter Zijlstra
                   ` (3 preceding siblings ...)
  2014-03-10 15:42 ` [RFC][PATCH 4/7] x86: Add atomic_test_and_set_bit() Peter Zijlstra
@ 2014-03-10 15:42 ` Peter Zijlstra
  2014-03-10 15:42   ` Peter Zijlstra
  2014-03-10 15:42 ` [RFC][PATCH 6/7] qspinlock: Optimize xchg_tail Peter Zijlstra
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-10 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg, Peter Zijlstra

[-- Attachment #1: peterz-qspinlock-pending-test_and_set_bit.patch --]
[-- Type: text/plain, Size: 2958 bytes --]

Replace the initial set-pending cmpxchg() loop with an unconditional
test-and-set bit (x86: bts) instruction.

It looses the direct trylock state transition; however since that should
be very unlikely (we've just done a trylock) that shouldn't be a
problem.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/asm-generic/qspinlock_types.h |    2 +
 kernel/locking/qspinlock.c            |   60 +++++++++++++++++++---------------
 2 files changed, 36 insertions(+), 26 deletions(-)

--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -59,6 +59,8 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK	(((1U << _Q_TAIL_CPU_BITS) - 1) << _Q_TAIL_CPU_OFFSET)
 
+#define _Q_TAIL_MASK		(_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL		(1U << _Q_PENDING_OFFSET)
 
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -83,6 +83,37 @@ static inline struct mcs_spinlock *decod
 	return per_cpu_ptr(&mcs_nodes[idx], cpu);
 }
 
+/*
+ * 0,0,1 -> 0,1,* ; pending
+ *
+ * Ignore the locked bit; if we set pending and locked happens to be clear
+ * we'll fall through on the subsequent wait.
+ */
+static int __always_inline
+try_set_pending(struct qspinlock *lock, u32 val)
+{
+	if (val & ~_Q_LOCKED_MASK)
+		return 0; /* fail; queue */
+
+	/*
+	 * If we find the pending bit was already set; fail and queue.
+	 */
+	if (atomic_test_and_set_bit(_Q_PENDING_OFFSET, &lock->val))
+		return 0;
+
+	/*
+	 * If we raced and someone concurrently set the tail; no problem. He
+	 * need not have observed our pending bit and can have claimed the
+	 * lock.
+	 *
+	 * The next node in line however will wait for the pending to go away
+	 * again though, so in effect we've just flipped order between two
+	 * contenders which already had undetermined order as per the race.
+	 */
+
+	return 1;
+}
+
 #define _Q_LOCKED_PENDING_MASK	(_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
@@ -115,34 +146,10 @@ void queue_spin_lock_slowpath(struct qsp
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
 	/*
-	 * trylock || pending
-	 *
-	 * 0,0,0 -> 0,0,1 ; trylock
 	 * 0,0,1 -> 0,1,1 ; pending
 	 */
-	for (;;) {
-		/*
-		 * If we observe any contention; queue.
-		 */
-		if (val & ~_Q_LOCKED_MASK)
-			goto queue;
-
-		new = _Q_LOCKED_VAL;
-		if (val == new)
-			new |= _Q_PENDING_VAL;
-
-		old = atomic_cmpxchg(&lock->val, val, new);
-		if (old == val)
-			break;
-
-		val = old;
-	}
-
-	/*
-	 * we won the trylock
-	 */
-	if (new == _Q_LOCKED_VAL)
-		return;
+	if (!try_set_pending(lock, val))
+		goto queue;
 
 	/*
 	 * we're pending, wait for the owner to go away.
@@ -186,6 +193,7 @@ void queue_spin_lock_slowpath(struct qsp
 	 * 0,0,0 -> 0,0,1 ; trylock
 	 * p,y,x -> n,y,x ; prev = xchg(lock, node)
 	 */
+	val = atomic_read(&lock->val);
 	for (;;) {
 		new = _Q_LOCKED_VAL;
 		if (val)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 5/7] qspinlock: Optimize the pending case
  2014-03-10 15:42 ` [RFC][PATCH 5/7] qspinlock: Optimize the pending case Peter Zijlstra
@ 2014-03-10 15:42   ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-10 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg, Peter Zijlstra

[-- Attachment #1: peterz-qspinlock-pending-test_and_set_bit.patch --]
[-- Type: text/plain, Size: 2960 bytes --]

Replace the initial set-pending cmpxchg() loop with an unconditional
test-and-set bit (x86: bts) instruction.

It looses the direct trylock state transition; however since that should
be very unlikely (we've just done a trylock) that shouldn't be a
problem.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/asm-generic/qspinlock_types.h |    2 +
 kernel/locking/qspinlock.c            |   60 +++++++++++++++++++---------------
 2 files changed, 36 insertions(+), 26 deletions(-)

--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -59,6 +59,8 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK	(((1U << _Q_TAIL_CPU_BITS) - 1) << _Q_TAIL_CPU_OFFSET)
 
+#define _Q_TAIL_MASK		(_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL		(1U << _Q_PENDING_OFFSET)
 
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -83,6 +83,37 @@ static inline struct mcs_spinlock *decod
 	return per_cpu_ptr(&mcs_nodes[idx], cpu);
 }
 
+/*
+ * 0,0,1 -> 0,1,* ; pending
+ *
+ * Ignore the locked bit; if we set pending and locked happens to be clear
+ * we'll fall through on the subsequent wait.
+ */
+static int __always_inline
+try_set_pending(struct qspinlock *lock, u32 val)
+{
+	if (val & ~_Q_LOCKED_MASK)
+		return 0; /* fail; queue */
+
+	/*
+	 * If we find the pending bit was already set; fail and queue.
+	 */
+	if (atomic_test_and_set_bit(_Q_PENDING_OFFSET, &lock->val))
+		return 0;
+
+	/*
+	 * If we raced and someone concurrently set the tail; no problem. He
+	 * need not have observed our pending bit and can have claimed the
+	 * lock.
+	 *
+	 * The next node in line however will wait for the pending to go away
+	 * again though, so in effect we've just flipped order between two
+	 * contenders which already had undetermined order as per the race.
+	 */
+
+	return 1;
+}
+
 #define _Q_LOCKED_PENDING_MASK	(_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
@@ -115,34 +146,10 @@ void queue_spin_lock_slowpath(struct qsp
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
 	/*
-	 * trylock || pending
-	 *
-	 * 0,0,0 -> 0,0,1 ; trylock
 	 * 0,0,1 -> 0,1,1 ; pending
 	 */
-	for (;;) {
-		/*
-		 * If we observe any contention; queue.
-		 */
-		if (val & ~_Q_LOCKED_MASK)
-			goto queue;
-
-		new = _Q_LOCKED_VAL;
-		if (val == new)
-			new |= _Q_PENDING_VAL;
-
-		old = atomic_cmpxchg(&lock->val, val, new);
-		if (old == val)
-			break;
-
-		val = old;
-	}
-
-	/*
-	 * we won the trylock
-	 */
-	if (new == _Q_LOCKED_VAL)
-		return;
+	if (!try_set_pending(lock, val))
+		goto queue;
 
 	/*
 	 * we're pending, wait for the owner to go away.
@@ -186,6 +193,7 @@ void queue_spin_lock_slowpath(struct qsp
 	 * 0,0,0 -> 0,0,1 ; trylock
 	 * p,y,x -> n,y,x ; prev = xchg(lock, node)
 	 */
+	val = atomic_read(&lock->val);
 	for (;;) {
 		new = _Q_LOCKED_VAL;
 		if (val)



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 6/7] qspinlock: Optimize xchg_tail
  2014-03-10 15:42 [RFC][PATCH 0/7] locking: qspinlock Peter Zijlstra
                   ` (4 preceding siblings ...)
  2014-03-10 15:42 ` [RFC][PATCH 5/7] qspinlock: Optimize the pending case Peter Zijlstra
@ 2014-03-10 15:42 ` Peter Zijlstra
  2014-03-10 15:42   ` Peter Zijlstra
  2014-03-10 15:42 ` [RFC][PATCH 7/7] qspinlock: Optimize for smaller NR_CPUS Peter Zijlstra
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-10 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg, Peter Zijlstra

[-- Attachment #1: peterz-qspinlock-queue-xchg.patch --]
[-- Type: text/plain, Size: 4272 bytes --]

Replace the xchg(lock, tail) cmpxchg loop with an actual xchg().

This is a little tricky in that we need to preserve the
(locked,pending) state that is also in this word.

By making sure all wait-acquire loops re-start when they observe lock
wasn't in fact free after all, we can have the initial xchg() op set
the lock bit unconditionally.

This means that even if we raced with an unlock and the wait loop has
already terminated the acquire will stall and we can fix-up any state
we wrecked with the xchg().

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/locking/qspinlock.c |   88 +++++++++++++++++++++++++++++++++------------
 1 file changed, 65 insertions(+), 23 deletions(-)

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -115,6 +115,50 @@ try_set_pending(struct qspinlock *lock,
 	return 1;
 }
 
+/*
+ * xchg(lock, tail)
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static u32 __always_inline
+xchg_tail(struct qspinlock *lock, u32 tail)
+{
+	u32 old, new, val = atomic_read(&lock->val);
+
+	/*
+	 * guess the pending bit; if we race no harm done because we'll
+	 * unconditionally set the locked bit, so we can always fix up.
+	 *
+	 * we must always assume the lock bit is set; accidentially clearing it
+	 * would release pending waiters and screw up our ordering.
+	 */
+	new = tail | (val & _Q_PENDING_MASK) | _Q_LOCKED_VAL;
+	old = atomic_xchg(&lock->val, new);
+
+	/*
+	 * fixup the pending bit; since we now have a tail the pending bit is
+	 * stable, see try_pending() and since we have the locked bit set,
+	 * nothing can claim the lock and make progress.
+	 */
+	if (unlikely((old & _Q_PENDING_MASK) != (new & _Q_PENDING_MASK))) {
+		if (old & _Q_PENDING_MASK)
+			atomic_clear_mask(_Q_PENDING_MASK, &lock->val);
+		else
+			atomic_set_mask(_Q_PENDING_VAL, &lock->val);
+	}
+
+	/*
+	 * fixup the locked bit; having accidentally set the locked bit
+	 * we must make sure our wait-acquire loops are robust and not
+	 * set the locked bit when its already set, otherwise we cannot
+	 * release here.
+	 */
+	if (unlikely(!(old & _Q_LOCKED_MASK)))
+		queue_spin_unlock(lock);
+
+	return old; /* tail bits are still fine */
+}
+
 #define _Q_LOCKED_PENDING_MASK	(_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
@@ -155,6 +199,7 @@ void queue_spin_lock_slowpath(struct qsp
 	 *
 	 * *,1,1 -> *,1,0
 	 */
+retry_pending_wait:
 	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
 		cpu_relax();
 
@@ -170,6 +215,9 @@ void queue_spin_lock_slowpath(struct qsp
 		if (old == val)
 			break;
 
+		if (unlikely(old & _Q_LOCKED_MASK))
+			goto retry_pending_wait;
+
 		val = old;
 	}
 	return;
@@ -184,37 +232,24 @@ void queue_spin_lock_slowpath(struct qsp
 	node->next = NULL;
 
 	/*
-	 * we already touched the queueing cacheline; don't bother with pending
-	 * stuff.
-	 *
-	 * trylock || xchg(lock, node)
-	 *
-	 * 0,0,0 -> 0,0,1 ; trylock
-	 * p,y,x -> n,y,x ; prev = xchg(lock, node)
+	 * we touched a (possibly) cold cacheline; attempt the trylock once
+	 * more in the hope someone let go while we weren't watching.
 	 */
-	val = atomic_read(&lock->val);
-	for (;;) {
-		new = _Q_LOCKED_VAL;
-		if (val)
-			new = tail | (val & _Q_LOCKED_PENDING_MASK);
-
-		old = atomic_cmpxchg(&lock->val, val, new);
-		if (old == val)
-			break;
-
-		val = old;
-	}
+	if (queue_spin_trylock(lock))
+		goto release;
 
 	/*
-	 * we won the trylock; forget about queueing.
+	 * we already touched the queueing cacheline; don't bother with pending
+	 * stuff.
+	 *
+	 * p,*,* -> n,*,*
 	 */
-	if (new == _Q_LOCKED_VAL)
-		goto release;
+	old = xchg_tail(lock, tail);
 
 	/*
 	 * if there was a previous node; link it and wait.
 	 */
-	if (old & ~_Q_LOCKED_PENDING_MASK) {
+	if (old & _Q_TAIL_MASK) {
 		prev = decode_tail(old);
 		ACCESS_ONCE(prev->next) = node;
 
@@ -227,6 +262,7 @@ void queue_spin_lock_slowpath(struct qsp
 	 *
 	 * *,x,y -> *,0,0
 	 */
+retry_queue_wait:
 	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK)
 		cpu_relax();
 
@@ -245,6 +281,12 @@ void queue_spin_lock_slowpath(struct qsp
 		if (old == val)
 			break;
 
+		/*
+		 * ensure to never claim a locked value; see xchg_tail().
+		 */
+		if (unlikely(old & _Q_LOCKED_PENDING_MASK))
+			goto retry_queue_wait;
+
 		val = old;
 	}
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 6/7] qspinlock: Optimize xchg_tail
  2014-03-10 15:42 ` [RFC][PATCH 6/7] qspinlock: Optimize xchg_tail Peter Zijlstra
@ 2014-03-10 15:42   ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-10 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg, Peter Zijlstra

[-- Attachment #1: peterz-qspinlock-queue-xchg.patch --]
[-- Type: text/plain, Size: 4274 bytes --]

Replace the xchg(lock, tail) cmpxchg loop with an actual xchg().

This is a little tricky in that we need to preserve the
(locked,pending) state that is also in this word.

By making sure all wait-acquire loops re-start when they observe lock
wasn't in fact free after all, we can have the initial xchg() op set
the lock bit unconditionally.

This means that even if we raced with an unlock and the wait loop has
already terminated the acquire will stall and we can fix-up any state
we wrecked with the xchg().

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/locking/qspinlock.c |   88 +++++++++++++++++++++++++++++++++------------
 1 file changed, 65 insertions(+), 23 deletions(-)

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -115,6 +115,50 @@ try_set_pending(struct qspinlock *lock,
 	return 1;
 }
 
+/*
+ * xchg(lock, tail)
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static u32 __always_inline
+xchg_tail(struct qspinlock *lock, u32 tail)
+{
+	u32 old, new, val = atomic_read(&lock->val);
+
+	/*
+	 * guess the pending bit; if we race no harm done because we'll
+	 * unconditionally set the locked bit, so we can always fix up.
+	 *
+	 * we must always assume the lock bit is set; accidentially clearing it
+	 * would release pending waiters and screw up our ordering.
+	 */
+	new = tail | (val & _Q_PENDING_MASK) | _Q_LOCKED_VAL;
+	old = atomic_xchg(&lock->val, new);
+
+	/*
+	 * fixup the pending bit; since we now have a tail the pending bit is
+	 * stable, see try_pending() and since we have the locked bit set,
+	 * nothing can claim the lock and make progress.
+	 */
+	if (unlikely((old & _Q_PENDING_MASK) != (new & _Q_PENDING_MASK))) {
+		if (old & _Q_PENDING_MASK)
+			atomic_clear_mask(_Q_PENDING_MASK, &lock->val);
+		else
+			atomic_set_mask(_Q_PENDING_VAL, &lock->val);
+	}
+
+	/*
+	 * fixup the locked bit; having accidentally set the locked bit
+	 * we must make sure our wait-acquire loops are robust and not
+	 * set the locked bit when its already set, otherwise we cannot
+	 * release here.
+	 */
+	if (unlikely(!(old & _Q_LOCKED_MASK)))
+		queue_spin_unlock(lock);
+
+	return old; /* tail bits are still fine */
+}
+
 #define _Q_LOCKED_PENDING_MASK	(_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
@@ -155,6 +199,7 @@ void queue_spin_lock_slowpath(struct qsp
 	 *
 	 * *,1,1 -> *,1,0
 	 */
+retry_pending_wait:
 	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
 		cpu_relax();
 
@@ -170,6 +215,9 @@ void queue_spin_lock_slowpath(struct qsp
 		if (old == val)
 			break;
 
+		if (unlikely(old & _Q_LOCKED_MASK))
+			goto retry_pending_wait;
+
 		val = old;
 	}
 	return;
@@ -184,37 +232,24 @@ void queue_spin_lock_slowpath(struct qsp
 	node->next = NULL;
 
 	/*
-	 * we already touched the queueing cacheline; don't bother with pending
-	 * stuff.
-	 *
-	 * trylock || xchg(lock, node)
-	 *
-	 * 0,0,0 -> 0,0,1 ; trylock
-	 * p,y,x -> n,y,x ; prev = xchg(lock, node)
+	 * we touched a (possibly) cold cacheline; attempt the trylock once
+	 * more in the hope someone let go while we weren't watching.
 	 */
-	val = atomic_read(&lock->val);
-	for (;;) {
-		new = _Q_LOCKED_VAL;
-		if (val)
-			new = tail | (val & _Q_LOCKED_PENDING_MASK);
-
-		old = atomic_cmpxchg(&lock->val, val, new);
-		if (old == val)
-			break;
-
-		val = old;
-	}
+	if (queue_spin_trylock(lock))
+		goto release;
 
 	/*
-	 * we won the trylock; forget about queueing.
+	 * we already touched the queueing cacheline; don't bother with pending
+	 * stuff.
+	 *
+	 * p,*,* -> n,*,*
 	 */
-	if (new == _Q_LOCKED_VAL)
-		goto release;
+	old = xchg_tail(lock, tail);
 
 	/*
 	 * if there was a previous node; link it and wait.
 	 */
-	if (old & ~_Q_LOCKED_PENDING_MASK) {
+	if (old & _Q_TAIL_MASK) {
 		prev = decode_tail(old);
 		ACCESS_ONCE(prev->next) = node;
 
@@ -227,6 +262,7 @@ void queue_spin_lock_slowpath(struct qsp
 	 *
 	 * *,x,y -> *,0,0
 	 */
+retry_queue_wait:
 	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK)
 		cpu_relax();
 
@@ -245,6 +281,12 @@ void queue_spin_lock_slowpath(struct qsp
 		if (old == val)
 			break;
 
+		/*
+		 * ensure to never claim a locked value; see xchg_tail().
+		 */
+		if (unlikely(old & _Q_LOCKED_PENDING_MASK))
+			goto retry_queue_wait;
+
 		val = old;
 	}
 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 7/7] qspinlock: Optimize for smaller NR_CPUS
  2014-03-10 15:42 [RFC][PATCH 0/7] locking: qspinlock Peter Zijlstra
                   ` (5 preceding siblings ...)
  2014-03-10 15:42 ` [RFC][PATCH 6/7] qspinlock: Optimize xchg_tail Peter Zijlstra
@ 2014-03-10 15:42 ` Peter Zijlstra
  2014-03-10 15:42   ` Peter Zijlstra
  2014-03-11 10:45 ` [RFC][PATCH 0/7] locking: qspinlock Ingo Molnar
  2014-03-12  2:31 ` Dave Chinner
  8 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-10 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg, Peter Zijlstra

[-- Attachment #1: peterz-qspinlock-pending-byte.patch --]
[-- Type: text/plain, Size: 4633 bytes --]

When we allow for a max NR_CPUS < 2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.

By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the trickyness of having to fix up the (pending,locked) state.

This in turn allows us to unconditionally acquire; the locked state as
observed by the wait loops cannot change. And because both locked and
pending are now a full byte we can use simple ordered stores for the
state transition, obviating one atomic operation entirely.

All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/asm-generic/qspinlock_types.h |    5 +
 kernel/locking/qspinlock.c            |  105 ++++++++++++++++++++++++++++++----
 2 files changed, 98 insertions(+), 12 deletions(-)

--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -48,7 +48,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK		(((1U << _Q_LOCKED_BITS) - 1) << _Q_LOCKED_OFFSET)
 
 #define _Q_PENDING_OFFSET	(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS < (1U << 14)
+#define _Q_PENDING_BITS		8
+#else
 #define _Q_PENDING_BITS		1
+#endif
 #define _Q_PENDING_MASK		(((1U << _Q_PENDING_BITS) - 1) << _Q_PENDING_OFFSET)
 
 #define _Q_TAIL_IDX_OFFSET	(_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -59,6 +63,7 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK	(((1U << _Q_TAIL_CPU_BITS) - 1) << _Q_TAIL_CPU_OFFSET)
 
+#define _Q_TAIL_OFFSET		_Q_TAIL_IDX_OFFSET
 #define _Q_TAIL_MASK		(_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -114,6 +114,89 @@ try_set_pending(struct qspinlock *lock,
 	return 1;
 }
 
+#if _Q_PENDING_BITS == 8
+
+struct __qspinlock {
+	union {
+		atomic_t val;
+		struct {
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+			u8	locked;
+			u8	pending;
+			u16	tail;
+#else
+			u16	tail;
+			u8	pending;
+			u8	locked;
+#endif
+		};
+	};
+};
+
+/*
+ * take ownership and clear the pending bit.
+ *
+ * *,1,0 -> *,0,1
+ */
+static int __always_inline
+try_clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	ACCESS_ONCE(l->locked) = 1;
+	/*
+	 * we must order the stores of locked and pending such that the
+	 * (locked,pending) tuple never observably becomes 0.
+	 *
+	 * 'matched' by the queue wait loop.
+	 */
+	smp_wmb();
+	ACCESS_ONCE(l->pending) = 0;
+
+	return 1;
+}
+
+/*
+ * xchg(lock, tail)
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static u32 __always_inline
+xchg_tail(struct qspinlock *lock, u32 tail)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	return (u32)xchg(&l->tail, tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+}
+
+#else
+
+/*
+ * take ownership and clear the pending bit.
+ *
+ * *,1,0 -> *,0,1
+ */
+static int __always_inline
+try_clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+	u32 new, old;
+
+	for (;;) {
+		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		if (unlikely(old & _Q_LOCKED_MASK))
+			return 0;
+
+		val = old;
+	}
+
+	return 1;
+}
+
 /*
  * xchg(lock, tail)
  *
@@ -158,6 +241,8 @@ xchg_tail(struct qspinlock *lock, u32 ta
 	return old; /* tail bits are still fine */
 }
 
+#endif /* _Q_PENDING_BITS == 8 */
+
 #define _Q_LOCKED_PENDING_MASK	(_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
@@ -199,9 +284,14 @@ void queue_spin_lock_slowpath(struct qsp
 	 * we're pending, wait for the owner to go away.
 	 *
 	 * *,1,1 -> *,1,0
+	 *
+	 * this wait loop must be a load-acquire such that we match the
+	 * store-release that clears the locked bit and create lock
+	 * sequentiality; this because not all try_clear_pending_set_locked()
+	 * implementations imply full barriers.
 	 */
 retry_pending_wait:
-	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
+	while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK)
 		cpu_relax();
 
 	/*
@@ -209,18 +299,9 @@ void queue_spin_lock_slowpath(struct qsp
 	 *
 	 * *,1,0 -> *,0,1
 	 */
-	for (;;) {
-		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
-
-		old = atomic_cmpxchg(&lock->val, val, new);
-		if (old == val)
-			break;
+	if (!try_clear_pending_set_locked(lock, val))
+		goto retry_pending_wait;
 
-		if (unlikely(old & _Q_LOCKED_MASK))
-			goto retry_pending_wait;
-
-		val = old;
-	}
 	return;
 
 queue:

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC][PATCH 7/7] qspinlock: Optimize for smaller NR_CPUS
  2014-03-10 15:42 ` [RFC][PATCH 7/7] qspinlock: Optimize for smaller NR_CPUS Peter Zijlstra
@ 2014-03-10 15:42   ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-10 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg, Peter Zijlstra

[-- Attachment #1: peterz-qspinlock-pending-byte.patch --]
[-- Type: text/plain, Size: 4635 bytes --]

When we allow for a max NR_CPUS < 2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.

By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the trickyness of having to fix up the (pending,locked) state.

This in turn allows us to unconditionally acquire; the locked state as
observed by the wait loops cannot change. And because both locked and
pending are now a full byte we can use simple ordered stores for the
state transition, obviating one atomic operation entirely.

All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/asm-generic/qspinlock_types.h |    5 +
 kernel/locking/qspinlock.c            |  105 ++++++++++++++++++++++++++++++----
 2 files changed, 98 insertions(+), 12 deletions(-)

--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -48,7 +48,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK		(((1U << _Q_LOCKED_BITS) - 1) << _Q_LOCKED_OFFSET)
 
 #define _Q_PENDING_OFFSET	(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS < (1U << 14)
+#define _Q_PENDING_BITS		8
+#else
 #define _Q_PENDING_BITS		1
+#endif
 #define _Q_PENDING_MASK		(((1U << _Q_PENDING_BITS) - 1) << _Q_PENDING_OFFSET)
 
 #define _Q_TAIL_IDX_OFFSET	(_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -59,6 +63,7 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK	(((1U << _Q_TAIL_CPU_BITS) - 1) << _Q_TAIL_CPU_OFFSET)
 
+#define _Q_TAIL_OFFSET		_Q_TAIL_IDX_OFFSET
 #define _Q_TAIL_MASK		(_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -114,6 +114,89 @@ try_set_pending(struct qspinlock *lock,
 	return 1;
 }
 
+#if _Q_PENDING_BITS == 8
+
+struct __qspinlock {
+	union {
+		atomic_t val;
+		struct {
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+			u8	locked;
+			u8	pending;
+			u16	tail;
+#else
+			u16	tail;
+			u8	pending;
+			u8	locked;
+#endif
+		};
+	};
+};
+
+/*
+ * take ownership and clear the pending bit.
+ *
+ * *,1,0 -> *,0,1
+ */
+static int __always_inline
+try_clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	ACCESS_ONCE(l->locked) = 1;
+	/*
+	 * we must order the stores of locked and pending such that the
+	 * (locked,pending) tuple never observably becomes 0.
+	 *
+	 * 'matched' by the queue wait loop.
+	 */
+	smp_wmb();
+	ACCESS_ONCE(l->pending) = 0;
+
+	return 1;
+}
+
+/*
+ * xchg(lock, tail)
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static u32 __always_inline
+xchg_tail(struct qspinlock *lock, u32 tail)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	return (u32)xchg(&l->tail, tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+}
+
+#else
+
+/*
+ * take ownership and clear the pending bit.
+ *
+ * *,1,0 -> *,0,1
+ */
+static int __always_inline
+try_clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+	u32 new, old;
+
+	for (;;) {
+		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		if (unlikely(old & _Q_LOCKED_MASK))
+			return 0;
+
+		val = old;
+	}
+
+	return 1;
+}
+
 /*
  * xchg(lock, tail)
  *
@@ -158,6 +241,8 @@ xchg_tail(struct qspinlock *lock, u32 ta
 	return old; /* tail bits are still fine */
 }
 
+#endif /* _Q_PENDING_BITS == 8 */
+
 #define _Q_LOCKED_PENDING_MASK	(_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
@@ -199,9 +284,14 @@ void queue_spin_lock_slowpath(struct qsp
 	 * we're pending, wait for the owner to go away.
 	 *
 	 * *,1,1 -> *,1,0
+	 *
+	 * this wait loop must be a load-acquire such that we match the
+	 * store-release that clears the locked bit and create lock
+	 * sequentiality; this because not all try_clear_pending_set_locked()
+	 * implementations imply full barriers.
 	 */
 retry_pending_wait:
-	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
+	while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK)
 		cpu_relax();
 
 	/*
@@ -209,18 +299,9 @@ void queue_spin_lock_slowpath(struct qsp
 	 *
 	 * *,1,0 -> *,0,1
 	 */
-	for (;;) {
-		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
-
-		old = atomic_cmpxchg(&lock->val, val, new);
-		if (old == val)
-			break;
+	if (!try_clear_pending_set_locked(lock, val))
+		goto retry_pending_wait;
 
-		if (unlikely(old & _Q_LOCKED_MASK))
-			goto retry_pending_wait;
-
-		val = old;
-	}
 	return;
 
 queue:



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-10 15:42 [RFC][PATCH 0/7] locking: qspinlock Peter Zijlstra
                   ` (6 preceding siblings ...)
  2014-03-10 15:42 ` [RFC][PATCH 7/7] qspinlock: Optimize for smaller NR_CPUS Peter Zijlstra
@ 2014-03-11 10:45 ` Ingo Molnar
  2014-03-11 11:02   ` Peter Zijlstra
  2014-03-12  3:17   ` Waiman Long
  2014-03-12  2:31 ` Dave Chinner
  8 siblings, 2 replies; 34+ messages in thread
From: Ingo Molnar @ 2014-03-11 10:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, arnd, linux-arch, x86, linux-kernel, rostedt, akpm,
	walken, andi, riel, paulmck, torvalds, oleg


* Peter Zijlstra <peterz@infradead.org> wrote:

> Hi Waiman,
> 
> I promised you this series a number of days ago; sorry for the delay 
> I've been somewhat unwell :/
> 
> That said, these few patches start with a (hopefully) simple and 
> correct form of the queue spinlock, and then gradually build upon 
> it, explaining each optimization as we go.
> 
> Having these optimizations as separate patches helps twofold; 
> firstly it makes one aware of which exact optimizations were done, 
> and secondly it allows one to proove or disprove any one step; 
> seeing how they should be mostly identity transforms.
> 
> The resulting code is near to what you posted I think; however it 
> has one atomic op less in the pending wait-acquire case for NR_CPUS 
> != huge. It also doesn't do lock stealing; its still perfectly fair 
> afaict.
> 
> Have I missed any tricks from your code?

Waiman, you indicated in the other thread that these look good to you, 
right? If so then I can queue them up so that they form a base for 
further work.

It would be nice to have per patch performance measurements though ... 
this split-up structure really enables that rather nicely.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-11 10:45 ` [RFC][PATCH 0/7] locking: qspinlock Ingo Molnar
@ 2014-03-11 11:02   ` Peter Zijlstra
  2014-03-11 11:04     ` Ingo Molnar
  2014-03-12  3:17   ` Waiman Long
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-11 11:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Waiman Long, arnd, linux-arch, x86, linux-kernel, rostedt, akpm,
	walken, andi, riel, paulmck, torvalds, oleg

On Tue, Mar 11, 2014 at 11:45:03AM +0100, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Hi Waiman,
> > 
> > I promised you this series a number of days ago; sorry for the delay 
> > I've been somewhat unwell :/
> > 
> > That said, these few patches start with a (hopefully) simple and 
> > correct form of the queue spinlock, and then gradually build upon 
> > it, explaining each optimization as we go.
> > 
> > Having these optimizations as separate patches helps twofold; 
> > firstly it makes one aware of which exact optimizations were done, 
> > and secondly it allows one to proove or disprove any one step; 
> > seeing how they should be mostly identity transforms.
> > 
> > The resulting code is near to what you posted I think; however it 
> > has one atomic op less in the pending wait-acquire case for NR_CPUS 
> > != huge. It also doesn't do lock stealing; its still perfectly fair 
> > afaict.
> > 
> > Have I missed any tricks from your code?
> 
> Waiman, you indicated in the other thread that these look good to you, 
> right? If so then I can queue them up so that they form a base for 
> further work.

Ah, no that was on the qrwlock; I think we managed to cross wires
somewhere.

I've got this entire pile waiting for something:

  lkml.kernel.org/r/20140210195820.834693028@infradead.org

That's 5 mutex patches and the 2 qrwlock patches. Not sure what to do
with them. To merge or not, that is the question.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-11 11:02   ` Peter Zijlstra
@ 2014-03-11 11:04     ` Ingo Molnar
  0 siblings, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2014-03-11 11:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, arnd, linux-arch, x86, linux-kernel, rostedt, akpm,
	walken, andi, riel, paulmck, torvalds, oleg


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Mar 11, 2014 at 11:45:03AM +0100, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > Hi Waiman,
> > > 
> > > I promised you this series a number of days ago; sorry for the delay 
> > > I've been somewhat unwell :/
> > > 
> > > That said, these few patches start with a (hopefully) simple and 
> > > correct form of the queue spinlock, and then gradually build upon 
> > > it, explaining each optimization as we go.
> > > 
> > > Having these optimizations as separate patches helps twofold; 
> > > firstly it makes one aware of which exact optimizations were done, 
> > > and secondly it allows one to proove or disprove any one step; 
> > > seeing how they should be mostly identity transforms.
> > > 
> > > The resulting code is near to what you posted I think; however it 
> > > has one atomic op less in the pending wait-acquire case for NR_CPUS 
> > > != huge. It also doesn't do lock stealing; its still perfectly fair 
> > > afaict.
> > > 
> > > Have I missed any tricks from your code?
> > 
> > Waiman, you indicated in the other thread that these look good to 
> > you, right? If so then I can queue them up so that they form a 
> > base for further work.
> 
> Ah, no that was on the qrwlock; I think we managed to cross wires 
> somewhere.

Oops, too many q-locks ;-)

> I've got this entire pile waiting for something:
> 
>   lkml.kernel.org/r/20140210195820.834693028@infradead.org
> 
> That's 5 mutex patches and the 2 qrwlock patches. Not sure what to 
> do with them. To merge or not, that is the question.

Can merge them in tip:core/locking if there's no objections.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-10 15:42 [RFC][PATCH 0/7] locking: qspinlock Peter Zijlstra
                   ` (7 preceding siblings ...)
  2014-03-11 10:45 ` [RFC][PATCH 0/7] locking: qspinlock Ingo Molnar
@ 2014-03-12  2:31 ` Dave Chinner
  2014-03-12  3:11   ` Steven Rostedt
  2014-03-12  6:15   ` Peter Zijlstra
  8 siblings, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2014-03-12  2:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, arnd, linux-arch, x86, linux-kernel, rostedt, akpm,
	walken, andi, riel, paulmck, torvalds, oleg

On Mon, Mar 10, 2014 at 04:42:36PM +0100, Peter Zijlstra wrote:
> Hi Waiman,
> 
> I promised you this series a number of days ago; sorry for the delay I've been
> somewhat unwell :/
> 
> That said, these few patches start with a (hopefully) simple and correct form
> of the queue spinlock, and then gradually build upon it, explaining each
> optimization as we go.
> 
> Having these optimizations as separate patches helps twofold; firstly it makes
> one aware of which exact optimizations were done, and secondly it allows one to
> proove or disprove any one step; seeing how they should be mostly identity
> transforms.
> 
> The resulting code is near to what you posted I think; however it has one
> atomic op less in the pending wait-acquire case for NR_CPUS != huge. It also
> doesn't do lock stealing; its still perfectly fair afaict.
> 
> Have I missed any tricks from your code?
> 
> The patches apply to tip/master + lkml.kernel.org/r/20140210195820.834693028@infradead.org

FWIW, I saw this patchset referenced on LWN, and noted that the lock
contention that AIM7 saw was in the VFS and XFS in particular. So,
I pulled out my trusty "smash the VFS locks" test on a 16p VM. The
multithreaded bulkstat test *hammers* the inode cache. I've got a
filesystem with 51.2 million inodes in it, and it pegs all 16 CPUs
with this profile:

$ sudo time src/bstat -t -q /mnt/scratch
0.01user 2300.09system 2:28.13elapsed 1552%CPU (0avgtext+0avgdata 56268maxresident)k
0inputs+0outputs (12major+133515minor)pagefaults 0swaps
$
.....
  73.95%  [kernel]  [k] _raw_spin_lock
      + 48.65% evict
      + 48.21% inode_sb_list_add
      + 1.31% inode_wait_for_writeback
        0.51% __remove_inode_hash
           evict
           iput
           xfs_bulkstat_one_int
           xfs_bulkstat_one
           xfs_bulkstat
           xfs_ioc_bulkstat
           xfs_file_ioctl
           do_vfs_ioctl
           sys_ioctl
           system_call_fastpath
           __GI___ioctl
+   4.87%  [kernel]  [k] native_read_tsc
+   2.72%  [kernel]  [k] do_raw_spin_unlock
+   2.36%  [kernel]  [k] _xfs_buf_find
+   1.35%  [kernel]  [k] delay_tsc
+   1.29%  [kernel]  [k] __crc32c_le
+   1.19%  [kernel]  [k] __slab_alloc
+   0.93%  [kernel]  [k] xfs_setup_inode
+   0.73%  [kernel]  [k] _raw_spin_unlock_irqrestore

It's almost entirely spinlock contention on the inode list lock,
just like the AIM7 disk test that was referenced here:

http://lwn.net/Articles/590268/


With the queuing spinlock, I expected to see somewhat better
results, but I didn't at first. Turns out if you have any sort of
lock debugging turned on, then the code doesn't ever go into the
lock slow path and hence does not ever enter the "lock failed" slow
path where all the contention fixes are supposed to be.

Anyway, with all lock debugging turned off, the system hangs
the instant I start the multithreaded bulkstat workload. Even the
console is unrepsonsive. I did get a partial task trace, indicating
everything was locked up on the contended lock:

[   64.992006] bstat           R  running task     5128  4488   4397 0x0000000a
[   64.992006]  ffff88041b385a18 ffffffff81ce6c81 ffff88041b385a38 ffffffff811bee8d
[   64.992006]  ffff88041b385a58 ffff8800b8527000 ffff88041b385a58 ffffffff814a0854
[   64.992006]  0000002c80004321 ffff8800b8527000 ffff88041b385b08 ffffffff8149998c
[   64.992006] Call Trace:
[   64.992006]  [<ffffffff81ce6c81>] ? _raw_spin_lock+0x21/0x30
[   64.992006]  [<ffffffff811bee8d>] inode_sb_list_add+0x1d/0x60
[   64.992006]  [<ffffffff814a0854>] xfs_setup_inode+0x34/0x310
[   64.992006]  [<ffffffff8149998c>] xfs_iget+0x4ec/0x750
[   64.992006]  [<ffffffff814a0b30>] ? xfs_setup_inode+0x310/0x310
[   64.992006]  [<ffffffff814a0cb7>] xfs_bulkstat_one_int+0xa7/0x340
[   64.992006]  [<ffffffff814a0f70>] xfs_bulkstat_one+0x20/0x30
[   64.992006]  [<ffffffff814a143c>] xfs_bulkstat+0x4bc/0xa10
[   64.992006]  [<ffffffff814a0f50>] ? xfs_bulkstat_one_int+0x340/0x340
[   64.992006]  [<ffffffff8149bdb0>] xfs_ioc_bulkstat+0xd0/0x1b0
[   64.992006]  [<ffffffff8149d114>] xfs_file_ioctl+0x3e4/0xae0
[   64.992006]  [<ffffffff81cea6fc>] ? __do_page_fault+0x1fc/0x4f0
[   64.992006]  [<ffffffff811b79e3>] do_vfs_ioctl+0x83/0x500
[   64.992006]  [<ffffffff811b7ef1>] SyS_ioctl+0x91/0xb0
[   64.992006]  [<ffffffff81cef652>] system_call_fastpath+0x16/0x1b

And the rest of the threads were in a cond_resched() right after the
code where the lock was taken in teh evict path:

[   64.992006] bstat           R  running task     4576  4493   4397 0x00000000
[   64.992006]  ffff880419c95a48 0000000000000082 ffff880419ec1820 0000000000012e00
[   64.992006]  ffff880419c95fd8 0000000000012e00 ffff880419509820 ffff880419ec1820
[   64.992006]  ffff880419c95a88 ffff8800b8a8a1a8 ffff8800b8a8a000 ffff8800b8a8a1a8
[   64.992006] Call Trace:
[   64.992006]  [<ffffffff81ce3099>] _cond_resched+0x29/0x40
[   64.992006]  [<ffffffff811be526>] clear_inode+0x16/0x70
[   64.992006]  [<ffffffff814a59e9>] xfs_fs_evict_inode+0x59/0xf0
[   64.992006]  [<ffffffff811bf96f>] evict+0xaf/0x1b0
[   64.992006]  [<ffffffff811c00d3>] iput+0x103/0x190
[   64.992006]  [<ffffffff814a0b30>] ? xfs_setup_inode+0x310/0x310
[   64.992006]  [<ffffffff814a0e97>] xfs_bulkstat_one_int+0x287/0x340
[   64.992006]  [<ffffffff814a0f70>] xfs_bulkstat_one+0x20/0x30
[   64.992006]  [<ffffffff814a143c>] xfs_bulkstat+0x4bc/0xa10
[   64.992006]  [<ffffffff814a0f50>] ? xfs_bulkstat_one_int+0x340/0x340
[   64.992006]  [<ffffffff8149bdb0>] xfs_ioc_bulkstat+0xd0/0x1b0
[   64.992006]  [<ffffffff8149d114>] xfs_file_ioctl+0x3e4/0xae0
[   64.992006]  [<ffffffff81cea6fc>] ? __do_page_fault+0x1fc/0x4f0
[   64.992006]  [<ffffffff811b79e3>] do_vfs_ioctl+0x83/0x500
[   64.992006]  [<ffffffff811b7ef1>] SyS_ioctl+0x91/0xb0
[   64.992006]  [<ffffffff81cef652>] system_call_fastpath+0x16/0x1b

So either this patchset doesn't work right now, or there's something
else broken in the tip/master tree...

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-12  2:31 ` Dave Chinner
@ 2014-03-12  3:11   ` Steven Rostedt
  2014-03-12  4:26     ` Dave Chinner
  2014-03-12  6:15   ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2014-03-12  3:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Peter Zijlstra, Waiman Long, arnd, linux-arch, x86, linux-kernel,
	akpm, walken, andi, riel, paulmck, torvalds, oleg

On Wed, 12 Mar 2014 13:31:53 +1100
Dave Chinner <david@fromorbit.com> wrote:


> So either this patchset doesn't work right now, or there's something
> else broken in the tip/master tree...

I've found a few things broken with the tip master tree. Anyway, I'm
not sure the patches are even there. The last I saw, the patches are
still in RFC stage. Might want to take a known working kernel and apply
the patches manually.

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-11 10:45 ` [RFC][PATCH 0/7] locking: qspinlock Ingo Molnar
  2014-03-11 11:02   ` Peter Zijlstra
@ 2014-03-12  3:17   ` Waiman Long
  2014-03-12  6:24     ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Waiman Long @ 2014-03-12  3:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, arnd, linux-arch, x86, linux-kernel, rostedt,
	akpm, walken, andi, riel, paulmck, torvalds, oleg

On 03/11/2014 06:45 AM, Ingo Molnar wrote:
> * Peter Zijlstra<peterz@infradead.org>  wrote:
>
>> Hi Waiman,
>>
>> I promised you this series a number of days ago; sorry for the delay
>> I've been somewhat unwell :/
>>
>> That said, these few patches start with a (hopefully) simple and
>> correct form of the queue spinlock, and then gradually build upon
>> it, explaining each optimization as we go.
>>
>> Having these optimizations as separate patches helps twofold;
>> firstly it makes one aware of which exact optimizations were done,
>> and secondly it allows one to proove or disprove any one step;
>> seeing how they should be mostly identity transforms.
>>
>> The resulting code is near to what you posted I think; however it
>> has one atomic op less in the pending wait-acquire case for NR_CPUS
>> != huge. It also doesn't do lock stealing; its still perfectly fair
>> afaict.
>>
>> Have I missed any tricks from your code?
> Waiman, you indicated in the other thread that these look good to you,
> right? If so then I can queue them up so that they form a base for
> further work.
>
> It would be nice to have per patch performance measurements though ...
> this split-up structure really enables that rather nicely.
>
> Thanks,
>
> 	Ingo
As said by Peter, I haven't reviewed his change yet. The patch I am 
working on has an optimization that is similar to PeterZ's small NR_CPUS 
change. Except that I do a single atomic short integer write to switch 
the bits instead of 2 byte write. However, this code seems to have some 
problem working with the lockref code and I had panic happening in 
fs/dcache.c. So I am investigating that issue.

I am also trying to revise the PV support to be similar to what is 
currently done in the PV ticketlock code. That is why I am kind of 
silent this past week.

-Longman

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-12  3:11   ` Steven Rostedt
@ 2014-03-12  4:26     ` Dave Chinner
  2014-03-12 10:07       ` Steven Rostedt
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2014-03-12  4:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Waiman Long, arnd, linux-arch, x86, linux-kernel,
	akpm, walken, andi, riel, paulmck, torvalds, oleg

On Tue, Mar 11, 2014 at 11:11:59PM -0400, Steven Rostedt wrote:
> On Wed, 12 Mar 2014 13:31:53 +1100
> Dave Chinner <david@fromorbit.com> wrote:
> 
> 
> > So either this patchset doesn't work right now, or there's something
> > else broken in the tip/master tree...
> 
> I've found a few things broken with the tip master tree. Anyway, I'm
> not sure the patches are even there. The last I saw, the patches are
> still in RFC stage. Might want to take a known working kernel and apply
> the patches manually.

I did apply them manually - but the patchset is based on the
tip/master, plus some patches from a previous patch set that aren't
in tip/locking to be able to apply this patchset cleanly in the
first place. So there's no "start froma known working kernel" case
that applies here :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-12  2:31 ` Dave Chinner
  2014-03-12  3:11   ` Steven Rostedt
@ 2014-03-12  6:15   ` Peter Zijlstra
  2014-03-12 23:48     ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-12  6:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Waiman Long, arnd, linux-arch, x86, linux-kernel, rostedt, akpm,
	walken, andi, riel, paulmck, torvalds, oleg

On Wed, Mar 12, 2014 at 01:31:53PM +1100, Dave Chinner wrote:
> With the queuing spinlock, I expected to see somewhat better
> results, but I didn't at first. Turns out if you have any sort of
> lock debugging turned on, then the code doesn't ever go into the
> lock slow path and hence does not ever enter the "lock failed" slow
> path where all the contention fixes are supposed to be.

Yeah; its a 'feature' of the spinlock debugging to turn all spinlocks
into test-and-set thingies.

> Anyway, with all lock debugging turned off, the system hangs
> the instant I start the multithreaded bulkstat workload. Even the
> console is unrepsonsive. 

Oops, I only briefly tested this series in userspace and that seemed to
work. I'll go prod at it. Thanks for having a look though.

Is that bstat test any easier/faster to setup/run than the aim7 crap?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-12  3:17   ` Waiman Long
@ 2014-03-12  6:24     ` Peter Zijlstra
  2014-03-12 15:32       ` Peter Zijlstra
  2014-03-12 19:00       ` Waiman Long
  0 siblings, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-12  6:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, arnd, linux-arch, x86, linux-kernel, rostedt, akpm,
	walken, andi, riel, paulmck, torvalds, oleg

On Tue, Mar 11, 2014 at 11:17:46PM -0400, Waiman Long wrote:
> Except that I do a single atomic short integer write to switch the bits
> instead of 2 byte write.

D'0h why didn't I think of that. A single short write is much better
than the 2 byte writes.

In any case; can you try and keep the structure in this patch-set if you
post another version? It takes a lot of effort to unravel your
all-in-one patch.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-12  4:26     ` Dave Chinner
@ 2014-03-12 10:07       ` Steven Rostedt
  2014-03-12 15:57         ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2014-03-12 10:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Peter Zijlstra, Waiman Long, arnd, linux-arch, x86, linux-kernel,
	akpm, walken, andi, riel, paulmck, torvalds, oleg

On Wed, 12 Mar 2014 15:26:16 +1100
Dave Chinner <david@fromorbit.com> wrote:


> I did apply them manually - but the patchset is based on the
> tip/master, plus some patches from a previous patch set that aren't
> in tip/locking to be able to apply this patchset cleanly in the
> first place. So there's no "start froma known working kernel" case
> that applies here :/

Peter,

Which kernel should we start testing your code against?

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-12  6:24     ` Peter Zijlstra
@ 2014-03-12 15:32       ` Peter Zijlstra
  2014-03-12 19:00       ` Waiman Long
  1 sibling, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-12 15:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, arnd, linux-arch, x86, linux-kernel, rostedt, akpm,
	walken, andi, riel, paulmck, torvalds, oleg

On Wed, Mar 12, 2014 at 07:24:36AM +0100, Peter Zijlstra wrote:
> On Tue, Mar 11, 2014 at 11:17:46PM -0400, Waiman Long wrote:
> > Except that I do a single atomic short integer write to switch the bits
> > instead of 2 byte write.
> 
> D'0h why didn't I think of that. A single short write is much better
> than the 2 byte writes.

Here's a replacement patch that does the one short store.

---
Subject: qspinlock: Optimize for smaller NR_CPUS
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Mar 10 14:05:36 CET 2014

When we allow for a max NR_CPUS < 2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.

By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the trickyness of having to fix up the (pending,locked) state.

This in turn allows us to unconditionally acquire; the locked state as
observed by the wait loops cannot change. And because both locked and
pending are now a full byte we can use simple store for the state
transition, obviating one atomic operation entirely.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/asm-generic/qspinlock_types.h |    5 +
 kernel/locking/qspinlock.c            |   96 +++++++++++++++++++++++++++++-----
 2 files changed, 89 insertions(+), 12 deletions(-)

--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -48,7 +48,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK		(((1U << _Q_LOCKED_BITS) - 1) << _Q_LOCKED_OFFSET)
 
 #define _Q_PENDING_OFFSET	(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS < (1U << 14)
+#define _Q_PENDING_BITS		8
+#else
 #define _Q_PENDING_BITS		1
+#endif
 #define _Q_PENDING_MASK		(((1U << _Q_PENDING_BITS) - 1) << _Q_PENDING_OFFSET)
 
 #define _Q_TAIL_IDX_OFFSET	(_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -59,6 +63,7 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK	(((1U << _Q_TAIL_CPU_BITS) - 1) << _Q_TAIL_CPU_OFFSET)
 
+#define _Q_TAIL_OFFSET		_Q_TAIL_IDX_OFFSET
 #define _Q_TAIL_MASK		(_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 
 #define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -22,6 +22,7 @@
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
 #include <linux/mutex.h>
+#include <asm/byteorder.h>
 #include <asm/qspinlock.h>
 
 /*
@@ -114,6 +115,79 @@ try_set_pending(struct qspinlock *lock,
 	return 1;
 }
 
+#if _Q_PENDING_BITS == 8
+
+struct __qspinlock {
+	union {
+		atomic_t val;
+		struct {
+#ifdef __LITTLE_ENDIAN
+			u16	locked_pending;
+			u16	tail;
+#else
+			u16	tail;
+			u16	locked_pending;
+#endif
+		};
+	};
+};
+
+/*
+ * take ownership and clear the pending bit.
+ *
+ * *,1,0 -> *,0,1
+ */
+static int __always_inline
+try_clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	ACCESS_ONCE(l->locked_pending) = __constant_le16_to_cpu(_Q_LOCKED_VAL);
+
+	return 1;
+}
+
+/*
+ * xchg(lock, tail)
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static u32 __always_inline
+xchg_tail(struct qspinlock *lock, u32 tail)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	return (u32)xchg(&l->tail, tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+}
+
+#else
+
+/*
+ * take ownership and clear the pending bit.
+ *
+ * *,1,0 -> *,0,1
+ */
+static int __always_inline
+try_clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+	u32 new, old;
+
+	for (;;) {
+		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		if (unlikely(old & _Q_LOCKED_MASK))
+			return 0;
+
+		val = old;
+	}
+
+	return 1;
+}
+
 /*
  * xchg(lock, tail)
  *
@@ -158,6 +232,8 @@ xchg_tail(struct qspinlock *lock, u32 ta
 	return old; /* tail bits are still fine */
 }
 
+#endif /* _Q_PENDING_BITS == 8 */
+
 #define _Q_LOCKED_PENDING_MASK	(_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
@@ -199,9 +275,14 @@ void queue_spin_lock_slowpath(struct qsp
 	 * we're pending, wait for the owner to go away.
 	 *
 	 * *,1,1 -> *,1,0
+	 *
+	 * this wait loop must be a load-acquire such that we match the
+	 * store-release that clears the locked bit and create lock
+	 * sequentiality; this because not all try_clear_pending_set_locked()
+	 * implementations imply full barriers.
 	 */
 retry_pending_wait:
-	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
+	while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK)
 		cpu_relax();
 
 	/*
@@ -209,18 +290,9 @@ void queue_spin_lock_slowpath(struct qsp
 	 *
 	 * *,1,0 -> *,0,1
 	 */
-	for (;;) {
-		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
-
-		old = atomic_cmpxchg(&lock->val, val, new);
-		if (old == val)
-			break;
+	if (!try_clear_pending_set_locked(lock, val))
+		goto retry_pending_wait;
 
-		if (unlikely(old & _Q_LOCKED_MASK))
-			goto retry_pending_wait;
-
-		val = old;
-	}
 	return;
 
 queue:

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-12 10:07       ` Steven Rostedt
@ 2014-03-12 15:57         ` Peter Zijlstra
  2014-03-12 16:06           ` Linus Torvalds
  2014-03-12 16:19           ` Steven Rostedt
  0 siblings, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-12 15:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Dave Chinner, Waiman Long, arnd, linux-arch, x86, linux-kernel,
	akpm, walken, andi, riel, paulmck, torvalds, oleg

On Wed, Mar 12, 2014 at 06:07:30AM -0400, Steven Rostedt wrote:
> On Wed, 12 Mar 2014 15:26:16 +1100
> Dave Chinner <david@fromorbit.com> wrote:
> 
> 
> > I did apply them manually - but the patchset is based on the
> > tip/master, plus some patches from a previous patch set that aren't
> > in tip/locking to be able to apply this patchset cleanly in the
> > first place. So there's no "start froma known working kernel" case
> > that applies here :/
> 
> Peter,
> 
> Which kernel should we start testing your code against?

I build and booted these patches against tip/master as of this morning.
It build ~50 defconfig kernels on my WSM-EP.

I've not managed to reproduce the hang as reported by Dave, but then I
doubt building kernels has anywhere near the lock contention he
generated.

I did run my userspace on both the WSM-EP as well as the Interlagos box,
topping out at 24 and 32 contending cpus resp.

No weird lockups :/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-12 15:57         ` Peter Zijlstra
@ 2014-03-12 16:06           ` Linus Torvalds
  2014-03-12 16:19           ` Steven Rostedt
  1 sibling, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2014-03-12 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Dave Chinner, Waiman Long, Arnd Bergmann,
	linux-arch@vger.kernel.org, the arch/x86 maintainers,
	Linux Kernel Mailing List, Andrew Morton, Michel Lespinasse,
	Andi Kleen, Rik van Riel, Paul McKenney, Oleg Nesterov

On Wed, Mar 12, 2014 at 8:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> I've not managed to reproduce the hang as reported by Dave, but then I
> doubt building kernels has anywhere near the lock contention he
> generated.

Building kernels has basically no lock contention at all.

                Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-12 15:57         ` Peter Zijlstra
  2014-03-12 16:06           ` Linus Torvalds
@ 2014-03-12 16:19           ` Steven Rostedt
  2014-03-12 16:19             ` Steven Rostedt
  2014-03-12 16:23             ` Peter Zijlstra
  1 sibling, 2 replies; 34+ messages in thread
From: Steven Rostedt @ 2014-03-12 16:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Chinner, Waiman Long, arnd, linux-arch, x86, linux-kernel,
	akpm, walken, andi, riel, paulmck, torvalds, oleg

On Wed, 12 Mar 2014 16:57:54 +0100
Peter Zijlstra <peterz@infradead.org> wrote:


> I build and booted these patches against tip/master as of this morning.
> It build ~50 defconfig kernels on my WSM-EP.
> 
> I've not managed to reproduce the hang as reported by Dave, but then I
> doubt building kernels has anywhere near the lock contention he
> generated.
> 

Have you tried hackbench? That usually gives rq spinlocks a bit of
contention.

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-12 16:19           ` Steven Rostedt
@ 2014-03-12 16:19             ` Steven Rostedt
  2014-03-12 16:23             ` Peter Zijlstra
  1 sibling, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2014-03-12 16:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Chinner, Waiman Long, arnd, linux-arch, x86, linux-kernel,
	akpm, walken, andi, riel, paulmck, torvalds, oleg

On Wed, 12 Mar 2014 16:57:54 +0100
Peter Zijlstra <peterz@infradead.org> wrote:


> I build and booted these patches against tip/master as of this morning.
> It build ~50 defconfig kernels on my WSM-EP.
> 
> I've not managed to reproduce the hang as reported by Dave, but then I
> doubt building kernels has anywhere near the lock contention he
> generated.
> 

Have you tried hackbench? That usually gives rq spinlocks a bit of
contention.

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-12 16:19           ` Steven Rostedt
  2014-03-12 16:19             ` Steven Rostedt
@ 2014-03-12 16:23             ` Peter Zijlstra
  2014-03-12 16:23               ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-12 16:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Dave Chinner, Waiman Long, arnd, linux-arch, x86, linux-kernel,
	akpm, walken, andi, riel, paulmck, torvalds, oleg

On Wed, Mar 12, 2014 at 12:19:33PM -0400, Steven Rostedt wrote:
> On Wed, 12 Mar 2014 16:57:54 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
> > I build and booted these patches against tip/master as of this morning.
> > It build ~50 defconfig kernels on my WSM-EP.
> > 
> > I've not managed to reproduce the hang as reported by Dave, but then I
> > doubt building kernels has anywhere near the lock contention he
> > generated.
> > 
> 
> Have you tried hackbench? That usually gives rq spinlocks a bit of
> contention.

Ha!, bang it goes. I suppose my userspace contention is too nice or
something for not triggering this.

Lemme try and put it together again :-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-12 16:23             ` Peter Zijlstra
@ 2014-03-12 16:23               ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-12 16:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Dave Chinner, Waiman Long, arnd, linux-arch, x86, linux-kernel,
	akpm, walken, andi, riel, paulmck, torvalds, oleg

On Wed, Mar 12, 2014 at 12:19:33PM -0400, Steven Rostedt wrote:
> On Wed, 12 Mar 2014 16:57:54 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
> > I build and booted these patches against tip/master as of this morning.
> > It build ~50 defconfig kernels on my WSM-EP.
> > 
> > I've not managed to reproduce the hang as reported by Dave, but then I
> > doubt building kernels has anywhere near the lock contention he
> > generated.
> > 
> 
> Have you tried hackbench? That usually gives rq spinlocks a bit of
> contention.

Ha!, bang it goes. I suppose my userspace contention is too nice or
something for not triggering this.

Lemme try and put it together again :-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-12  6:24     ` Peter Zijlstra
  2014-03-12 15:32       ` Peter Zijlstra
@ 2014-03-12 19:00       ` Waiman Long
  1 sibling, 0 replies; 34+ messages in thread
From: Waiman Long @ 2014-03-12 19:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, arnd, linux-arch, x86, linux-kernel, rostedt, akpm,
	walken, andi, riel, paulmck, torvalds, oleg

On 03/12/2014 02:24 AM, Peter Zijlstra wrote:
> On Tue, Mar 11, 2014 at 11:17:46PM -0400, Waiman Long wrote:
>> Except that I do a single atomic short integer write to switch the bits
>> instead of 2 byte write.
> D'0h why didn't I think of that. A single short write is much better
> than the 2 byte writes.
>
> In any case; can you try and keep the structure in this patch-set if you
> post another version? It takes a lot of effort to unravel your
> all-in-one patch.

I didn't have time to fully review and test your changes yet. So I just 
post a new series of what I currently have right now. I will certainly 
do the testing on your patches.

-Longman

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 0/7] locking: qspinlock
  2014-03-12  6:15   ` Peter Zijlstra
@ 2014-03-12 23:48     ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2014-03-12 23:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, arnd, linux-arch, x86, linux-kernel, rostedt, akpm,
	walken, andi, riel, paulmck, torvalds, oleg

On Wed, Mar 12, 2014 at 07:15:03AM +0100, Peter Zijlstra wrote:
> On Wed, Mar 12, 2014 at 01:31:53PM +1100, Dave Chinner wrote:
> > With the queuing spinlock, I expected to see somewhat better
> > results, but I didn't at first. Turns out if you have any sort of
> > lock debugging turned on, then the code doesn't ever go into the
> > lock slow path and hence does not ever enter the "lock failed" slow
> > path where all the contention fixes are supposed to be.
> 
> Yeah; its a 'feature' of the spinlock debugging to turn all spinlocks
> into test-and-set thingies.
> 
> > Anyway, with all lock debugging turned off, the system hangs
> > the instant I start the multithreaded bulkstat workload. Even the
> > console is unrepsonsive. 
> 
> Oops, I only briefly tested this series in userspace and that seemed to
> work. I'll go prod at it. Thanks for having a look though.
> 
> Is that bstat test any easier/faster to setup/run than the aim7 crap?

Depends. I've got a VM setup with a sparse 100TB block device hosted
on SSDs where I can create 50M inodes using fsmark in about 3
and half minutes. I also have a hacked xfstests::src/bstat.c that is
multithreaded that I then run and it triggers it staight away.

Quite frankly, you don't need bulkstat to produce this lock
contention - you'll see it running this on a wide directory
structure on XFS and an SSD:

$ cat ~/tests/walk-scratch.sh 
#!/bin/bash

echo Walking via find -ctime
echo 3 > /proc/sys/vm/drop_caches
time (
        for d in /mnt/scratch/[0-9]* ; do

                for i in $d/*; do
                        (
                                echo $i
                                find $i -ctime 1 > /dev/null
                        ) > /dev/null 2>&1
                done &
        done
        wait
)

echo Walking via ls -R
echo 3 > /proc/sys/vm/drop_caches
time (
        for d in /mnt/scratch/[0-9]* ; do

                for i in $d/*; do
                        (
                                echo $i
                                ls -R $i
                        ) > /dev/null 2>&1
                done &
        done
        wait
)
$

The directory structure I create has 16 top level directories (0-15)
each with 30-40 subdirectories containing 100,000 files each.
There's a thread per top level directory, and running it on a 16p VM
and an SSD that can do 30,000 IOPS will generate sufficient inode
cache pressure to trigger severe lock contention.

My usual test script for this workload runs mkfs, fsmark,
xfs_repair, bstat, walk-scratch, and finally a multi-threaded rm to
clean up. Usual inode numbers are in the 50-100million for zero
length file workloads, 10-20 million for single block (small) files,
and 100,000-1million for larger files. it's great for stressing VFs,
FS and IO level scalability...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC][PATCH 1/7] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-03-10 15:42 ` [RFC][PATCH 1/7] qspinlock: Introducing a 4-byte queue spinlock implementation Peter Zijlstra
  2014-03-10 15:42   ` Peter Zijlstra
@ 2014-03-13 13:07   ` Peter Zijlstra
  1 sibling, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-03-13 13:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: arnd, linux-arch, x86, linux-kernel, rostedt, akpm, walken, andi,
	riel, paulmck, torvalds, oleg

On Mon, Mar 10, 2014 at 04:42:37PM +0100, Peter Zijlstra wrote:

A rather embarrassing bug indeed; and no wonder my userspace didn't
trigger this; it doesn't have nested contexts.

> +static inline struct mcs_spinlock *decode_tail(u32 tail)
> +{
> +	int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;

> +	int idx = (tail >> _Q_TAIL_IDX_OFFSET) & _Q_TAIL_IDX_MASK;
+	int idx = (tail & _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;

> +
> +	return per_cpu_ptr(&mcs_nodes[idx], cpu);
> +}

Fixed patch below. With this the series seems to survive hackbench.
Onwards to find more interesting loads.

---
Subject: qspinlock: Introducing a 4-byte queue spinlock implementation
From: Waiman Long <Waiman.Long@hp.com>
Date: Wed, 26 Feb 2014 10:14:21 -0500

Simple 4 byte MCS like spinlock implementation.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1393427668-60228-2-git-send-email-Waiman.Long@hp.com
---
 include/asm-generic/qspinlock.h       |  113 +++++++++++++++++++
 include/asm-generic/qspinlock_types.h |   59 ++++++++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 
 kernel/locking/mcs_spinlock.h         |    1 
 kernel/locking/qspinlock.c            |  196 ++++++++++++++++++++++++++++++++++
 6 files changed, 377 insertions(+)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,113 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+	return atomic_read(&lock->val) & _Q_LOCKED_VAL;
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+	return !(atomic_read(&lock.val) & _Q_LOCKED_VAL);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+	return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+	if (!atomic_read(&lock->val) &&
+	   (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
+		return 1;
+	return 0;
+}
+
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+	u32 val;
+
+	val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
+	if (likely(val == 0))
+		return;
+	queue_spin_lock_slowpath(lock, val);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	/*
+	 * smp_mb__before_atomic() in order to guarantee release semantics
+	 */
+	smp_mb__before_atomic_dec();
+	atomic_sub(_Q_LOCKED_VAL, &lock->val);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l)		queue_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queue_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queue_spin_lock(l)
+#define arch_spin_trylock(l)		queue_spin_trylock(l)
+#define arch_spin_unlock(l)		queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f)	queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,59 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+/*
+ * Including atomic.h with PARAVIRT on will cause compilation errors because
+ * of recursive header file incluson via paravirt_types.h. A workaround is
+ * to include paravirt_types.h here.
+ */
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt_types.h>
+#endif
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <linux/bug.h>
+
+typedef struct qspinlock {
+	atomic_t	val;
+} arch_spinlock_t;
+
+/*
+ * Bitfields in the atomic value:
+ *
+ *  0- 7: locked byte
+ *  8- 9: tail index
+ * 10-31: tail cpu (+1)
+ */
+
+#define _Q_LOCKED_OFFSET	0
+#define _Q_LOCKED_BITS		8
+#define _Q_LOCKED_MASK		(((1U << _Q_LOCKED_BITS) - 1) << _Q_LOCKED_OFFSET)
+
+#define _Q_TAIL_IDX_OFFSET	(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_TAIL_IDX_BITS	2
+#define _Q_TAIL_IDX_MASK	(((1U << _Q_TAIL_IDX_BITS) - 1) << _Q_TAIL_IDX_OFFSET)
+
+#define _Q_TAIL_CPU_OFFSET	(_Q_TAIL_IDX_OFFSET + _Q_TAIL_IDX_BITS)
+#define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
+#define _Q_TAIL_CPU_MASK	(((1U << _Q_TAIL_CPU_BITS) - 1) << _Q_TAIL_CPU_OFFSET)
+
+#define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -224,6 +224,13 @@ config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
 
+config ARCH_USE_QUEUE_SPINLOCK
+	bool
+
+config QUEUE_SPINLOCK
+	def_bool y if ARCH_USE_QUEUE_SPINLOCK
+	depends on SMP && !PARAVIRT_SPINLOCKS
+
 config ARCH_USE_QUEUE_RWLOCK
 	bool
 
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -15,6 +15,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
 obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
 obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,6 +17,7 @@
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
 	int locked; /* 1 if lock acquired */
+	int count;
 };
 
 #ifndef arch_mcs_spin_lock_contended
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,196 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ *          Peter Zijlstra <pzijlstr@redhat.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <asm/qspinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock, however to make
+ * it fit the 4 bytes we assume spinlock_t to be, and preserve its existing
+ * API, we must modify it some.
+ *
+ * In particular; where the traditional MCS lock consists of a tail pointer
+ * (8 bytes) and needs the next pointer (another 8 bytes) of its own node to
+ * unlock the next pending (next->locked), we compress both these: {tail,
+ * next->locked} into a single u32 value.
+ *
+ * Since a spinlock disables recursion of its own context and there is a limit
+ * to the contexts that can nest; namely: task, softirq, hardirq, nmi, we can
+ * encode the tail as and index indicating this context and a cpu number.
+ *
+ * We can further change the first spinner to spin on a bit in the lock word
+ * instead of its node; whereby avoiding the need to carry a node from lock to
+ * unlock, and preserving API.
+ */
+
+#include "mcs_spinlock.h"
+
+/*
+ * Per-CPU queue node structures; we can never have more than 4 nested
+ * contexts: task, softirq, hardirq, nmi.
+ *
+ * Exactly fits one cacheline.
+ */
+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
+
+/*
+ * We must be able to distinguish between no-tail and the tail at 0:0,
+ * therefore increment the cpu number by one.
+ */
+
+static inline u32 encode_tail(int cpu, int idx)
+{
+	u32 tail;
+
+	tail  = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
+	tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */
+
+	return tail;
+}
+
+static inline struct mcs_spinlock *decode_tail(u32 tail)
+{
+	int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
+	int idx = (tail & _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
+
+	return per_cpu_ptr(&mcs_nodes[idx], cpu);
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ * @val: Current value of the queue spinlock 32-bit word
+ *
+ *
+ *              fast      :    slow                                  :    unlock
+ *                        :                                          :
+ * uncontended  (0,0)   --:--> (0,1) --------------------------------:--> (*,0)
+ *                        :       | ^--------.                    /  :
+ *                        :       v           \                   |  :
+ * uncontended            :    (n,x) --+--> (n,0)                 |  :
+ *   queue                :       | ^--'                          |  :
+ *                        :       v                               |  :
+ * contended              :    (*,x) --+--> (*,0) -----> (*,1) ---'  :
+ *   queue                :         ^--'                             :
+ *
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+	struct mcs_spinlock *prev, *next, *node;
+	u32 new, old, tail;
+	int idx;
+
+	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+
+	node = this_cpu_ptr(&mcs_nodes[0]);
+	idx = node->count++;
+	tail = encode_tail(smp_processor_id(), idx);
+
+	node += idx;
+	node->locked = 0;
+	node->next = NULL;
+
+	/*
+	 * trylock || xchg(lock, node)
+	 *
+	 * 0,0 -> 0,1 ; trylock
+	 * p,x -> n,x ; prev = xchg(lock, node)
+	 */
+	for (;;) {
+		new = _Q_LOCKED_VAL;
+		if (val)
+			new = tail | (val & _Q_LOCKED_MASK);
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		val = old;
+	}
+
+	/*
+	 * we won the trylock; forget about queueing.
+	 */
+	if (new == _Q_LOCKED_VAL)
+		goto release;
+
+	/*
+	 * if there was a previous node; link it and wait.
+	 */
+	if (old & ~_Q_LOCKED_MASK) {
+		prev = decode_tail(old);
+		ACCESS_ONCE(prev->next) = node;
+
+		arch_mcs_spin_lock_contended(&node->locked);
+	}
+
+	/*
+	 * we're at the head of the waitqueue, wait for the owner to go away.
+	 *
+	 * *,x -> *,0
+	 */
+	while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
+		cpu_relax();
+
+	/*
+	 * claim the lock:
+	 *
+	 * n,0 -> 0,1 : lock, uncontended
+	 * *,0 -> *,1 : lock, contended
+	 */
+	for (;;) {
+		new = _Q_LOCKED_VAL;
+		if (val != tail)
+			new |= val;
+
+		old = atomic_cmpxchg(&lock->val, val, new);
+		if (old == val)
+			break;
+
+		val = old;
+	}
+
+	/*
+	 * contended path; wait for next, release.
+	 */
+	if (new != _Q_LOCKED_VAL) {
+		while (!(next = ACCESS_ONCE(node->next)))
+			arch_mutex_cpu_relax();
+
+		arch_mcs_spin_unlock_contended(&next->locked);
+	}
+
+release:
+	/*
+	 * release the node
+	 */
+	this_cpu_dec(mcs_nodes[0].count);
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2014-03-13 13:08 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-10 15:42 [RFC][PATCH 0/7] locking: qspinlock Peter Zijlstra
2014-03-10 15:42 ` [RFC][PATCH 1/7] qspinlock: Introducing a 4-byte queue spinlock implementation Peter Zijlstra
2014-03-10 15:42   ` Peter Zijlstra
2014-03-13 13:07   ` Peter Zijlstra
2014-03-10 15:42 ` [RFC][PATCH 2/7] qspinlock, x86: Enable x86 to use queue spinlock Peter Zijlstra
2014-03-10 15:42   ` Peter Zijlstra
2014-03-10 15:42 ` [RFC][PATCH 3/7] qspinlock: Add pending bit Peter Zijlstra
2014-03-10 15:42   ` Peter Zijlstra
2014-03-10 15:42 ` [RFC][PATCH 4/7] x86: Add atomic_test_and_set_bit() Peter Zijlstra
2014-03-10 15:42 ` [RFC][PATCH 5/7] qspinlock: Optimize the pending case Peter Zijlstra
2014-03-10 15:42   ` Peter Zijlstra
2014-03-10 15:42 ` [RFC][PATCH 6/7] qspinlock: Optimize xchg_tail Peter Zijlstra
2014-03-10 15:42   ` Peter Zijlstra
2014-03-10 15:42 ` [RFC][PATCH 7/7] qspinlock: Optimize for smaller NR_CPUS Peter Zijlstra
2014-03-10 15:42   ` Peter Zijlstra
2014-03-11 10:45 ` [RFC][PATCH 0/7] locking: qspinlock Ingo Molnar
2014-03-11 11:02   ` Peter Zijlstra
2014-03-11 11:04     ` Ingo Molnar
2014-03-12  3:17   ` Waiman Long
2014-03-12  6:24     ` Peter Zijlstra
2014-03-12 15:32       ` Peter Zijlstra
2014-03-12 19:00       ` Waiman Long
2014-03-12  2:31 ` Dave Chinner
2014-03-12  3:11   ` Steven Rostedt
2014-03-12  4:26     ` Dave Chinner
2014-03-12 10:07       ` Steven Rostedt
2014-03-12 15:57         ` Peter Zijlstra
2014-03-12 16:06           ` Linus Torvalds
2014-03-12 16:19           ` Steven Rostedt
2014-03-12 16:19             ` Steven Rostedt
2014-03-12 16:23             ` Peter Zijlstra
2014-03-12 16:23               ` Peter Zijlstra
2014-03-12  6:15   ` Peter Zijlstra
2014-03-12 23:48     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).