linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/4] qrwlock: Introducing a queue read/write lock implementation
@ 2013-11-22 19:04 Waiman Long
  2013-11-22 19:04 ` [PATCH v7 1/4] qrwlock: A " Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Waiman Long @ 2013-11-22 19:04 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Aswin Chandramouleeswaran", Scott J Norton,
	Waiman Long

v6->v7:
 - Remove support for unfair lock, so only fair qrwlock will be provided.
 - Move qrwlock.c to the kernel/locking directory.

v5->v6:
 - Modify queue_read_can_lock() to avoid false positive result.
 - Move the two slowpath functions' performance tuning change from
   patch 4 to patch 1.
 - Add a new optional patch to use the new smp_store_release() function 
   if that is merged.

v4->v5:
 - Fix wrong definitions for QW_MASK_FAIR & QW_MASK_UNFAIR macros.
 - Add an optional patch 4 which should only be applied after the
   mcs_spinlock.h header file is merged.

v3->v4:
 - Optimize the fast path with better cold cache behavior and
   performance.
 - Removing some testing code.
 - Make x86 use queue rwlock with no user configuration.

v2->v3:
 - Make read lock stealing the default and fair rwlock an option with
   a different initializer.
 - In queue_read_lock_slowpath(), check irq_count() and force spinning
   and lock stealing in interrupt context.
 - Unify the fair and classic read-side code path, and make write-side
   to use cmpxchg with 2 different writer states. This slows down the
   write lock fastpath to make the read side more efficient, but is
   still slightly faster than a spinlock.

v1->v2:
 - Improve lock fastpath performance.
 - Optionally provide classic read/write lock behavior for backward
   compatibility.
 - Use xadd instead of cmpxchg for fair reader code path to make it
   immute to reader contention.
 - Run more performance testing.

As mentioned in the LWN article http://lwn.net/Articles/364583/,
the read/write lock suffer from an unfairness problem that it is
possible for a stream of incoming readers to block a waiting writer
from getting the lock for a long time. Also, a waiting reader/writer
contending a rwlock in local memory will have a higher chance of
acquiring the lock than a reader/writer in remote node.

This patch set introduces a queue-based read/write lock implementation
that can largely solve this unfairness problem.

The read lock slowpath will check if the reader is in an interrupt
context. If so, it will force lock spinning and stealing without
waiting in a queue. This is to ensure the read lock will be granted
as soon as possible.

The queue write lock can also be used as a replacement for ticket
spinlocks that are highly contended if lock size increase is not
an issue.

This patch set currently provides queue read/write lock support on
x86 architecture only. Support for other architectures can be added
later on once architecture specific support infrastructure is added
and proper testing is done.

The optional patch 3 has a dependency on the the mcs_spinlock.h
header file which has not been merged yet. So this patch should only
be applied after the mcs_spinlock.h header file is merged.

The optional patch 4 use the new smp_store_release() and
smp_load_acquire() functions which are being reviewed & not merged yet.

Waiman Long (4):
  qrwlock: A queue read/write lock implementation
  qrwlock x86: Enable x86 to use queue read/write lock
  qrwlock: Use the mcs_spinlock helper functions for MCS queuing
  qrwlock: Use smp_store_release() in write_unlock()

 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/spinlock.h       |    2 +
 arch/x86/include/asm/spinlock_types.h |    4 +
 include/asm-generic/qrwlock.h         |  200 +++++++++++++++++++++++++++++++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 +
 kernel/locking/qrwlock.c              |  196 ++++++++++++++++++++++++++++++++
 7 files changed, 411 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qrwlock.h
 create mode 100644 kernel/locking/qrwlock.c

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-11-22 19:04 [PATCH v7 0/4] qrwlock: Introducing a queue read/write lock implementation Waiman Long
@ 2013-11-22 19:04 ` Waiman Long
  2013-11-22 19:14   ` Linus Torvalds
  2013-12-17 19:21   ` Paul E. McKenney
  2013-11-22 19:04 ` [PATCH v7 2/4] qrwlock x86: Enable x86 to use queue read/write lock Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 26+ messages in thread
From: Waiman Long @ 2013-11-22 19:04 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Aswin Chandramouleeswaran", Scott J Norton,
	Waiman Long

This patch introduces a new read/write lock implementation that put
waiting readers and writers into a queue instead of actively contending
the lock like the current read/write lock implementation. This will
improve performance in highly contended situation by reducing the
cache line bouncing effect.

The queue read/write lock (qrwlock) is a fair lock even though there
is still a slight chance of lock stealing if a reader or writer comes
at the right moment.  Other than that, lock granting is done in a
FIFO manner.  As a result, it is possible to determine a maximum time
period after which the waiting is over and the lock can be acquired.

Internally, however, there is a second type of readers which try to
steal lock aggressively. They simply increments the reader count and
wait until the writer releases the lock. The transition to aggressive
reader happens in the read lock slowpath when

 1. In an interrupt context.
 2. When a reader comes to the head of the wait queue and sees
    the release of a write lock.

The queue read lock is safe to use in an interrupt context (softirq
or hardirq) as it will switch to become an aggressive reader in such
environment allowing recursive read lock.

The only downside of queue rwlock is the size increase in the lock
structure by 4 bytes for 32-bit systems and by 12 bytes for 64-bit
systems.

In term of single-thread performance (no contention), a 256K
lock/unlock loop was run on a 2.4GHz and 2.93Ghz Westmere x86-64
CPUs. The following table shows the average time (in ns) for a single
lock/unlock sequence (including the looping and timing overhead):

Lock Type		    2.4GHz	2.93GHz
---------		    ------	-------
Ticket spinlock		     14.9	 12.3
Read lock		     17.0	 13.5
Write lock		     17.0	 13.5
Queue read lock	     	     16.0	 13.4
Queue write lock	      9.2	  7.8

The queue read lock is slightly slower than the spinlock, but is
slightly faster than the read lock. The queue write lock, however,
is the fastest of all. It is almost twice as fast as the write lock
and about 1.5X of the spinlock.

With lock contention, the speed of each individual lock/unlock function
is less important than the amount of contention-induced delays.

To investigate the performance characteristics of the queue rwlock
compared with the regular rwlock, Ingo's anon_vmas patch that converts
rwsem to rwlock was applied to a 3.12 kernel. This kernel was then
tested under the following 3 conditions:

 1) Plain 3.12
 2) Ingo's patch
 3) Ingo's patch + qrwlock

Each of the 3 kernels were booted up twice with and without the
"idle=poll" kernel parameter which keeps the CPUs in C0 state while
idling instead of a more energy-saving sleep state.  The jobs per
minutes (JPM) results of the AIM7's high_systime workload at 1500
users on a 8-socket 80-core DL980 (HT off) were:

 Kernel	    JPMs	%Change from (1)
 ------	    ----	----------------
   1	145704/227295		-
   2	229750/236066	    +58%/+3.8%
   4	240062/248606	    +65%/+9.4%

The first JPM number is without the "idle=poll" kernel parameter,
the second number is with that parameter. It can be seen that most
of the performance benefit of converting rwsem to rwlock actually
come from the latency improvement of not needing to wake up a CPU
from deep sleep state when work is available.

The use of non-sleeping locks did improve performance by eliminating
the context switching cost. Using queue rwlock gave almost tripling
of performance gain. The performance gain was reduced somewhat with
a fair lock which was to be expected.

Looking at the perf profiles (with idle=poll) below, we can clearly see
that other bottlenecks were constraining the performance improvement.

Perf profile of kernel (2):

 18.65%    reaim  [kernel.kallsyms]  [k] __write_lock_failed
  9.00%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  5.21%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  3.08%    reaim  [kernel.kallsyms]  [k] mspin_lock
  2.50%    reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
  2.00%       ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  1.29%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  1.21%    reaim  [kernel.kallsyms]  [k] __read_lock_failed
  1.12%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
  1.10%    reaim  [kernel.kallsyms]  [k] perf_event_aux
  1.09%     true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave

Perf profile of kernel (3):

 20.14%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  7.94%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  5.41%    reaim  [kernel.kallsyms]  [k] queue_write_lock_slowpath
  5.01%    reaim  [kernel.kallsyms]  [k] mspin_lock
  2.12%    reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
  2.07%       ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  1.58%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  1.25%    reaim  [kernel.kallsyms]  [k] queue_write_3step_lock
  1.18%    reaim  [kernel.kallsyms]  [k] queue_read_lock_slowpath
  1.14%     true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  0.95%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner

The spinlock bottlenecks were shown below.

  7.94%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
              |--59.72%-- release_pages
              |--37.41%-- pagevec_lru_move_fn
              |--0.82%-- get_page_from_freelist
              |--0.73%-- __page_cache_release
               --1.32%-- [...]

For both release_pages() & pagevec_lru_move_fn() function, the
spinlock contention was on zone->lru_lock. With the queue spinlock
patch, however, the contention went away with a lot more idle time
available and the JPM number went up to 265532 which was an additional
performance improvement.

 28.40%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  6.89%    reaim  [kernel.kallsyms]  [k] mspin_lock
  4.17%    reaim  [kernel.kallsyms]  [k] queue_write_lock_slowpath
  2.10%    reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
  1.82%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  1.34%    reaim  [kernel.kallsyms]  [k] entity_tick
  1.17%    reaim  [kernel.kallsyms]  [k] queue_write_3step_lock
  1.06%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  0.86%    reaim  [kernel.kallsyms]  [k] perf_event_aux
  0.83%       ls  [kernel.kallsyms]  [k] mspin_lock
   :
  0.53%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
  0.14%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave

Tim Chen also tested the qrwlock with Ingo's patch on a 4-socket
machine.  It was found the performance improvement of 11% was the
same with regular rwlock or queue rwlock.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/asm-generic/qrwlock.h |  205 +++++++++++++++++++++++++++++++
 kernel/Kconfig.locks          |    7 +
 kernel/locking/Makefile       |    1 +
 kernel/locking/qrwlock.c      |  265 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 478 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qrwlock.h
 create mode 100644 kernel/locking/qrwlock.c

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
new file mode 100644
index 0000000..9d085cb
--- /dev/null
+++ b/include/asm-generic/qrwlock.h
@@ -0,0 +1,205 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QRWLOCK_H
+#define __ASM_GENERIC_QRWLOCK_H
+
+#include <linux/types.h>
+#include <asm/bitops.h>
+#include <asm/cmpxchg.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
+#include <asm/byteorder.h>
+
+#if !defined(__LITTLE_ENDIAN) && !defined(__BIG_ENDIAN)
+#error "Missing either LITTLE_ENDIAN or BIG_ENDIAN definition."
+#endif
+
+/*
+ * The queue read/write lock data structure
+ *
+ * The layout of the structure is endian-sensitive to make sure that adding
+ * _QR_BIAS to the rw field to increment the reader count won't disturb
+ * the writer field. The least significant 8 bits is the writer field
+ * whereas the remaining 24 bits is the reader count.
+ */
+struct qrwnode {
+	struct qrwnode *next;
+	bool		wait;	/* Waiting flag */
+};
+
+typedef struct qrwlock {
+	union qrwcnts {
+		struct {
+#ifdef __LITTLE_ENDIAN
+			u8  writer;	/* Writer state		*/
+#else
+			u16 r16;	/* Reader count - msb	*/
+			u8  r8;		/* Reader count - lsb	*/
+			u8  writer;	/* Writer state		*/
+#endif
+		};
+		u32	rw;		/* Reader/writer number pair */
+	} cnts;
+	struct qrwnode *waitq;		/* Tail of waiting queue */
+} arch_rwlock_t;
+
+/*
+ * Writer states & reader shift and bias
+ */
+#define	_QW_WAITING	1		/* A writer is waiting	   */
+#define	_QW_LOCKED	0xff		/* A writer holds the lock */
+#define	_QR_SHIFT	8		/* Reader count shift	   */
+#define _QR_BIAS	(1U << _QR_SHIFT)
+
+/*
+ * External function declarations
+ */
+extern void queue_read_lock_slowpath(struct qrwlock *lock);
+extern void queue_write_lock_slowpath(struct qrwlock *lock);
+
+/**
+ * queue_read_can_lock- would read_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline int queue_read_can_lock(struct qrwlock *lock)
+{
+	return !ACCESS_ONCE(lock->cnts.writer);
+}
+
+/**
+ * queue_write_can_lock- would write_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline int queue_write_can_lock(struct qrwlock *lock)
+{
+	return !ACCESS_ONCE(lock->cnts.rw);
+}
+
+/**
+ * queue_read_trylock - try to acquire read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int queue_read_trylock(struct qrwlock *lock)
+{
+	union qrwcnts cnts;
+
+	cnts.rw = ACCESS_ONCE(lock->cnts.rw);
+	if (likely(!cnts.writer)) {
+		cnts.rw = xadd(&lock->cnts.rw, _QR_BIAS);
+		if (likely(!cnts.writer))
+			return 1;
+		add_smp(&lock->cnts.rw, -_QR_BIAS);
+	}
+	return 0;
+}
+
+/**
+ * queue_write_trylock - try to acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int queue_write_trylock(struct qrwlock *lock)
+{
+	union qrwcnts old, new;
+
+	old.rw = ACCESS_ONCE(lock->cnts.rw);
+	if (likely(!old.rw)) {
+		new.rw = old.rw;
+		new.writer = _QW_LOCKED;
+		if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
+			return 1;
+	}
+	return 0;
+}
+/**
+ * queue_read_lock - acquire read lock of a queue rwlock
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline void queue_read_lock(struct qrwlock *lock)
+{
+	union qrwcnts cnts;
+
+	cnts.rw = xadd(&lock->cnts.rw, _QR_BIAS);
+	if (likely(!cnts.writer))
+		return;
+	/*
+	 * Slowpath will decrement the reader count, if necessary
+	 */
+	queue_read_lock_slowpath(lock);
+}
+
+/**
+ * queue_write_lock - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_write_lock(struct qrwlock *lock)
+{
+	/*
+	 * Optimize for the unfair lock case where the fair flag is 0.
+	 */
+	if (cmpxchg(&lock->cnts.rw, 0, _QW_LOCKED) == 0)
+		return;
+	queue_write_lock_slowpath(lock);
+}
+
+/**
+ * queue_read_unlock - release read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_read_unlock(struct qrwlock *lock)
+{
+	/*
+	 * Atomically decrement the reader count
+	 */
+	add_smp(&lock->cnts.rw, -_QR_BIAS);
+}
+
+/**
+ * queue_write_unlock - release write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_write_unlock(struct qrwlock *lock)
+{
+	/*
+	 * Make sure that none of the critical section will be leaked out.
+	 */
+	smp_mb__before_clear_bit();
+	ACCESS_ONCE(lock->cnts.writer) = 0;
+	smp_mb__after_clear_bit();
+}
+
+/*
+ * Initializier
+ */
+#define	__ARCH_RW_LOCK_UNLOCKED	{ .cnts = { .rw = 0 }, .waitq = NULL }
+
+/*
+ * Remapping rwlock architecture specific functions to the corresponding
+ * queue rwlock functions.
+ */
+#define arch_read_can_lock(l)	queue_read_can_lock(l)
+#define arch_write_can_lock(l)	queue_write_can_lock(l)
+#define arch_read_lock(l)	queue_read_lock(l)
+#define arch_write_lock(l)	queue_write_lock(l)
+#define arch_read_trylock(l)	queue_read_trylock(l)
+#define arch_write_trylock(l)	queue_write_trylock(l)
+#define arch_read_unlock(l)	queue_read_unlock(l)
+#define arch_write_unlock(l)	queue_write_unlock(l)
+
+#endif /* __ASM_GENERIC_QRWLOCK_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..b665478 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_QUEUE_RWLOCK
+	bool
+
+config QUEUE_RWLOCK
+	def_bool y if ARCH_QUEUE_RWLOCK
+	depends on SMP
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..3e7bab1 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
 obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
 obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
 obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
+obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
new file mode 100644
index 0000000..ea5553d
--- /dev/null
+++ b/kernel/locking/qrwlock.c
@@ -0,0 +1,265 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <asm-generic/qrwlock.h>
+
+/*
+ * Compared with regular rwlock, the queue rwlock has has the following
+ * advantages:
+ * 1. Even though there is a slight chance of stealing the lock if come at
+ *    the right moment, the granting of the lock is mostly in FIFO order.
+ * 2. It is usually faster in high contention situation.
+ *
+ * The only downside is that the lock is 4 bytes larger in 32-bit systems
+ * and 12 bytes larger in 64-bit systems.
+ *
+ * There are two queues for writers. The writer field of the lock is a
+ * one-slot wait queue. The writers that follow will have to wait in the
+ * combined reader/writer queue (waitq).
+ *
+ * Compared with x86 ticket spinlock, the queue rwlock is faster in high
+ * contention situation. The writer lock is also faster in single thread
+ * operations. Therefore, queue rwlock can be considered as a replacement
+ * for those spinlocks that are highly contended as long as an increase
+ * in lock size is not an issue.
+ */
+
+#ifndef arch_mutex_cpu_relax
+# define arch_mutex_cpu_relax() cpu_relax()
+#endif
+
+#ifndef smp_mb__load_acquire
+# ifdef CONFIG_X86
+#   define smp_mb__load_acquire()	barrier()
+# else
+#   define smp_mb__load_acquire()	smp_mb()
+# endif
+#endif
+
+#ifndef smp_mb__store_release
+# ifdef CONFIG_X86
+#   define smp_mb__store_release()	barrier()
+# else
+#   define smp_mb__store_release()	smp_mb()
+# endif
+#endif
+
+/**
+ * wait_in_queue - Add to queue and wait until it is at the head
+ * @lock: Pointer to queue rwlock structure
+ * @node: Node pointer to be added to the queue
+ */
+static __always_inline void
+wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
+{
+	struct qrwnode *prev;
+
+	node->next = NULL;
+	node->wait = true;
+	prev = xchg(&lock->waitq, node);
+	if (prev) {
+		prev->next = node;
+		/*
+		 * Wait until the waiting flag is off
+		 */
+		while (ACCESS_ONCE(node->wait))
+			arch_mutex_cpu_relax();
+		smp_mb__load_acquire();
+	}
+}
+
+/**
+ * signal_next - Signal the next one in queue to be at the head
+ * @lock: Pointer to queue rwlock structure
+ * @node: Node pointer to the current head of queue
+ */
+static __always_inline void
+signal_next(struct qrwlock *lock, struct qrwnode *node)
+{
+	struct qrwnode *next;
+
+	/*
+	 * Try to notify the next node first without disturbing the cacheline
+	 * of the lock. If that fails, check to see if it is the last node
+	 * and so should clear the wait queue.
+	 */
+	next = ACCESS_ONCE(node->next);
+	if (likely(next))
+		goto notify_next;
+
+	/*
+	 * Clear the wait queue if it is the last node
+	 */
+	if ((ACCESS_ONCE(lock->waitq) == node) &&
+	    (cmpxchg(&lock->waitq, node, NULL) == node))
+			return;
+	/*
+	 * Wait until the next one in queue set up the next field
+	 */
+	while (likely(!(next = ACCESS_ONCE(node->next))))
+		arch_mutex_cpu_relax();
+	/*
+	 * The next one in queue is now at the head
+	 */
+notify_next:
+	smp_mb__store_release();
+	ACCESS_ONCE(next->wait) = false;
+}
+
+/**
+ * rspin_until_writer_unlock - inc reader count & spin until writer is gone
+ * @lock: Pointer to queue rwlock structure
+ * @cnts: Current queue rwlock counts structure
+ *
+ * In interrupt context or at the head of the queue, the reader will just
+ * increment the reader count & wait until the writer releases the lock.
+ */
+static __always_inline void
+rspin_until_writer_unlock(struct qrwlock *lock, union qrwcnts cnts)
+{
+	while (cnts.writer == _QW_LOCKED) {
+		arch_mutex_cpu_relax();
+		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
+	}
+}
+
+/**
+ * queue_read_lock_slowpath - acquire read lock of a queue rwlock
+ * @lock: Pointer to queue rwlock structure
+ */
+void queue_read_lock_slowpath(struct qrwlock *lock)
+{
+	struct qrwnode node;
+	union qrwcnts cnts;
+
+	/*
+	 * Readers come here when it cannot get the lock without waiting
+	 */
+	if (unlikely(irq_count())) {
+		/*
+		 * Readers in interrupt context will spin until the lock is
+		 * available without waiting in the queue.
+		 */
+		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
+		rspin_until_writer_unlock(lock, cnts);
+		return;
+	}
+	add_smp(&lock->cnts.rw, -_QR_BIAS);
+
+	/*
+	 * Put the reader into the wait queue
+	 */
+	wait_in_queue(lock, &node);
+
+	/*
+	 * At the head of the wait queue now, wait until the writer state
+	 * goes to 0 and then try to increment the reader count and get
+	 * the lock.
+	 */
+	while (ACCESS_ONCE(lock->cnts.writer))
+		arch_mutex_cpu_relax();
+	cnts.rw = xadd(&lock->cnts.rw, _QR_BIAS);
+	rspin_until_writer_unlock(lock, cnts);
+	/*
+	 * Need to have a barrier with read-acquire semantics
+	 */
+	smp_mb__load_acquire();
+	signal_next(lock, &node);
+}
+EXPORT_SYMBOL(queue_read_lock_slowpath);
+
+/**
+ * qwrite_trylock - Try to acquire the write lock
+ * @lock : Pointer to queue rwlock structure
+ * @old  : The current queue rwlock count structure
+ * Return: 1 if lock acquired, 0 otherwise
+ */
+static __always_inline int
+qwrite_trylock(struct qrwlock *lock, union qrwcnts old)
+{
+	register union qrwcnts new;
+
+	new.rw     = old.rw;
+	new.writer = _QW_LOCKED;
+	if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
+		return 1;
+	return 0;
+}
+
+/**
+ * queue_write_3step_lock - acquire write lock in 3 steps
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 otherwise
+ *
+ * Step 1 - Try to acquire the lock directly if no reader is present
+ * Step 2 - Set the waiting flag to notify readers that a writer is waiting
+ * Step 3 - When the readers field goes to 0, set the locked flag
+ *
+ * In x86, the use of noinline generates a slight better optimized code
+ * with less memory access.
+ */
+static noinline int queue_write_3step_lock(struct qrwlock *lock)
+{
+	register union qrwcnts old;
+
+	old.rw = ACCESS_ONCE(lock->cnts.rw);
+
+	/* Step 1 */
+	if (!old.rw && qwrite_trylock(lock, old))
+		return 1;
+
+	/* Step 2 */
+	if (old.writer || (cmpxchg(&lock->cnts.writer, 0, _QW_WAITING) != 0))
+		return 0;
+
+	/* Step 3 */
+	arch_mutex_cpu_relax();
+	old.rw = ACCESS_ONCE(lock->cnts.rw);
+	while ((old.rw > _QW_WAITING) || !qwrite_trylock(lock, old)) {
+		arch_mutex_cpu_relax();
+		old.rw = ACCESS_ONCE(lock->cnts.rw);
+	}
+	return 1;
+}
+
+/**
+ * queue_write_lock_slowpath - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+void queue_write_lock_slowpath(struct qrwlock *lock)
+{
+	struct qrwnode node;
+
+	/*
+	 * Put the writer into the wait queue
+	 */
+	wait_in_queue(lock, &node);
+
+	/*
+	 * At the head of the wait queue now, call queue_write_3step_lock()
+	 * to acquire the lock until it is done.
+	 */
+	while (!queue_write_3step_lock(lock))
+		arch_mutex_cpu_relax();
+	signal_next(lock, &node);
+}
+EXPORT_SYMBOL(queue_write_lock_slowpath);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 2/4] qrwlock x86: Enable x86 to use queue read/write lock
  2013-11-22 19:04 [PATCH v7 0/4] qrwlock: Introducing a queue read/write lock implementation Waiman Long
  2013-11-22 19:04 ` [PATCH v7 1/4] qrwlock: A " Waiman Long
@ 2013-11-22 19:04 ` Waiman Long
  2013-11-22 19:04   ` Waiman Long
  2013-12-17 19:22   ` Paul E. McKenney
  2013-11-22 19:04 ` [PATCH v7 3/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing Waiman Long
  2013-11-22 19:04 ` [PATCH v7 4/4] qrwlock: Use smp_store_release() in write_unlock() Waiman Long
  3 siblings, 2 replies; 26+ messages in thread
From: Waiman Long @ 2013-11-22 19:04 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Aswin Chandramouleeswaran", Scott J Norton,
	Waiman Long

This patch makes the necessary changes at the x86 architecture specific
layer to enable the presence of the CONFIG_QUEUE_RWLOCK kernel option
to replace the read/write lock by the queue read/write lock.

It also enables the CONFIG_QUEUE_RWLOCK option by default for x86 which
will force the use of queue read/write lock. That will greatly improve
the fairness of read/write lock and eliminate live-lock situation
where one task may not get the lock for an indefinite period of time.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/spinlock.h       |    2 ++
 arch/x86/include/asm/spinlock_types.h |    4 ++++
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e903c71..c4a9c54 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -124,6 +124,7 @@ config X86
 	select RTC_LIB
 	select HAVE_DEBUG_STACKOVERFLOW
 	select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
+	select QUEUE_RWLOCK
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..8fb88c5 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -188,6 +188,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 		cpu_relax();
 }
 
+#ifndef CONFIG_QUEUE_RWLOCK
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
@@ -270,6 +271,7 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
 	asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
 		     : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
 }
+#endif /* CONFIG_QUEUE_RWLOCK */
 
 #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
 #define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..a585635 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -34,6 +34,10 @@ typedef struct arch_spinlock {
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
 
+#ifdef CONFIG_QUEUE_RWLOCK
+#include <asm-generic/qrwlock.h>
+#else
 #include <asm/rwlock.h>
+#endif
 
 #endif /* _ASM_X86_SPINLOCK_TYPES_H */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 2/4] qrwlock x86: Enable x86 to use queue read/write lock
  2013-11-22 19:04 ` [PATCH v7 2/4] qrwlock x86: Enable x86 to use queue read/write lock Waiman Long
@ 2013-11-22 19:04   ` Waiman Long
  2013-12-17 19:22   ` Paul E. McKenney
  1 sibling, 0 replies; 26+ messages in thread
From: Waiman Long @ 2013-11-22 19:04 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Aswin Chandramouleeswaran", Scott J Norton,
	Waiman Long

This patch makes the necessary changes at the x86 architecture specific
layer to enable the presence of the CONFIG_QUEUE_RWLOCK kernel option
to replace the read/write lock by the queue read/write lock.

It also enables the CONFIG_QUEUE_RWLOCK option by default for x86 which
will force the use of queue read/write lock. That will greatly improve
the fairness of read/write lock and eliminate live-lock situation
where one task may not get the lock for an indefinite period of time.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/spinlock.h       |    2 ++
 arch/x86/include/asm/spinlock_types.h |    4 ++++
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e903c71..c4a9c54 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -124,6 +124,7 @@ config X86
 	select RTC_LIB
 	select HAVE_DEBUG_STACKOVERFLOW
 	select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
+	select QUEUE_RWLOCK
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..8fb88c5 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -188,6 +188,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 		cpu_relax();
 }
 
+#ifndef CONFIG_QUEUE_RWLOCK
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
@@ -270,6 +271,7 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
 	asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
 		     : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
 }
+#endif /* CONFIG_QUEUE_RWLOCK */
 
 #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
 #define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..a585635 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -34,6 +34,10 @@ typedef struct arch_spinlock {
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
 
+#ifdef CONFIG_QUEUE_RWLOCK
+#include <asm-generic/qrwlock.h>
+#else
 #include <asm/rwlock.h>
+#endif
 
 #endif /* _ASM_X86_SPINLOCK_TYPES_H */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 3/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing
  2013-11-22 19:04 [PATCH v7 0/4] qrwlock: Introducing a queue read/write lock implementation Waiman Long
  2013-11-22 19:04 ` [PATCH v7 1/4] qrwlock: A " Waiman Long
  2013-11-22 19:04 ` [PATCH v7 2/4] qrwlock x86: Enable x86 to use queue read/write lock Waiman Long
@ 2013-11-22 19:04 ` Waiman Long
  2013-12-17 19:23   ` Paul E. McKenney
  2013-11-22 19:04 ` [PATCH v7 4/4] qrwlock: Use smp_store_release() in write_unlock() Waiman Long
  3 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2013-11-22 19:04 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Aswin Chandramouleeswaran", Scott J Norton,
	Waiman Long

There is a pending MCS lock patch series that adds generic MCS lock
helper functions to do MCS-style locking. This patch will enable
the queue rwlock to use that generic MCS lock/unlock primitives for
internal queuing. This patch should only be merged after the merging
of that generic MCS locking patch.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/asm-generic/qrwlock.h |    7 +--
 kernel/locking/qrwlock.c      |   83 +++-------------------------------------
 2 files changed, 9 insertions(+), 81 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 9d085cb..335473a 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -37,10 +37,7 @@
  * the writer field. The least significant 8 bits is the writer field
  * whereas the remaining 24 bits is the reader count.
  */
-struct qrwnode {
-	struct qrwnode *next;
-	bool		wait;	/* Waiting flag */
-};
+struct mcs_spinlock;
 
 typedef struct qrwlock {
 	union qrwcnts {
@@ -55,7 +52,7 @@ typedef struct qrwlock {
 		};
 		u32	rw;		/* Reader/writer number pair */
 	} cnts;
-	struct qrwnode *waitq;		/* Tail of waiting queue */
+	struct mcs_spinlock *waitq;	/* Tail of waiting queue */
 } arch_rwlock_t;
 
 /*
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index ea5553d..b3e5c2b 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -20,6 +20,7 @@
 #include <linux/cpumask.h>
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
+#include <linux/mcs_spinlock.h>
 #include <asm-generic/qrwlock.h>
 
 /*
@@ -55,76 +56,6 @@
 # endif
 #endif
 
-#ifndef smp_mb__store_release
-# ifdef CONFIG_X86
-#   define smp_mb__store_release()	barrier()
-# else
-#   define smp_mb__store_release()	smp_mb()
-# endif
-#endif
-
-/**
- * wait_in_queue - Add to queue and wait until it is at the head
- * @lock: Pointer to queue rwlock structure
- * @node: Node pointer to be added to the queue
- */
-static __always_inline void
-wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
-{
-	struct qrwnode *prev;
-
-	node->next = NULL;
-	node->wait = true;
-	prev = xchg(&lock->waitq, node);
-	if (prev) {
-		prev->next = node;
-		/*
-		 * Wait until the waiting flag is off
-		 */
-		while (ACCESS_ONCE(node->wait))
-			arch_mutex_cpu_relax();
-		smp_mb__load_acquire();
-	}
-}
-
-/**
- * signal_next - Signal the next one in queue to be at the head
- * @lock: Pointer to queue rwlock structure
- * @node: Node pointer to the current head of queue
- */
-static __always_inline void
-signal_next(struct qrwlock *lock, struct qrwnode *node)
-{
-	struct qrwnode *next;
-
-	/*
-	 * Try to notify the next node first without disturbing the cacheline
-	 * of the lock. If that fails, check to see if it is the last node
-	 * and so should clear the wait queue.
-	 */
-	next = ACCESS_ONCE(node->next);
-	if (likely(next))
-		goto notify_next;
-
-	/*
-	 * Clear the wait queue if it is the last node
-	 */
-	if ((ACCESS_ONCE(lock->waitq) == node) &&
-	    (cmpxchg(&lock->waitq, node, NULL) == node))
-			return;
-	/*
-	 * Wait until the next one in queue set up the next field
-	 */
-	while (likely(!(next = ACCESS_ONCE(node->next))))
-		arch_mutex_cpu_relax();
-	/*
-	 * The next one in queue is now at the head
-	 */
-notify_next:
-	smp_mb__store_release();
-	ACCESS_ONCE(next->wait) = false;
-}
-
 /**
  * rspin_until_writer_unlock - inc reader count & spin until writer is gone
  * @lock: Pointer to queue rwlock structure
@@ -148,7 +79,7 @@ rspin_until_writer_unlock(struct qrwlock *lock, union qrwcnts cnts)
  */
 void queue_read_lock_slowpath(struct qrwlock *lock)
 {
-	struct qrwnode node;
+	struct mcs_spinlock node;
 	union qrwcnts cnts;
 
 	/*
@@ -168,7 +99,7 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
 	/*
 	 * Put the reader into the wait queue
 	 */
-	wait_in_queue(lock, &node);
+	mcs_spin_lock(&lock->waitq, &node);
 
 	/*
 	 * At the head of the wait queue now, wait until the writer state
@@ -183,7 +114,7 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
 	 * Need to have a barrier with read-acquire semantics
 	 */
 	smp_mb__load_acquire();
-	signal_next(lock, &node);
+	mcs_spin_unlock(&lock->waitq, &node);
 }
 EXPORT_SYMBOL(queue_read_lock_slowpath);
 
@@ -247,12 +178,12 @@ static noinline int queue_write_3step_lock(struct qrwlock *lock)
  */
 void queue_write_lock_slowpath(struct qrwlock *lock)
 {
-	struct qrwnode node;
+	struct mcs_spinlock node;
 
 	/*
 	 * Put the writer into the wait queue
 	 */
-	wait_in_queue(lock, &node);
+	mcs_spin_lock(&lock->waitq, &node);
 
 	/*
 	 * At the head of the wait queue now, call queue_write_3step_lock()
@@ -260,6 +191,6 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
 	 */
 	while (!queue_write_3step_lock(lock))
 		arch_mutex_cpu_relax();
-	signal_next(lock, &node);
+	mcs_spin_unlock(&lock->waitq, &node);
 }
 EXPORT_SYMBOL(queue_write_lock_slowpath);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 4/4] qrwlock: Use smp_store_release() in write_unlock()
  2013-11-22 19:04 [PATCH v7 0/4] qrwlock: Introducing a queue read/write lock implementation Waiman Long
                   ` (2 preceding siblings ...)
  2013-11-22 19:04 ` [PATCH v7 3/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing Waiman Long
@ 2013-11-22 19:04 ` Waiman Long
  2013-12-17 19:24   ` Paul E. McKenney
  3 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2013-11-22 19:04 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann
  Cc: linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin,
	Tim Chen, Aswin Chandramouleeswaran", Scott J Norton,
	Waiman Long

This patch modifies the queue_write_unlock() function to use the
new smp_store_release() function in another pending patch. This patch
should only be merged if the other patch was merged.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/asm-generic/qrwlock.h |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 335473a..59c5972 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -176,9 +176,7 @@ static inline void queue_write_unlock(struct qrwlock *lock)
 	/*
 	 * Make sure that none of the critical section will be leaked out.
 	 */
-	smp_mb__before_clear_bit();
-	ACCESS_ONCE(lock->cnts.writer) = 0;
-	smp_mb__after_clear_bit();
+	smp_store_release(&lock->cnts.writer, 0)
 }
 
 /*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-11-22 19:04 ` [PATCH v7 1/4] qrwlock: A " Waiman Long
@ 2013-11-22 19:14   ` Linus Torvalds
  2013-11-22 20:35     ` Waiman Long
  2013-12-17 19:21   ` Paul E. McKenney
  1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2013-11-22 19:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch@vger.kernel.org, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Raghavendra K T, George Spelvin, Tim Chen,
	Aswin Chandramouleeswaran, Scott J Norton

On Fri, Nov 22, 2013 at 11:04 AM, Waiman Long <Waiman.Long@hp.com> wrote:
>
> In term of single-thread performance (no contention), a 256K
> lock/unlock loop was run on a 2.4GHz and 2.93Ghz Westmere x86-64
> CPUs. The following table shows the average time (in ns) for a single
> lock/unlock sequence (including the looping and timing overhead):
>
> Lock Type                   2.4GHz      2.93GHz
> ---------                   ------      -------
> Ticket spinlock              14.9        12.3
> Read lock                    17.0        13.5
> Write lock                   17.0        13.5
> Queue read lock              16.0        13.4
> Queue write lock              9.2         7.8

Can you verify for me that you re-did those numbers? Because it used
to be that the fair queue write lock was slower than the numbers you
now quote..

Was the cost of the fair queue write lock purely in the extra
conditional testing for whether the lock was supposed to be fair or
not, and now that you dropped that, it's fast? If so, then that's an
extra argument for the old conditional fair/unfair being complete
garbage.

Alternatively, maybe you just took the old timings, and the above
numbers are for the old unfair code, and *not* for the actual patch
you sent out?

So please double-check and verify.

              Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-11-22 19:14   ` Linus Torvalds
@ 2013-11-22 20:35     ` Waiman Long
  2013-11-22 21:14       ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2013-11-22 20:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch@vger.kernel.org, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Raghavendra K T, George Spelvin, Tim Chen,
	Aswin Chandramouleeswaran, Scott J Norton

On 11/22/2013 02:14 PM, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 11:04 AM, Waiman Long<Waiman.Long@hp.com>  wrote:
>> In term of single-thread performance (no contention), a 256K
>> lock/unlock loop was run on a 2.4GHz and 2.93Ghz Westmere x86-64
>> CPUs. The following table shows the average time (in ns) for a single
>> lock/unlock sequence (including the looping and timing overhead):
>>
>> Lock Type                   2.4GHz      2.93GHz
>> ---------                   ------      -------
>> Ticket spinlock              14.9        12.3
>> Read lock                    17.0        13.5
>> Write lock                   17.0        13.5
>> Queue read lock              16.0        13.4
>> Queue write lock              9.2         7.8
> Can you verify for me that you re-did those numbers? Because it used
> to be that the fair queue write lock was slower than the numbers you
> now quote..
>
> Was the cost of the fair queue write lock purely in the extra
> conditional testing for whether the lock was supposed to be fair or
> not, and now that you dropped that, it's fast? If so, then that's an
> extra argument for the old conditional fair/unfair being complete
> garbage.

Yes, the extra latency of the fair lock in earlier patch is due to the 
need to do a second cmpxchg(). That can be avoided by doing a read 
first, but that is not good for good cache.  So I optimized it for the 
default unfair lock. By supporting only one version, there is no need to 
do a second cmpxchg anymore.

> Alternatively, maybe you just took the old timings, and the above
> numbers are for the old unfair code, and *not* for the actual patch
> you sent out?
>
> So please double-check and verify.
>
>                Linus

I reran the timing test on the 2.93GHz processor. The timing is the 
practically the same. I reused the old one for the 2.4GHz processor.

Regards,
Longman

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-11-22 20:35     ` Waiman Long
@ 2013-11-22 21:14       ` Linus Torvalds
  0 siblings, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2013-11-22 21:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch@vger.kernel.org, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Paul E. McKenney, Raghavendra K T, George Spelvin, Tim Chen,
	Aswin Chandramouleeswaran, Scott J Norton

On Fri, Nov 22, 2013 at 12:35 PM, Waiman Long <waiman.long@hp.com> wrote:
>
> Yes, the extra latency of the fair lock in earlier patch is due to the need
> to do a second cmpxchg(). That can be avoided by doing a read first, but
> that is not good for good cache.  So I optimized it for the default unfair
> lock. By supporting only one version, there is no need to do a second
> cmpxchg anymore.

Ok, thanks.

> I reran the timing test on the 2.93GHz processor. The timing is the
> practically the same. I reused the old one for the 2.4GHz processor.

Sounds good. I just wanted to make sure the numbers were sane.

            Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-11-22 19:04 ` [PATCH v7 1/4] qrwlock: A " Waiman Long
  2013-11-22 19:14   ` Linus Torvalds
@ 2013-12-17 19:21   ` Paul E. McKenney
  2013-12-17 19:27     ` Linus Torvalds
  2013-12-18 18:45     ` Waiman Long
  1 sibling, 2 replies; 26+ messages in thread
From: Paul E. McKenney @ 2013-12-17 19:21 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Aswin Chandramouleeswaran", Scott J Norton

On Fri, Nov 22, 2013 at 02:04:44PM -0500, Waiman Long wrote:
> This patch introduces a new read/write lock implementation that put
> waiting readers and writers into a queue instead of actively contending
> the lock like the current read/write lock implementation. This will
> improve performance in highly contended situation by reducing the
> cache line bouncing effect.

Much improved, but still some issues called out inline.

							Thanx, Paul

> The queue read/write lock (qrwlock) is a fair lock even though there
> is still a slight chance of lock stealing if a reader or writer comes
> at the right moment.  Other than that, lock granting is done in a
> FIFO manner.  As a result, it is possible to determine a maximum time
> period after which the waiting is over and the lock can be acquired.
> 
> Internally, however, there is a second type of readers which try to
> steal lock aggressively. They simply increments the reader count and
> wait until the writer releases the lock. The transition to aggressive
> reader happens in the read lock slowpath when
> 
>  1. In an interrupt context.
>  2. When a reader comes to the head of the wait queue and sees
>     the release of a write lock.
> 
> The queue read lock is safe to use in an interrupt context (softirq
> or hardirq) as it will switch to become an aggressive reader in such
> environment allowing recursive read lock.
> 
> The only downside of queue rwlock is the size increase in the lock
> structure by 4 bytes for 32-bit systems and by 12 bytes for 64-bit
> systems.
> 
> In term of single-thread performance (no contention), a 256K
> lock/unlock loop was run on a 2.4GHz and 2.93Ghz Westmere x86-64
> CPUs. The following table shows the average time (in ns) for a single
> lock/unlock sequence (including the looping and timing overhead):
> 
> Lock Type		    2.4GHz	2.93GHz
> ---------		    ------	-------
> Ticket spinlock		     14.9	 12.3
> Read lock		     17.0	 13.5
> Write lock		     17.0	 13.5
> Queue read lock	     	     16.0	 13.4
> Queue write lock	      9.2	  7.8
> 
> The queue read lock is slightly slower than the spinlock, but is
> slightly faster than the read lock. The queue write lock, however,
> is the fastest of all. It is almost twice as fast as the write lock
> and about 1.5X of the spinlock.
> 
> With lock contention, the speed of each individual lock/unlock function
> is less important than the amount of contention-induced delays.
> 
> To investigate the performance characteristics of the queue rwlock
> compared with the regular rwlock, Ingo's anon_vmas patch that converts
> rwsem to rwlock was applied to a 3.12 kernel. This kernel was then
> tested under the following 3 conditions:
> 
>  1) Plain 3.12
>  2) Ingo's patch
>  3) Ingo's patch + qrwlock
> 
> Each of the 3 kernels were booted up twice with and without the
> "idle=poll" kernel parameter which keeps the CPUs in C0 state while
> idling instead of a more energy-saving sleep state.  The jobs per
> minutes (JPM) results of the AIM7's high_systime workload at 1500
> users on a 8-socket 80-core DL980 (HT off) were:
> 
>  Kernel	    JPMs	%Change from (1)
>  ------	    ----	----------------
>    1	145704/227295		-
>    2	229750/236066	    +58%/+3.8%
>    4	240062/248606	    +65%/+9.4%
> 
> The first JPM number is without the "idle=poll" kernel parameter,
> the second number is with that parameter. It can be seen that most
> of the performance benefit of converting rwsem to rwlock actually
> come from the latency improvement of not needing to wake up a CPU
> from deep sleep state when work is available.
> 
> The use of non-sleeping locks did improve performance by eliminating
> the context switching cost. Using queue rwlock gave almost tripling
> of performance gain. The performance gain was reduced somewhat with
> a fair lock which was to be expected.
> 
> Looking at the perf profiles (with idle=poll) below, we can clearly see
> that other bottlenecks were constraining the performance improvement.
> 
> Perf profile of kernel (2):
> 
>  18.65%    reaim  [kernel.kallsyms]  [k] __write_lock_failed
>   9.00%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>   5.21%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
>   3.08%    reaim  [kernel.kallsyms]  [k] mspin_lock
>   2.50%    reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
>   2.00%       ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>   1.29%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
>   1.21%    reaim  [kernel.kallsyms]  [k] __read_lock_failed
>   1.12%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
>   1.10%    reaim  [kernel.kallsyms]  [k] perf_event_aux
>   1.09%     true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> 
> Perf profile of kernel (3):
> 
>  20.14%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
>   7.94%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>   5.41%    reaim  [kernel.kallsyms]  [k] queue_write_lock_slowpath
>   5.01%    reaim  [kernel.kallsyms]  [k] mspin_lock
>   2.12%    reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
>   2.07%       ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>   1.58%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
>   1.25%    reaim  [kernel.kallsyms]  [k] queue_write_3step_lock
>   1.18%    reaim  [kernel.kallsyms]  [k] queue_read_lock_slowpath
>   1.14%     true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>   0.95%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
> 
> The spinlock bottlenecks were shown below.
> 
>   7.94%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>               |--59.72%-- release_pages
>               |--37.41%-- pagevec_lru_move_fn
>               |--0.82%-- get_page_from_freelist
>               |--0.73%-- __page_cache_release
>                --1.32%-- [...]
> 
> For both release_pages() & pagevec_lru_move_fn() function, the
> spinlock contention was on zone->lru_lock. With the queue spinlock
> patch, however, the contention went away with a lot more idle time
> available and the JPM number went up to 265532 which was an additional
> performance improvement.
> 
>  28.40%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
>   6.89%    reaim  [kernel.kallsyms]  [k] mspin_lock
>   4.17%    reaim  [kernel.kallsyms]  [k] queue_write_lock_slowpath
>   2.10%    reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
>   1.82%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
>   1.34%    reaim  [kernel.kallsyms]  [k] entity_tick
>   1.17%    reaim  [kernel.kallsyms]  [k] queue_write_3step_lock
>   1.06%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
>   0.86%    reaim  [kernel.kallsyms]  [k] perf_event_aux
>   0.83%       ls  [kernel.kallsyms]  [k] mspin_lock
>    :
>   0.53%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
>   0.14%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> 
> Tim Chen also tested the qrwlock with Ingo's patch on a 4-socket
> machine.  It was found the performance improvement of 11% was the
> same with regular rwlock or queue rwlock.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  include/asm-generic/qrwlock.h |  205 +++++++++++++++++++++++++++++++
>  kernel/Kconfig.locks          |    7 +
>  kernel/locking/Makefile       |    1 +
>  kernel/locking/qrwlock.c      |  265 +++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 478 insertions(+), 0 deletions(-)
>  create mode 100644 include/asm-generic/qrwlock.h
>  create mode 100644 kernel/locking/qrwlock.c
> 
> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> new file mode 100644
> index 0000000..9d085cb
> --- /dev/null
> +++ b/include/asm-generic/qrwlock.h
> @@ -0,0 +1,205 @@
> +/*
> + * Queue read/write lock
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
> + *
> + * Authors: Waiman Long <waiman.long@hp.com>
> + */
> +#ifndef __ASM_GENERIC_QRWLOCK_H
> +#define __ASM_GENERIC_QRWLOCK_H
> +
> +#include <linux/types.h>
> +#include <asm/bitops.h>
> +#include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> +#include <asm/processor.h>
> +#include <asm/byteorder.h>
> +
> +#if !defined(__LITTLE_ENDIAN) && !defined(__BIG_ENDIAN)
> +#error "Missing either LITTLE_ENDIAN or BIG_ENDIAN definition."
> +#endif
> +
> +/*
> + * The queue read/write lock data structure
> + *
> + * The layout of the structure is endian-sensitive to make sure that adding
> + * _QR_BIAS to the rw field to increment the reader count won't disturb
> + * the writer field. The least significant 8 bits is the writer field
> + * whereas the remaining 24 bits is the reader count.
> + */
> +struct qrwnode {
> +	struct qrwnode *next;
> +	bool		wait;	/* Waiting flag */
> +};
> +
> +typedef struct qrwlock {
> +	union qrwcnts {
> +		struct {
> +#ifdef __LITTLE_ENDIAN
> +			u8  writer;	/* Writer state		*/
> +#else
> +			u16 r16;	/* Reader count - msb	*/
> +			u8  r8;		/* Reader count - lsb	*/
> +			u8  writer;	/* Writer state		*/
> +#endif
> +		};
> +		u32	rw;		/* Reader/writer number pair */
> +	} cnts;
> +	struct qrwnode *waitq;		/* Tail of waiting queue */
> +} arch_rwlock_t;
> +
> +/*
> + * Writer states & reader shift and bias
> + */
> +#define	_QW_WAITING	1		/* A writer is waiting	   */
> +#define	_QW_LOCKED	0xff		/* A writer holds the lock */
> +#define	_QR_SHIFT	8		/* Reader count shift	   */
> +#define _QR_BIAS	(1U << _QR_SHIFT)
> +
> +/*
> + * External function declarations
> + */
> +extern void queue_read_lock_slowpath(struct qrwlock *lock);
> +extern void queue_write_lock_slowpath(struct qrwlock *lock);
> +
> +/**
> + * queue_read_can_lock- would read_trylock() succeed?
> + * @lock: Pointer to queue rwlock structure
> + */
> +static inline int queue_read_can_lock(struct qrwlock *lock)
> +{
> +	return !ACCESS_ONCE(lock->cnts.writer);
> +}
> +
> +/**
> + * queue_write_can_lock- would write_trylock() succeed?
> + * @lock: Pointer to queue rwlock structure
> + */
> +static inline int queue_write_can_lock(struct qrwlock *lock)
> +{
> +	return !ACCESS_ONCE(lock->cnts.rw);
> +}
> +
> +/**
> + * queue_read_trylock - try to acquire read lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static inline int queue_read_trylock(struct qrwlock *lock)
> +{
> +	union qrwcnts cnts;
> +
> +	cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> +	if (likely(!cnts.writer)) {
> +		cnts.rw = xadd(&lock->cnts.rw, _QR_BIAS);

Looks like xadd() is x86-specific, but this is common code.  One
approach would be to do xadd() for other arches, another approach
would be to make .rw be an atomic_t rather than a u32.  Making it
be atomic_t is probably easiest.  (The cmpxchg()s would then need
to be atomic_cmpxchg().)

Ditto for add_smp() further down.

> +		if (likely(!cnts.writer))
> +			return 1;
> +		add_smp(&lock->cnts.rw, -_QR_BIAS);
> +	}
> +	return 0;
> +}
> +
> +/**
> + * queue_write_trylock - try to acquire write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static inline int queue_write_trylock(struct qrwlock *lock)
> +{
> +	union qrwcnts old, new;
> +
> +	old.rw = ACCESS_ONCE(lock->cnts.rw);
> +	if (likely(!old.rw)) {
> +		new.rw = old.rw;
> +		new.writer = _QW_LOCKED;
> +		if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
> +			return 1;
> +	}
> +	return 0;
> +}
> +/**
> + * queue_read_lock - acquire read lock of a queue rwlock
> + * @lock: Pointer to queue rwlock structure
> + */
> +static inline void queue_read_lock(struct qrwlock *lock)
> +{
> +	union qrwcnts cnts;
> +
> +	cnts.rw = xadd(&lock->cnts.rw, _QR_BIAS);
> +	if (likely(!cnts.writer))
> +		return;
> +	/*
> +	 * Slowpath will decrement the reader count, if necessary
> +	 */
> +	queue_read_lock_slowpath(lock);
> +}
> +
> +/**
> + * queue_write_lock - acquire write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_write_lock(struct qrwlock *lock)
> +{
> +	/*
> +	 * Optimize for the unfair lock case where the fair flag is 0.
> +	 */
> +	if (cmpxchg(&lock->cnts.rw, 0, _QW_LOCKED) == 0)
> +		return;
> +	queue_write_lock_slowpath(lock);
> +}
> +
> +/**
> + * queue_read_unlock - release read lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_read_unlock(struct qrwlock *lock)
> +{
> +	/*
> +	 * Atomically decrement the reader count
> +	 */
> +	add_smp(&lock->cnts.rw, -_QR_BIAS);
> +}
> +
> +/**
> + * queue_write_unlock - release write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_write_unlock(struct qrwlock *lock)
> +{
> +	/*
> +	 * Make sure that none of the critical section will be leaked out.
> +	 */
> +	smp_mb__before_clear_bit();
> +	ACCESS_ONCE(lock->cnts.writer) = 0;
> +	smp_mb__after_clear_bit();

Interesting combination...  It does seem to work out, though.

> +}
> +
> +/*
> + * Initializier
> + */
> +#define	__ARCH_RW_LOCK_UNLOCKED	{ .cnts = { .rw = 0 }, .waitq = NULL }
> +
> +/*
> + * Remapping rwlock architecture specific functions to the corresponding
> + * queue rwlock functions.
> + */
> +#define arch_read_can_lock(l)	queue_read_can_lock(l)
> +#define arch_write_can_lock(l)	queue_write_can_lock(l)
> +#define arch_read_lock(l)	queue_read_lock(l)
> +#define arch_write_lock(l)	queue_write_lock(l)
> +#define arch_read_trylock(l)	queue_read_trylock(l)
> +#define arch_write_trylock(l)	queue_write_trylock(l)
> +#define arch_read_unlock(l)	queue_read_unlock(l)
> +#define arch_write_unlock(l)	queue_write_unlock(l)
> +
> +#endif /* __ASM_GENERIC_QRWLOCK_H */
> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index d2b32ac..b665478 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -223,3 +223,10 @@ endif
>  config MUTEX_SPIN_ON_OWNER
>  	def_bool y
>  	depends on SMP && !DEBUG_MUTEXES
> +
> +config ARCH_QUEUE_RWLOCK
> +	bool
> +
> +config QUEUE_RWLOCK
> +	def_bool y if ARCH_QUEUE_RWLOCK
> +	depends on SMP
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index baab8e5..3e7bab1 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -23,3 +23,4 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
>  obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
>  obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
>  obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
> +obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> new file mode 100644
> index 0000000..ea5553d
> --- /dev/null
> +++ b/kernel/locking/qrwlock.c
> @@ -0,0 +1,265 @@
> +/*
> + * Queue read/write lock
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
> + *
> + * Authors: Waiman Long <waiman.long@hp.com>
> + */
> +#include <linux/smp.h>
> +#include <linux/bug.h>
> +#include <linux/cpumask.h>
> +#include <linux/percpu.h>
> +#include <linux/hardirq.h>
> +#include <asm-generic/qrwlock.h>
> +
> +/*
> + * Compared with regular rwlock, the queue rwlock has has the following
> + * advantages:
> + * 1. Even though there is a slight chance of stealing the lock if come at
> + *    the right moment, the granting of the lock is mostly in FIFO order.
> + * 2. It is usually faster in high contention situation.
> + *
> + * The only downside is that the lock is 4 bytes larger in 32-bit systems
> + * and 12 bytes larger in 64-bit systems.
> + *
> + * There are two queues for writers. The writer field of the lock is a
> + * one-slot wait queue. The writers that follow will have to wait in the
> + * combined reader/writer queue (waitq).
> + *
> + * Compared with x86 ticket spinlock, the queue rwlock is faster in high
> + * contention situation. The writer lock is also faster in single thread
> + * operations. Therefore, queue rwlock can be considered as a replacement
> + * for those spinlocks that are highly contended as long as an increase
> + * in lock size is not an issue.

Judging from the #defines at the end of qrwlock.h, this replacement is
done on a file-by-file basis?  Looks to me more like the replacement
must be all-or-nothing across the entire kernel, otherwise trouble would
ensue for locks used across multiple files.  What am I missing here?

> + */
> +
> +#ifndef arch_mutex_cpu_relax
> +# define arch_mutex_cpu_relax() cpu_relax()
> +#endif
> +
> +#ifndef smp_mb__load_acquire
> +# ifdef CONFIG_X86
> +#   define smp_mb__load_acquire()	barrier()
> +# else
> +#   define smp_mb__load_acquire()	smp_mb()
> +# endif
> +#endif
> +
> +#ifndef smp_mb__store_release
> +# ifdef CONFIG_X86
> +#   define smp_mb__store_release()	barrier()
> +# else
> +#   define smp_mb__store_release()	smp_mb()
> +# endif
> +#endif

These are now smp_load_acquire() and smp_store_release().

> +
> +/**
> + * wait_in_queue - Add to queue and wait until it is at the head
> + * @lock: Pointer to queue rwlock structure
> + * @node: Node pointer to be added to the queue
> + */
> +static __always_inline void
> +wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
> +{
> +	struct qrwnode *prev;
> +
> +	node->next = NULL;
> +	node->wait = true;
> +	prev = xchg(&lock->waitq, node);
> +	if (prev) {
> +		prev->next = node;
> +		/*
> +		 * Wait until the waiting flag is off
> +		 */
> +		while (ACCESS_ONCE(node->wait))
> +			arch_mutex_cpu_relax();
> +		smp_mb__load_acquire();

		while (smp_load_acquire(&node->wait))
			arch_mutex_cpu_relax();

On TSO systems like x86, this should generate the same code.

> +	}
> +}
> +
> +/**
> + * signal_next - Signal the next one in queue to be at the head
> + * @lock: Pointer to queue rwlock structure
> + * @node: Node pointer to the current head of queue
> + */
> +static __always_inline void
> +signal_next(struct qrwlock *lock, struct qrwnode *node)
> +{
> +	struct qrwnode *next;
> +
> +	/*
> +	 * Try to notify the next node first without disturbing the cacheline
> +	 * of the lock. If that fails, check to see if it is the last node
> +	 * and so should clear the wait queue.
> +	 */
> +	next = ACCESS_ONCE(node->next);
> +	if (likely(next))
> +		goto notify_next;
> +
> +	/*
> +	 * Clear the wait queue if it is the last node
> +	 */
> +	if ((ACCESS_ONCE(lock->waitq) == node) &&
> +	    (cmpxchg(&lock->waitq, node, NULL) == node))
> +			return;
> +	/*
> +	 * Wait until the next one in queue set up the next field
> +	 */
> +	while (likely(!(next = ACCESS_ONCE(node->next))))
> +		arch_mutex_cpu_relax();
> +	/*
> +	 * The next one in queue is now at the head
> +	 */
> +notify_next:
> +	smp_mb__store_release();
> +	ACCESS_ONCE(next->wait) = false;

The above pair of lines can be simply:

	smp_store_release(&next->wait, false);

This pairs nicely with the smp_load_acquire() in wait_in_queue().

> +}
> +
> +/**
> + * rspin_until_writer_unlock - inc reader count & spin until writer is gone
> + * @lock: Pointer to queue rwlock structure
> + * @cnts: Current queue rwlock counts structure
> + *
> + * In interrupt context or at the head of the queue, the reader will just
> + * increment the reader count & wait until the writer releases the lock.
> + */
> +static __always_inline void
> +rspin_until_writer_unlock(struct qrwlock *lock, union qrwcnts cnts)
> +{
> +	while (cnts.writer == _QW_LOCKED) {
> +		arch_mutex_cpu_relax();
> +		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> +	}
> +}
> +
> +/**
> + * queue_read_lock_slowpath - acquire read lock of a queue rwlock
> + * @lock: Pointer to queue rwlock structure
> + */
> +void queue_read_lock_slowpath(struct qrwlock *lock)
> +{
> +	struct qrwnode node;
> +	union qrwcnts cnts;
> +
> +	/*
> +	 * Readers come here when it cannot get the lock without waiting

Grammar nit: s/it/they/
Or s/Readers come/Each reader comes/

> +	 */
> +	if (unlikely(irq_count())) {
> +		/*
> +		 * Readers in interrupt context will spin until the lock is
> +		 * available without waiting in the queue.
> +		 */
> +		cnts.rw = ACCESS_ONCE(lock->cnts.rw);

This needs to be:

		cnts.rw = smp_load_acquire(&lock->cnts.rw);

I was going to argue that the above assignment should be pushed into
rspin_until_writer_unlock(), but I see that this won't work for the
call later in this function.  ;-)

> +		rspin_until_writer_unlock(lock, cnts);

We do need a memory barrier in this path, otherwise we are not guaranteed
to see the writer's critical section.  One approach would be to make
rspin_until_writer_unlock()s "while" loop body do:

		arch_mutex_cpu_relax();
		cnts.rw = smp_load_acquire(&lock->cnts.rw);

> +		return;
> +	}
> +	add_smp(&lock->cnts.rw, -_QR_BIAS);
> +
> +	/*
> +	 * Put the reader into the wait queue
> +	 */
> +	wait_in_queue(lock, &node);
> +
> +	/*
> +	 * At the head of the wait queue now, wait until the writer state
> +	 * goes to 0 and then try to increment the reader count and get
> +	 * the lock.
> +	 */
> +	while (ACCESS_ONCE(lock->cnts.writer))
> +		arch_mutex_cpu_relax();
> +	cnts.rw = xadd(&lock->cnts.rw, _QR_BIAS);
> +	rspin_until_writer_unlock(lock, cnts);
> +	/*
> +	 * Need to have a barrier with read-acquire semantics
> +	 */
> +	smp_mb__load_acquire();

Making rspin_until_writer_unlock() do an smp_load_acquire() makes this
unnecessary.

> +	signal_next(lock, &node);

Good, this allows multiple readers to acquire the lock concurrently,
give or take memory latency compared to critical-section duration.
When the first writer shows up, it presumably spins on the lock word.

> +}
> +EXPORT_SYMBOL(queue_read_lock_slowpath);
> +
> +/**
> + * qwrite_trylock - Try to acquire the write lock
> + * @lock : Pointer to queue rwlock structure
> + * @old  : The current queue rwlock count structure
> + * Return: 1 if lock acquired, 0 otherwise
> + */
> +static __always_inline int
> +qwrite_trylock(struct qrwlock *lock, union qrwcnts old)
> +{
> +	register union qrwcnts new;
> +
> +	new.rw     = old.rw;
> +	new.writer = _QW_LOCKED;
> +	if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
> +		return 1;
> +	return 0;
> +}
> +
> +/**
> + * queue_write_3step_lock - acquire write lock in 3 steps
> + * @lock : Pointer to queue rwlock structure
> + * Return: 1 if lock acquired, 0 otherwise
> + *
> + * Step 1 - Try to acquire the lock directly if no reader is present
> + * Step 2 - Set the waiting flag to notify readers that a writer is waiting
> + * Step 3 - When the readers field goes to 0, set the locked flag
> + *
> + * In x86, the use of noinline generates a slight better optimized code
> + * with less memory access.
> + */
> +static noinline int queue_write_3step_lock(struct qrwlock *lock)
> +{
> +	register union qrwcnts old;
> +
> +	old.rw = ACCESS_ONCE(lock->cnts.rw);
> +
> +	/* Step 1 */
> +	if (!old.rw && qwrite_trylock(lock, old))
> +		return 1;
> +
> +	/* Step 2 */
> +	if (old.writer || (cmpxchg(&lock->cnts.writer, 0, _QW_WAITING) != 0))
> +		return 0;
> +
> +	/* Step 3 */
> +	arch_mutex_cpu_relax();
> +	old.rw = ACCESS_ONCE(lock->cnts.rw);
> +	while ((old.rw > _QW_WAITING) || !qwrite_trylock(lock, old)) {
> +		arch_mutex_cpu_relax();
> +		old.rw = ACCESS_ONCE(lock->cnts.rw);
> +	}
> +	return 1;
> +}
> +
> +/**
> + * queue_write_lock_slowpath - acquire write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +void queue_write_lock_slowpath(struct qrwlock *lock)
> +{
> +	struct qrwnode node;
> +
> +	/*
> +	 * Put the writer into the wait queue
> +	 */
> +	wait_in_queue(lock, &node);
> +
> +	/*
> +	 * At the head of the wait queue now, call queue_write_3step_lock()
> +	 * to acquire the lock until it is done.
> +	 */
> +	while (!queue_write_3step_lock(lock))
> +		arch_mutex_cpu_relax();
> +	signal_next(lock, &node);
> +}
> +EXPORT_SYMBOL(queue_write_lock_slowpath);
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 2/4] qrwlock x86: Enable x86 to use queue read/write lock
  2013-11-22 19:04 ` [PATCH v7 2/4] qrwlock x86: Enable x86 to use queue read/write lock Waiman Long
  2013-11-22 19:04   ` Waiman Long
@ 2013-12-17 19:22   ` Paul E. McKenney
  1 sibling, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2013-12-17 19:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Aswin Chandramouleeswaran", Scott J Norton

On Fri, Nov 22, 2013 at 02:04:45PM -0500, Waiman Long wrote:
> This patch makes the necessary changes at the x86 architecture specific
> layer to enable the presence of the CONFIG_QUEUE_RWLOCK kernel option
> to replace the read/write lock by the queue read/write lock.
> 
> It also enables the CONFIG_QUEUE_RWLOCK option by default for x86 which
> will force the use of queue read/write lock. That will greatly improve
> the fairness of read/write lock and eliminate live-lock situation
> where one task may not get the lock for an indefinite period of time.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  arch/x86/Kconfig                      |    1 +
>  arch/x86/include/asm/spinlock.h       |    2 ++
>  arch/x86/include/asm/spinlock_types.h |    4 ++++
>  3 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index e903c71..c4a9c54 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -124,6 +124,7 @@ config X86
>  	select RTC_LIB
>  	select HAVE_DEBUG_STACKOVERFLOW
>  	select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
> +	select QUEUE_RWLOCK
> 
>  config INSTRUCTION_DECODER
>  	def_bool y
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index bf156de..8fb88c5 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -188,6 +188,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
>  		cpu_relax();
>  }
> 
> +#ifndef CONFIG_QUEUE_RWLOCK
>  /*
>   * Read-write spinlocks, allowing multiple readers
>   * but only one writer.
> @@ -270,6 +271,7 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
>  	asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
>  		     : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
>  }
> +#endif /* CONFIG_QUEUE_RWLOCK */
> 
>  #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
>  #define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
> diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
> index 4f1bea1..a585635 100644
> --- a/arch/x86/include/asm/spinlock_types.h
> +++ b/arch/x86/include/asm/spinlock_types.h
> @@ -34,6 +34,10 @@ typedef struct arch_spinlock {
> 
>  #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
> 
> +#ifdef CONFIG_QUEUE_RWLOCK
> +#include <asm-generic/qrwlock.h>
> +#else
>  #include <asm/rwlock.h>
> +#endif
> 
>  #endif /* _ASM_X86_SPINLOCK_TYPES_H */
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 3/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing
  2013-11-22 19:04 ` [PATCH v7 3/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing Waiman Long
@ 2013-12-17 19:23   ` Paul E. McKenney
  0 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2013-12-17 19:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Aswin Chandramouleeswaran", Scott J Norton

On Fri, Nov 22, 2013 at 02:04:46PM -0500, Waiman Long wrote:
> There is a pending MCS lock patch series that adds generic MCS lock
> helper functions to do MCS-style locking. This patch will enable
> the queue rwlock to use that generic MCS lock/unlock primitives for
> internal queuing. This patch should only be merged after the merging
> of that generic MCS locking patch.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  include/asm-generic/qrwlock.h |    7 +--
>  kernel/locking/qrwlock.c      |   83 +++-------------------------------------
>  2 files changed, 9 insertions(+), 81 deletions(-)
> 
> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> index 9d085cb..335473a 100644
> --- a/include/asm-generic/qrwlock.h
> +++ b/include/asm-generic/qrwlock.h
> @@ -37,10 +37,7 @@
>   * the writer field. The least significant 8 bits is the writer field
>   * whereas the remaining 24 bits is the reader count.
>   */
> -struct qrwnode {
> -	struct qrwnode *next;
> -	bool		wait;	/* Waiting flag */
> -};
> +struct mcs_spinlock;
> 
>  typedef struct qrwlock {
>  	union qrwcnts {
> @@ -55,7 +52,7 @@ typedef struct qrwlock {
>  		};
>  		u32	rw;		/* Reader/writer number pair */
>  	} cnts;
> -	struct qrwnode *waitq;		/* Tail of waiting queue */
> +	struct mcs_spinlock *waitq;	/* Tail of waiting queue */
>  } arch_rwlock_t;
> 
>  /*
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> index ea5553d..b3e5c2b 100644
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -20,6 +20,7 @@
>  #include <linux/cpumask.h>
>  #include <linux/percpu.h>
>  #include <linux/hardirq.h>
> +#include <linux/mcs_spinlock.h>
>  #include <asm-generic/qrwlock.h>
> 
>  /*
> @@ -55,76 +56,6 @@
>  # endif
>  #endif
> 
> -#ifndef smp_mb__store_release
> -# ifdef CONFIG_X86
> -#   define smp_mb__store_release()	barrier()
> -# else
> -#   define smp_mb__store_release()	smp_mb()
> -# endif
> -#endif
> -
> -/**
> - * wait_in_queue - Add to queue and wait until it is at the head
> - * @lock: Pointer to queue rwlock structure
> - * @node: Node pointer to be added to the queue
> - */
> -static __always_inline void
> -wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
> -{
> -	struct qrwnode *prev;
> -
> -	node->next = NULL;
> -	node->wait = true;
> -	prev = xchg(&lock->waitq, node);
> -	if (prev) {
> -		prev->next = node;
> -		/*
> -		 * Wait until the waiting flag is off
> -		 */
> -		while (ACCESS_ONCE(node->wait))
> -			arch_mutex_cpu_relax();
> -		smp_mb__load_acquire();
> -	}
> -}
> -
> -/**
> - * signal_next - Signal the next one in queue to be at the head
> - * @lock: Pointer to queue rwlock structure
> - * @node: Node pointer to the current head of queue
> - */
> -static __always_inline void
> -signal_next(struct qrwlock *lock, struct qrwnode *node)
> -{
> -	struct qrwnode *next;
> -
> -	/*
> -	 * Try to notify the next node first without disturbing the cacheline
> -	 * of the lock. If that fails, check to see if it is the last node
> -	 * and so should clear the wait queue.
> -	 */
> -	next = ACCESS_ONCE(node->next);
> -	if (likely(next))
> -		goto notify_next;
> -
> -	/*
> -	 * Clear the wait queue if it is the last node
> -	 */
> -	if ((ACCESS_ONCE(lock->waitq) == node) &&
> -	    (cmpxchg(&lock->waitq, node, NULL) == node))
> -			return;
> -	/*
> -	 * Wait until the next one in queue set up the next field
> -	 */
> -	while (likely(!(next = ACCESS_ONCE(node->next))))
> -		arch_mutex_cpu_relax();
> -	/*
> -	 * The next one in queue is now at the head
> -	 */
> -notify_next:
> -	smp_mb__store_release();
> -	ACCESS_ONCE(next->wait) = false;
> -}
> -
>  /**
>   * rspin_until_writer_unlock - inc reader count & spin until writer is gone
>   * @lock: Pointer to queue rwlock structure
> @@ -148,7 +79,7 @@ rspin_until_writer_unlock(struct qrwlock *lock, union qrwcnts cnts)
>   */
>  void queue_read_lock_slowpath(struct qrwlock *lock)
>  {
> -	struct qrwnode node;
> +	struct mcs_spinlock node;
>  	union qrwcnts cnts;
> 
>  	/*
> @@ -168,7 +99,7 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
>  	/*
>  	 * Put the reader into the wait queue
>  	 */
> -	wait_in_queue(lock, &node);
> +	mcs_spin_lock(&lock->waitq, &node);
> 
>  	/*
>  	 * At the head of the wait queue now, wait until the writer state
> @@ -183,7 +114,7 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
>  	 * Need to have a barrier with read-acquire semantics
>  	 */
>  	smp_mb__load_acquire();
> -	signal_next(lock, &node);
> +	mcs_spin_unlock(&lock->waitq, &node);
>  }
>  EXPORT_SYMBOL(queue_read_lock_slowpath);
> 
> @@ -247,12 +178,12 @@ static noinline int queue_write_3step_lock(struct qrwlock *lock)
>   */
>  void queue_write_lock_slowpath(struct qrwlock *lock)
>  {
> -	struct qrwnode node;
> +	struct mcs_spinlock node;
> 
>  	/*
>  	 * Put the writer into the wait queue
>  	 */
> -	wait_in_queue(lock, &node);
> +	mcs_spin_lock(&lock->waitq, &node);
> 
>  	/*
>  	 * At the head of the wait queue now, call queue_write_3step_lock()
> @@ -260,6 +191,6 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
>  	 */
>  	while (!queue_write_3step_lock(lock))
>  		arch_mutex_cpu_relax();
> -	signal_next(lock, &node);
> +	mcs_spin_unlock(&lock->waitq, &node);
>  }
>  EXPORT_SYMBOL(queue_write_lock_slowpath);
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 4/4] qrwlock: Use smp_store_release() in write_unlock()
  2013-11-22 19:04 ` [PATCH v7 4/4] qrwlock: Use smp_store_release() in write_unlock() Waiman Long
@ 2013-12-17 19:24   ` Paul E. McKenney
  0 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2013-12-17 19:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen,
	Aswin Chandramouleeswaran", Scott J Norton

On Fri, Nov 22, 2013 at 02:04:47PM -0500, Waiman Long wrote:
> This patch modifies the queue_write_unlock() function to use the
> new smp_store_release() function in another pending patch. This patch
> should only be merged if the other patch was merged.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  include/asm-generic/qrwlock.h |    4 +---
>  1 files changed, 1 insertions(+), 3 deletions(-)
> 
> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> index 335473a..59c5972 100644
> --- a/include/asm-generic/qrwlock.h
> +++ b/include/asm-generic/qrwlock.h
> @@ -176,9 +176,7 @@ static inline void queue_write_unlock(struct qrwlock *lock)
>  	/*
>  	 * Make sure that none of the critical section will be leaked out.
>  	 */
> -	smp_mb__before_clear_bit();
> -	ACCESS_ONCE(lock->cnts.writer) = 0;
> -	smp_mb__after_clear_bit();
> +	smp_store_release(&lock->cnts.writer, 0)
>  }
> 
>  /*
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-12-17 19:21   ` Paul E. McKenney
@ 2013-12-17 19:27     ` Linus Torvalds
  2013-12-17 19:49       ` Paul E. McKenney
                         ` (2 more replies)
  2013-12-18 18:45     ` Waiman Long
  1 sibling, 3 replies; 26+ messages in thread
From: Linus Torvalds @ 2013-12-17 19:27 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Peter Zijlstra, Steven Rostedt, Andrew Morton, Michel Lespinasse,
	Andi Kleen, Rik van Riel, Raghavendra K T, George Spelvin,
	Tim Chen, Aswin Chandramouleeswaran, Scott J Norton

On Tue, Dec 17, 2013 at 11:21 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> Looks like xadd() is x86-specific, but this is common code.  One
> approach would be to do xadd() for other arches, another approach
> would be to make .rw be an atomic_t rather than a u32.  Making it
> be atomic_t is probably easiest.  (The cmpxchg()s would then need
> to be atomic_cmpxchg().)

Note that "xadd()" has different semantics from "atomic_add_return()".

xadd() returns the original value, while atomic_add_return() returns
the result of the addition.

In this case, we seem to want the xadd() semantics. I guess we can use
"atomic_add_return(val,&atomic)-val" and just assume that the compiler
gets it right (with the addition and the subtraction cancelling out).
Or maybe we should have a "atomic_add_return_original()" with xadd
semantics?

               Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-12-17 19:27     ` Linus Torvalds
@ 2013-12-17 19:49       ` Paul E. McKenney
  2013-12-18 18:51       ` Waiman Long
  2013-12-18 19:38       ` Andi Kleen
  2 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2013-12-17 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Peter Zijlstra, Steven Rostedt, Andrew Morton, Michel Lespinasse,
	Andi Kleen, Rik van Riel, Raghavendra K T, George Spelvin,
	Tim Chen, Aswin Chandramouleeswaran, Scott J Norton

On Tue, Dec 17, 2013 at 11:27:29AM -0800, Linus Torvalds wrote:
> On Tue, Dec 17, 2013 at 11:21 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > Looks like xadd() is x86-specific, but this is common code.  One
> > approach would be to do xadd() for other arches, another approach
> > would be to make .rw be an atomic_t rather than a u32.  Making it
> > be atomic_t is probably easiest.  (The cmpxchg()s would then need
> > to be atomic_cmpxchg().)
> 
> Note that "xadd()" has different semantics from "atomic_add_return()".

Gah, that one always trips me up.  :-/

> xadd() returns the original value, while atomic_add_return() returns
> the result of the addition.
> 
> In this case, we seem to want the xadd() semantics. I guess we can use
> "atomic_add_return(val,&atomic)-val" and just assume that the compiler
> gets it right (with the addition and the subtraction cancelling out).

That seems like it would work well.

> Or maybe we should have a "atomic_add_return_original()" with xadd
> semantics?

My lazy side prefers the autocancellation.  ;-)  But yes, there are a
number of architectures (including ARM and Power) where the compiler
would have to be very tricky to reach into an asm to do the cancellation.

So perhaps a generic atomic_add_return_original() that is defined in
terms of atomic_add_return() as you say above, allowing architectures
to override with more-efficient implementations?  The same could be done
for add_smp() and xadd(), for that matter.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-12-17 19:21   ` Paul E. McKenney
  2013-12-17 19:27     ` Linus Torvalds
@ 2013-12-18 18:45     ` Waiman Long
  2013-12-18 18:45       ` Waiman Long
  2013-12-18 18:59       ` Paul E. McKenney
  1 sibling, 2 replies; 26+ messages in thread
From: Waiman Long @ 2013-12-18 18:45 UTC (permalink / raw)
  To: paulmck
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen, aswin,
	Scott J Norton

On 12/17/2013 02:21 PM, Paul E. McKenney wrote:
> +/**
> + * queue_read_trylock - try to acquire read lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static inline int queue_read_trylock(struct qrwlock *lock)
> +{
> +	union qrwcnts cnts;
> +
> +	cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> +	if (likely(!cnts.writer)) {
> +		cnts.rw = xadd(&lock->cnts.rw, _QR_BIAS);
> Looks like xadd() is x86-specific, but this is common code.  One
> approach would be to do xadd() for other arches, another approach
> would be to make .rw be an atomic_t rather than a u32.  Making it
> be atomic_t is probably easiest.  (The cmpxchg()s would then need
> to be atomic_cmpxchg().)
>
> Ditto for add_smp() further down.

You are right that xadd() and add_smp() are x86 specific. I will change 
the code to use atomic_t for rw as suggested.

> +
> +/**
> + * queue_write_unlock - release write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_write_unlock(struct qrwlock *lock)
> +{
> +	/*
> +	 * Make sure that none of the critical section will be leaked out.
> +	 */
> +	smp_mb__before_clear_bit();
> +	ACCESS_ONCE(lock->cnts.writer) = 0;
> +	smp_mb__after_clear_bit();
> Interesting combination...  It does seem to work out, though.

They are used for the time being. They will be changed to better 
alternatives like smp_store_release() once they are available.

>> +++ b/kernel/locking/qrwlock.c
>> @@ -0,0 +1,265 @@
>> +/*
>> + * Queue read/write lock
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
>> + *
>> + * Authors: Waiman Long<waiman.long@hp.com>
>> + */
>> +#include<linux/smp.h>
>> +#include<linux/bug.h>
>> +#include<linux/cpumask.h>
>> +#include<linux/percpu.h>
>> +#include<linux/hardirq.h>
>> +#include<asm-generic/qrwlock.h>
>> +
>> +/*
>> + * Compared with regular rwlock, the queue rwlock has has the following
>> + * advantages:
>> + * 1. Even though there is a slight chance of stealing the lock if come at
>> + *    the right moment, the granting of the lock is mostly in FIFO order.
>> + * 2. It is usually faster in high contention situation.
>> + *
>> + * The only downside is that the lock is 4 bytes larger in 32-bit systems
>> + * and 12 bytes larger in 64-bit systems.
>> + *
>> + * There are two queues for writers. The writer field of the lock is a
>> + * one-slot wait queue. The writers that follow will have to wait in the
>> + * combined reader/writer queue (waitq).
>> + *
>> + * Compared with x86 ticket spinlock, the queue rwlock is faster in high
>> + * contention situation. The writer lock is also faster in single thread
>> + * operations. Therefore, queue rwlock can be considered as a replacement
>> + * for those spinlocks that are highly contended as long as an increase
>> + * in lock size is not an issue.
> Judging from the #defines at the end of qrwlock.h, this replacement is
> done on a file-by-file basis?  Looks to me more like the replacement
> must be all-or-nothing across the entire kernel, otherwise trouble would
> ensue for locks used across multiple files.  What am I missing here?

Yes, it is either all or nothing. Activating queue rwlock will replace 
the existing implementation.

>> + */
>> +
>> +#ifndef arch_mutex_cpu_relax
>> +# define arch_mutex_cpu_relax() cpu_relax()
>> +#endif
>> +
>> +#ifndef smp_mb__load_acquire
>> +# ifdef CONFIG_X86
>> +#   define smp_mb__load_acquire()	barrier()
>> +# else
>> +#   define smp_mb__load_acquire()	smp_mb()
>> +# endif
>> +#endif
>> +
>> +#ifndef smp_mb__store_release
>> +# ifdef CONFIG_X86
>> +#   define smp_mb__store_release()	barrier()
>> +# else
>> +#   define smp_mb__store_release()	smp_mb()
>> +# endif
>> +#endif
> These are now smp_load_acquire() and smp_store_release().

Will change that in the next version.

>> +
>> +/**
>> + * wait_in_queue - Add to queue and wait until it is at the head
>> + * @lock: Pointer to queue rwlock structure
>> + * @node: Node pointer to be added to the queue
>> + */
>> +static __always_inline void
>> +wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
>> +{
>> +	struct qrwnode *prev;
>> +
>> +	node->next = NULL;
>> +	node->wait = true;
>> +	prev = xchg(&lock->waitq, node);
>> +	if (prev) {
>> +		prev->next = node;
>> +		/*
>> +		 * Wait until the waiting flag is off
>> +		 */
>> +		while (ACCESS_ONCE(node->wait))
>> +			arch_mutex_cpu_relax();
>> +		smp_mb__load_acquire();
> 		while (smp_load_acquire(&node->wait))
> 			arch_mutex_cpu_relax();
>
> On TSO systems like x86, this should generate the same code.

OK, I can change that too.

>> +	}
>> +}
>> +
>> +/**
>> + * signal_next - Signal the next one in queue to be at the head
>> + * @lock: Pointer to queue rwlock structure
>> + * @node: Node pointer to the current head of queue
>> + */
>> +static __always_inline void
>> +signal_next(struct qrwlock *lock, struct qrwnode *node)
>> +{
>> +	struct qrwnode *next;
>> +
>> +	/*
>> +	 * Try to notify the next node first without disturbing the cacheline
>> +	 * of the lock. If that fails, check to see if it is the last node
>> +	 * and so should clear the wait queue.
>> +	 */
>> +	next = ACCESS_ONCE(node->next);
>> +	if (likely(next))
>> +		goto notify_next;
>> +
>> +	/*
>> +	 * Clear the wait queue if it is the last node
>> +	 */
>> +	if ((ACCESS_ONCE(lock->waitq) == node)&&
>> +	    (cmpxchg(&lock->waitq, node, NULL) == node))
>> +			return;
>> +	/*
>> +	 * Wait until the next one in queue set up the next field
>> +	 */
>> +	while (likely(!(next = ACCESS_ONCE(node->next))))
>> +		arch_mutex_cpu_relax();
>> +	/*
>> +	 * The next one in queue is now at the head
>> +	 */
>> +notify_next:
>> +	smp_mb__store_release();
>> +	ACCESS_ONCE(next->wait) = false;
> The above pair of lines can be simply:
>
> 	smp_store_release(&next->wait, false);
>
> This pairs nicely with the smp_load_acquire() in wait_in_queue().

Will do.

>> +}
>> +
>> +/**
>> + * rspin_until_writer_unlock - inc reader count&  spin until writer is gone
>> + * @lock: Pointer to queue rwlock structure
>> + * @cnts: Current queue rwlock counts structure
>> + *
>> + * In interrupt context or at the head of the queue, the reader will just
>> + * increment the reader count&  wait until the writer releases the lock.
>> + */
>> +static __always_inline void
>> +rspin_until_writer_unlock(struct qrwlock *lock, union qrwcnts cnts)
>> +{
>> +	while (cnts.writer == _QW_LOCKED) {
>> +		arch_mutex_cpu_relax();
>> +		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
>> +	}
>> +}
>> +
>> +/**
>> + * queue_read_lock_slowpath - acquire read lock of a queue rwlock
>> + * @lock: Pointer to queue rwlock structure
>> + */
>> +void queue_read_lock_slowpath(struct qrwlock *lock)
>> +{
>> +	struct qrwnode node;
>> +	union qrwcnts cnts;
>> +
>> +	/*
>> +	 * Readers come here when it cannot get the lock without waiting
> Grammar nit: s/it/they/
> Or s/Readers come/Each reader comes/

Will fix that.

>> +	 */
>> +	if (unlikely(irq_count())) {
>> +		/*
>> +		 * Readers in interrupt context will spin until the lock is
>> +		 * available without waiting in the queue.
>> +		 */
>> +		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> This needs to be:
>
> 		cnts.rw = smp_load_acquire(&lock->cnts.rw);
>
> I was going to argue that the above assignment should be pushed into
> rspin_until_writer_unlock(), but I see that this won't work for the
> call later in this function.  ;-)
>

Will made the change.

>> +		rspin_until_writer_unlock(lock, cnts);
> We do need a memory barrier in this path, otherwise we are not guaranteed
> to see the writer's critical section.  One approach would be to make
> rspin_until_writer_unlock()s "while" loop body do:
>
> 		arch_mutex_cpu_relax();
> 		cnts.rw = smp_load_acquire(&lock->cnts.rw);

Yes, Will do.

>> +		return;
>> +	}
>> +	add_smp(&lock->cnts.rw, -_QR_BIAS);
>> +
>> +	/*
>> +	 * Put the reader into the wait queue
>> +	 */
>> +	wait_in_queue(lock,&node);
>> +
>> +	/*
>> +	 * At the head of the wait queue now, wait until the writer state
>> +	 * goes to 0 and then try to increment the reader count and get
>> +	 * the lock.
>> +	 */
>> +	while (ACCESS_ONCE(lock->cnts.writer))
>> +		arch_mutex_cpu_relax();
>> +	cnts.rw = xadd(&lock->cnts.rw, _QR_BIAS);
>> +	rspin_until_writer_unlock(lock, cnts);
>> +	/*
>> +	 * Need to have a barrier with read-acquire semantics
>> +	 */
>> +	smp_mb__load_acquire();
> Making rspin_until_writer_unlock() do an smp_load_acquire() makes this
> unnecessary.

Yes.

>> +	signal_next(lock,&node);
> Good, this allows multiple readers to acquire the lock concurrently,
> give or take memory latency compared to critical-section duration.
> When the first writer shows up, it presumably spins on the lock word.
>

Yes, that was the intention. The first writer that shows up will block 
succeeding readers from getting the lock.

BTW, what was the status of the TSO memory barrier patch? This patch has 
some partial dependency it.

-Longman

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-12-18 18:45     ` Waiman Long
@ 2013-12-18 18:45       ` Waiman Long
  2013-12-18 18:59       ` Paul E. McKenney
  1 sibling, 0 replies; 26+ messages in thread
From: Waiman Long @ 2013-12-18 18:45 UTC (permalink / raw)
  To: paulmck
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen, aswin,
	Scott J Norton

On 12/17/2013 02:21 PM, Paul E. McKenney wrote:
> +/**
> + * queue_read_trylock - try to acquire read lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static inline int queue_read_trylock(struct qrwlock *lock)
> +{
> +	union qrwcnts cnts;
> +
> +	cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> +	if (likely(!cnts.writer)) {
> +		cnts.rw = xadd(&lock->cnts.rw, _QR_BIAS);
> Looks like xadd() is x86-specific, but this is common code.  One
> approach would be to do xadd() for other arches, another approach
> would be to make .rw be an atomic_t rather than a u32.  Making it
> be atomic_t is probably easiest.  (The cmpxchg()s would then need
> to be atomic_cmpxchg().)
>
> Ditto for add_smp() further down.

You are right that xadd() and add_smp() are x86 specific. I will change 
the code to use atomic_t for rw as suggested.

> +
> +/**
> + * queue_write_unlock - release write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_write_unlock(struct qrwlock *lock)
> +{
> +	/*
> +	 * Make sure that none of the critical section will be leaked out.
> +	 */
> +	smp_mb__before_clear_bit();
> +	ACCESS_ONCE(lock->cnts.writer) = 0;
> +	smp_mb__after_clear_bit();
> Interesting combination...  It does seem to work out, though.

They are used for the time being. They will be changed to better 
alternatives like smp_store_release() once they are available.

>> +++ b/kernel/locking/qrwlock.c
>> @@ -0,0 +1,265 @@
>> +/*
>> + * Queue read/write lock
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
>> + *
>> + * Authors: Waiman Long<waiman.long@hp.com>
>> + */
>> +#include<linux/smp.h>
>> +#include<linux/bug.h>
>> +#include<linux/cpumask.h>
>> +#include<linux/percpu.h>
>> +#include<linux/hardirq.h>
>> +#include<asm-generic/qrwlock.h>
>> +
>> +/*
>> + * Compared with regular rwlock, the queue rwlock has has the following
>> + * advantages:
>> + * 1. Even though there is a slight chance of stealing the lock if come at
>> + *    the right moment, the granting of the lock is mostly in FIFO order.
>> + * 2. It is usually faster in high contention situation.
>> + *
>> + * The only downside is that the lock is 4 bytes larger in 32-bit systems
>> + * and 12 bytes larger in 64-bit systems.
>> + *
>> + * There are two queues for writers. The writer field of the lock is a
>> + * one-slot wait queue. The writers that follow will have to wait in the
>> + * combined reader/writer queue (waitq).
>> + *
>> + * Compared with x86 ticket spinlock, the queue rwlock is faster in high
>> + * contention situation. The writer lock is also faster in single thread
>> + * operations. Therefore, queue rwlock can be considered as a replacement
>> + * for those spinlocks that are highly contended as long as an increase
>> + * in lock size is not an issue.
> Judging from the #defines at the end of qrwlock.h, this replacement is
> done on a file-by-file basis?  Looks to me more like the replacement
> must be all-or-nothing across the entire kernel, otherwise trouble would
> ensue for locks used across multiple files.  What am I missing here?

Yes, it is either all or nothing. Activating queue rwlock will replace 
the existing implementation.

>> + */
>> +
>> +#ifndef arch_mutex_cpu_relax
>> +# define arch_mutex_cpu_relax() cpu_relax()
>> +#endif
>> +
>> +#ifndef smp_mb__load_acquire
>> +# ifdef CONFIG_X86
>> +#   define smp_mb__load_acquire()	barrier()
>> +# else
>> +#   define smp_mb__load_acquire()	smp_mb()
>> +# endif
>> +#endif
>> +
>> +#ifndef smp_mb__store_release
>> +# ifdef CONFIG_X86
>> +#   define smp_mb__store_release()	barrier()
>> +# else
>> +#   define smp_mb__store_release()	smp_mb()
>> +# endif
>> +#endif
> These are now smp_load_acquire() and smp_store_release().

Will change that in the next version.

>> +
>> +/**
>> + * wait_in_queue - Add to queue and wait until it is at the head
>> + * @lock: Pointer to queue rwlock structure
>> + * @node: Node pointer to be added to the queue
>> + */
>> +static __always_inline void
>> +wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
>> +{
>> +	struct qrwnode *prev;
>> +
>> +	node->next = NULL;
>> +	node->wait = true;
>> +	prev = xchg(&lock->waitq, node);
>> +	if (prev) {
>> +		prev->next = node;
>> +		/*
>> +		 * Wait until the waiting flag is off
>> +		 */
>> +		while (ACCESS_ONCE(node->wait))
>> +			arch_mutex_cpu_relax();
>> +		smp_mb__load_acquire();
> 		while (smp_load_acquire(&node->wait))
> 			arch_mutex_cpu_relax();
>
> On TSO systems like x86, this should generate the same code.

OK, I can change that too.

>> +	}
>> +}
>> +
>> +/**
>> + * signal_next - Signal the next one in queue to be at the head
>> + * @lock: Pointer to queue rwlock structure
>> + * @node: Node pointer to the current head of queue
>> + */
>> +static __always_inline void
>> +signal_next(struct qrwlock *lock, struct qrwnode *node)
>> +{
>> +	struct qrwnode *next;
>> +
>> +	/*
>> +	 * Try to notify the next node first without disturbing the cacheline
>> +	 * of the lock. If that fails, check to see if it is the last node
>> +	 * and so should clear the wait queue.
>> +	 */
>> +	next = ACCESS_ONCE(node->next);
>> +	if (likely(next))
>> +		goto notify_next;
>> +
>> +	/*
>> +	 * Clear the wait queue if it is the last node
>> +	 */
>> +	if ((ACCESS_ONCE(lock->waitq) == node)&&
>> +	    (cmpxchg(&lock->waitq, node, NULL) == node))
>> +			return;
>> +	/*
>> +	 * Wait until the next one in queue set up the next field
>> +	 */
>> +	while (likely(!(next = ACCESS_ONCE(node->next))))
>> +		arch_mutex_cpu_relax();
>> +	/*
>> +	 * The next one in queue is now at the head
>> +	 */
>> +notify_next:
>> +	smp_mb__store_release();
>> +	ACCESS_ONCE(next->wait) = false;
> The above pair of lines can be simply:
>
> 	smp_store_release(&next->wait, false);
>
> This pairs nicely with the smp_load_acquire() in wait_in_queue().

Will do.

>> +}
>> +
>> +/**
>> + * rspin_until_writer_unlock - inc reader count&  spin until writer is gone
>> + * @lock: Pointer to queue rwlock structure
>> + * @cnts: Current queue rwlock counts structure
>> + *
>> + * In interrupt context or at the head of the queue, the reader will just
>> + * increment the reader count&  wait until the writer releases the lock.
>> + */
>> +static __always_inline void
>> +rspin_until_writer_unlock(struct qrwlock *lock, union qrwcnts cnts)
>> +{
>> +	while (cnts.writer == _QW_LOCKED) {
>> +		arch_mutex_cpu_relax();
>> +		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
>> +	}
>> +}
>> +
>> +/**
>> + * queue_read_lock_slowpath - acquire read lock of a queue rwlock
>> + * @lock: Pointer to queue rwlock structure
>> + */
>> +void queue_read_lock_slowpath(struct qrwlock *lock)
>> +{
>> +	struct qrwnode node;
>> +	union qrwcnts cnts;
>> +
>> +	/*
>> +	 * Readers come here when it cannot get the lock without waiting
> Grammar nit: s/it/they/
> Or s/Readers come/Each reader comes/

Will fix that.

>> +	 */
>> +	if (unlikely(irq_count())) {
>> +		/*
>> +		 * Readers in interrupt context will spin until the lock is
>> +		 * available without waiting in the queue.
>> +		 */
>> +		cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> This needs to be:
>
> 		cnts.rw = smp_load_acquire(&lock->cnts.rw);
>
> I was going to argue that the above assignment should be pushed into
> rspin_until_writer_unlock(), but I see that this won't work for the
> call later in this function.  ;-)
>

Will made the change.

>> +		rspin_until_writer_unlock(lock, cnts);
> We do need a memory barrier in this path, otherwise we are not guaranteed
> to see the writer's critical section.  One approach would be to make
> rspin_until_writer_unlock()s "while" loop body do:
>
> 		arch_mutex_cpu_relax();
> 		cnts.rw = smp_load_acquire(&lock->cnts.rw);

Yes, Will do.

>> +		return;
>> +	}
>> +	add_smp(&lock->cnts.rw, -_QR_BIAS);
>> +
>> +	/*
>> +	 * Put the reader into the wait queue
>> +	 */
>> +	wait_in_queue(lock,&node);
>> +
>> +	/*
>> +	 * At the head of the wait queue now, wait until the writer state
>> +	 * goes to 0 and then try to increment the reader count and get
>> +	 * the lock.
>> +	 */
>> +	while (ACCESS_ONCE(lock->cnts.writer))
>> +		arch_mutex_cpu_relax();
>> +	cnts.rw = xadd(&lock->cnts.rw, _QR_BIAS);
>> +	rspin_until_writer_unlock(lock, cnts);
>> +	/*
>> +	 * Need to have a barrier with read-acquire semantics
>> +	 */
>> +	smp_mb__load_acquire();
> Making rspin_until_writer_unlock() do an smp_load_acquire() makes this
> unnecessary.

Yes.

>> +	signal_next(lock,&node);
> Good, this allows multiple readers to acquire the lock concurrently,
> give or take memory latency compared to critical-section duration.
> When the first writer shows up, it presumably spins on the lock word.
>

Yes, that was the intention. The first writer that shows up will block 
succeeding readers from getting the lock.

BTW, what was the status of the TSO memory barrier patch? This patch has 
some partial dependency it.

-Longman

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-12-17 19:27     ` Linus Torvalds
  2013-12-17 19:49       ` Paul E. McKenney
@ 2013-12-18 18:51       ` Waiman Long
  2013-12-18 19:38       ` Andi Kleen
  2 siblings, 0 replies; 26+ messages in thread
From: Waiman Long @ 2013-12-18 18:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Peter Zijlstra, Steven Rostedt, Andrew Morton, Michel Lespinasse,
	Andi Kleen, Rik van Riel, Raghavendra K T, George Spelvin,
	Tim Chen, Aswin Chandramouleeswaran, Scott J Norton

On 12/17/2013 02:27 PM, Linus Torvalds wrote:
> On Tue, Dec 17, 2013 at 11:21 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com>  wrote:
>> Looks like xadd() is x86-specific, but this is common code.  One
>> approach would be to do xadd() for other arches, another approach
>> would be to make .rw be an atomic_t rather than a u32.  Making it
>> be atomic_t is probably easiest.  (The cmpxchg()s would then need
>> to be atomic_cmpxchg().)
> Note that "xadd()" has different semantics from "atomic_add_return()".
>
> xadd() returns the original value, while atomic_add_return() returns
> the result of the addition.
>
> In this case, we seem to want the xadd() semantics. I guess we can use
> "atomic_add_return(val,&atomic)-val" and just assume that the compiler
> gets it right (with the addition and the subtraction cancelling out).
> Or maybe we should have a "atomic_add_return_original()" with xadd
> semantics?
>
>                 Linus

I will use the atomic_add_return() for the combined count. Actually what 
the code is looking for in the returned value is the state of the writer 
byte. So it doesn't really matter if _QR_BIAS has or hasn't been added 
to the count.

-Longman

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-12-18 18:45     ` Waiman Long
  2013-12-18 18:45       ` Waiman Long
@ 2013-12-18 18:59       ` Paul E. McKenney
  2013-12-18 18:59         ` Paul E. McKenney
  2013-12-18 19:16         ` Peter Zijlstra
  1 sibling, 2 replies; 26+ messages in thread
From: Paul E. McKenney @ 2013-12-18 18:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen, aswin,
	Scott J Norton

On Wed, Dec 18, 2013 at 01:45:01PM -0500, Waiman Long wrote:
> On 12/17/2013 02:21 PM, Paul E. McKenney wrote:

[ . . . ]

> >>+	signal_next(lock,&node);
> >Good, this allows multiple readers to acquire the lock concurrently,
> >give or take memory latency compared to critical-section duration.
> >When the first writer shows up, it presumably spins on the lock word.
> >
> 
> Yes, that was the intention. The first writer that shows up will
> block succeeding readers from getting the lock.
> 
> BTW, what was the status of the TSO memory barrier patch? This patch
> has some partial dependency it.

I am hoping that it makes it into -tip soon, as I also have a patch that
depends on it.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-12-18 18:59       ` Paul E. McKenney
@ 2013-12-18 18:59         ` Paul E. McKenney
  2013-12-18 19:16         ` Peter Zijlstra
  1 sibling, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2013-12-18 18:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen, aswin,
	Scott J Norton

On Wed, Dec 18, 2013 at 01:45:01PM -0500, Waiman Long wrote:
> On 12/17/2013 02:21 PM, Paul E. McKenney wrote:

[ . . . ]

> >>+	signal_next(lock,&node);
> >Good, this allows multiple readers to acquire the lock concurrently,
> >give or take memory latency compared to critical-section duration.
> >When the first writer shows up, it presumably spins on the lock word.
> >
> 
> Yes, that was the intention. The first writer that shows up will
> block succeeding readers from getting the lock.
> 
> BTW, what was the status of the TSO memory barrier patch? This patch
> has some partial dependency it.

I am hoping that it makes it into -tip soon, as I also have a patch that
depends on it.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-12-18 18:59       ` Paul E. McKenney
  2013-12-18 18:59         ` Paul E. McKenney
@ 2013-12-18 19:16         ` Peter Zijlstra
  2013-12-18 19:16           ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2013-12-18 19:16 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen, aswin,
	Scott J Norton

On Wed, Dec 18, 2013 at 10:59:10AM -0800, Paul E. McKenney wrote:
> I am hoping that it makes it into -tip soon, as I also have a patch that
> depends on it.

Just send out the hopefully final version.

lkml.kernel.org/r/20131218190806.370008594@infradead.org

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-12-18 19:16         ` Peter Zijlstra
@ 2013-12-18 19:16           ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2013-12-18 19:16 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt,
	Andrew Morton, Michel Lespinasse, Andi Kleen, Rik van Riel,
	Linus Torvalds, Raghavendra K T, George Spelvin, Tim Chen, aswin,
	Scott J Norton

On Wed, Dec 18, 2013 at 10:59:10AM -0800, Paul E. McKenney wrote:
> I am hoping that it makes it into -tip soon, as I also have a patch that
> depends on it.

Just send out the hopefully final version.

lkml.kernel.org/r/20131218190806.370008594@infradead.org

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-12-17 19:27     ` Linus Torvalds
  2013-12-17 19:49       ` Paul E. McKenney
  2013-12-18 18:51       ` Waiman Long
@ 2013-12-18 19:38       ` Andi Kleen
  2013-12-18 19:42         ` Andi Kleen
  2013-12-18 19:46         ` Peter Zijlstra
  2 siblings, 2 replies; 26+ messages in thread
From: Andi Kleen @ 2013-12-18 19:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Waiman Long, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Peter Zijlstra, Steven Rostedt, Andrew Morton, Michel Lespinasse,
	Rik van Riel, Raghavendra K T, George Spelvin, Tim Chen,
	Aswin Chandramouleeswaran, Scott J Norton

Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> In this case, we seem to want the xadd() semantics. I guess we can use
> "atomic_add_return(val,&atomic)-val" and just assume that the compiler
> gets it right (with the addition and the subtraction cancelling out).

It can't, as we use inline assembler. 

In theory it could with __sync_* or __atomic_* intrinsics but we
traditionally don't use them as they don't support patchable LOCK

(maybe it would be time to get rid of the patchable LOCK though?)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-12-18 19:38       ` Andi Kleen
@ 2013-12-18 19:42         ` Andi Kleen
  2013-12-18 19:46         ` Peter Zijlstra
  1 sibling, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2013-12-18 19:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Waiman Long, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Arnd Bergmann, linux-arch@vger.kernel.org,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Peter Zijlstra, Steven Rostedt, Andrew Morton, Michel Lespinasse,
	Rik van Riel, Raghavendra K T, George Spelvin, Tim Chen,
	Aswin Chandramouleeswaran, Scott J Norton

Andi Kleen <andi@firstfloor.org> writes:
>
> It can't, as we use inline assembler. 

Ok never mind. Aactually it works of course, as the add is still in C.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-12-18 19:38       ` Andi Kleen
  2013-12-18 19:42         ` Andi Kleen
@ 2013-12-18 19:46         ` Peter Zijlstra
  2013-12-18 20:16           ` Andi Kleen
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2013-12-18 19:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Paul McKenney, Waiman Long, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch@vger.kernel.org, the arch/x86 maintainers,
	Linux Kernel Mailing List, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Rik van Riel, Raghavendra K T, George Spelvin,
	Tim Chen, Aswin Chandramouleeswaran, Scott J Norton

On Wed, Dec 18, 2013 at 11:38:29AM -0800, Andi Kleen wrote:
> (maybe it would be time to get rid of the patchable LOCK though?)

With the argument that Intel simply doesn't ship UP chips anymore, with
the exception of quark which should probably run custom UP kernels due
to size constraints anyway?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/4] qrwlock: A queue read/write lock implementation
  2013-12-18 19:46         ` Peter Zijlstra
@ 2013-12-18 20:16           ` Andi Kleen
  0 siblings, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2013-12-18 20:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Paul McKenney, Waiman Long, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	linux-arch@vger.kernel.org, the arch/x86 maintainers,
	Linux Kernel Mailing List, Steven Rostedt, Andrew Morton,
	Michel Lespinasse, Rik van Riel, Raghavendra K T, George Spelvin,
	Tim Chen, Aswin Chandramouleeswaran, Scott J Norton

Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Dec 18, 2013 at 11:38:29AM -0800, Andi Kleen wrote:
>> (maybe it would be time to get rid of the patchable LOCK though?)
>
> With the argument that Intel simply doesn't ship UP chips anymore, with
> the exception of quark which should probably run custom UP kernels due
> to size constraints anyway?

That, and:

- Anything with a single core only has very fast LOCK
- LOCK generally became much faster everywhere.
- The original reason was for single cpu VM guests, but even those
should increasingly have at least two VCPUs.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2013-12-18 20:16 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-22 19:04 [PATCH v7 0/4] qrwlock: Introducing a queue read/write lock implementation Waiman Long
2013-11-22 19:04 ` [PATCH v7 1/4] qrwlock: A " Waiman Long
2013-11-22 19:14   ` Linus Torvalds
2013-11-22 20:35     ` Waiman Long
2013-11-22 21:14       ` Linus Torvalds
2013-12-17 19:21   ` Paul E. McKenney
2013-12-17 19:27     ` Linus Torvalds
2013-12-17 19:49       ` Paul E. McKenney
2013-12-18 18:51       ` Waiman Long
2013-12-18 19:38       ` Andi Kleen
2013-12-18 19:42         ` Andi Kleen
2013-12-18 19:46         ` Peter Zijlstra
2013-12-18 20:16           ` Andi Kleen
2013-12-18 18:45     ` Waiman Long
2013-12-18 18:45       ` Waiman Long
2013-12-18 18:59       ` Paul E. McKenney
2013-12-18 18:59         ` Paul E. McKenney
2013-12-18 19:16         ` Peter Zijlstra
2013-12-18 19:16           ` Peter Zijlstra
2013-11-22 19:04 ` [PATCH v7 2/4] qrwlock x86: Enable x86 to use queue read/write lock Waiman Long
2013-11-22 19:04   ` Waiman Long
2013-12-17 19:22   ` Paul E. McKenney
2013-11-22 19:04 ` [PATCH v7 3/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing Waiman Long
2013-12-17 19:23   ` Paul E. McKenney
2013-11-22 19:04 ` [PATCH v7 4/4] qrwlock: Use smp_store_release() in write_unlock() Waiman Long
2013-12-17 19:24   ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).