[PATCH v2 0/4] futex: Wakeup optimizations

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/4] futex: Wakeup optimizations
@ 2013-12-03  9:45 Davidlohr Bueso
  2013-12-03  9:45 ` [PATCH v2 1/4] futex: Misc cleanups Davidlohr Bueso
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Davidlohr Bueso @ 2013-12-03  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2,
	davidlohr

Changes from v1 [https://lkml.org/lkml/2013/11/22/525]:
 - Removed patch "futex: Check for pi futex_q only once".

 - Cleaned up ifdefs for larger hash table.

 - Added a doc patch from tglx that describes the futex 
   ordering guarantees.

 - Improved the lockless plist check for the wake calls.
   Based on the community feedback, the necessary abstractions
   and barriers are added to maintain ordering guarantees.
   Code documentation is also updated.

 - Removed patch "sched,futex: Provide delayed wakeup list".
   Based on feedback from PeterZ, I will look into this as
   a separate issue once the other patches are settled.

We have been dealing with a customer database workload on large
12Tb, 240 core 16 socket NUMA system that exhibits high amounts 
of contention on some of the locks that serialize internal futex 
data structures. This workload specially suffers in the wakeup 
paths, where waiting on the corresponding hb->lock can account for 
up to ~60% of the time. The result of such calls can mostly be 
classified as (i) nothing to wake up and (ii) wakeup large amount 
of tasks.

Before these patches are applied, we can see this pathological behavior:

 37.12%  826174  xxx  [kernel.kallsyms] [k] _raw_spin_lock
            --- _raw_spin_lock
             |
             |--97.14%-- futex_wake
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          |
             |          |--99.70%-- 0x7f383fbdea1f
             |          |           yyy

 43.71%  762296  xxx  [kernel.kallsyms] [k] _raw_spin_lock
            --- _raw_spin_lock
             |
             |--53.74%-- futex_wake
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          |
             |          |--99.40%-- 0x7fe7d44a4c05
             |          |           zzz
             |--45.90%-- futex_wait_setup
             |          futex_wait
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          0x7fe7ba315789
             |          syscall

With these patches, contention is practically non existent:

 0.10%     49   xxx  [kernel.kallsyms]   [k] _raw_spin_lock
               --- _raw_spin_lock
                |
                |--76.06%-- futex_wait_setup
                |          futex_wait
                |          do_futex
                |          sys_futex
                |          system_call_fastpath
                |          |
                |          |--99.90%-- 0x7f3165e63789
                |          |          syscall|
                           ...
                |--6.27%-- futex_wake
                |          do_futex
                |          sys_futex
                |          system_call_fastpath
                |          |
                |          |--54.56%-- 0x7f317fff2c05
                ...

Patch 1 is a cleanup.

Patch 2 addresses the well known issue of the global hash table.
By creating a larger and NUMA aware table, we can reduce the false
sharing and collisions, thus reducing the chance of different futexes 
using hb->lock.

Patch 3 documents the futex ordering guarantees.

Patch 4 reduces contention on the corresponding hb->lock by not trying to
acquire it if there are no blocked tasks in the waitqueue.
This particularly deals with point (i) above, where we see that it is not
uncommon for up to 90% of wakeup calls end up returning 0, indicating that no
tasks were woken.

This patchset has also been tested on smaller systems for a variety of
benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
hb locks programs. So far, no functional or performance regressions have been seen.
Furthermore, no issues were found when running the different tests in the futextest 
suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/

This patchset applies on top of Linus' tree as of v3.13-rc2 (2e7babfa).

Special thanks to Scott Norton, Tom Vanden, Mark Ray and Aswin Chandramouleeswaran
for help presenting, debugging and analyzing the data.

  futex: Misc cleanups
  futex: Larger hash table
  futex: Document ordering guarantees
  futex: Avoid taking hb lock if nothing to wakeup

 kernel/futex.c | 230 ++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 194 insertions(+), 36 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/4] futex: Misc cleanups
  2013-12-03  9:45 [PATCH v2 0/4] futex: Wakeup optimizations Davidlohr Bueso
@ 2013-12-03  9:45 ` Davidlohr Bueso
  2013-12-03  9:45 ` [PATCH v2 2/4] futex: Larger hash table Davidlohr Bueso
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Davidlohr Bueso @ 2013-12-03  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2,
	davidlohr

From: Jason Low <jason.low2@hp.com>

- Remove unnecessary head variables.
- Delete unused parameter in queue_unlock().

Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 39 ++++++++++++---------------------------
 1 file changed, 12 insertions(+), 27 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 80ba086..e6ffe73 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -597,13 +597,10 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
 {
 	struct futex_pi_state *pi_state = NULL;
 	struct futex_q *this, *next;
-	struct plist_head *head;
 	struct task_struct *p;
 	pid_t pid = uval & FUTEX_TID_MASK;
 
-	head = &hb->chain;
-
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb->chain, list) {
 		if (match_futex(&this->key, key)) {
 			/*
 			 * Another waiter already exists - bump up
@@ -985,7 +982,6 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
-	struct plist_head *head;
 	union futex_key key = FUTEX_KEY_INIT;
 	int ret;
 
@@ -998,9 +994,8 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 
 	hb = hash_futex(&key);
 	spin_lock(&hb->lock);
-	head = &hb->chain;
 
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb->chain, list) {
 		if (match_futex (&this->key, &key)) {
 			if (this->pi_state || this->rt_waiter) {
 				ret = -EINVAL;
@@ -1033,7 +1028,6 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 {
 	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb1, *hb2;
-	struct plist_head *head;
 	struct futex_q *this, *next;
 	int ret, op_ret;
 
@@ -1081,9 +1075,7 @@ retry_private:
 		goto retry;
 	}
 
-	head = &hb1->chain;
-
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb1->chain, list) {
 		if (match_futex (&this->key, &key1)) {
 			if (this->pi_state || this->rt_waiter) {
 				ret = -EINVAL;
@@ -1096,10 +1088,8 @@ retry_private:
 	}
 
 	if (op_ret > 0) {
-		head = &hb2->chain;
-
 		op_ret = 0;
-		plist_for_each_entry_safe(this, next, head, list) {
+		plist_for_each_entry_safe(this, next, &hb2->chain, list) {
 			if (match_futex (&this->key, &key2)) {
 				if (this->pi_state || this->rt_waiter) {
 					ret = -EINVAL;
@@ -1269,7 +1259,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 	int drop_count = 0, task_count = 0, ret;
 	struct futex_pi_state *pi_state = NULL;
 	struct futex_hash_bucket *hb1, *hb2;
-	struct plist_head *head1;
 	struct futex_q *this, *next;
 	u32 curval2;
 
@@ -1392,8 +1381,7 @@ retry_private:
 		}
 	}
 
-	head1 = &hb1->chain;
-	plist_for_each_entry_safe(this, next, head1, list) {
+	plist_for_each_entry_safe(this, next, &hb1->chain, list) {
 		if (task_count - nr_wake >= nr_requeue)
 			break;
 
@@ -1493,7 +1481,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 }
 
 static inline void
-queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
+queue_unlock(struct futex_hash_bucket *hb)
 	__releases(&hb->lock)
 {
 	spin_unlock(&hb->lock);
@@ -1866,7 +1854,7 @@ retry_private:
 	ret = get_futex_value_locked(&uval, uaddr);
 
 	if (ret) {
-		queue_unlock(q, *hb);
+		queue_unlock(*hb);
 
 		ret = get_user(uval, uaddr);
 		if (ret)
@@ -1880,7 +1868,7 @@ retry_private:
 	}
 
 	if (uval != val) {
-		queue_unlock(q, *hb);
+		queue_unlock(*hb);
 		ret = -EWOULDBLOCK;
 	}
 
@@ -2028,7 +2016,7 @@ retry_private:
 			 * Task is exiting and we just wait for the
 			 * exit to complete.
 			 */
-			queue_unlock(&q, hb);
+			queue_unlock(hb);
 			put_futex_key(&q.key);
 			cond_resched();
 			goto retry;
@@ -2080,7 +2068,7 @@ retry_private:
 	goto out_put_key;
 
 out_unlock_put_key:
-	queue_unlock(&q, hb);
+	queue_unlock(hb);
 
 out_put_key:
 	put_futex_key(&q.key);
@@ -2090,7 +2078,7 @@ out:
 	return ret != -EINTR ? ret : -ERESTARTNOINTR;
 
 uaddr_faulted:
-	queue_unlock(&q, hb);
+	queue_unlock(hb);
 
 	ret = fault_in_user_writeable(uaddr);
 	if (ret)
@@ -2112,7 +2100,6 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
-	struct plist_head *head;
 	union futex_key key = FUTEX_KEY_INIT;
 	u32 uval, vpid = task_pid_vnr(current);
 	int ret;
@@ -2152,9 +2139,7 @@ retry:
 	 * Ok, other tasks may need to be woken up - check waiters
 	 * and do the wakeup if necessary:
 	 */
-	head = &hb->chain;
-
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb->chain, list) {
 		if (!match_futex (&this->key, &key))
 			continue;
 		ret = wake_futex_pi(uaddr, uval, this);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/4] futex: Larger hash table
  2013-12-03  9:45 [PATCH v2 0/4] futex: Wakeup optimizations Davidlohr Bueso
  2013-12-03  9:45 ` [PATCH v2 1/4] futex: Misc cleanups Davidlohr Bueso
@ 2013-12-03  9:45 ` Davidlohr Bueso
  2013-12-03  9:45 ` [PATCH v2 3/4] futex: Document ordering guarantees Davidlohr Bueso
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Davidlohr Bueso @ 2013-12-03  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2,
	davidlohr

Currently, the futex global hash table suffers from it's fixed, smallish
(for today's standards) size of 256 entries, as well as its lack of NUMA
awareness. Large systems, using many futexes, can be prone to high amounts
of collisions; where these futexes hash to the same bucket and lead to
extra contention on the same hb->lock. Furthermore, cacheline bouncing is a
reality when we have multiple hb->locks residing on the same cacheline and
different futexes hash to adjacent buckets.

This patch keeps the current static size of 16 entries for small systems,
or otherwise, 256 * ncpus (or larger as we need to round the number to a
power of 2). Note that this number of CPUs accounts for all CPUs that can
ever be available in the system, taking into consideration things like
hotpluging. While we do impose extra overhead at bootup by making the hash
table larger, this is a one time thing, and does not shadow the benefits
of this patch.

Furthermore, as suggested by tglx, by cache aligning the hash buckets we can
avoid access across cacheline boundaries and also avoid massive cache line
bouncing if multiple cpus are hammering away at different hash buckets which
happen to reside in the same cache line.

Also, similar to other core kernel components (pid, dcache, tcp), by using
alloc_large_system_hash() we benefit from its NUMA awareness and thus the
table is distributed among the nodes instead of in a single one.

For a custom microbenchmark that pounds on the uaddr hashing -- making the wait
path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of
futexes, we can see the following benefits on a 80-core, 8-socket 1Tb server:

+---------+--------------------+------------------------+-----------------------+-------------------------------+
| threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
+---------+--------------------+------------------------+-----------------------+-------------------------------+
|     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             |
|     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             |
|     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             |
|      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
|      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             |
|      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              |
+---------+--------------------+------------------------+-----------------------+-------------------------------+

Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index e6ffe73..e603520 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -63,6 +63,7 @@
 #include <linux/sched/rt.h>
 #include <linux/hugetlb.h>
 #include <linux/freezer.h>
+#include <linux/bootmem.h>
 
 #include <asm/futex.h>
 
@@ -70,8 +71,6 @@
 
 int __read_mostly futex_cmpxchg_enabled;
 
-#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
-
 /*
  * Futex flags used to encode options to functions and preserve them across
  * restarts.
@@ -149,9 +148,11 @@ static const struct futex_q futex_q_init = {
 struct futex_hash_bucket {
 	spinlock_t lock;
 	struct plist_head chain;
-};
+} ____cacheline_aligned_in_smp;
 
-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+static unsigned long __read_mostly futex_hashsize;
+
+static struct futex_hash_bucket *futex_queues;
 
 /*
  * We hash on the keys returned from get_futex_key (see below).
@@ -161,7 +162,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key)
 	u32 hash = jhash2((u32*)&key->both.word,
 			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
 			  key->both.offset);
-	return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+	return &futex_queues[hash & (futex_hashsize - 1)];
 }
 
 /*
@@ -2718,7 +2719,18 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 static int __init futex_init(void)
 {
 	u32 curval;
-	int i;
+	unsigned long i;
+
+#if CONFIG_BASE_SMALL
+	futex_hashsize = 16;
+#else
+	futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
+#endif
+
+	futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
+					       futex_hashsize, 0,
+					       futex_hashsize < 256 ? HASH_SMALL : 0,
+					       NULL, NULL, futex_hashsize, futex_hashsize);
 
 	/*
 	 * This will fail and we want it. Some arch implementations do
@@ -2733,7 +2745,7 @@ static int __init futex_init(void)
 	if (cmpxchg_futex_value_locked(&curval, NULL, 0, 0) == -EFAULT)
 		futex_cmpxchg_enabled = 1;
 
-	for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
+	for (i = 0; i < futex_hashsize; i++) {
 		plist_head_init(&futex_queues[i].chain);
 		spin_lock_init(&futex_queues[i].lock);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 3/4] futex: Document ordering guarantees
  2013-12-03  9:45 [PATCH v2 0/4] futex: Wakeup optimizations Davidlohr Bueso
  2013-12-03  9:45 ` [PATCH v2 1/4] futex: Misc cleanups Davidlohr Bueso
  2013-12-03  9:45 ` [PATCH v2 2/4] futex: Larger hash table Davidlohr Bueso
@ 2013-12-03  9:45 ` Davidlohr Bueso
  2013-12-03  9:45 ` [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Davidlohr Bueso @ 2013-12-03  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2,
	davidlohr

From: Thomas Gleixner <tglx@linutronix.de>

That's essential, if you want to hack on futexes.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: Jason Low <jason.low2@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/futex.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index e603520..75719bd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -69,6 +69,63 @@
 
 #include "locking/rtmutex_common.h"
 
+/*
+ * Basic futex operation and ordering guarantees:
+ *
+ * The waiter reads the futex value in user space and calls
+ * futex_wait(). It computes the hash bucket and acquires the hash
+ * bucket lock. After that it reads the futex user space value again
+ * and verifies that the data has not changed. If it has not changed
+ * it enqueues itself into the hash bucket, releases the hash
+ * bucket lock and schedules.
+ *
+ * The waker side modifies the user space value of the futex and calls
+ * futex_wake(). It computes the hash bucket and acquires the hash
+ * bucket lock. Then it looks for waiters on that futex in the hash
+ * bucket and wakes them.
+ *
+ * Note that the spin_lock serializes waiters and wakers, so that the
+ * following scenario is avoided:
+ *
+ * CPU 0                               CPU 1
+ * val = *futex;
+ * sys_futex(WAIT, futex, val);
+ *   futex_wait(futex, val);
+ *   uval = *futex;
+ *                                     *futex = newval;
+ *                                     sys_futex(WAKE, futex);
+ *                                       futex_wake(futex);
+ *                                       if (queue_empty())
+ *                                         return;
+ *   if (uval == val)
+ *      lock(hash_bucket(futex));
+ *      queue();
+ *     unlock(hash_bucket(futex));
+ *     schedule();
+ *
+ * This would cause the waiter on CPU 0 to wait forever because it
+ * missed the transition of the user space value from val to newval
+ * and the waker did not find the waiter in the hash bucket queue.
+ * The spinlock serializes that:
+ *
+ * CPU 0                               CPU 1
+ * val = *futex;
+ * sys_futex(WAIT, futex, val);
+ *   futex_wait(futex, val);
+ *   lock(hash_bucket(futex));
+ *   uval = *futex;
+ *                                     *futex = newval;
+ *                                     sys_futex(WAKE, futex);
+ *                                       futex_wake(futex);
+ *                                       lock(hash_bucket(futex));
+ *   if (uval == val)
+ *      queue();
+ *     unlock(hash_bucket(futex));
+ *     schedule();                       if (!queue_empty())
+ *                                         wake_waiters(futex);
+ *                                       unlock(hash_bucket(futex));
+ */
+
 int __read_mostly futex_cmpxchg_enabled;
 
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-03  9:45 [PATCH v2 0/4] futex: Wakeup optimizations Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2013-12-03  9:45 ` [PATCH v2 3/4] futex: Document ordering guarantees Davidlohr Bueso
@ 2013-12-03  9:45 ` Davidlohr Bueso
  2013-12-06 18:26   ` Thomas Gleixner
                     ` (5 more replies)
  2013-12-12 17:30 ` [PATCH v2 0/4] futex: Wakeup optimizations Peter Zijlstra
  2013-12-18 15:32 ` Davidlohr Bueso
  5 siblings, 6 replies; 22+ messages in thread
From: Davidlohr Bueso @ 2013-12-03  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2,
	davidlohr

In futex_wake() there is clearly no point in taking the hb->lock if
we know beforehand that there are no tasks to be woken. This comes
at the smaller cost of doing some atomic operations to keep track of
the list's size. Specifically, increment the counter when an element is
added to the list, and decrement when it is removed. Of course, if the
counter is 0, then there are no tasks blocked on a futex. Some special
considerations:

- increment the counter at queue_lock() as we always end up calling
  queue_me() which adds the element to the list. Upon any error,
  queue_unlock() is called for housekeeping, for which we decrement
  to mach the increment done in queue_lock().

- decrement the counter at __unqueue_me() to reflect when an element is
  removed from the queue for wakeup related purposes.

With the new atomic counter we only enlarge the futex_hash_bucket struct
by 4 bytes on 32bit systems, and it is left the same size for 64bit
ones, at 24 bytes.

More important, we make sure that the futex ordering guarantees are preserved,
ensuring that waiters either observes the changed user space value before
blocking or is woken by a concurrent waker. This is done by relying on the
barriers in atomic_inc() -- for archs that do have implict mb in atomic_inc()
we explicitly add them through a set of new functions that are introduced:
futex_get_mm(), hb_waiters_inc(x) and hb_waiters_dec().
For more details please refer to the updated comments in the code and related
discussion: https://lkml.org/lkml/2013/11/26/556

Furthermore, the overhead of new barriers is trivial. The following are some
results, from a quad-core x86_64 laptop, measuring the latency of nthread
wakeups (1 at a time), for 1000 runs:

+----------+-----------------------------+------------------------------+
| nthreads | baseline time (ms) [stddev] | atomicops time (ms) [stddev] |
+----------+-----------------------------+------------------------------+
|      512 | 2.8360 [0.5168]             | 3.8150 [1.3293]              |
|      256 | 2.5080 [0.6375]             | 2.5980 [0.9079]              |
|      128 | 1.0200 [0.4264]             | 1.5180 [0.4902]              |
|       64 | 0.7890 [0.2667]             | 0.4020 [0.2447]              |
|       32 | 0.1150 [0.0184]             | 0.1490 [0.1156]              |
+----------+-----------------------------+------------------------------+

Special thanks to tglx for careful review and feedback.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 140 +++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 122 insertions(+), 18 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 75719bd..53ce28f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -82,10 +82,12 @@
  * The waker side modifies the user space value of the futex and calls
  * futex_wake(). It computes the hash bucket and acquires the hash
  * bucket lock. Then it looks for waiters on that futex in the hash
- * bucket and wakes them.
- *
- * Note that the spin_lock serializes waiters and wakers, so that the
- * following scenario is avoided:
+ * bucket and wakes them. In scenarios where wakeups are called and no
+ * tasks are blocked on a futex, taking the hb spinlock can be avoided
+ * and simply return. In order for this optimization to work, ordering
+ * guarantees must exist so that the waiter being added to the list is
+ * acknowledged when the list is concurrently being checked by the waker,
+ * avoiding scenarios like the following:
  *
  * CPU 0                               CPU 1
  * val = *futex;
@@ -106,24 +108,40 @@
  * This would cause the waiter on CPU 0 to wait forever because it
  * missed the transition of the user space value from val to newval
  * and the waker did not find the waiter in the hash bucket queue.
- * The spinlock serializes that:
+ * The correct serialization ensures that a waiter either observes
+ * the changed user space value before blocking or is woken by a
+ * concurrent waker:
  *
  * CPU 0                               CPU 1
  * val = *futex;
  * sys_futex(WAIT, futex, val);
  *   futex_wait(futex, val);
- *   lock(hash_bucket(futex));
- *   uval = *futex;
- *                                     *futex = newval;
- *                                     sys_futex(WAKE, futex);
- *                                       futex_wake(futex);
- *                                       lock(hash_bucket(futex));
+ *
+ *   mb(); <-- paired with ------
+ *                              |
+ *   lock(hash_bucket(futex));  |
+ *                              |
+ *   uval = *futex;             |
+ *                              |        *futex = newval;
+ *                              |        sys_futex(WAKE, futex);
+ *                              |          futex_wake(futex);
+ *                              |
+ *                              -------->   mb();
  *   if (uval == val)
- *      queue();
+ *     queue();
  *     unlock(hash_bucket(futex));
- *     schedule();                       if (!queue_empty())
- *                                         wake_waiters(futex);
- *                                       unlock(hash_bucket(futex));
+ *     schedule();                         if (!queue_empty())
+ *                                           lock(hash_bucket(futex));
+ *                                           wake_waiters(futex);
+ *                                           unlock(hash_bucket(futex));
+ *
+ * The length of the list is tracked with atomic ops (hb->waiters),
+ * providing the necessary memory barriers for the waiters. For the
+ * waker side, however, we rely on get_futex_key_refs(), using either
+ * ihold() or the atomic_inc(), for shared futexes. The former provides
+ * a full mb on all architectures. For architectures that do not have an
+ * implicit barrier in atomic_inc/dec, we explicitly add it - please
+ * refer to futex_get_mm() and hb_waiters_inc/dec().
  */
 
 int __read_mostly futex_cmpxchg_enabled;
@@ -203,6 +221,9 @@ static const struct futex_q futex_q_init = {
  * waiting on a futex.
  */
 struct futex_hash_bucket {
+#ifdef CONFIG_SMP
+	atomic_t waiters;
+#endif
 	spinlock_t lock;
 	struct plist_head chain;
 } ____cacheline_aligned_in_smp;
@@ -211,6 +232,65 @@ static unsigned long __read_mostly futex_hashsize;
 
 static struct futex_hash_bucket *futex_queues;
 
+static inline void futex_get_mm(union futex_key *key)
+{
+	atomic_inc(&key->private.mm->mm_count);
+#ifdef CONFIG_SMP
+	/*
+	 * Reduced to a simple barrier() where the atomic_inc
+	 * has an implicit mb().
+	 */
+	smp_mb__after_atomic_inc();
+#endif
+}
+
+/*
+ * Reflects a new waiter being added to the waitqueue.
+ */
+static inline void hb_waiters_inc(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	atomic_inc(&hb->waiters);
+	/*
+	 * Reduced to a simple barrier() where the atomic_inc
+	 * has an implicit mb().
+	 */
+	smp_mb__after_atomic_inc();
+#endif
+}
+
+/*
+ * Reflects a waiter being removed from the waitqueue by wakeup
+ * paths.
+ */
+static inline void hb_waiters_dec(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	atomic_dec(&hb->waiters);
+	/*
+	 * Reduced to a simple barrier() where the atomic_inc
+	 * has an implicit mb().
+	 *
+	 * For non-x86 archs it's debatable whether this has
+	 * a hard requirement to be guarded. The optimized
+	 * hb_waiters_pending() check for pending wakers might
+	 * fail in rare cases, but just for the cost of a
+	 * spinlock/unlock. The consistency of hb->waiters itself
+	 * is always guaranteed, i.e. it can't go below 0.
+	 */
+	smp_mb__after_atomic_dec();
+#endif
+}
+
+static inline int hb_waiters_pending(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	return atomic_read(&hb->waiters);
+#else
+	return 1;
+#endif
+}
+
 /*
  * We hash on the keys returned from get_futex_key (see below).
  */
@@ -248,7 +328,7 @@ static void get_futex_key_refs(union futex_key *key)
 		ihold(key->shared.inode);
 		break;
 	case FUT_OFF_MMSHARED:
-		atomic_inc(&key->private.mm->mm_count);
+		futex_get_mm(key);
 		break;
 	}
 }
@@ -892,6 +972,7 @@ static void __unqueue_futex(struct futex_q *q)
 
 	hb = container_of(q->lock_ptr, struct futex_hash_bucket, lock);
 	plist_del(&q->list, &hb->chain);
+	hb_waiters_dec(hb);
 }
 
 /*
@@ -1051,6 +1132,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 		goto out;
 
 	hb = hash_futex(&key);
+
+	/* Make sure we really have tasks to wakeup */
+	if (!hb_waiters_pending(hb))
+		goto out_put_key;
+
 	spin_lock(&hb->lock);
 
 	plist_for_each_entry_safe(this, next, &hb->chain, list) {
@@ -1071,6 +1157,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	}
 
 	spin_unlock(&hb->lock);
+out_put_key:
 	put_futex_key(&key);
 out:
 	return ret;
@@ -1189,7 +1276,9 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
 	 */
 	if (likely(&hb1->chain != &hb2->chain)) {
 		plist_del(&q->list, &hb1->chain);
+		hb_waiters_dec(hb1);
 		plist_add(&q->list, &hb2->chain);
+		hb_waiters_inc(hb2);
 		q->lock_ptr = &hb2->lock;
 	}
 	get_futex_key_refs(key2);
@@ -1532,17 +1621,28 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 	struct futex_hash_bucket *hb;
 
 	hb = hash_futex(&q->key);
+
+	/*
+	 * Increment the counter before taking the lock so that
+	 * a potential waker won't miss a to-be-slept task that is
+	 * waiting for the spinlock. This is safe as all queue_lock()
+	 * users end up calling queue_me(). Similarly, for housekeeping,
+	 * decrement the counter at queue_unlock() when some error has
+	 * occurred and we don't end up adding the task to the list.
+	 */
+	hb_waiters_inc(hb);
+
 	q->lock_ptr = &hb->lock;
 
 	spin_lock(&hb->lock);
 	return hb;
 }
 
-static inline void
-queue_unlock(struct futex_hash_bucket *hb)
+static inline void queue_unlock(struct futex_hash_bucket *hb)
 	__releases(&hb->lock)
 {
 	spin_unlock(&hb->lock);
+	hb_waiters_dec(hb);
 }
 
 /**
@@ -2274,6 +2374,7 @@ int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
 		 * Unqueue the futex_q and determine which it was.
 		 */
 		plist_del(&q->list, &hb->chain);
+		hb_waiters_dec(hb);
 
 		/* Handle spurious wakeups gracefully */
 		ret = -EWOULDBLOCK;
@@ -2805,6 +2906,9 @@ static int __init futex_init(void)
 	for (i = 0; i < futex_hashsize; i++) {
 		plist_head_init(&futex_queues[i].chain);
 		spin_lock_init(&futex_queues[i].lock);
+#ifdef CONFIG_SMP
+		atomic_set(&futex_queues[i].waiters, 0);
+#endif
 	}
 
 	return 0;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-03  9:45 ` [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
@ 2013-12-06 18:26   ` Thomas Gleixner
  2013-12-06 19:04     ` Davidlohr Bueso
  2013-12-10 16:08   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2013-12-06 18:26 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, peterz, paulmck, efault, jeffm,
	torvalds, scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Tue, 3 Dec 2013, Davidlohr Bueso wrote:
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> Signed-off-by: Jason Low <jason.low2@hp.com>
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>

Who is the author of this? According to the SOB chains its Waiman, who
handed it down to Jason and it finally got sumbitted by you. Is that
correct?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-06 18:26   ` Thomas Gleixner
@ 2013-12-06 19:04     ` Davidlohr Bueso
  2013-12-10 15:36       ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: Davidlohr Bueso @ 2013-12-06 19:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, mingo, dvhart, peterz, paulmck, efault, jeffm,
	torvalds, scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Fri, 2013-12-06 at 19:26 +0100, Thomas Gleixner wrote:
> On Tue, 3 Dec 2013, Davidlohr Bueso wrote:
> > Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> > Signed-off-by: Jason Low <jason.low2@hp.com>
> > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> 
> Who is the author of this? According to the SOB chains its Waiman, who
> handed it down to Jason and it finally got sumbitted by you. Is that
> correct?

This was a group effort for the original patchset -- things did change
quite a bit for this particular patch in v2. Waiman and Jason helped
review and test the original atomic ops code.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-06 19:04     ` Davidlohr Bueso
@ 2013-12-10 15:36       ` Ingo Molnar
  2013-12-10 16:17         ` Davidlohr Bueso
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2013-12-10 15:36 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Gleixner, linux-kernel, dvhart, peterz, paulmck, efault,
	jeffm, torvalds, scott.norton, tom.vaden, aswin, Waiman.Long,
	jason.low2


* Davidlohr Bueso <davidlohr@hp.com> wrote:

> On Fri, 2013-12-06 at 19:26 +0100, Thomas Gleixner wrote:
> > On Tue, 3 Dec 2013, Davidlohr Bueso wrote:
> > > Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> > > Signed-off-by: Jason Low <jason.low2@hp.com>
> > > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> > 
> > Who is the author of this? According to the SOB chains its Waiman, who
> > handed it down to Jason and it finally got sumbitted by you. Is that
> > correct?
> 
> This was a group effort for the original patchset -- things did change
> quite a bit for this particular patch in v2. Waiman and Jason helped
> review and test the original atomic ops code.

Then please put that into the credits (in the changelog and/or in the 
source code), and/or into copyright notices, etc. - but keep the 
SOB-chain the true route of the patch, where the primary author is the 
first in the SOB and where the person sending the patch is the last 
SOB. (the two are identical normally.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-03  9:45 ` [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
  2013-12-06 18:26   ` Thomas Gleixner
@ 2013-12-10 16:08   ` Peter Zijlstra
  2013-12-10 16:24   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2013-12-10 16:08 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, tglx, paulmck, efault, jeffm,
	torvalds, scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Tue, Dec 03, 2013 at 01:45:27AM -0800, Davidlohr Bueso wrote:
> With the new atomic counter we only enlarge the futex_hash_bucket struct
> by 4 bytes on 32bit systems, and it is left the same size for 64bit
> ones, at 24 bytes.

patch 2 grew the structure to be an entire cacheline, this now seems a
rather pointless comment ;-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-10 15:36       ` Ingo Molnar
@ 2013-12-10 16:17         ` Davidlohr Bueso
  2013-12-10 18:28           ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: Davidlohr Bueso @ 2013-12-10 16:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, linux-kernel, dvhart, peterz, paulmck, efault,
	jeffm, torvalds, scott.norton, tom.vaden, aswin, Waiman.Long,
	jason.low2

On Tue, 2013-12-10 at 16:36 +0100, Ingo Molnar wrote:
> * Davidlohr Bueso <davidlohr@hp.com> wrote:
> 
> > On Fri, 2013-12-06 at 19:26 +0100, Thomas Gleixner wrote:
> > > On Tue, 3 Dec 2013, Davidlohr Bueso wrote:
> > > > Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> > > > Signed-off-by: Jason Low <jason.low2@hp.com>
> > > > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> > > 
> > > Who is the author of this? According to the SOB chains its Waiman, who
> > > handed it down to Jason and it finally got sumbitted by you. Is that
> > > correct?
> > 
> > This was a group effort for the original patchset -- things did change
> > quite a bit for this particular patch in v2. Waiman and Jason helped
> > review and test the original atomic ops code.
> 
> Then please put that into the credits (in the changelog and/or in the 
> source code), and/or into copyright notices, etc. - but keep the 
> SOB-chain the true route of the patch, where the primary author is the 
> first in the SOB and where the person sending the patch is the last 
> SOB. (the two are identical normally.)

Ok, I wasn't aware of that primary author rule, sorry. The exact same
thing would go for patch 2 then. Is this something you want me to
correct for this particular series, or to keep in mind for the future?

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-03  9:45 ` [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
  2013-12-06 18:26   ` Thomas Gleixner
  2013-12-10 16:08   ` Peter Zijlstra
@ 2013-12-10 16:24   ` Peter Zijlstra
  2013-12-10 16:57   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2013-12-10 16:24 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, tglx, paulmck, efault, jeffm,
	torvalds, scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Tue, Dec 03, 2013 at 01:45:27AM -0800, Davidlohr Bueso wrote:
> @@ -203,6 +221,9 @@ static const struct futex_q futex_q_init = {
>   * waiting on a futex.
>   */
>  struct futex_hash_bucket {
> +#ifdef CONFIG_SMP
> +	atomic_t waiters;
> +#endif
>  	spinlock_t lock;
>  	struct plist_head chain;
>  } ____cacheline_aligned_in_smp;

> @@ -2805,6 +2906,9 @@ static int __init futex_init(void)
>  	for (i = 0; i < futex_hashsize; i++) {
>  		plist_head_init(&futex_queues[i].chain);
>  		spin_lock_init(&futex_queues[i].lock);
> +#ifdef CONFIG_SMP
> +		atomic_set(&futex_queues[i].waiters, 0);
> +#endif
>  	}
>  
>  	return 0;

Just remove those two #ifdefs, its not like the data structure size
matters at this point.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-03  9:45 ` [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
                     ` (2 preceding siblings ...)
  2013-12-10 16:24   ` Peter Zijlstra
@ 2013-12-10 16:57   ` Peter Zijlstra
  2013-12-10 17:22     ` Davidlohr Bueso
  2013-12-10 17:15   ` Peter Zijlstra
  2013-12-10 18:10   ` [PATCH v3 " Davidlohr Bueso
  5 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-12-10 16:57 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, tglx, paulmck, efault, jeffm,
	torvalds, scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Tue, Dec 03, 2013 at 01:45:27AM -0800, Davidlohr Bueso wrote:
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -82,10 +82,12 @@
>   * The waker side modifies the user space value of the futex and calls
>   * futex_wake(). It computes the hash bucket and acquires the hash
>   * bucket lock. Then it looks for waiters on that futex in the hash
> - * bucket and wakes them.
> - *
> - * Note that the spin_lock serializes waiters and wakers, so that the
> - * following scenario is avoided:
> + * bucket and wakes them.

Why not let this be the start of a new paragraph?

> In scenarios where wakeups are called and no
> + * tasks are blocked on a futex, taking the hb spinlock can be avoided
> + * and simply return. In order for this optimization to work, ordering
> + * guarantees must exist so that the waiter being added to the list is
> + * acknowledged when the list is concurrently being checked by the waker,
> + * avoiding scenarios like the following:
>   *
>   * CPU 0                               CPU 1
>   * val = *futex;
> @@ -106,24 +108,40 @@
>   * This would cause the waiter on CPU 0 to wait forever because it
>   * missed the transition of the user space value from val to newval
>   * and the waker did not find the waiter in the hash bucket queue.
> + * The correct serialization ensures that a waiter either observes
> + * the changed user space value before blocking or is woken by a
> + * concurrent waker:
>   *
>   * CPU 0                               CPU 1
>   * val = *futex;
>   * sys_futex(WAIT, futex, val);
>   *   futex_wait(futex, val);
> + *
> + *   mb(); <-- paired with ------
> + *                              |
> + *   lock(hash_bucket(futex));  |
> + *                              |
> + *   uval = *futex;             |
> + *                              |        *futex = newval;
> + *                              |        sys_futex(WAKE, futex);
> + *                              |          futex_wake(futex);
> + *                              |
> + *                              -------->   mb();
>   *   if (uval == val)
> + *     queue();
>   *     unlock(hash_bucket(futex));
> + *     schedule();                         if (!queue_empty())
> + *                                           lock(hash_bucket(futex));
> + *                                           wake_waiters(futex);
> + *                                           unlock(hash_bucket(futex));
> + *
> + * The length of the list is tracked with atomic ops (hb->waiters),
> + * providing the necessary memory barriers for the waiters. For the
> + * waker side, however, we rely on get_futex_key_refs(), using either
> + * ihold() or the atomic_inc(), for shared futexes. The former provides
> + * a full mb on all architectures. For architectures that do not have an
> + * implicit barrier in atomic_inc/dec, we explicitly add it - please
> + * refer to futex_get_mm() and hb_waiters_inc/dec().
>   */

This comment actually confuses me :/

It isn't at all explained what purpose the memory barriers serve.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-03  9:45 ` [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
                     ` (3 preceding siblings ...)
  2013-12-10 16:57   ` Peter Zijlstra
@ 2013-12-10 17:15   ` Peter Zijlstra
  2013-12-10 17:36     ` Davidlohr Bueso
  2013-12-10 18:10   ` [PATCH v3 " Davidlohr Bueso
  5 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-12-10 17:15 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, tglx, paulmck, efault, jeffm,
	torvalds, scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2


---
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -82,12 +82,13 @@
  * The waker side modifies the user space value of the futex and calls
  * futex_wake(). It computes the hash bucket and acquires the hash
  * bucket lock. Then it looks for waiters on that futex in the hash
- * bucket and wakes them. In scenarios where wakeups are called and no
- * tasks are blocked on a futex, taking the hb spinlock can be avoided
- * and simply return. In order for this optimization to work, ordering
- * guarantees must exist so that the waiter being added to the list is
- * acknowledged when the list is concurrently being checked by the waker,
- * avoiding scenarios like the following:
+ * bucket and wakes them.
+ *
+ * In scenarios where wakeups are called and no tasks are blocked on a futex,
+ * taking the hb spinlock can be avoided and simply return. In order for this
+ * optimization to work, ordering guarantees must exist so that the waiter
+ * being added to the list is acknowledged when the list is concurrently being
+ * checked by the waker, avoiding scenarios like the following:
  *
  * CPU 0                               CPU 1
  * val = *futex;
@@ -108,6 +109,7 @@
  * This would cause the waiter on CPU 0 to wait forever because it
  * missed the transition of the user space value from val to newval
  * and the waker did not find the waiter in the hash bucket queue.
+ *
  * The correct serialization ensures that a waiter either observes
  * the changed user space value before blocking or is woken by a
  * concurrent waker:
@@ -117,7 +119,8 @@
  * sys_futex(WAIT, futex, val);
  *   futex_wait(futex, val);
  *
- *   mb(); <-- paired with ------
+ *   waiters++;
+ *   mb(); (A) <-- paired with -.
  *                              |
  *   lock(hash_bucket(futex));  |
  *                              |
@@ -126,22 +129,29 @@
  *                              |        sys_futex(WAKE, futex);
  *                              |          futex_wake(futex);
  *                              |
- *                              -------->   mb();
+ *                              `------->   mb(); (B)
  *   if (uval == val)
  *     queue();
  *     unlock(hash_bucket(futex));
- *     schedule();                         if (!queue_empty())
+ *     schedule();                         if (waiters)
  *                                           lock(hash_bucket(futex));
  *                                           wake_waiters(futex);
  *                                           unlock(hash_bucket(futex));
  *
- * The length of the list is tracked with atomic ops (hb->waiters),
- * providing the necessary memory barriers for the waiters. For the
- * waker side, however, we rely on get_futex_key_refs(), using either
- * ihold() or the atomic_inc(), for shared futexes. The former provides
- * a full mb on all architectures. For architectures that do not have an
- * implicit barrier in atomic_inc/dec, we explicitly add it - please
- * refer to futex_get_mm() and hb_waiters_inc/dec().
+ * Where (A) orders the waiters increment and the futex value read; and
+ * where (B) orders the write to futex and the waiters read.
+ *
+ * This yields the following case (where X:=waiters, Y:=futex):
+ *
+ * 	X = Y = 0
+ *
+ *  	w[X]=1		w[Y]=1
+ * 	MB		MB
+ * 	r[Y]=y		r[X]=x
+ *
+ * Which guarantees that x==0 && y==0 is impossible; which translates back into
+ * the guarantee that we cannot both miss the futex variable change and the
+ * enqueue.
  */
 
 int __read_mostly futex_cmpxchg_enabled;
@@ -221,9 +231,7 @@ static const struct futex_q futex_q_init
  * waiting on a futex.
  */
 struct futex_hash_bucket {
-#ifdef CONFIG_SMP
 	atomic_t waiters;
-#endif
 	spinlock_t lock;
 	struct plist_head chain;
 } ____cacheline_aligned_in_smp;
@@ -237,8 +245,9 @@ static inline void futex_get_mm(union fu
 	atomic_inc(&key->private.mm->mm_count);
 #ifdef CONFIG_SMP
 	/*
-	 * Reduced to a simple barrier() where the atomic_inc
-	 * has an implicit mb().
+	 * Ensure futex_get_mm() implies a full barrier such that
+	 * get_futex_key() implies a full barrier. This is relied upon as full
+	 * barrier (B), see the ordering comment above.
 	 */
 	smp_mb__after_atomic_inc();
 #endif
@@ -252,8 +261,7 @@ static inline void hb_waiters_inc(struct
 #ifdef CONFIG_SMP
 	atomic_inc(&hb->waiters);
 	/*
-	 * Reduced to a simple barrier() where the atomic_inc
-	 * has an implicit mb().
+	 * Full barrier (A), see the ordering comment above.
 	 */
 	smp_mb__after_atomic_inc();
 #endif
@@ -267,18 +275,6 @@ static inline void hb_waiters_dec(struct
 {
 #ifdef CONFIG_SMP
 	atomic_dec(&hb->waiters);
-	/*
-	 * Reduced to a simple barrier() where the atomic_inc
-	 * has an implicit mb().
-	 *
-	 * For non-x86 archs it's debatable whether this has
-	 * a hard requirement to be guarded. The optimized
-	 * hb_waiters_pending() check for pending wakers might
-	 * fail in rare cases, but just for the cost of a
-	 * spinlock/unlock. The consistency of hb->waiters itself
-	 * is always guaranteed, i.e. it can't go below 0.
-	 */
-	smp_mb__after_atomic_dec();
 #endif
 }
 
@@ -317,6 +313,8 @@ static inline int match_futex(union fute
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
  *
+ * Implies a full memory barrier; relied upon as (B), see the comment above
+ * about ordering.
  */
 static void get_futex_key_refs(union futex_key *key)
 {
@@ -325,10 +323,10 @@ static void get_futex_key_refs(union fut
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		ihold(key->shared.inode);
+		ihold(key->shared.inode); /* implies MB */
 		break;
 	case FUT_OFF_MMSHARED:
-		futex_get_mm(key);
+		futex_get_mm(key); /* implies MB */
 		break;
 	}
 }
@@ -372,6 +370,8 @@ static void drop_futex_key_refs(union fu
  * We can usually work out the index without swapping in the page.
  *
  * lock_page() might sleep, the caller should not hold a spinlock.
+ *
+ * Implies a full memory barrier (B).
  */
 static int
 get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
@@ -401,7 +401,7 @@ get_futex_key(u32 __user *uaddr, int fsh
 			return -EFAULT;
 		key->private.mm = mm;
 		key->private.address = address;
-		get_futex_key_refs(key);
+		get_futex_key_refs(key); /* implies MB (B) */
 		return 0;
 	}
 
@@ -508,7 +508,7 @@ get_futex_key(u32 __user *uaddr, int fsh
 		key->shared.pgoff = basepage_index(page);
 	}
 
-	get_futex_key_refs(key);
+	get_futex_key_refs(key); /* implies MB (B) */
 
 out:
 	unlock_page(page_head);
@@ -2904,11 +2904,9 @@ static int __init futex_init(void)
 		futex_cmpxchg_enabled = 1;
 
 	for (i = 0; i < futex_hashsize; i++) {
-		plist_head_init(&futex_queues[i].chain);
-		spin_lock_init(&futex_queues[i].lock);
-#ifdef CONFIG_SMP
 		atomic_set(&futex_queues[i].waiters, 0);
-#endif
+		spin_lock_init(&futex_queues[i].lock);
+		plist_head_init(&futex_queues[i].chain);
 	}
 
 	return 0;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-10 16:57   ` Peter Zijlstra
@ 2013-12-10 17:22     ` Davidlohr Bueso
  2013-12-10 17:34       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Davidlohr Bueso @ 2013-12-10 17:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, dvhart, tglx, paulmck, efault, jeffm,
	torvalds, scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Tue, 2013-12-10 at 17:57 +0100, Peter Zijlstra wrote:
> On Tue, Dec 03, 2013 at 01:45:27AM -0800, Davidlohr Bueso wrote:
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -82,10 +82,12 @@
> >   * The waker side modifies the user space value of the futex and calls
> >   * futex_wake(). It computes the hash bucket and acquires the hash
> >   * bucket lock. Then it looks for waiters on that futex in the hash
> > - * bucket and wakes them.
> > - *
> > - * Note that the spin_lock serializes waiters and wakers, so that the
> > - * following scenario is avoided:
> > + * bucket and wakes them.
> 
> Why not let this be the start of a new paragraph?
> 
> > In scenarios where wakeups are called and no
> > + * tasks are blocked on a futex, taking the hb spinlock can be avoided
> > + * and simply return. In order for this optimization to work, ordering
> > + * guarantees must exist so that the waiter being added to the list is
> > + * acknowledged when the list is concurrently being checked by the waker,
> > + * avoiding scenarios like the following:
> >   *
> >   * CPU 0                               CPU 1
> >   * val = *futex;
> > @@ -106,24 +108,40 @@
> >   * This would cause the waiter on CPU 0 to wait forever because it
> >   * missed the transition of the user space value from val to newval
> >   * and the waker did not find the waiter in the hash bucket queue.
> > + * The correct serialization ensures that a waiter either observes
> > + * the changed user space value before blocking or is woken by a
> > + * concurrent waker:
> >   *
> >   * CPU 0                               CPU 1
> >   * val = *futex;
> >   * sys_futex(WAIT, futex, val);
> >   *   futex_wait(futex, val);
> > + *
> > + *   mb(); <-- paired with ------
> > + *                              |
> > + *   lock(hash_bucket(futex));  |
> > + *                              |
> > + *   uval = *futex;             |
> > + *                              |        *futex = newval;
> > + *                              |        sys_futex(WAKE, futex);
> > + *                              |          futex_wake(futex);
> > + *                              |
> > + *                              -------->   mb();
> >   *   if (uval == val)
> > + *     queue();
> >   *     unlock(hash_bucket(futex));
> > + *     schedule();                         if (!queue_empty())
> > + *                                           lock(hash_bucket(futex));
> > + *                                           wake_waiters(futex);
> > + *                                           unlock(hash_bucket(futex));
> > + *
> > + * The length of the list is tracked with atomic ops (hb->waiters),
> > + * providing the necessary memory barriers for the waiters. For the
> > + * waker side, however, we rely on get_futex_key_refs(), using either
> > + * ihold() or the atomic_inc(), for shared futexes. The former provides
> > + * a full mb on all architectures. For architectures that do not have an
> > + * implicit barrier in atomic_inc/dec, we explicitly add it - please
> > + * refer to futex_get_mm() and hb_waiters_inc/dec().
> >   */
> 
> This comment actually confuses me :/

Well that explains where the required barriers are coming from.

> 
> It isn't at all explained what purpose the memory barriers serve.

Why doesn't this explain it?

"The correct serialization ensures that a waiter either observes
the changed user space value before blocking or is woken by a
concurrent waker."

Perhaps adding an example?
 plist_add()    |    uaddr = newval
 smp_mb()       |    smp_mb()
 verify uaddr   |    plist_head_empty()

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-10 17:22     ` Davidlohr Bueso
@ 2013-12-10 17:34       ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2013-12-10 17:34 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, tglx, paulmck, efault, jeffm,
	torvalds, scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Tue, Dec 10, 2013 at 09:22:08AM -0800, Davidlohr Bueso wrote:
> On Tue, 2013-12-10 at 17:57 +0100, Peter Zijlstra wrote:
> > > @@ -106,24 +108,40 @@
> > >   * This would cause the waiter on CPU 0 to wait forever because it
> > >   * missed the transition of the user space value from val to newval
> > >   * and the waker did not find the waiter in the hash bucket queue.
> > > + * The correct serialization ensures that a waiter either observes
> > > + * the changed user space value before blocking or is woken by a
> > > + * concurrent waker:
> > >   *
> > >   * CPU 0                               CPU 1
> > >   * val = *futex;
> > >   * sys_futex(WAIT, futex, val);
> > >   *   futex_wait(futex, val);
> > > + *
> > > + *   mb(); <-- paired with ------
> > > + *                              |
> > > + *   lock(hash_bucket(futex));  |
> > > + *                              |
> > > + *   uval = *futex;             |
> > > + *                              |        *futex = newval;
> > > + *                              |        sys_futex(WAKE, futex);
> > > + *                              |          futex_wake(futex);
> > > + *                              |
> > > + *                              -------->   mb();
> > >   *   if (uval == val)
> > > + *     queue();
> > >   *     unlock(hash_bucket(futex));
> > > + *     schedule();                         if (!queue_empty())
> > > + *                                           lock(hash_bucket(futex));
> > > + *                                           wake_waiters(futex);
> > > + *                                           unlock(hash_bucket(futex));
> > > + *
> > > + * The length of the list is tracked with atomic ops (hb->waiters),
> > > + * providing the necessary memory barriers for the waiters. For the
> > > + * waker side, however, we rely on get_futex_key_refs(), using either
> > > + * ihold() or the atomic_inc(), for shared futexes. The former provides
> > > + * a full mb on all architectures. For architectures that do not have an
> > > + * implicit barrier in atomic_inc/dec, we explicitly add it - please
> > > + * refer to futex_get_mm() and hb_waiters_inc/dec().
> > >   */

> > It isn't at all explained what purpose the memory barriers serve.
> 
> Why doesn't this explain it?

Because you failed to explain what is ordered against what and what
the resulting order guarantees.

> "The correct serialization ensures that a waiter either observes
> the changed user space value before blocking or is woken by a
> concurrent waker."

Or both. For the given case:

X = Y = 0

w[X]=1  w[Y]=1
MB	MB
r[Y]=y	r[X]=x

x==1 && y==1 is a valid result. The only invalid result is both 0.

But then we're still short of how we end up at that guarantee.

> Perhaps adding an example?
>  plist_add()    |    uaddr = newval
>  smp_mb()       |    smp_mb()
>  verify uaddr   |    plist_head_empty()

Except of course you don't actually use plist_add() and
plist_head_empty() for anything much at all, only creating more
confusion.

Just add the waiters variable and explicitly mention what you're doing.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-10 17:15   ` Peter Zijlstra
@ 2013-12-10 17:36     ` Davidlohr Bueso
  2013-12-10 18:00       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Davidlohr Bueso @ 2013-12-10 17:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, dvhart, tglx, paulmck, efault, jeffm,
	torvalds, scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Tue, 2013-12-10 at 18:15 +0100, Peter Zijlstra wrote:
> ---
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -82,12 +82,13 @@
>   * The waker side modifies the user space value of the futex and calls
>   * futex_wake(). It computes the hash bucket and acquires the hash
>   * bucket lock. Then it looks for waiters on that futex in the hash
> - * bucket and wakes them. In scenarios where wakeups are called and no
> - * tasks are blocked on a futex, taking the hb spinlock can be avoided
> - * and simply return. In order for this optimization to work, ordering
> - * guarantees must exist so that the waiter being added to the list is
> - * acknowledged when the list is concurrently being checked by the waker,
> - * avoiding scenarios like the following:
> + * bucket and wakes them.
> + *
> + * In scenarios where wakeups are called and no tasks are blocked on a futex,
> + * taking the hb spinlock can be avoided and simply return. In order for this
> + * optimization to work, ordering guarantees must exist so that the waiter
> + * being added to the list is acknowledged when the list is concurrently being
> + * checked by the waker, avoiding scenarios like the following:
>   *
>   * CPU 0                               CPU 1
>   * val = *futex;
> @@ -108,6 +109,7 @@
>   * This would cause the waiter on CPU 0 to wait forever because it
>   * missed the transition of the user space value from val to newval
>   * and the waker did not find the waiter in the hash bucket queue.
> + *
>   * The correct serialization ensures that a waiter either observes
>   * the changed user space value before blocking or is woken by a
>   * concurrent waker:
> @@ -117,7 +119,8 @@
>   * sys_futex(WAIT, futex, val);
>   *   futex_wait(futex, val);
>   *
> - *   mb(); <-- paired with ------
> + *   waiters++;
> + *   mb(); (A) <-- paired with -.
>   *                              |
>   *   lock(hash_bucket(futex));  |
>   *                              |
> @@ -126,22 +129,29 @@
>   *                              |        sys_futex(WAKE, futex);
>   *                              |          futex_wake(futex);
>   *                              |
> - *                              -------->   mb();
> + *                              `------->   mb(); (B)
>   *   if (uval == val)
>   *     queue();
>   *     unlock(hash_bucket(futex));
> - *     schedule();                         if (!queue_empty())
> + *     schedule();                         if (waiters)
>   *                                           lock(hash_bucket(futex));
>   *                                           wake_waiters(futex);
>   *                                           unlock(hash_bucket(futex));
>   *
> - * The length of the list is tracked with atomic ops (hb->waiters),
> - * providing the necessary memory barriers for the waiters. For the
> - * waker side, however, we rely on get_futex_key_refs(), using either
> - * ihold() or the atomic_inc(), for shared futexes. The former provides
> - * a full mb on all architectures. For architectures that do not have an
> - * implicit barrier in atomic_inc/dec, we explicitly add it - please
> - * refer to futex_get_mm() and hb_waiters_inc/dec().

IMHO this text gives a nice summary instead of documenting each function
with this things like '... implies MB (B)'. Anyway, I'll resend this
patch with your corrections.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-10 17:36     ` Davidlohr Bueso
@ 2013-12-10 18:00       ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2013-12-10 18:00 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, tglx, paulmck, efault, jeffm,
	torvalds, scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

On Tue, Dec 10, 2013 at 09:36:36AM -0800, Davidlohr Bueso wrote:
> On Tue, 2013-12-10 at 18:15 +0100, Peter Zijlstra wrote:
> > - * The length of the list is tracked with atomic ops (hb->waiters),
> > - * providing the necessary memory barriers for the waiters. For the
> > - * waker side, however, we rely on get_futex_key_refs(), using either
> > - * ihold() or the atomic_inc(), for shared futexes. The former provides
> > - * a full mb on all architectures. For architectures that do not have an
> > - * implicit barrier in atomic_inc/dec, we explicitly add it - please
> > - * refer to futex_get_mm() and hb_waiters_inc/dec().
> 
> IMHO this text gives a nice summary instead of documenting each function
> with this things like '... implies MB (B)'. Anyway, I'll resend this
> patch with your corrections.

Right, I didn't much care for that. Once you know that you need them and
what for, the actual finding of the barriers is usually a 'trivial'
matter.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-03  9:45 ` [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
                     ` (4 preceding siblings ...)
  2013-12-10 17:15   ` Peter Zijlstra
@ 2013-12-10 18:10   ` Davidlohr Bueso
  5 siblings, 0 replies; 22+ messages in thread
From: Davidlohr Bueso @ 2013-12-10 18:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

In futex_wake() there is clearly no point in taking the hb->lock if
we know beforehand that there are no tasks to be woken. This comes
at the smaller cost of doing some atomic operations to keep track of
the list's size. Specifically, increment the counter when an element is
added to the list, and decrement when it is removed. Of course, if the
counter is 0, then there are no tasks blocked on a futex. Some special
considerations:

- increment the counter at queue_lock() as we always end up calling
  queue_me() which adds the element to the list. Upon any error,
  queue_unlock() is called for housekeeping, for which we decrement
  to mach the increment done in queue_lock().

- decrement the counter at __unqueue_me() to reflect when an element is
  removed from the queue for wakeup related purposes.

We make sure that the futex ordering guarantees are preserved, ensuring
that waiters either observes the changed user space value before blocking or
is woken by a concurrent waker. This is done by relying on the barriers in
atomic_inc() -- for archs that do have implict mb in atomic_inc() we
explicitly add them through a set of new functions that are introduced:
futex_get_mm(), hb_waiters_inc(x) and hb_waiters_dec(). For more details
please refer to the updated comments in the code and related discussion:
https://lkml.org/lkml/2013/11/26/556

Furthermore, the overhead of new barriers is trivial. The following are some
results, from a quad-core x86_64 laptop, measuring the latency of nthread
wakeups (1 at a time), for 1000 runs:

+----------+-----------------------------+------------------------------+
| nthreads | baseline time (ms) [stddev] | atomicops time (ms) [stddev] |
+----------+-----------------------------+------------------------------+
|      512 | 2.8360 [0.5168]             | 3.8150 [1.3293]              |
|      256 | 2.5080 [0.6375]             | 2.5980 [0.9079]              |
|      128 | 1.0200 [0.4264]             | 1.5180 [0.4902]              |
|       64 | 0.7890 [0.2667]             | 0.4020 [0.2447]              |
|       32 | 0.1150 [0.0184]             | 0.1490 [0.1156]              |
+----------+-----------------------------+------------------------------+

Special thanks to tglx for careful review and feedback.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
---
Changes from v2: 
 - Improved ordering guarantee comments -- peterz.
 - Reordered SOB tags to reflect me as primary author.

 kernel/futex.c | 142 +++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 122 insertions(+), 20 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 75719bd..3a3c606 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -84,8 +84,11 @@
  * bucket lock. Then it looks for waiters on that futex in the hash
  * bucket and wakes them.
  *
- * Note that the spin_lock serializes waiters and wakers, so that the
- * following scenario is avoided:
+ * In scenarios where wakeups are called and no tasks are blocked on a futex,
+ * taking the hb spinlock can be avoided and simply return. In order for this
+ * optimization to work, ordering guarantees must exist so that the waiter
+ * being added to the list is acknowledged when the list is concurrently being
+ * checked by the waker, avoiding scenarios like the following:
  *
  * CPU 0                               CPU 1
  * val = *futex;
@@ -106,24 +109,49 @@
  * This would cause the waiter on CPU 0 to wait forever because it
  * missed the transition of the user space value from val to newval
  * and the waker did not find the waiter in the hash bucket queue.
- * The spinlock serializes that:
+ *
+ * The correct serialization ensures that a waiter either observes
+ * the changed user space value before blocking or is woken by a
+ * concurrent waker:
  *
  * CPU 0                               CPU 1
  * val = *futex;
  * sys_futex(WAIT, futex, val);
  *   futex_wait(futex, val);
- *   lock(hash_bucket(futex));
- *   uval = *futex;
- *                                     *futex = newval;
- *                                     sys_futex(WAKE, futex);
- *                                       futex_wake(futex);
- *                                       lock(hash_bucket(futex));
+ *
+ *   waiters++;
+ *   mb(); (A) <-- paired with -.
+ *                              |
+ *   lock(hash_bucket(futex));  |
+ *                              |
+ *   uval = *futex;             |
+ *                              |        *futex = newval;
+ *                              |        sys_futex(WAKE, futex);
+ *                              |          futex_wake(futex);
+ *                              |
+ *                              `------->   mb(); (B)
  *   if (uval == val)
- *      queue();
+ *     queue();
  *     unlock(hash_bucket(futex));
- *     schedule();                       if (!queue_empty())
- *                                         wake_waiters(futex);
- *                                       unlock(hash_bucket(futex));
+ *     schedule();                         if (waiters)
+ *                                           lock(hash_bucket(futex));
+ *                                           wake_waiters(futex);
+ *                                           unlock(hash_bucket(futex));
+ *
+ * Where (A) orders the waiters increment and the futex value read; and
+ * where (B) orders the write to futex and the waiters read.
+ *
+ * This yields the following case (where X:=waiters, Y:=futex):
+ *
+ *	X = Y = 0
+ *
+ *	w[X]=1		w[Y]=1
+ *	MB		MB
+ *	r[Y]=y		r[X]=x
+ *
+ * Which guarantees that x==0 && y==0 is impossible; which translates back into
+ * the guarantee that we cannot both miss the futex variable change and the
+ * enqueue.
  */
 
 int __read_mostly futex_cmpxchg_enabled;
@@ -203,6 +231,7 @@ static const struct futex_q futex_q_init = {
  * waiting on a futex.
  */
 struct futex_hash_bucket {
+	atomic_t waiters;
 	spinlock_t lock;
 	struct plist_head chain;
 } ____cacheline_aligned_in_smp;
@@ -211,6 +240,53 @@ static unsigned long __read_mostly futex_hashsize;
 
 static struct futex_hash_bucket *futex_queues;
 
+static inline void futex_get_mm(union futex_key *key)
+{
+	atomic_inc(&key->private.mm->mm_count);
+#ifdef CONFIG_SMP
+	/*
+	 * Ensure futex_get_mm() implies a full barrier such that
+	 * get_futex_key() implies a full barrier. This is relied upon
+	 * as full barrier (B), see the ordering comment above.
+	 */
+	smp_mb__after_atomic_inc();
+#endif
+}
+
+/*
+ * Reflects a new waiter being added to the waitqueue.
+ */
+static inline void hb_waiters_inc(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	atomic_inc(&hb->waiters);
+	/*
+	 * Full barrier (A), see the ordering comment above.
+	 */
+	smp_mb__after_atomic_inc();
+#endif
+}
+
+/*
+ * Reflects a waiter being removed from the waitqueue by wakeup
+ * paths.
+ */
+static inline void hb_waiters_dec(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	atomic_dec(&hb->waiters);
+#endif
+}
+
+static inline int hb_waiters_pending(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	return atomic_read(&hb->waiters);
+#else
+	return 1;
+#endif
+}
+
 /*
  * We hash on the keys returned from get_futex_key (see below).
  */
@@ -237,6 +313,8 @@ static inline int match_futex(union futex_key *key1, union futex_key *key2)
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
  *
+ * Implies a full memory barrier; relied upon as (B), see the comment above
+ * about ordering.
  */
 static void get_futex_key_refs(union futex_key *key)
 {
@@ -245,10 +323,10 @@ static void get_futex_key_refs(union futex_key *key)
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		ihold(key->shared.inode);
+		ihold(key->shared.inode); /* implies MB */
 		break;
 	case FUT_OFF_MMSHARED:
-		atomic_inc(&key->private.mm->mm_count);
+		futex_get_mm(key); /* implies MB */
 		break;
 	}
 }
@@ -292,6 +370,8 @@ static void drop_futex_key_refs(union futex_key *key)
  * We can usually work out the index without swapping in the page.
  *
  * lock_page() might sleep, the caller should not hold a spinlock.
+ *
+ * Implies a full memory barrier (B).
  */
 static int
 get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
@@ -321,7 +401,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 			return -EFAULT;
 		key->private.mm = mm;
 		key->private.address = address;
-		get_futex_key_refs(key);
+		get_futex_key_refs(key); /* implies MB (B) */
 		return 0;
 	}
 
@@ -428,7 +508,7 @@ again:
 		key->shared.pgoff = basepage_index(page);
 	}
 
-	get_futex_key_refs(key);
+	get_futex_key_refs(key); /* implies MB (B) */
 
 out:
 	unlock_page(page_head);
@@ -892,6 +972,7 @@ static void __unqueue_futex(struct futex_q *q)
 
 	hb = container_of(q->lock_ptr, struct futex_hash_bucket, lock);
 	plist_del(&q->list, &hb->chain);
+	hb_waiters_dec(hb);
 }
 
 /*
@@ -1051,6 +1132,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 		goto out;
 
 	hb = hash_futex(&key);
+
+	/* Make sure we really have tasks to wakeup */
+	if (!hb_waiters_pending(hb))
+		goto out_put_key;
+
 	spin_lock(&hb->lock);
 
 	plist_for_each_entry_safe(this, next, &hb->chain, list) {
@@ -1071,6 +1157,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	}
 
 	spin_unlock(&hb->lock);
+out_put_key:
 	put_futex_key(&key);
 out:
 	return ret;
@@ -1189,7 +1276,9 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
 	 */
 	if (likely(&hb1->chain != &hb2->chain)) {
 		plist_del(&q->list, &hb1->chain);
+		hb_waiters_dec(hb1);
 		plist_add(&q->list, &hb2->chain);
+		hb_waiters_inc(hb2);
 		q->lock_ptr = &hb2->lock;
 	}
 	get_futex_key_refs(key2);
@@ -1532,17 +1621,28 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 	struct futex_hash_bucket *hb;
 
 	hb = hash_futex(&q->key);
+
+	/*
+	 * Increment the counter before taking the lock so that
+	 * a potential waker won't miss a to-be-slept task that is
+	 * waiting for the spinlock. This is safe as all queue_lock()
+	 * users end up calling queue_me(). Similarly, for housekeeping,
+	 * decrement the counter at queue_unlock() when some error has
+	 * occurred and we don't end up adding the task to the list.
+	 */
+	hb_waiters_inc(hb);
+
 	q->lock_ptr = &hb->lock;
 
 	spin_lock(&hb->lock);
 	return hb;
 }
 
-static inline void
-queue_unlock(struct futex_hash_bucket *hb)
+static inline void queue_unlock(struct futex_hash_bucket *hb)
 	__releases(&hb->lock)
 {
 	spin_unlock(&hb->lock);
+	hb_waiters_dec(hb);
 }
 
 /**
@@ -2274,6 +2374,7 @@ int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
 		 * Unqueue the futex_q and determine which it was.
 		 */
 		plist_del(&q->list, &hb->chain);
+		hb_waiters_dec(hb);
 
 		/* Handle spurious wakeups gracefully */
 		ret = -EWOULDBLOCK;
@@ -2803,8 +2904,9 @@ static int __init futex_init(void)
 		futex_cmpxchg_enabled = 1;
 
 	for (i = 0; i < futex_hashsize; i++) {
-		plist_head_init(&futex_queues[i].chain);
+		atomic_set(&futex_queues[i].waiters, 0);
 		spin_lock_init(&futex_queues[i].lock);
+		plist_head_init(&futex_queues[i].chain);
 	}
 
 	return 0;
-- 
1.8.1.4




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-10 16:17         ` Davidlohr Bueso
@ 2013-12-10 18:28           ` Ingo Molnar
  0 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2013-12-10 18:28 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Gleixner, linux-kernel, dvhart, peterz, paulmck, efault,
	jeffm, torvalds, scott.norton, tom.vaden, aswin, Waiman.Long,
	jason.low2


* Davidlohr Bueso <davidlohr@hp.com> wrote:

> On Tue, 2013-12-10 at 16:36 +0100, Ingo Molnar wrote:
> > * Davidlohr Bueso <davidlohr@hp.com> wrote:
> > 
> > > On Fri, 2013-12-06 at 19:26 +0100, Thomas Gleixner wrote:
> > > > On Tue, 3 Dec 2013, Davidlohr Bueso wrote:
> > > > > Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> > > > > Signed-off-by: Jason Low <jason.low2@hp.com>
> > > > > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> > > > 
> > > > Who is the author of this? According to the SOB chains its Waiman, who
> > > > handed it down to Jason and it finally got sumbitted by you. Is that
> > > > correct?
> > > 
> > > This was a group effort for the original patchset -- things did change
> > > quite a bit for this particular patch in v2. Waiman and Jason helped
> > > review and test the original atomic ops code.
> > 
> > Then please put that into the credits (in the changelog and/or in the 
> > source code), and/or into copyright notices, etc. - but keep the 
> > SOB-chain the true route of the patch, where the primary author is the 
> > first in the SOB and where the person sending the patch is the last 
> > SOB. (the two are identical normally.)
> 
> Ok, I wasn't aware of that primary author rule, sorry. The exact 
> same thing would go for patch 2 then. Is this something you want me 
> to correct for this particular series, or to keep in mind for the 
> future?

Would be nice to fix it, the SOB chain is important for fine print 
reasons.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/4] futex: Wakeup optimizations
  2013-12-03  9:45 [PATCH v2 0/4] futex: Wakeup optimizations Davidlohr Bueso
                   ` (3 preceding siblings ...)
  2013-12-03  9:45 ` [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
@ 2013-12-12 17:30 ` Peter Zijlstra
  2013-12-18 15:32 ` Davidlohr Bueso
  5 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2013-12-12 17:30 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, tglx, paulmck, efault, jeffm,
	torvalds, scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2



Acked-by: Peter Zijlstra <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/4] futex: Wakeup optimizations
  2013-12-03  9:45 [PATCH v2 0/4] futex: Wakeup optimizations Davidlohr Bueso
                   ` (4 preceding siblings ...)
  2013-12-12 17:30 ` [PATCH v2 0/4] futex: Wakeup optimizations Peter Zijlstra
@ 2013-12-18 15:32 ` Davidlohr Bueso
  2013-12-19 12:54   ` Ingo Molnar
  5 siblings, 1 reply; 22+ messages in thread
From: Davidlohr Bueso @ 2013-12-18 15:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

ping? 

If no one has any objections, could this patchset be picked up?

Thanks,
Davidlohr

On Tue, 2013-12-03 at 01:45 -0800, Davidlohr Bueso wrote:
> Changes from v1 [https://lkml.org/lkml/2013/11/22/525]:
>  - Removed patch "futex: Check for pi futex_q only once".
> 
>  - Cleaned up ifdefs for larger hash table.
> 
>  - Added a doc patch from tglx that describes the futex 
>    ordering guarantees.
> 
>  - Improved the lockless plist check for the wake calls.
>    Based on the community feedback, the necessary abstractions
>    and barriers are added to maintain ordering guarantees.
>    Code documentation is also updated.
> 
>  - Removed patch "sched,futex: Provide delayed wakeup list".
>    Based on feedback from PeterZ, I will look into this as
>    a separate issue once the other patches are settled.
>  
> 
> We have been dealing with a customer database workload on large
> 12Tb, 240 core 16 socket NUMA system that exhibits high amounts 
> of contention on some of the locks that serialize internal futex 
> data structures. This workload specially suffers in the wakeup 
> paths, where waiting on the corresponding hb->lock can account for 
> up to ~60% of the time. The result of such calls can mostly be 
> classified as (i) nothing to wake up and (ii) wakeup large amount 
> of tasks.
> 
> Before these patches are applied, we can see this pathological behavior:
> 
>  37.12%  826174  xxx  [kernel.kallsyms] [k] _raw_spin_lock
>             --- _raw_spin_lock
>              |
>              |--97.14%-- futex_wake
>              |          do_futex
>              |          sys_futex
>              |          system_call_fastpath
>              |          |
>              |          |--99.70%-- 0x7f383fbdea1f
>              |          |           yyy
> 
>  43.71%  762296  xxx  [kernel.kallsyms] [k] _raw_spin_lock
>             --- _raw_spin_lock
>              |
>              |--53.74%-- futex_wake
>              |          do_futex
>              |          sys_futex
>              |          system_call_fastpath
>              |          |
>              |          |--99.40%-- 0x7fe7d44a4c05
>              |          |           zzz
>              |--45.90%-- futex_wait_setup
>              |          futex_wait
>              |          do_futex
>              |          sys_futex
>              |          system_call_fastpath
>              |          0x7fe7ba315789
>              |          syscall
> 
> 
> With these patches, contention is practically non existent:
> 
>  0.10%     49   xxx  [kernel.kallsyms]   [k] _raw_spin_lock
>                --- _raw_spin_lock
>                 |
>                 |--76.06%-- futex_wait_setup
>                 |          futex_wait
>                 |          do_futex
>                 |          sys_futex
>                 |          system_call_fastpath
>                 |          |
>                 |          |--99.90%-- 0x7f3165e63789
>                 |          |          syscall|
>                            ...
>                 |--6.27%-- futex_wake
>                 |          do_futex
>                 |          sys_futex
>                 |          system_call_fastpath
>                 |          |
>                 |          |--54.56%-- 0x7f317fff2c05
>                 ...
> 
> Patch 1 is a cleanup.
> 
> Patch 2 addresses the well known issue of the global hash table.
> By creating a larger and NUMA aware table, we can reduce the false
> sharing and collisions, thus reducing the chance of different futexes 
> using hb->lock.
> 
> Patch 3 documents the futex ordering guarantees.
> 
> Patch 4 reduces contention on the corresponding hb->lock by not trying to
> acquire it if there are no blocked tasks in the waitqueue.
> This particularly deals with point (i) above, where we see that it is not
> uncommon for up to 90% of wakeup calls end up returning 0, indicating that no
> tasks were woken.
> 
> This patchset has also been tested on smaller systems for a variety of
> benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
> hb locks programs. So far, no functional or performance regressions have been seen.
> Furthermore, no issues were found when running the different tests in the futextest 
> suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/
> 
> This patchset applies on top of Linus' tree as of v3.13-rc2 (2e7babfa).
> 
> Special thanks to Scott Norton, Tom Vanden, Mark Ray and Aswin Chandramouleeswaran
> for help presenting, debugging and analyzing the data.
> 
>   futex: Misc cleanups
>   futex: Larger hash table
>   futex: Document ordering guarantees
>   futex: Avoid taking hb lock if nothing to wakeup
> 
>  kernel/futex.c | 230 ++++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 194 insertions(+), 36 deletions(-)
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/4] futex: Wakeup optimizations
  2013-12-18 15:32 ` Davidlohr Bueso
@ 2013-12-19 12:54   ` Ingo Molnar
  0 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2013-12-19 12:54 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, dvhart, peterz, tglx, paulmck, efault, jeffm,
	torvalds, scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2


* Davidlohr Bueso <davidlohr@hp.com> wrote:

> ping? 
> 
> If no one has any objections, could this patchset be picked up?

There were a couple of review comments that I don't see fully 
addressed, one of them involving the messy looking SOB chain, so a new 
version needs to be sent in any case.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-12-19 12:55 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-03  9:45 [PATCH v2 0/4] futex: Wakeup optimizations Davidlohr Bueso
2013-12-03  9:45 ` [PATCH v2 1/4] futex: Misc cleanups Davidlohr Bueso
2013-12-03  9:45 ` [PATCH v2 2/4] futex: Larger hash table Davidlohr Bueso
2013-12-03  9:45 ` [PATCH v2 3/4] futex: Document ordering guarantees Davidlohr Bueso
2013-12-03  9:45 ` [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
2013-12-06 18:26   ` Thomas Gleixner
2013-12-06 19:04     ` Davidlohr Bueso
2013-12-10 15:36       ` Ingo Molnar
2013-12-10 16:17         ` Davidlohr Bueso
2013-12-10 18:28           ` Ingo Molnar
2013-12-10 16:08   ` Peter Zijlstra
2013-12-10 16:24   ` Peter Zijlstra
2013-12-10 16:57   ` Peter Zijlstra
2013-12-10 17:22     ` Davidlohr Bueso
2013-12-10 17:34       ` Peter Zijlstra
2013-12-10 17:15   ` Peter Zijlstra
2013-12-10 17:36     ` Davidlohr Bueso
2013-12-10 18:00       ` Peter Zijlstra
2013-12-10 18:10   ` [PATCH v3 " Davidlohr Bueso
2013-12-12 17:30 ` [PATCH v2 0/4] futex: Wakeup optimizations Peter Zijlstra
2013-12-18 15:32 ` Davidlohr Bueso
2013-12-19 12:54   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox