public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-03  9:45 ` [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
@ 2013-12-10 18:10   ` Davidlohr Bueso
  0 siblings, 0 replies; 30+ messages in thread
From: Davidlohr Bueso @ 2013-12-10 18:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	scott.norton, tom.vaden, aswin, Waiman.Long, jason.low2

In futex_wake() there is clearly no point in taking the hb->lock if
we know beforehand that there are no tasks to be woken. This comes
at the smaller cost of doing some atomic operations to keep track of
the list's size. Specifically, increment the counter when an element is
added to the list, and decrement when it is removed. Of course, if the
counter is 0, then there are no tasks blocked on a futex. Some special
considerations:

- increment the counter at queue_lock() as we always end up calling
  queue_me() which adds the element to the list. Upon any error,
  queue_unlock() is called for housekeeping, for which we decrement
  to mach the increment done in queue_lock().

- decrement the counter at __unqueue_me() to reflect when an element is
  removed from the queue for wakeup related purposes.

We make sure that the futex ordering guarantees are preserved, ensuring
that waiters either observes the changed user space value before blocking or
is woken by a concurrent waker. This is done by relying on the barriers in
atomic_inc() -- for archs that do have implict mb in atomic_inc() we
explicitly add them through a set of new functions that are introduced:
futex_get_mm(), hb_waiters_inc(x) and hb_waiters_dec(). For more details
please refer to the updated comments in the code and related discussion:
https://lkml.org/lkml/2013/11/26/556

Furthermore, the overhead of new barriers is trivial. The following are some
results, from a quad-core x86_64 laptop, measuring the latency of nthread
wakeups (1 at a time), for 1000 runs:

+----------+-----------------------------+------------------------------+
| nthreads | baseline time (ms) [stddev] | atomicops time (ms) [stddev] |
+----------+-----------------------------+------------------------------+
|      512 | 2.8360 [0.5168]             | 3.8150 [1.3293]              |
|      256 | 2.5080 [0.6375]             | 2.5980 [0.9079]              |
|      128 | 1.0200 [0.4264]             | 1.5180 [0.4902]              |
|       64 | 0.7890 [0.2667]             | 0.4020 [0.2447]              |
|       32 | 0.1150 [0.0184]             | 0.1490 [0.1156]              |
+----------+-----------------------------+------------------------------+

Special thanks to tglx for careful review and feedback.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
---
Changes from v2: 
 - Improved ordering guarantee comments -- peterz.
 - Reordered SOB tags to reflect me as primary author.

 kernel/futex.c | 142 +++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 122 insertions(+), 20 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 75719bd..3a3c606 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -84,8 +84,11 @@
  * bucket lock. Then it looks for waiters on that futex in the hash
  * bucket and wakes them.
  *
- * Note that the spin_lock serializes waiters and wakers, so that the
- * following scenario is avoided:
+ * In scenarios where wakeups are called and no tasks are blocked on a futex,
+ * taking the hb spinlock can be avoided and simply return. In order for this
+ * optimization to work, ordering guarantees must exist so that the waiter
+ * being added to the list is acknowledged when the list is concurrently being
+ * checked by the waker, avoiding scenarios like the following:
  *
  * CPU 0                               CPU 1
  * val = *futex;
@@ -106,24 +109,49 @@
  * This would cause the waiter on CPU 0 to wait forever because it
  * missed the transition of the user space value from val to newval
  * and the waker did not find the waiter in the hash bucket queue.
- * The spinlock serializes that:
+ *
+ * The correct serialization ensures that a waiter either observes
+ * the changed user space value before blocking or is woken by a
+ * concurrent waker:
  *
  * CPU 0                               CPU 1
  * val = *futex;
  * sys_futex(WAIT, futex, val);
  *   futex_wait(futex, val);
- *   lock(hash_bucket(futex));
- *   uval = *futex;
- *                                     *futex = newval;
- *                                     sys_futex(WAKE, futex);
- *                                       futex_wake(futex);
- *                                       lock(hash_bucket(futex));
+ *
+ *   waiters++;
+ *   mb(); (A) <-- paired with -.
+ *                              |
+ *   lock(hash_bucket(futex));  |
+ *                              |
+ *   uval = *futex;             |
+ *                              |        *futex = newval;
+ *                              |        sys_futex(WAKE, futex);
+ *                              |          futex_wake(futex);
+ *                              |
+ *                              `------->   mb(); (B)
  *   if (uval == val)
- *      queue();
+ *     queue();
  *     unlock(hash_bucket(futex));
- *     schedule();                       if (!queue_empty())
- *                                         wake_waiters(futex);
- *                                       unlock(hash_bucket(futex));
+ *     schedule();                         if (waiters)
+ *                                           lock(hash_bucket(futex));
+ *                                           wake_waiters(futex);
+ *                                           unlock(hash_bucket(futex));
+ *
+ * Where (A) orders the waiters increment and the futex value read; and
+ * where (B) orders the write to futex and the waiters read.
+ *
+ * This yields the following case (where X:=waiters, Y:=futex):
+ *
+ *	X = Y = 0
+ *
+ *	w[X]=1		w[Y]=1
+ *	MB		MB
+ *	r[Y]=y		r[X]=x
+ *
+ * Which guarantees that x==0 && y==0 is impossible; which translates back into
+ * the guarantee that we cannot both miss the futex variable change and the
+ * enqueue.
  */
 
 int __read_mostly futex_cmpxchg_enabled;
@@ -203,6 +231,7 @@ static const struct futex_q futex_q_init = {
  * waiting on a futex.
  */
 struct futex_hash_bucket {
+	atomic_t waiters;
 	spinlock_t lock;
 	struct plist_head chain;
 } ____cacheline_aligned_in_smp;
@@ -211,6 +240,53 @@ static unsigned long __read_mostly futex_hashsize;
 
 static struct futex_hash_bucket *futex_queues;
 
+static inline void futex_get_mm(union futex_key *key)
+{
+	atomic_inc(&key->private.mm->mm_count);
+#ifdef CONFIG_SMP
+	/*
+	 * Ensure futex_get_mm() implies a full barrier such that
+	 * get_futex_key() implies a full barrier. This is relied upon
+	 * as full barrier (B), see the ordering comment above.
+	 */
+	smp_mb__after_atomic_inc();
+#endif
+}
+
+/*
+ * Reflects a new waiter being added to the waitqueue.
+ */
+static inline void hb_waiters_inc(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	atomic_inc(&hb->waiters);
+	/*
+	 * Full barrier (A), see the ordering comment above.
+	 */
+	smp_mb__after_atomic_inc();
+#endif
+}
+
+/*
+ * Reflects a waiter being removed from the waitqueue by wakeup
+ * paths.
+ */
+static inline void hb_waiters_dec(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	atomic_dec(&hb->waiters);
+#endif
+}
+
+static inline int hb_waiters_pending(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	return atomic_read(&hb->waiters);
+#else
+	return 1;
+#endif
+}
+
 /*
  * We hash on the keys returned from get_futex_key (see below).
  */
@@ -237,6 +313,8 @@ static inline int match_futex(union futex_key *key1, union futex_key *key2)
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
  *
+ * Implies a full memory barrier; relied upon as (B), see the comment above
+ * about ordering.
  */
 static void get_futex_key_refs(union futex_key *key)
 {
@@ -245,10 +323,10 @@ static void get_futex_key_refs(union futex_key *key)
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		ihold(key->shared.inode);
+		ihold(key->shared.inode); /* implies MB */
 		break;
 	case FUT_OFF_MMSHARED:
-		atomic_inc(&key->private.mm->mm_count);
+		futex_get_mm(key); /* implies MB */
 		break;
 	}
 }
@@ -292,6 +370,8 @@ static void drop_futex_key_refs(union futex_key *key)
  * We can usually work out the index without swapping in the page.
  *
  * lock_page() might sleep, the caller should not hold a spinlock.
+ *
+ * Implies a full memory barrier (B).
  */
 static int
 get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
@@ -321,7 +401,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 			return -EFAULT;
 		key->private.mm = mm;
 		key->private.address = address;
-		get_futex_key_refs(key);
+		get_futex_key_refs(key); /* implies MB (B) */
 		return 0;
 	}
 
@@ -428,7 +508,7 @@ again:
 		key->shared.pgoff = basepage_index(page);
 	}
 
-	get_futex_key_refs(key);
+	get_futex_key_refs(key); /* implies MB (B) */
 
 out:
 	unlock_page(page_head);
@@ -892,6 +972,7 @@ static void __unqueue_futex(struct futex_q *q)
 
 	hb = container_of(q->lock_ptr, struct futex_hash_bucket, lock);
 	plist_del(&q->list, &hb->chain);
+	hb_waiters_dec(hb);
 }
 
 /*
@@ -1051,6 +1132,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 		goto out;
 
 	hb = hash_futex(&key);
+
+	/* Make sure we really have tasks to wakeup */
+	if (!hb_waiters_pending(hb))
+		goto out_put_key;
+
 	spin_lock(&hb->lock);
 
 	plist_for_each_entry_safe(this, next, &hb->chain, list) {
@@ -1071,6 +1157,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	}
 
 	spin_unlock(&hb->lock);
+out_put_key:
 	put_futex_key(&key);
 out:
 	return ret;
@@ -1189,7 +1276,9 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
 	 */
 	if (likely(&hb1->chain != &hb2->chain)) {
 		plist_del(&q->list, &hb1->chain);
+		hb_waiters_dec(hb1);
 		plist_add(&q->list, &hb2->chain);
+		hb_waiters_inc(hb2);
 		q->lock_ptr = &hb2->lock;
 	}
 	get_futex_key_refs(key2);
@@ -1532,17 +1621,28 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 	struct futex_hash_bucket *hb;
 
 	hb = hash_futex(&q->key);
+
+	/*
+	 * Increment the counter before taking the lock so that
+	 * a potential waker won't miss a to-be-slept task that is
+	 * waiting for the spinlock. This is safe as all queue_lock()
+	 * users end up calling queue_me(). Similarly, for housekeeping,
+	 * decrement the counter at queue_unlock() when some error has
+	 * occurred and we don't end up adding the task to the list.
+	 */
+	hb_waiters_inc(hb);
+
 	q->lock_ptr = &hb->lock;
 
 	spin_lock(&hb->lock);
 	return hb;
 }
 
-static inline void
-queue_unlock(struct futex_hash_bucket *hb)
+static inline void queue_unlock(struct futex_hash_bucket *hb)
 	__releases(&hb->lock)
 {
 	spin_unlock(&hb->lock);
+	hb_waiters_dec(hb);
 }
 
 /**
@@ -2274,6 +2374,7 @@ int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
 		 * Unqueue the futex_q and determine which it was.
 		 */
 		plist_del(&q->list, &hb->chain);
+		hb_waiters_dec(hb);
 
 		/* Handle spurious wakeups gracefully */
 		ret = -EWOULDBLOCK;
@@ -2803,8 +2904,9 @@ static int __init futex_init(void)
 		futex_cmpxchg_enabled = 1;
 
 	for (i = 0; i < futex_hashsize; i++) {
-		plist_head_init(&futex_queues[i].chain);
+		atomic_set(&futex_queues[i].waiters, 0);
 		spin_lock_init(&futex_queues[i].lock);
+		plist_head_init(&futex_queues[i].chain);
 	}
 
 	return 0;
-- 
1.8.1.4




^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 0/4] futex: Wakeup optimizations
@ 2013-12-19 18:45 Davidlohr Bueso
  2013-12-19 18:45 ` [PATCH v3 1/4] futex: Misc cleanups Davidlohr Bueso
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Davidlohr Bueso @ 2013-12-19 18:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin,
	davidlohr

Changes from v2 [http://lwn.net/Articles/575449/]:
 - Reordered SOB tags to reflect me as primary author.

 - Improved ordering guarantee comments for patch 4.

 - Rebased patch 4 against Linus' tree (this patch didn't
   apply after the recent futex changes/fixes).

Changes from v1 [https://lkml.org/lkml/2013/11/22/525]:
 - Removed patch "futex: Check for pi futex_q only once".

 - Cleaned up ifdefs for larger hash table.

 - Added a doc patch from tglx that describes the futex 
   ordering guarantees.

 - Improved the lockless plist check for the wake calls.
   Based on the community feedback, the necessary abstractions
   and barriers are added to maintain ordering guarantees.
   Code documentation is also updated.

 - Removed patch "sched,futex: Provide delayed wakeup list".
   Based on feedback from PeterZ, I will look into this as
   a separate issue once the other patches are settled.
 

We have been dealing with a customer database workload on large
12Tb, 240 core 16 socket NUMA system that exhibits high amounts 
of contention on some of the locks that serialize internal futex 
data structures. This workload specially suffers in the wakeup 
paths, where waiting on the corresponding hb->lock can account for 
up to ~60% of the time. The result of such calls can mostly be 
classified as (i) nothing to wake up and (ii) wakeup large amount 
of tasks.

Before these patches are applied, we can see this pathological behavior:

 37.12%  826174  xxx  [kernel.kallsyms] [k] _raw_spin_lock
            --- _raw_spin_lock
             |
             |--97.14%-- futex_wake
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          |
             |          |--99.70%-- 0x7f383fbdea1f
             |          |           yyy

 43.71%  762296  xxx  [kernel.kallsyms] [k] _raw_spin_lock
            --- _raw_spin_lock
             |
             |--53.74%-- futex_wake
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          |
             |          |--99.40%-- 0x7fe7d44a4c05
             |          |           zzz
             |--45.90%-- futex_wait_setup
             |          futex_wait
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          0x7fe7ba315789
             |          syscall


With these patches, contention is practically non existent:

 0.10%     49   xxx  [kernel.kallsyms]   [k] _raw_spin_lock
               --- _raw_spin_lock
                |
                |--76.06%-- futex_wait_setup
                |          futex_wait
                |          do_futex
                |          sys_futex
                |          system_call_fastpath
                |          |
                |          |--99.90%-- 0x7f3165e63789
                |          |          syscall|
                           ...
                |--6.27%-- futex_wake
                |          do_futex
                |          sys_futex
                |          system_call_fastpath
                |          |
                |          |--54.56%-- 0x7f317fff2c05
                ...

Patch 1 is a cleanup.

Patch 2 addresses the well known issue of the global hash table.
By creating a larger and NUMA aware table, we can reduce the false
sharing and collisions, thus reducing the chance of different futexes 
using hb->lock.

Patch 3 documents the futex ordering guarantees.

Patch 4 reduces contention on the corresponding hb->lock by not trying to
acquire it if there are no blocked tasks in the waitqueue.
This particularly deals with point (i) above, where we see that it is not
uncommon for up to 90% of wakeup calls end up returning 0, indicating that no
tasks were woken.

This patchset has also been tested on smaller systems for a variety of
benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
hb locks programs. So far, no functional or performance regressions have been seen.
Furthermore, no issues were found when running the different tests in the futextest 
suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/

This patchset applies on top of Linus' tree as of v3.13-rc4 (f7556698).

Special thanks to Scott Norton, Tom Vanden, Mark Ray and Aswin Chandramouleeswaran
for help presenting, debugging and analyzing the data.

  futex: Misc cleanups
  futex: Larger hash table
  futex: Document ordering guarantees
  futex: Avoid taking hb lock if nothing to wakeup

 kernel/futex.c | 236 +++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 196 insertions(+), 40 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 1/4] futex: Misc cleanups
  2013-12-19 18:45 [PATCH v3 0/4] futex: Wakeup optimizations Davidlohr Bueso
@ 2013-12-19 18:45 ` Davidlohr Bueso
  2013-12-19 18:45 ` [PATCH v3 2/4] futex: Larger hash table Davidlohr Bueso
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: Davidlohr Bueso @ 2013-12-19 18:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin,
	davidlohr, Ingo Molnar

From: Jason Low <jason.low2@hp.com>

- Remove unnecessary head variables.
- Delete unused parameter in queue_unlock().

Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 39 ++++++++++++---------------------------
 1 file changed, 12 insertions(+), 27 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index f6ff019..085f5fa 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -598,13 +598,10 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
 {
 	struct futex_pi_state *pi_state = NULL;
 	struct futex_q *this, *next;
-	struct plist_head *head;
 	struct task_struct *p;
 	pid_t pid = uval & FUTEX_TID_MASK;
 
-	head = &hb->chain;
-
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb->chain, list) {
 		if (match_futex(&this->key, key)) {
 			/*
 			 * Another waiter already exists - bump up
@@ -986,7 +983,6 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
-	struct plist_head *head;
 	union futex_key key = FUTEX_KEY_INIT;
 	int ret;
 
@@ -999,9 +995,8 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 
 	hb = hash_futex(&key);
 	spin_lock(&hb->lock);
-	head = &hb->chain;
 
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb->chain, list) {
 		if (match_futex (&this->key, &key)) {
 			if (this->pi_state || this->rt_waiter) {
 				ret = -EINVAL;
@@ -1034,7 +1029,6 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 {
 	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb1, *hb2;
-	struct plist_head *head;
 	struct futex_q *this, *next;
 	int ret, op_ret;
 
@@ -1082,9 +1076,7 @@ retry_private:
 		goto retry;
 	}
 
-	head = &hb1->chain;
-
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb1->chain, list) {
 		if (match_futex (&this->key, &key1)) {
 			if (this->pi_state || this->rt_waiter) {
 				ret = -EINVAL;
@@ -1097,10 +1089,8 @@ retry_private:
 	}
 
 	if (op_ret > 0) {
-		head = &hb2->chain;
-
 		op_ret = 0;
-		plist_for_each_entry_safe(this, next, head, list) {
+		plist_for_each_entry_safe(this, next, &hb2->chain, list) {
 			if (match_futex (&this->key, &key2)) {
 				if (this->pi_state || this->rt_waiter) {
 					ret = -EINVAL;
@@ -1270,7 +1260,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 	int drop_count = 0, task_count = 0, ret;
 	struct futex_pi_state *pi_state = NULL;
 	struct futex_hash_bucket *hb1, *hb2;
-	struct plist_head *head1;
 	struct futex_q *this, *next;
 	u32 curval2;
 
@@ -1393,8 +1382,7 @@ retry_private:
 		}
 	}
 
-	head1 = &hb1->chain;
-	plist_for_each_entry_safe(this, next, head1, list) {
+	plist_for_each_entry_safe(this, next, &hb1->chain, list) {
 		if (task_count - nr_wake >= nr_requeue)
 			break;
 
@@ -1494,7 +1482,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 }
 
 static inline void
-queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
+queue_unlock(struct futex_hash_bucket *hb)
 	__releases(&hb->lock)
 {
 	spin_unlock(&hb->lock);
@@ -1867,7 +1855,7 @@ retry_private:
 	ret = get_futex_value_locked(&uval, uaddr);
 
 	if (ret) {
-		queue_unlock(q, *hb);
+		queue_unlock(*hb);
 
 		ret = get_user(uval, uaddr);
 		if (ret)
@@ -1881,7 +1869,7 @@ retry_private:
 	}
 
 	if (uval != val) {
-		queue_unlock(q, *hb);
+		queue_unlock(*hb);
 		ret = -EWOULDBLOCK;
 	}
 
@@ -2029,7 +2017,7 @@ retry_private:
 			 * Task is exiting and we just wait for the
 			 * exit to complete.
 			 */
-			queue_unlock(&q, hb);
+			queue_unlock(hb);
 			put_futex_key(&q.key);
 			cond_resched();
 			goto retry;
@@ -2081,7 +2069,7 @@ retry_private:
 	goto out_put_key;
 
 out_unlock_put_key:
-	queue_unlock(&q, hb);
+	queue_unlock(hb);
 
 out_put_key:
 	put_futex_key(&q.key);
@@ -2091,7 +2079,7 @@ out:
 	return ret != -EINTR ? ret : -ERESTARTNOINTR;
 
 uaddr_faulted:
-	queue_unlock(&q, hb);
+	queue_unlock(hb);
 
 	ret = fault_in_user_writeable(uaddr);
 	if (ret)
@@ -2113,7 +2101,6 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
-	struct plist_head *head;
 	union futex_key key = FUTEX_KEY_INIT;
 	u32 uval, vpid = task_pid_vnr(current);
 	int ret;
@@ -2153,9 +2140,7 @@ retry:
 	 * Ok, other tasks may need to be woken up - check waiters
 	 * and do the wakeup if necessary:
 	 */
-	head = &hb->chain;
-
-	plist_for_each_entry_safe(this, next, head, list) {
+	plist_for_each_entry_safe(this, next, &hb->chain, list) {
 		if (!match_futex (&this->key, &key))
 			continue;
 		ret = wake_futex_pi(uaddr, uval, this);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 2/4] futex: Larger hash table
  2013-12-19 18:45 [PATCH v3 0/4] futex: Wakeup optimizations Davidlohr Bueso
  2013-12-19 18:45 ` [PATCH v3 1/4] futex: Misc cleanups Davidlohr Bueso
@ 2013-12-19 18:45 ` Davidlohr Bueso
  2013-12-19 18:45 ` [PATCH v3 3/4] futex: Document ordering guarantees Davidlohr Bueso
  2013-12-19 18:45 ` [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
  3 siblings, 0 replies; 30+ messages in thread
From: Davidlohr Bueso @ 2013-12-19 18:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin,
	davidlohr, Ingo Molnar

Currently, the futex global hash table suffers from it's fixed, smallish
(for today's standards) size of 256 entries, as well as its lack of NUMA
awareness. Large systems, using many futexes, can be prone to high amounts
of collisions; where these futexes hash to the same bucket and lead to
extra contention on the same hb->lock. Furthermore, cacheline bouncing is a
reality when we have multiple hb->locks residing on the same cacheline and
different futexes hash to adjacent buckets.

This patch keeps the current static size of 16 entries for small systems,
or otherwise, 256 * ncpus (or larger as we need to round the number to a
power of 2). Note that this number of CPUs accounts for all CPUs that can
ever be available in the system, taking into consideration things like
hotpluging. While we do impose extra overhead at bootup by making the hash
table larger, this is a one time thing, and does not shadow the benefits
of this patch.

Furthermore, as suggested by tglx, by cache aligning the hash buckets we can
avoid access across cacheline boundaries and also avoid massive cache line
bouncing if multiple cpus are hammering away at different hash buckets which
happen to reside in the same cache line.

Also, similar to other core kernel components (pid, dcache, tcp), by using
alloc_large_system_hash() we benefit from its NUMA awareness and thus the
table is distributed among the nodes instead of in a single one.

For a custom microbenchmark that pounds on the uaddr hashing -- making the wait
path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of
futexes, we can see the following benefits on a 80-core, 8-socket 1Tb server:

+---------+--------------------+------------------------+-----------------------+-------------------------------+
| threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
+---------+--------------------+------------------------+-----------------------+-------------------------------+
|     512 |		 32426 | 50531  (+55.8%)	| 255274  (+687.2%)	| 292553  (+802.2%)		|
|     256 |		 65360 | 99588  (+52.3%)	| 443563  (+578.6%)	| 508088  (+677.3%)		|
|     128 |		125635 | 200075 (+59.2%)	| 742613  (+491.1%)	| 835452  (+564.9%)		|
|      80 |		193559 | 323425 (+67.1%)	| 1028147 (+431.1%)	| 1130304 (+483.9%)		|
|      64 |		247667 | 443740 (+79.1%)	| 997300  (+302.6%)	| 1145494 (+362.5%)		|
|      32 |		628412 | 721401 (+14.7%)	| 965996  (+53.7%)	| 1122115 (+78.5%)		|
+---------+--------------------+------------------------+-----------------------+-------------------------------+

Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
---
 kernel/futex.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 085f5fa..577481d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -63,6 +63,7 @@
 #include <linux/sched/rt.h>
 #include <linux/hugetlb.h>
 #include <linux/freezer.h>
+#include <linux/bootmem.h>
 
 #include <asm/futex.h>
 
@@ -70,8 +71,6 @@
 
 int __read_mostly futex_cmpxchg_enabled;
 
-#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
-
 /*
  * Futex flags used to encode options to functions and preserve them across
  * restarts.
@@ -149,9 +148,11 @@ static const struct futex_q futex_q_init = {
 struct futex_hash_bucket {
 	spinlock_t lock;
 	struct plist_head chain;
-};
+} ____cacheline_aligned_in_smp;
 
-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+static unsigned long __read_mostly futex_hashsize;
+
+static struct futex_hash_bucket *futex_queues;
 
 /*
  * We hash on the keys returned from get_futex_key (see below).
@@ -161,7 +162,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key)
 	u32 hash = jhash2((u32*)&key->both.word,
 			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
 			  key->both.offset);
-	return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+	return &futex_queues[hash & (futex_hashsize - 1)];
 }
 
 /*
@@ -2719,7 +2720,18 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 static int __init futex_init(void)
 {
 	u32 curval;
-	int i;
+	unsigned long i;
+
+#if CONFIG_BASE_SMALL
+	futex_hashsize = 16;
+#else
+	futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
+#endif
+
+	futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
+					       futex_hashsize, 0,
+					       futex_hashsize < 256 ? HASH_SMALL : 0,
+					       NULL, NULL, futex_hashsize, futex_hashsize);
 
 	/*
 	 * This will fail and we want it. Some arch implementations do
@@ -2734,7 +2746,7 @@ static int __init futex_init(void)
 	if (cmpxchg_futex_value_locked(&curval, NULL, 0, 0) == -EFAULT)
 		futex_cmpxchg_enabled = 1;
 
-	for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
+	for (i = 0; i < futex_hashsize; i++) {
 		plist_head_init(&futex_queues[i].chain);
 		spin_lock_init(&futex_queues[i].lock);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 3/4] futex: Document ordering guarantees
  2013-12-19 18:45 [PATCH v3 0/4] futex: Wakeup optimizations Davidlohr Bueso
  2013-12-19 18:45 ` [PATCH v3 1/4] futex: Misc cleanups Davidlohr Bueso
  2013-12-19 18:45 ` [PATCH v3 2/4] futex: Larger hash table Davidlohr Bueso
@ 2013-12-19 18:45 ` Davidlohr Bueso
  2013-12-19 18:45 ` [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
  3 siblings, 0 replies; 30+ messages in thread
From: Davidlohr Bueso @ 2013-12-19 18:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin,
	davidlohr, Ingo Molnar

From: Thomas Gleixner <tglx@linutronix.de>

That's essential, if you want to hack on futexes.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: Jason Low <jason.low2@hp.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 577481d..af1fc31 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -69,6 +69,63 @@
 
 #include "locking/rtmutex_common.h"
 
+/*
+ * Basic futex operation and ordering guarantees:
+ *
+ * The waiter reads the futex value in user space and calls
+ * futex_wait(). It computes the hash bucket and acquires the hash
+ * bucket lock. After that it reads the futex user space value again
+ * and verifies that the data has not changed. If it has not changed
+ * it enqueues itself into the hash bucket, releases the hash
+ * bucket lock and schedules.
+ *
+ * The waker side modifies the user space value of the futex and calls
+ * futex_wake(). It computes the hash bucket and acquires the hash
+ * bucket lock. Then it looks for waiters on that futex in the hash
+ * bucket and wakes them.
+ *
+ * Note that the spin_lock serializes waiters and wakers, so that the
+ * following scenario is avoided:
+ *
+ * CPU 0                               CPU 1
+ * val = *futex;
+ * sys_futex(WAIT, futex, val);
+ *   futex_wait(futex, val);
+ *   uval = *futex;
+ *                                     *futex = newval;
+ *                                     sys_futex(WAKE, futex);
+ *                                       futex_wake(futex);
+ *                                       if (queue_empty())
+ *                                         return;
+ *   if (uval == val)
+ *      lock(hash_bucket(futex));
+ *      queue();
+ *     unlock(hash_bucket(futex));
+ *     schedule();
+ *
+ * This would cause the waiter on CPU 0 to wait forever because it
+ * missed the transition of the user space value from val to newval
+ * and the waker did not find the waiter in the hash bucket queue.
+ * The spinlock serializes that:
+ *
+ * CPU 0                               CPU 1
+ * val = *futex;
+ * sys_futex(WAIT, futex, val);
+ *   futex_wait(futex, val);
+ *   lock(hash_bucket(futex));
+ *   uval = *futex;
+ *                                     *futex = newval;
+ *                                     sys_futex(WAKE, futex);
+ *                                       futex_wake(futex);
+ *                                       lock(hash_bucket(futex));
+ *   if (uval == val)
+ *      queue();
+ *     unlock(hash_bucket(futex));
+ *     schedule();                       if (!queue_empty())
+ *                                         wake_waiters(futex);
+ *                                       unlock(hash_bucket(futex));
+ */
+
 int __read_mostly futex_cmpxchg_enabled;
 
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-19 18:45 [PATCH v3 0/4] futex: Wakeup optimizations Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2013-12-19 18:45 ` [PATCH v3 3/4] futex: Document ordering guarantees Davidlohr Bueso
@ 2013-12-19 18:45 ` Davidlohr Bueso
  2013-12-19 19:25   ` Ingo Molnar
  2013-12-19 23:14   ` Linus Torvalds
  3 siblings, 2 replies; 30+ messages in thread
From: Davidlohr Bueso @ 2013-12-19 18:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, dvhart, peterz, tglx, paulmck, efault, jeffm, torvalds,
	jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin,
	davidlohr, Ingo Molnar

In futex_wake() there is clearly no point in taking the hb->lock if
we know beforehand that there are no tasks to be woken. This comes
at the smaller cost of doing some atomic operations to keep track of
the list's size. Specifically, increment the counter when an element is
added to the list, and decrement when it is removed. Of course, if the
counter is 0, then there are no tasks blocked on a futex. Some special
considerations:

- increment the counter at queue_lock() as we always end up calling
  queue_me() which adds the element to the list. Upon any error,
  queue_unlock() is called for housekeeping, for which we decrement
  to mach the increment done in queue_lock().

- decrement the counter at __unqueue_me() to reflect when an element is
  removed from the queue for wakeup related purposes.

We make sure that the futex ordering guarantees are preserved, ensuring
that waiters either observes the changed user space value before blocking or
is woken by a concurrent waker. This is done by relying on the barriers in
atomic_inc() -- for archs that do have implict mb in atomic_inc() we
explicitly add them through a set of new functions that are introduced:
futex_get_mm(), hb_waiters_inc() and hb_waiters_dec(). For more details
please refer to the updated comments in the code and related discussion:
https://lkml.org/lkml/2013/11/26/556

Furthermore, the overhead of new barriers is trivial. The following are some
results, from a quad-core x86_64 laptop, measuring the latency of nthread
wakeups (1 at a time), for 1000 runs:

+----------+-----------------------------+------------------------------+
| nthreads | baseline time (ms) [stddev] | atomicops time (ms) [stddev] |
+----------+-----------------------------+------------------------------+
|      512 | 2.8360 [0.5168]             | 3.8150 [1.3293]              |
|      256 | 2.5080 [0.6375]             | 2.5980 [0.9079]              |
|      128 | 1.0200 [0.4264]             | 1.5180 [0.4902]              |
|       64 | 0.7890 [0.2667]             | 0.4020 [0.2447]              |
|       32 | 0.1150 [0.0184]             | 0.1490 [0.1156]              |
+----------+-----------------------------+------------------------------+

Special thanks to tglx for careful review and feedback.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
---
 kernel/futex.c | 142 +++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 122 insertions(+), 20 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index af1fc31..201af6b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -84,8 +84,11 @@
  * bucket lock. Then it looks for waiters on that futex in the hash
  * bucket and wakes them.
  *
- * Note that the spin_lock serializes waiters and wakers, so that the
- * following scenario is avoided:
+ * In scenarios where wakeups are called and no tasks are blocked on a futex,
+ * taking the hb spinlock can be avoided and simply return. In order for this
+ * optimization to work, ordering guarantees must exist so that the waiter
+ * being added to the list is acknowledged when the list is concurrently being
+ * checked by the waker, avoiding scenarios like the following:
  *
  * CPU 0                               CPU 1
  * val = *futex;
@@ -106,24 +109,49 @@
  * This would cause the waiter on CPU 0 to wait forever because it
  * missed the transition of the user space value from val to newval
  * and the waker did not find the waiter in the hash bucket queue.
- * The spinlock serializes that:
+ *
+ * The correct serialization ensures that a waiter either observes
+ * the changed user space value before blocking or is woken by a
+ * concurrent waker:
  *
  * CPU 0                               CPU 1
  * val = *futex;
  * sys_futex(WAIT, futex, val);
  *   futex_wait(futex, val);
- *   lock(hash_bucket(futex));
- *   uval = *futex;
- *                                     *futex = newval;
- *                                     sys_futex(WAKE, futex);
- *                                       futex_wake(futex);
- *                                       lock(hash_bucket(futex));
+ *
+ *   waiters++;
+ *   mb(); (A) <-- paired with -.
+ *                              |
+ *   lock(hash_bucket(futex));  |
+ *                              |
+ *   uval = *futex;             |
+ *                              |        *futex = newval;
+ *                              |        sys_futex(WAKE, futex);
+ *                              |          futex_wake(futex);
+ *                              |
+ *                              `------->   mb(); (B)
  *   if (uval == val)
- *      queue();
+ *     queue();
  *     unlock(hash_bucket(futex));
- *     schedule();                       if (!queue_empty())
- *                                         wake_waiters(futex);
- *                                       unlock(hash_bucket(futex));
+ *     schedule();                         if (waiters)
+ *                                           lock(hash_bucket(futex));
+ *                                           wake_waiters(futex);
+ *                                           unlock(hash_bucket(futex));
+ *
+ * Where (A) orders the waiters increment and the futex value read; and
+ * where (B) orders the write to futex and the waiters read.
+ *
+ * This yields the following case (where X:=waiters, Y:=futex):
+ *
+ *	X = Y = 0
+ *
+ *	w[X]=1		w[Y]=1
+ *	MB		MB
+ *	r[Y]=y		r[X]=x
+ *
+ * Which guarantees that x==0 && y==0 is impossible; which translates back into
+ * the guarantee that we cannot both miss the futex variable change and the
+ * enqueue.
  */
 
 int __read_mostly futex_cmpxchg_enabled;
@@ -203,6 +231,7 @@ static const struct futex_q futex_q_init = {
  * waiting on a futex.
  */
 struct futex_hash_bucket {
+	atomic_t waiters;
 	spinlock_t lock;
 	struct plist_head chain;
 } ____cacheline_aligned_in_smp;
@@ -211,6 +240,53 @@ static unsigned long __read_mostly futex_hashsize;
 
 static struct futex_hash_bucket *futex_queues;
 
+static inline void futex_get_mm(union futex_key *key)
+{
+	atomic_inc(&key->private.mm->mm_count);
+#ifdef CONFIG_SMP
+	/*
+	 * Ensure futex_get_mm() implies a full barrier such that
+	 * get_futex_key() implies a full barrier. This is relied upon
+	 * as full barrier (B), see the ordering comment above.
+	 */
+	smp_mb__after_atomic_inc();
+#endif
+}
+
+/*
+ * Reflects a new waiter being added to the waitqueue.
+ */
+static inline void hb_waiters_inc(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	atomic_inc(&hb->waiters);
+	/*
+	 * Full barrier (A), see the ordering comment above.
+	 */
+	smp_mb__after_atomic_inc();
+#endif
+}
+
+/*
+ * Reflects a waiter being removed from the waitqueue by wakeup
+ * paths.
+ */
+static inline void hb_waiters_dec(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	atomic_dec(&hb->waiters);
+#endif
+}
+
+static inline int hb_waiters_pending(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	return atomic_read(&hb->waiters);
+#else
+	return 1;
+#endif
+}
+
 /*
  * We hash on the keys returned from get_futex_key (see below).
  */
@@ -237,6 +313,8 @@ static inline int match_futex(union futex_key *key1, union futex_key *key2)
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
  *
+ * Implies a full memory barrier; relied upon as (B), see the comment above
+ * about ordering.
  */
 static void get_futex_key_refs(union futex_key *key)
 {
@@ -245,10 +323,10 @@ static void get_futex_key_refs(union futex_key *key)
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		ihold(key->shared.inode);
+		ihold(key->shared.inode); /* implies MB */
 		break;
 	case FUT_OFF_MMSHARED:
-		atomic_inc(&key->private.mm->mm_count);
+		futex_get_mm(key); /* implies MB */
 		break;
 	}
 }
@@ -292,6 +370,8 @@ static void drop_futex_key_refs(union futex_key *key)
  * We can usually work out the index without swapping in the page.
  *
  * lock_page() might sleep, the caller should not hold a spinlock.
+ *
+ * Implies a full memory barrier (B).
  */
 static int
 get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
@@ -322,7 +402,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 	if (!fshared) {
 		key->private.mm = mm;
 		key->private.address = address;
-		get_futex_key_refs(key);
+		get_futex_key_refs(key); /* implies MB (B) */
 		return 0;
 	}
 
@@ -429,7 +509,7 @@ again:
 		key->shared.pgoff = basepage_index(page);
 	}
 
-	get_futex_key_refs(key);
+	get_futex_key_refs(key); /* implies MB (B) */
 
 out:
 	unlock_page(page_head);
@@ -893,6 +973,7 @@ static void __unqueue_futex(struct futex_q *q)
 
 	hb = container_of(q->lock_ptr, struct futex_hash_bucket, lock);
 	plist_del(&q->list, &hb->chain);
+	hb_waiters_dec(hb);
 }
 
 /*
@@ -1052,6 +1133,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 		goto out;
 
 	hb = hash_futex(&key);
+
+	/* Make sure we really have tasks to wakeup */
+	if (!hb_waiters_pending(hb))
+		goto out_put_key;
+
 	spin_lock(&hb->lock);
 
 	plist_for_each_entry_safe(this, next, &hb->chain, list) {
@@ -1072,6 +1158,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	}
 
 	spin_unlock(&hb->lock);
+out_put_key:
 	put_futex_key(&key);
 out:
 	return ret;
@@ -1190,7 +1277,9 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
 	 */
 	if (likely(&hb1->chain != &hb2->chain)) {
 		plist_del(&q->list, &hb1->chain);
+		hb_waiters_dec(hb1);
 		plist_add(&q->list, &hb2->chain);
+		hb_waiters_inc(hb2);
 		q->lock_ptr = &hb2->lock;
 	}
 	get_futex_key_refs(key2);
@@ -1533,17 +1622,28 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 	struct futex_hash_bucket *hb;
 
 	hb = hash_futex(&q->key);
+
+	/*
+	 * Increment the counter before taking the lock so that
+	 * a potential waker won't miss a to-be-slept task that is
+	 * waiting for the spinlock. This is safe as all queue_lock()
+	 * users end up calling queue_me(). Similarly, for housekeeping,
+	 * decrement the counter at queue_unlock() when some error has
+	 * occurred and we don't end up adding the task to the list.
+	 */
+	hb_waiters_inc(hb);
+
 	q->lock_ptr = &hb->lock;
 
 	spin_lock(&hb->lock);
 	return hb;
 }
 
-static inline void
-queue_unlock(struct futex_hash_bucket *hb)
+static inline void queue_unlock(struct futex_hash_bucket *hb)
 	__releases(&hb->lock)
 {
 	spin_unlock(&hb->lock);
+	hb_waiters_dec(hb);
 }
 
 /**
@@ -2275,6 +2375,7 @@ int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
 		 * Unqueue the futex_q and determine which it was.
 		 */
 		plist_del(&q->list, &hb->chain);
+		hb_waiters_dec(hb);
 
 		/* Handle spurious wakeups gracefully */
 		ret = -EWOULDBLOCK;
@@ -2290,7 +2391,7 @@ int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
  * futex_wait_requeue_pi() - Wait on uaddr and take uaddr2
  * @uaddr:	the futex we initially wait on (non-pi)
  * @flags:	futex flags (FLAGS_SHARED, FLAGS_CLOCKRT, etc.), they must be
- * 		the same type, no requeueing from private to shared, etc.
+ *		the same type, no requeueing from private to shared, etc.
  * @val:	the expected value of uaddr
  * @abs_time:	absolute timeout
  * @bitset:	32 bit wakeup bitset set by userspace, defaults to all
@@ -2804,6 +2905,7 @@ static int __init futex_init(void)
 		futex_cmpxchg_enabled = 1;
 
 	for (i = 0; i < futex_hashsize; i++) {
+		atomic_set(&futex_queues[i].waiters, 0);
 		plist_head_init(&futex_queues[i].chain);
 		spin_lock_init(&futex_queues[i].lock);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-19 18:45 ` [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
@ 2013-12-19 19:25   ` Ingo Molnar
  2013-12-19 19:29     ` Davidlohr Bueso
  2013-12-19 23:14   ` Linus Torvalds
  1 sibling, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2013-12-19 19:25 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, peterz, tglx, paulmck, efault, jeffm,
	torvalds, jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin


* Davidlohr Bueso <davidlohr@hp.com> wrote:

> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> Signed-off-by: Jason Low <jason.low2@hp.com>

So, that's not a valid SOB sequence either, the person sending me a 
patch should be the last person in the SOB chain - if you want to 
credit them please add a Reviewed-by or add special credits to the 
changelog.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-19 19:25   ` Ingo Molnar
@ 2013-12-19 19:29     ` Davidlohr Bueso
  2013-12-19 19:42       ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Davidlohr Bueso @ 2013-12-19 19:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, mingo, dvhart, peterz, tglx, paulmck, efault, jeffm,
	torvalds, jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin

On Thu, 2013-12-19 at 20:25 +0100, Ingo Molnar wrote:
> * Davidlohr Bueso <davidlohr@hp.com> wrote:
> 
> > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> > Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> > Signed-off-by: Jason Low <jason.low2@hp.com>
> 
> So, that's not a valid SOB sequence either, the person sending me a 
> patch should be the last person in the SOB chain 

Which is why I had it like that in the original version.

> - if you want to 
> credit them please add a Reviewed-by or add special credits to the 
> changelog.

I will change the tags and resend in a v4.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-19 19:29     ` Davidlohr Bueso
@ 2013-12-19 19:42       ` Ingo Molnar
  2013-12-19 22:35         ` Joe Perches
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2013-12-19 19:42 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-kernel, mingo, dvhart, peterz, tglx, paulmck, efault, jeffm,
	torvalds, jason.low2, Waiman.Long, tom.vaden, scott.norton, aswin


* Davidlohr Bueso <davidlohr@hp.com> wrote:

> On Thu, 2013-12-19 at 20:25 +0100, Ingo Molnar wrote:
> > * Davidlohr Bueso <davidlohr@hp.com> wrote:
> > 
> > > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> > > Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> > > Signed-off-by: Jason Low <jason.low2@hp.com>
> > 
> > So, that's not a valid SOB sequence either, the person sending me a 
> > patch should be the last person in the SOB chain 
> 
> Which is why I had it like that in the original version.

The problem with that order was that the first person should be the 
primary author and in the 'From' tag.

A SOB chain is intended to depict the true propagation/route of a 
patch, from author, through maintainers who handle and forward it, to 
the maintainer who applies it to a Git tree. The patch starts up with 
a single SOB (the primary author's) and every 'hop' after that adds a 
SOB to the tail of the existing SOB chain.

> > - if you want to 
> > credit them please add a Reviewed-by or add special credits to the 
> > changelog.
> 
> I will change the tags and resend in a v4.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-19 19:42       ` Ingo Molnar
@ 2013-12-19 22:35         ` Joe Perches
  2013-12-20  9:11           ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Joe Perches @ 2013-12-19 22:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davidlohr Bueso, linux-kernel, mingo, dvhart, peterz, tglx,
	paulmck, efault, jeffm, torvalds, jason.low2, Waiman.Long,
	tom.vaden, scott.norton, aswin

On Thu, 2013-12-19 at 20:42 +0100, Ingo Molnar wrote:
> * Davidlohr Bueso <davidlohr@hp.com> wrote:
> 
> > On Thu, 2013-12-19 at 20:25 +0100, Ingo Molnar wrote:
> > > * Davidlohr Bueso <davidlohr@hp.com> wrote:
> > > 
> > > > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> > > > Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> > > > Signed-off-by: Jason Low <jason.low2@hp.com>
> > > 
> > > So, that's not a valid SOB sequence either, the person sending me a 
> > > patch should be the last person in the SOB chain 
> > 
> > Which is why I had it like that in the original version.
> 
> The problem with that order was that the first person should be the 
> primary author and in the 'From' tag.
> 
> A SOB chain is intended to depict the true propagation/route of a 
> patch, from author, through maintainers who handle and forward it, to 
> the maintainer who applies it to a Git tree. The patch starts up with 
> a single SOB (the primary author's) and every 'hop' after that adds a 
> SOB to the tail of the existing SOB chain.

Multiple "Signed-off-by:"s are also used when there are
multiple authors of a single patch.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-19 18:45 ` [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
  2013-12-19 19:25   ` Ingo Molnar
@ 2013-12-19 23:14   ` Linus Torvalds
  2013-12-19 23:42     ` Davidlohr Bueso
                       ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: Linus Torvalds @ 2013-12-19 23:14 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Thu, Dec 19, 2013 at 10:45 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>
> - increment the counter at queue_lock() as we always end up calling
>   queue_me() which adds the element to the list. Upon any error,
>   queue_unlock() is called for housekeeping, for which we decrement
>   to mach the increment done in queue_lock().
>
> - decrement the counter at __unqueue_me() to reflect when an element is
>   removed from the queue for wakeup related purposes.

I still hate this whole separate counter thing. It seems really annoying.

If re-ordering things didn't work out, then why can't just the counter
we *already* have in the spinlock itself work as the counter? Your
counter update logic seems to basically match when you take the
spinlock anyway.

The *testing* side doesn't actually care about how many waiters there
are, it only cares about whether there are waiters. And it can look at
the wait-list for that - but you want to close the race between the
entry actually getting added to the list using this counter. But the
place you increment the new counter is the same place as you take the
spinlock, which does that ticket increment. No?

So I still think this can be done without that new counter field, or
any new atomics.

hb_waiters_pending() could be something like

    spin_contended(hb->lock) || !list_empty(&hb->chain)

which considering where you increment the new count should be
equivalent to your "!!hb->waiters". The smp_mb_after_atomic_inc()" on
the spinlock side would become smp_mb_after_spin_lock() instead.

Yes? No? Why?

            Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-19 23:14   ` Linus Torvalds
@ 2013-12-19 23:42     ` Davidlohr Bueso
  2013-12-19 23:53       ` Linus Torvalds
  2013-12-20 19:30     ` Davidlohr Bueso
  2013-12-21  1:36     ` Davidlohr Bueso
  2 siblings, 1 reply; 30+ messages in thread
From: Davidlohr Bueso @ 2013-12-19 23:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Thu, 2013-12-19 at 15:14 -0800, Linus Torvalds wrote:
> On Thu, Dec 19, 2013 at 10:45 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> >
> > - increment the counter at queue_lock() as we always end up calling
> >   queue_me() which adds the element to the list. Upon any error,
> >   queue_unlock() is called for housekeeping, for which we decrement
> >   to mach the increment done in queue_lock().
> >
> > - decrement the counter at __unqueue_me() to reflect when an element is
> >   removed from the queue for wakeup related purposes.
> 
> I still hate this whole separate counter thing. It seems really annoying.
> 
> If re-ordering things didn't work out, then why can't just the counter
> we *already* have in the spinlock itself work as the counter? Your
> counter update logic seems to basically match when you take the
> spinlock anyway.
> 
> The *testing* side doesn't actually care about how many waiters there
> are, it only cares about whether there are waiters. 

True.

> And it can look at
> the wait-list for that - but you want to close the race between the
> entry actually getting added to the list using this counter. But the
> place you increment the new counter is the same place as you take the
> spinlock, which does that ticket increment. No?

I don't think so. If we rely on this, then we could end up missing
to-be-queued tasks that are in the process of acquiring the lock, so
waiters could sleep forever. So we need a way of acknowledging that a
task is in the process of waiting when concurrently doing wakeups.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-19 23:42     ` Davidlohr Bueso
@ 2013-12-19 23:53       ` Linus Torvalds
  2013-12-20  0:04         ` Linus Torvalds
                           ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Linus Torvalds @ 2013-12-19 23:53 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Thu, Dec 19, 2013 at 3:42 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>
>> And it can look at
>> the wait-list for that - but you want to close the race between the
>> entry actually getting added to the list using this counter. But the
>> place you increment the new counter is the same place as you take the
>> spinlock, which does that ticket increment. No?
>
> I don't think so. If we rely on this, then we could end up missing
> to-be-queued tasks that are in the process of acquiring the lock, so
> waiters could sleep forever. So we need a way of acknowledging that a
> task is in the process of waiting when concurrently doing wakeups.

I don't understand. Why is that any better with the separate atomic count?

The only place you increment the count - hb_waiters_inc() - is called
in exactly two places:

 - in requeue_futex(), which already holds the spinlock on both queues

 - in queue_lock(), immediately before getting the spinlock (which
will do the SAME ATOMIC INCREMENT, except it's just doing it on a
different member of the structure, namely the spinlock head)

so if you drop the waiters increment, and instead use the spinlock
head increment as the "waiters", you get exactly the same semantics.
If anything, for the requeue_futex() case, the spinlock increment was
done *earlier*, but that's really not relevant.

So how could we miss this? Explain to me what the separate counter
does that isn't done by the spinlock head counter.

              Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-19 23:53       ` Linus Torvalds
@ 2013-12-20  0:04         ` Linus Torvalds
  2013-12-20  0:26           ` Darren Hart
  2013-12-20  0:12         ` Linus Torvalds
  2013-12-20  2:22         ` Davidlohr Bueso
  2 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2013-12-20  0:04 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Thu, Dec 19, 2013 at 3:53 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>  - in queue_lock(), immediately before getting the spinlock (which
> will do the SAME ATOMIC INCREMENT, except it's just doing it on a
> different member of the structure, namely the spinlock head)

Ok, so there's the "q->lock_ptr = &hb->lock" assignment in between,
but if the ordering of that is critical, it should be documented in
the memory ordering rules, because it sure is subtle and not obviously
visible anywhere..

            Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-19 23:53       ` Linus Torvalds
  2013-12-20  0:04         ` Linus Torvalds
@ 2013-12-20  0:12         ` Linus Torvalds
  2013-12-20  2:22         ` Davidlohr Bueso
  2 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2013-12-20  0:12 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Thu, Dec 19, 2013 at 3:53 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So how could we miss this? Explain to me what the separate counter
> does that isn't done by the spinlock head counter.

Hmm. Trying to answer this myself by looking at the code. And I just
realized that I described things wrong ("spin_contended()" rather than
"spin_is_locked()"). So that was a bug in my description.

One difference is that by avoiding the counter, the "do we have
waiters" is two values rather than one ("is spin locked" + "node list
head is empty"). So there are possible memory ordering issues wrt
reading those two fields, but I don't see it mattering: since you'd
need to read both "not locked" _and_ "list empty", any re-ordering
that shows both of those cases should be able to show the "waiters ==
0" case in the explicit separate counter model.

I don't know why I care, though. For some reason that extra counter
just makes me go "I'm convinced it's not needed", even though I can't
really explain why I should care about it.

        Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-20  0:04         ` Linus Torvalds
@ 2013-12-20  0:26           ` Darren Hart
  0 siblings, 0 replies; 30+ messages in thread
From: Darren Hart @ 2013-12-20  0:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Thu, 2013-12-19 at 16:04 -0800, Linus Torvalds wrote:
> On Thu, Dec 19, 2013 at 3:53 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> >  - in queue_lock(), immediately before getting the spinlock (which
> > will do the SAME ATOMIC INCREMENT, except it's just doing it on a
> > different member of the structure, namely the spinlock head)
> 
> Ok, so there's the "q->lock_ptr = &hb->lock" assignment in between,
> but if the ordering of that is critical, it should be documented in
> the memory ordering rules, because it sure is subtle and not obviously
> visible anywhere..

I'm trying to look into this to confirm, but it will require some time
for me to page that all in. There are some issues with respect to the
lock_ptr that are documented in the other functions (like unqueue_me),
but I *think* they are mostly on the other end of the operation and may
not be an issue here... 

q->lock_ptr == 0 also indicates "woken" state.

I will spend some time on this to try and get a definitive answer.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-19 23:53       ` Linus Torvalds
  2013-12-20  0:04         ` Linus Torvalds
  2013-12-20  0:12         ` Linus Torvalds
@ 2013-12-20  2:22         ` Davidlohr Bueso
  2013-12-20  3:13           ` Linus Torvalds
  2 siblings, 1 reply; 30+ messages in thread
From: Davidlohr Bueso @ 2013-12-20  2:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Thu, 2013-12-19 at 15:53 -0800, Linus Torvalds wrote:
> On Thu, Dec 19, 2013 at 3:42 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> >
> >> And it can look at
> >> the wait-list for that - but you want to close the race between the
> >> entry actually getting added to the list using this counter. But the
> >> place you increment the new counter is the same place as you take the
> >> spinlock, which does that ticket increment. No?
> >
> > I don't think so. If we rely on this, then we could end up missing
> > to-be-queued tasks that are in the process of acquiring the lock, so
> > waiters could sleep forever. So we need a way of acknowledging that a
> > task is in the process of waiting when concurrently doing wakeups.
> 
> I don't understand. Why is that any better with the separate atomic count?
> 
> The only place you increment the count - hb_waiters_inc() - is called
> in exactly two places:
> 
>  - in requeue_futex(), which already holds the spinlock on both queues
> 
>  - in queue_lock(), immediately before getting the spinlock (which
> will do the SAME ATOMIC INCREMENT, except it's just doing it on a
> different member of the structure, namely the spinlock head)
> 
> so if you drop the waiters increment, and instead use the spinlock
> head increment as the "waiters", you get exactly the same semantics.
> If anything, for the requeue_futex() case, the spinlock increment was
> done *earlier*, but that's really not relevant.

Ok so when I replied I was thinking about the plist really and not the
hb->lock ticket counter. Yeah, what you propose does make sense for
waiters. However in wake paths we have to decrement  the counter nwake
times (per each call to __unqueue_futex()), and if we rely on the
ticket, then we do it only once, in the unlock, so the counter doesn't
reflect the actual waitqueue size. Or am I missing something with ticket
spinlocks (I'm not very familiar with that code)?

Also by using the ticket counter we can run into scenarios where empty
wakers could still contend for the lock since we don't update the
counter after unlocking, instead of after plist_del. I guess this would
mostly affect requeuers since queue_unlock() would be exactly the same
as to what you propose.

Now, we've discussed this in the past and have agreed that we cannot
rely on plist_head_empty() as a waiter can queued after the user space
value changed and a concurrent wakeup will just return because at that
time plist is empty; thus the waiter blocks forever -- this program
illustrates the problem with nthreads > 1:

http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/tree/performance/futex_wait.c

Yes, the solution to that is to optimistically add the waiter to the
queue and afterwards do the "has the value changed?" check. I do not see
this being any better than using a counter. As Thomas pointed out
previously, we already dirty the cacheline for waiter paths and don't
have to do any housekeeping/dequeuing in case when the futex_wait() call
doesn't succeed.

Btw, I tried using spin_is_contended + plist_head_empty and we run into
a whole bunch of nasty issues which I have yet to sit down and look
into.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-20  2:22         ` Davidlohr Bueso
@ 2013-12-20  3:13           ` Linus Torvalds
  2013-12-20  6:13             ` Davidlohr Bueso
  2013-12-20  8:54             ` Peter Zijlstra
  0 siblings, 2 replies; 30+ messages in thread
From: Linus Torvalds @ 2013-12-20  3:13 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Thu, Dec 19, 2013 at 6:22 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>
> Ok so when I replied I was thinking about the plist really and not the
> hb->lock ticket counter. Yeah, what you propose does make sense for
> waiters. However in wake paths we have to decrement  the counter nwake
> times (per each call to __unqueue_futex()), and if we rely on the
> ticket, then we do it only once, in the unlock, so the counter doesn't
> reflect the actual waitqueue size. Or am I missing something with ticket
> spinlocks (I'm not very familiar with that code)?

So that's why I suggested checking the ticket lock _and_ the state of
the node list.

The ticket lock state gives you the racy window before the entry has
actually been added to the node list, and then after the lock has been
released, the fact that the list isn't empty gives you the rest.

I think we might have to order the two reads with an smp_rmb - making
sure that we check the lock state first - but I think it should be
otherwise pretty solid.

And maybe there is some reason it doesn't work, but I'd really like to
understand it (and I'd like to avoid the extra atomics for the waiters
count if it at all can be avoided)

> Btw, I tried using spin_is_contended + plist_head_empty and we run into
> a whole bunch of nasty issues which I have yet to sit down and look
> into.

Yeah, I said "spin_contended()" myself initially, but it needs to be
"spin_is_locked()". It's hopefully unlikely to ever actually be
contended (which in ticket lock terms is "head is _more_ than one away
from the tail").

            Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-20  3:13           ` Linus Torvalds
@ 2013-12-20  6:13             ` Davidlohr Bueso
  2013-12-20  8:54             ` Peter Zijlstra
  1 sibling, 0 replies; 30+ messages in thread
From: Davidlohr Bueso @ 2013-12-20  6:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Thu, 2013-12-19 at 19:13 -0800, Linus Torvalds wrote:
> On Thu, Dec 19, 2013 at 6:22 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> >
> > Ok so when I replied I was thinking about the plist really and not the
> > hb->lock ticket counter. Yeah, what you propose does make sense for
> > waiters. However in wake paths we have to decrement  the counter nwake
> > times (per each call to __unqueue_futex()), and if we rely on the
> > ticket, then we do it only once, in the unlock, so the counter doesn't
> > reflect the actual waitqueue size. Or am I missing something with ticket
> > spinlocks (I'm not very familiar with that code)?
> 
> So that's why I suggested checking the ticket lock _and_ the state of
> the node list.
> 
> The ticket lock state gives you the racy window before the entry has
> actually been added to the node list, and then after the lock has been
> released, the fact that the list isn't empty gives you the rest.

Yep, I'm starting to get around the idea now.

> I think we might have to order the two reads with an smp_rmb - making
> sure that we check the lock state first - but I think it should be
> otherwise pretty solid.

Yes, I don't see any guarantees otherwise.

> 
> And maybe there is some reason it doesn't work, but I'd really like to
> understand it (and I'd like to avoid the extra atomics for the waiters
> count if it at all can be avoided)

Other than the lock_ptr assignment I don't see much difference with what
you suggest. Since no wake paths use lock_ptr until after plist_del, and
we've already taken the lock way before then, it doesn't play a role in
what we're discussing.

I've put you suggestion (now corrected with spin_is_locked) and works
just as fine as with atomics. It's surviving my laptop and benchmarks on
a large system. I will wait to see if there are any concerns other might
have before sending another version of this patchset.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-20  3:13           ` Linus Torvalds
  2013-12-20  6:13             ` Davidlohr Bueso
@ 2013-12-20  8:54             ` Peter Zijlstra
  1 sibling, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2013-12-20  8:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Ingo Molnar,
	Darren Hart, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Thu, Dec 19, 2013 at 07:13:09PM -0800, Linus Torvalds wrote:
> I think we might have to order the two reads with an smp_rmb - making
> sure that we check the lock state first - but I think it should be
> otherwise pretty solid.

> Yeah, I said "spin_contended()" myself initially, but it needs to be
> "spin_is_locked()". It's hopefully unlikely to ever actually be
> contended (which in ticket lock terms is "head is _more_ than one away
> from the tail").


Right, which would give us something like:

  LOCK m		is_locked m
  list_{add,del}	RMB
  UNLOCK m		list_empty

Where the rmb matches the release/unlock. This guarantees that when we
see the unlock we must also see the list modification.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-19 22:35         ` Joe Perches
@ 2013-12-20  9:11           ` Peter Zijlstra
  2013-12-20 10:27             ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2013-12-20  9:11 UTC (permalink / raw)
  To: Joe Perches
  Cc: Ingo Molnar, Davidlohr Bueso, linux-kernel, mingo, dvhart, tglx,
	paulmck, efault, jeffm, torvalds, jason.low2, Waiman.Long,
	tom.vaden, scott.norton, aswin

On Thu, Dec 19, 2013 at 02:35:17PM -0800, Joe Perches wrote:
> On Thu, 2013-12-19 at 20:42 +0100, Ingo Molnar wrote:
> > * Davidlohr Bueso <davidlohr@hp.com> wrote:
> > 
> > > On Thu, 2013-12-19 at 20:25 +0100, Ingo Molnar wrote:
> > > > * Davidlohr Bueso <davidlohr@hp.com> wrote:
> > > > 
> > > > > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> > > > > Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> > > > > Signed-off-by: Jason Low <jason.low2@hp.com>
> > > > 
> > > > So, that's not a valid SOB sequence either, the person sending me a 
> > > > patch should be the last person in the SOB chain 
> > > 
> > > Which is why I had it like that in the original version.
> > 
> > The problem with that order was that the first person should be the 
> > primary author and in the 'From' tag.
> > 
> > A SOB chain is intended to depict the true propagation/route of a 
> > patch, from author, through maintainers who handle and forward it, to 
> > the maintainer who applies it to a Git tree. The patch starts up with 
> > a single SOB (the primary author's) and every 'hop' after that adds a 
> > SOB to the tail of the existing SOB chain.
> 
> Multiple "Signed-off-by:"s are also used when there are
> multiple authors of a single patch.

Abused; which is exactly what we're saying is not correct.

While doing so we've also found people abusing SoB where Reviewed-by was
meant and other 'creative' use.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-20  9:11           ` Peter Zijlstra
@ 2013-12-20 10:27             ` Ingo Molnar
  0 siblings, 0 replies; 30+ messages in thread
From: Ingo Molnar @ 2013-12-20 10:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joe Perches, Davidlohr Bueso, linux-kernel, mingo, dvhart, tglx,
	paulmck, efault, jeffm, torvalds, jason.low2, Waiman.Long,
	tom.vaden, scott.norton, aswin


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Dec 19, 2013 at 02:35:17PM -0800, Joe Perches wrote:
> > On Thu, 2013-12-19 at 20:42 +0100, Ingo Molnar wrote:
> > > * Davidlohr Bueso <davidlohr@hp.com> wrote:
> > > 
> > > > On Thu, 2013-12-19 at 20:25 +0100, Ingo Molnar wrote:
> > > > > * Davidlohr Bueso <davidlohr@hp.com> wrote:
> > > > > 
> > > > > > Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> > > > > > Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> > > > > > Signed-off-by: Jason Low <jason.low2@hp.com>
> > > > > 
> > > > > So, that's not a valid SOB sequence either, the person 
> > > > > sending me a patch should be the last person in the SOB 
> > > > > chain
> > > > 
> > > > Which is why I had it like that in the original version.
> > > 
> > > The problem with that order was that the first person should be 
> > > the primary author and in the 'From' tag.
> > > 
> > > A SOB chain is intended to depict the true propagation/route of 
> > > a patch, from author, through maintainers who handle and forward 
> > > it, to the maintainer who applies it to a Git tree. The patch 
> > > starts up with a single SOB (the primary author's) and every 
> > > 'hop' after that adds a SOB to the tail of the existing SOB 
> > > chain.
> > 
> > Multiple "Signed-off-by:"s are also used when there are multiple 
> > authors of a single patch.
> 
> Abused; which is exactly what we're saying is not correct.
> 
> While doing so we've also found people abusing SoB where Reviewed-by 
> was meant and other 'creative' use.

Yes, so I used to allow and author such signoff sequences years ago, 
but eventually Linus noticed the weirdness and asked me not to do it.

There's so many ways to credit people for contributions:

  - Reviewed-by, Acked-by tags carry credits

  - The changelog can mention someone specifically

  - I sometimes add the "Originally-by:" special tag, if an original
    patch was turned into a real patch by someone else.

  - If there's truly serious contribution then the code itself can
    carry a credit as well.

  - In a clean workflow authorship is naturally expressed through
    being the author of individual patches, and we want to encourage
    people to use clean workflows.

Having clean SOB sequences is also really useful: for example LWN.net 
is trying to keep track of the route of patches, and having 'creative' 
uses for signoffs adds unnecessary noise.

So there's no strong technical justification to pollute the SOB 
sequence with contibution/courtesy Signed-off-by lines, while there's 
good technological reasons to keep it a clean route.

I don't think Linus is buerocratically checking every single SOB 
sequence but statistically if you do too many messy signoffs someone 
eventually notices.

So I'm enforcing clean signoff sequences for the subsystems I maintain 
and patches I apply to Git - I might miss some, and it's not the end 
of the world since it has no functional side effects. Other 
maintainers might be more permissive.

It might make sense to clarify Documentation/SubmittingPatches 
accordingly.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-19 23:14   ` Linus Torvalds
  2013-12-19 23:42     ` Davidlohr Bueso
@ 2013-12-20 19:30     ` Davidlohr Bueso
  2013-12-20 19:54       ` Linus Torvalds
  2013-12-21  1:36     ` Davidlohr Bueso
  2 siblings, 1 reply; 30+ messages in thread
From: Davidlohr Bueso @ 2013-12-20 19:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Thu, 2013-12-19 at 15:14 -0800, Linus Torvalds wrote:
> 
> So I still think this can be done without that new counter field, or
> any new atomics.
> 
> hb_waiters_pending() could be something like
> 
>     spin_contended(hb->lock) || !list_empty(&hb->chain)
> 
> which considering where you increment the new count should be
> equivalent to your "!!hb->waiters". The smp_mb_after_atomic_inc()" on
> the spinlock side would become smp_mb_after_spin_lock() instead.

So we'd need the barrier right after the ticket increment (ie: the xadd
TICKET_LOCK_INC in x86), and cannot rely on the barrier after the lock
is taken as we could miss waiters that are just spinning trying to take
it. Is this implicitly guaranteed across all archs?

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-20 19:30     ` Davidlohr Bueso
@ 2013-12-20 19:54       ` Linus Torvalds
  2013-12-21  5:57         ` Davidlohr Bueso
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2013-12-20 19:54 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Fri, Dec 20, 2013 at 11:30 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>
> So we'd need the barrier right after the ticket increment (ie: the xadd
> TICKET_LOCK_INC in x86), and cannot rely on the barrier after the lock
> is taken as we could miss waiters that are just spinning trying to take
> it. Is this implicitly guaranteed across all archs?

Not necessarily. But I don't see why threads spinning on it would be
special? If you spin on things, you've already updated the head
counter, so even spinners *are* visible, even if they haven't actually
gotten the lock yet.

                 Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-19 23:14   ` Linus Torvalds
  2013-12-19 23:42     ` Davidlohr Bueso
  2013-12-20 19:30     ` Davidlohr Bueso
@ 2013-12-21  1:36     ` Davidlohr Bueso
  2013-12-21  2:34       ` Darren Hart
  2013-12-25  3:13       ` Waiman Long
  2 siblings, 2 replies; 30+ messages in thread
From: Davidlohr Bueso @ 2013-12-21  1:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Thu, 2013-12-19 at 15:14 -0800, Linus Torvalds wrote:
> On Thu, Dec 19, 2013 at 10:45 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> >
> > - increment the counter at queue_lock() as we always end up calling
> >   queue_me() which adds the element to the list. Upon any error,
> >   queue_unlock() is called for housekeeping, for which we decrement
> >   to mach the increment done in queue_lock().
> >
> > - decrement the counter at __unqueue_me() to reflect when an element is
> >   removed from the queue for wakeup related purposes.
> 
> I still hate this whole separate counter thing. It seems really annoying.
> 
> If re-ordering things didn't work out, then why can't just the counter
> we *already* have in the spinlock itself work as the counter? Your
> counter update logic seems to basically match when you take the
> spinlock anyway.

So the following has passed all testing, just like the atomics variant.
Thoughts?

Thanks,
Davidlohr

diff --git a/kernel/futex.c b/kernel/futex.c
index fcc6850..c8c7ce5 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -73,19 +73,22 @@
  * Basic futex operation and ordering guarantees:
  *
  * The waiter reads the futex value in user space and calls
- * futex_wait(). This function computes the hash bucket and acquires
- * the hash bucket lock. After that it reads the futex user space value
- * again and verifies that the data has not changed. If it has not
- * changed it enqueues itself into the hash bucket, releases the hash
+ * futex_wait(). It computes the hash bucket and acquires the hash
+ * bucket lock. After that it reads the futex user space value again
+ * and verifies that the data has not changed. If it has not changed
+ * it enqueues itself into the hash bucket, releases the hash
  * bucket lock and schedules.
  *
  * The waker side modifies the user space value of the futex and calls
- * futex_wake(). This functions computes the hash bucket and acquires
- * the hash bucket lock. Then it looks for waiters on that futex in the
- * hash bucket and wakes them.
+ * futex_wake(). It computes the hash bucket and acquires the hash
+ * bucket lock. Then it looks for waiters on that futex in the hash
+ * bucket and wakes them.
  *
- * Note that the spin_lock serializes waiters and wakers, so that the
- * following scenario is avoided:
+ * In scenarios where wakeups are called and no tasks are blocked on a futex,
+ * taking the hb spinlock can be avoided and simply return. In order for this
+ * optimization to work, ordering guarantees must exist so that the waiter
+ * being added to the list is acknowledged when the list is concurrently being
+ * checked by the waker, avoiding scenarios like the following:
  *
  * CPU 0                               CPU 1
  * val = *futex;
@@ -106,24 +109,50 @@
  * This would cause the waiter on CPU 0 to wait forever because it
  * missed the transition of the user space value from val to newval
  * and the waker did not find the waiter in the hash bucket queue.
- * The spinlock serializes that:
+ *
+ * The correct serialization ensures that a waiter either observes
+ * the changed user space value before blocking or is woken by a
+ * concurrent waker:
  *
  * CPU 0                               CPU 1
  * val = *futex;
  * sys_futex(WAIT, futex, val);
  *   futex_wait(futex, val);
- *   lock(hash_bucket(futex));
- *   uval = *futex;
- *                                     *futex = newval;
- *                                     sys_futex(WAKE, futex);
- *                                       futex_wake(futex);
- *                                       lock(hash_bucket(futex));
+ *
+ *   waiters++;
+ *   mb(); (A) <-- paired with -.
+ *                              |
+ *   lock(hash_bucket(futex));  |
+ *                              |
+ *   uval = *futex;             |
+ *                              |        *futex = newval;
+ *                              |        sys_futex(WAKE, futex);
+ *                              |          futex_wake(futex);
+ *                              |
+ *                              `------->   mb(); (B)
  *   if (uval == val)
- *      queue();
+ *     queue();
  *     unlock(hash_bucket(futex));
- *     schedule();                       if (!queue_empty())
- *                                         wake_waiters(futex);
- *                                       unlock(hash_bucket(futex));
+ *     schedule();                         if (waiters)
+ *                                           lock(hash_bucket(futex));
+ *                                           wake_waiters(futex);
+ *                                           unlock(hash_bucket(futex));
+ *
+ * Where (A) orders the waiters increment and the futex value read -- this
+ * is guaranteed by the head counter in the hb spinlock; and where (B)
+ * orders the write to futex and the waiters read.
+ *
+ * This yields the following case (where X:=waiters, Y:=futex):
+ *
+ *	X = Y = 0
+ *
+ *	w[X]=1		w[Y]=1
+ *	MB		MB
+ *	r[Y]=y		r[X]=x
+ *
+ * Which guarantees that x==0 && y==0 is impossible; which translates back into
+ * the guarantee that we cannot both miss the futex variable change and the
+ * enqueue.
  */
 
 int __read_mostly futex_cmpxchg_enabled;
@@ -211,6 +240,35 @@ static unsigned long __read_mostly futex_hashsize;
 
 static struct futex_hash_bucket *futex_queues;
 
+static inline void futex_get_mm(union futex_key *key)
+{
+	atomic_inc(&key->private.mm->mm_count);
+#ifdef CONFIG_SMP
+	/*
+	 * Ensure futex_get_mm() implies a full barrier such that
+	 * get_futex_key() implies a full barrier. This is relied upon
+	 * as full barrier (B), see the ordering comment above.
+	 */
+	smp_mb__after_atomic_inc();
+#endif
+}
+
+static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	/*
+	 * If the hash bucket is locked then we know the ticket counter
+	 * is non-zero and thus there is at least one waiter in the queue.
+	 */
+	if (spin_is_locked(&hb->lock))
+		return true;
+	smp_rmb(); /* Make sure we check the lock state first */
+	return !plist_head_empty(&hb->chain);
+#else
+	return true;
+#endif
+}
+
 /*
  * We hash on the keys returned from get_futex_key (see below).
  */
@@ -245,10 +303,10 @@ static void get_futex_key_refs(union futex_key *key)
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		ihold(key->shared.inode);
+		ihold(key->shared.inode); /* implies MB (B) */
 		break;
 	case FUT_OFF_MMSHARED:
-		atomic_inc(&key->private.mm->mm_count);
+		futex_get_mm(key); /* implies MB (B) */
 		break;
 	}
 }
@@ -322,7 +380,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 	if (!fshared) {
 		key->private.mm = mm;
 		key->private.address = address;
-		get_futex_key_refs(key);
+		get_futex_key_refs(key);  /* implies MB (B) */
 		return 0;
 	}
 
@@ -1052,6 +1110,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 		goto out;
 
 	hb = hash_futex(&key);
+
+	/* Make sure we really have tasks to wakeup */
+	if (!hb_waiters_pending(hb))
+		goto out_put_key;
+
 	spin_lock(&hb->lock);
 
 	plist_for_each_entry_safe(this, next, &hb->chain, list) {
@@ -1072,6 +1135,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	}
 
 	spin_unlock(&hb->lock);
+out_put_key:
 	put_futex_key(&key);
 out:
 	return ret;
@@ -1535,7 +1599,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 	hb = hash_futex(&q->key);
 	q->lock_ptr = &hb->lock;
 
-	spin_lock(&hb->lock);
+	spin_lock(&hb->lock); /* implies MB (A) */
 	return hb;
 }
 



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-21  1:36     ` Davidlohr Bueso
@ 2013-12-21  2:34       ` Darren Hart
  2013-12-21  3:08         ` Davidlohr Bueso
  2013-12-25  3:13       ` Waiman Long
  1 sibling, 1 reply; 30+ messages in thread
From: Darren Hart @ 2013-12-21  2:34 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linus Torvalds, Linux Kernel Mailing List, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Fri, 2013-12-20 at 17:36 -0800, Davidlohr Bueso wrote:
> On Thu, 2013-12-19 at 15:14 -0800, Linus Torvalds wrote:
> > On Thu, Dec 19, 2013 at 10:45 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> > >
> > > - increment the counter at queue_lock() as we always end up calling
> > >   queue_me() which adds the element to the list. Upon any error,
> > >   queue_unlock() is called for housekeeping, for which we decrement
> > >   to mach the increment done in queue_lock().
> > >
> > > - decrement the counter at __unqueue_me() to reflect when an element is
> > >   removed from the queue for wakeup related purposes.
> > 
> > I still hate this whole separate counter thing. It seems really annoying.
> > 
> > If re-ordering things didn't work out, then why can't just the counter
> > we *already* have in the spinlock itself work as the counter? Your
> > counter update logic seems to basically match when you take the
> > spinlock anyway.
> 
> So the following has passed all testing, just like the atomics variant.
> Thoughts?

Do you have similar performance numbers for comparison? I presume they
were *very* similar to the atomics version - I think you hinted at that
in a previous post?


-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-21  2:34       ` Darren Hart
@ 2013-12-21  3:08         ` Davidlohr Bueso
  0 siblings, 0 replies; 30+ messages in thread
From: Davidlohr Bueso @ 2013-12-21  3:08 UTC (permalink / raw)
  To: Darren Hart
  Cc: Linus Torvalds, Linux Kernel Mailing List, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Fri, 2013-12-20 at 18:34 -0800, Darren Hart wrote:
> On Fri, 2013-12-20 at 17:36 -0800, Davidlohr Bueso wrote:
> > On Thu, 2013-12-19 at 15:14 -0800, Linus Torvalds wrote:
> > > On Thu, Dec 19, 2013 at 10:45 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> > > >
> > > > - increment the counter at queue_lock() as we always end up calling
> > > >   queue_me() which adds the element to the list. Upon any error,
> > > >   queue_unlock() is called for housekeeping, for which we decrement
> > > >   to mach the increment done in queue_lock().
> > > >
> > > > - decrement the counter at __unqueue_me() to reflect when an element is
> > > >   removed from the queue for wakeup related purposes.
> > > 
> > > I still hate this whole separate counter thing. It seems really annoying.
> > > 
> > > If re-ordering things didn't work out, then why can't just the counter
> > > we *already* have in the spinlock itself work as the counter? Your
> > > counter update logic seems to basically match when you take the
> > > spinlock anyway.
> > 
> > So the following has passed all testing, just like the atomics variant.
> > Thoughts?
> 
> Do you have similar performance numbers for comparison? I presume they
> were *very* similar to the atomics version - I think you hinted at that
> in a previous post?

Performance isn't a factor, both approaches are pretty much identical.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-20 19:54       ` Linus Torvalds
@ 2013-12-21  5:57         ` Davidlohr Bueso
  0 siblings, 0 replies; 30+ messages in thread
From: Davidlohr Bueso @ 2013-12-21  5:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Ingo Molnar, Darren Hart,
	Peter Zijlstra, Thomas Gleixner, Paul McKenney, Mike Galbraith,
	Jeff Mahoney, Jason Low, Waiman Long, Tom Vaden, Norton, Scott J,
	Chandramouleeswaran, Aswin, Ingo Molnar

On Fri, 2013-12-20 at 11:54 -0800, Linus Torvalds wrote:
> On Fri, Dec 20, 2013 at 11:30 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> >
> > So we'd need the barrier right after the ticket increment (ie: the xadd
> > TICKET_LOCK_INC in x86), and cannot rely on the barrier after the lock
> > is taken as we could miss waiters that are just spinning trying to take
> > it. Is this implicitly guaranteed across all archs?
> 
> Not necessarily. But I don't see why threads spinning on it would be
> special? If you spin on things, you've already updated the head
> counter, so even spinners *are* visible, even if they haven't actually
> gotten the lock yet.

Fair enough, just making sure we're covering all cases.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-21  1:36     ` Davidlohr Bueso
  2013-12-21  2:34       ` Darren Hart
@ 2013-12-25  3:13       ` Waiman Long
  2014-01-02 11:37         ` Davidlohr Bueso
  1 sibling, 1 reply; 30+ messages in thread
From: Waiman Long @ 2013-12-25  3:13 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linus Torvalds, Linux Kernel Mailing List, Ingo Molnar,
	Darren Hart, Peter Zijlstra, Thomas Gleixner, Paul McKenney,
	Mike Galbraith, Jeff Mahoney, Jason Low, Tom Vaden,
	Norton, Scott J, Chandramouleeswaran, Aswin, Ingo Molnar

On 12/20/2013 08:36 PM, Davidlohr Bueso wrote:
> On Thu, 2013-12-19 at 15:14 -0800, Linus Torvalds wrote:
>> On Thu, Dec 19, 2013 at 10:45 AM, Davidlohr Bueso<davidlohr@hp.com>  wrote:
>>> - increment the counter at queue_lock() as we always end up calling
>>>    queue_me() which adds the element to the list. Upon any error,
>>>    queue_unlock() is called for housekeeping, for which we decrement
>>>    to mach the increment done in queue_lock().
>>>
>>> - decrement the counter at __unqueue_me() to reflect when an element is
>>>    removed from the queue for wakeup related purposes.
>> I still hate this whole separate counter thing. It seems really annoying.
>>
>> If re-ordering things didn't work out, then why can't just the counter
>> we *already* have in the spinlock itself work as the counter? Your
>> counter update logic seems to basically match when you take the
>> spinlock anyway.
> So the following has passed all testing, just like the atomics variant.
> Thoughts?
>
> Thanks,
> Davidlohr
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fcc6850..c8c7ce5 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -73,19 +73,22 @@
>    * Basic futex operation and ordering guarantees:
>    *
>    * The waiter reads the futex value in user space and calls
> - * futex_wait(). This function computes the hash bucket and acquires
> - * the hash bucket lock. After that it reads the futex user space value
> - * again and verifies that the data has not changed. If it has not
> - * changed it enqueues itself into the hash bucket, releases the hash
> + * futex_wait(). It computes the hash bucket and acquires the hash
> + * bucket lock. After that it reads the futex user space value again
> + * and verifies that the data has not changed. If it has not changed
> + * it enqueues itself into the hash bucket, releases the hash
>    * bucket lock and schedules.
>    *
>    * The waker side modifies the user space value of the futex and calls
> - * futex_wake(). This functions computes the hash bucket and acquires
> - * the hash bucket lock. Then it looks for waiters on that futex in the
> - * hash bucket and wakes them.
> + * futex_wake(). It computes the hash bucket and acquires the hash
> + * bucket lock. Then it looks for waiters on that futex in the hash
> + * bucket and wakes them.
>    *
> - * Note that the spin_lock serializes waiters and wakers, so that the
> - * following scenario is avoided:
> + * In scenarios where wakeups are called and no tasks are blocked on a futex,
> + * taking the hb spinlock can be avoided and simply return. In order for this
> + * optimization to work, ordering guarantees must exist so that the waiter
> + * being added to the list is acknowledged when the list is concurrently being
> + * checked by the waker, avoiding scenarios like the following:
>    *
>    * CPU 0                               CPU 1
>    * val = *futex;
> @@ -106,24 +109,50 @@
>    * This would cause the waiter on CPU 0 to wait forever because it
>    * missed the transition of the user space value from val to newval
>    * and the waker did not find the waiter in the hash bucket queue.
> - * The spinlock serializes that:
> + *
> + * The correct serialization ensures that a waiter either observes
> + * the changed user space value before blocking or is woken by a
> + * concurrent waker:
>    *
>    * CPU 0                               CPU 1
>    * val = *futex;
>    * sys_futex(WAIT, futex, val);
>    *   futex_wait(futex, val);
> - *   lock(hash_bucket(futex));
> - *   uval = *futex;
> - *                                     *futex = newval;
> - *                                     sys_futex(WAKE, futex);
> - *                                       futex_wake(futex);
> - *                                       lock(hash_bucket(futex));
> + *
> + *   waiters++;
> + *   mb(); (A)<-- paired with -.
> + *                              |
> + *   lock(hash_bucket(futex));  |
> + *                              |
> + *   uval = *futex;             |
> + *                              |        *futex = newval;
> + *                              |        sys_futex(WAKE, futex);
> + *                              |          futex_wake(futex);
> + *                              |
> + *                              `------->    mb(); (B)

Checking the state of the spinlock counter isn't the same as 
incrementing a waiter count. So your pseudo code here is misleading. See 
further explanation below.

>    *   if (uval == val)
> - *      queue();
> + *     queue();
>    *     unlock(hash_bucket(futex));
> - *     schedule();                       if (!queue_empty())
> - *                                         wake_waiters(futex);
> - *                                       unlock(hash_bucket(futex));
> + *     schedule();                         if (waiters)
> + *                                           lock(hash_bucket(futex));
> + *                                           wake_waiters(futex);
> + *                                           unlock(hash_bucket(futex));
> + *
> + * Where (A) orders the waiters increment and the futex value read -- this
> + * is guaranteed by the head counter in the hb spinlock; and where (B)
> + * orders the write to futex and the waiters read.
> + *
> + * This yields the following case (where X:=waiters, Y:=futex):
> + *
> + *	X = Y = 0
> + *
> + *	w[X]=1		w[Y]=1
> + *	MB		MB
> + *	r[Y]=y		r[X]=x
> + *
> + * Which guarantees that x==0&&  y==0 is impossible; which translates back into
> + * the guarantee that we cannot both miss the futex variable change and the
> + * enqueue.
>    */
>
>   int __read_mostly futex_cmpxchg_enabled;
> @@ -211,6 +240,35 @@ static unsigned long __read_mostly futex_hashsize;
>
>   static struct futex_hash_bucket *futex_queues;
>
> +static inline void futex_get_mm(union futex_key *key)
> +{
> +	atomic_inc(&key->private.mm->mm_count);
> +#ifdef CONFIG_SMP
> +	/*
> +	 * Ensure futex_get_mm() implies a full barrier such that
> +	 * get_futex_key() implies a full barrier. This is relied upon
> +	 * as full barrier (B), see the ordering comment above.
> +	 */
> +	smp_mb__after_atomic_inc();
> +#endif
> +}
> +
> +static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
> +{
> +#ifdef CONFIG_SMP
> +	/*
> +	 * If the hash bucket is locked then we know the ticket counter
> +	 * is non-zero and thus there is at least one waiter in the queue.
> +	 */
> +	if (spin_is_locked(&hb->lock))
> +		return true;
> +	smp_rmb(); /* Make sure we check the lock state first */
> +	return !plist_head_empty(&hb->chain);
> +#else
> +	return true;
> +#endif
> +}

The ticket spinlock counter is a cyclic counter that can cycle through 0 
periodically. So the zero-ness of the counter has no relation to whether 
it is locked or not.  Your comment above is not correct. What 
spin_is_locked() can tell you is whether one or more tasks are trying to 
get into the critical section which can be a waiter (most likely) or a 
waker. Coupled with checking if the list is empty, that could be a 
cheaper alternative to using a separate atomic counter, but it is also 
slightly less reliable and has a higher chance of false positive.

-Longman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
  2013-12-25  3:13       ` Waiman Long
@ 2014-01-02 11:37         ` Davidlohr Bueso
  0 siblings, 0 replies; 30+ messages in thread
From: Davidlohr Bueso @ 2014-01-02 11:37 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Linux Kernel Mailing List, Ingo Molnar,
	Darren Hart, Peter Zijlstra, Thomas Gleixner, Paul McKenney,
	Mike Galbraith, Jeff Mahoney, Jason Low, Tom Vaden,
	Norton, Scott J, Chandramouleeswaran, Aswin, Ingo Molnar

On Tue, 2013-12-24 at 22:13 -0500, Waiman Long wrote:
> On 12/20/2013 08:36 PM, Davidlohr Bueso wrote:
> > On Thu, 2013-12-19 at 15:14 -0800, Linus Torvalds wrote:
> >> On Thu, Dec 19, 2013 at 10:45 AM, Davidlohr Bueso<davidlohr@hp.com>  wrote:
> >>> - increment the counter at queue_lock() as we always end up calling
> >>>    queue_me() which adds the element to the list. Upon any error,
> >>>    queue_unlock() is called for housekeeping, for which we decrement
> >>>    to mach the increment done in queue_lock().
> >>>
> >>> - decrement the counter at __unqueue_me() to reflect when an element is
> >>>    removed from the queue for wakeup related purposes.
> >> I still hate this whole separate counter thing. It seems really annoying.
> >>
> >> If re-ordering things didn't work out, then why can't just the counter
> >> we *already* have in the spinlock itself work as the counter? Your
> >> counter update logic seems to basically match when you take the
> >> spinlock anyway.
> > So the following has passed all testing, just like the atomics variant.
> > Thoughts?
> >
> > Thanks,
> > Davidlohr
> >
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index fcc6850..c8c7ce5 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -73,19 +73,22 @@
> >    * Basic futex operation and ordering guarantees:
> >    *
> >    * The waiter reads the futex value in user space and calls
> > - * futex_wait(). This function computes the hash bucket and acquires
> > - * the hash bucket lock. After that it reads the futex user space value
> > - * again and verifies that the data has not changed. If it has not
> > - * changed it enqueues itself into the hash bucket, releases the hash
> > + * futex_wait(). It computes the hash bucket and acquires the hash
> > + * bucket lock. After that it reads the futex user space value again
> > + * and verifies that the data has not changed. If it has not changed
> > + * it enqueues itself into the hash bucket, releases the hash
> >    * bucket lock and schedules.
> >    *
> >    * The waker side modifies the user space value of the futex and calls
> > - * futex_wake(). This functions computes the hash bucket and acquires
> > - * the hash bucket lock. Then it looks for waiters on that futex in the
> > - * hash bucket and wakes them.
> > + * futex_wake(). It computes the hash bucket and acquires the hash
> > + * bucket lock. Then it looks for waiters on that futex in the hash
> > + * bucket and wakes them.
> >    *
> > - * Note that the spin_lock serializes waiters and wakers, so that the
> > - * following scenario is avoided:
> > + * In scenarios where wakeups are called and no tasks are blocked on a futex,
> > + * taking the hb spinlock can be avoided and simply return. In order for this
> > + * optimization to work, ordering guarantees must exist so that the waiter
> > + * being added to the list is acknowledged when the list is concurrently being
> > + * checked by the waker, avoiding scenarios like the following:
> >    *
> >    * CPU 0                               CPU 1
> >    * val = *futex;
> > @@ -106,24 +109,50 @@
> >    * This would cause the waiter on CPU 0 to wait forever because it
> >    * missed the transition of the user space value from val to newval
> >    * and the waker did not find the waiter in the hash bucket queue.
> > - * The spinlock serializes that:
> > + *
> > + * The correct serialization ensures that a waiter either observes
> > + * the changed user space value before blocking or is woken by a
> > + * concurrent waker:
> >    *
> >    * CPU 0                               CPU 1
> >    * val = *futex;
> >    * sys_futex(WAIT, futex, val);
> >    *   futex_wait(futex, val);
> > - *   lock(hash_bucket(futex));
> > - *   uval = *futex;
> > - *                                     *futex = newval;
> > - *                                     sys_futex(WAKE, futex);
> > - *                                       futex_wake(futex);
> > - *                                       lock(hash_bucket(futex));
> > + *
> > + *   waiters++;
> > + *   mb(); (A)<-- paired with -.
> > + *                              |
> > + *   lock(hash_bucket(futex));  |
> > + *                              |
> > + *   uval = *futex;             |
> > + *                              |        *futex = newval;
> > + *                              |        sys_futex(WAKE, futex);
> > + *                              |          futex_wake(futex);
> > + *                              |
> > + *                              `------->    mb(); (B)
> 
> Checking the state of the spinlock counter isn't the same as 
> incrementing a waiter count. So your pseudo code here is misleading. See 
> further explanation below.

Since the spinlock check is mostly done for corner cases where the plist
check alone isn't enough, I'd like to leave this comment as is and fix
the comment for hb_waiters_pending().

> 
> >    *   if (uval == val)
> > - *      queue();
> > + *     queue();
> >    *     unlock(hash_bucket(futex));
> > - *     schedule();                       if (!queue_empty())
> > - *                                         wake_waiters(futex);
> > - *                                       unlock(hash_bucket(futex));
> > + *     schedule();                         if (waiters)
> > + *                                           lock(hash_bucket(futex));
> > + *                                           wake_waiters(futex);
> > + *                                           unlock(hash_bucket(futex));
> > + *
> > + * Where (A) orders the waiters increment and the futex value read -- this
> > + * is guaranteed by the head counter in the hb spinlock; and where (B)
> > + * orders the write to futex and the waiters read.
> > + *
> > + * This yields the following case (where X:=waiters, Y:=futex):
> > + *
> > + *	X = Y = 0
> > + *
> > + *	w[X]=1		w[Y]=1
> > + *	MB		MB
> > + *	r[Y]=y		r[X]=x
> > + *
> > + * Which guarantees that x==0&&  y==0 is impossible; which translates back into
> > + * the guarantee that we cannot both miss the futex variable change and the
> > + * enqueue.
> >    */
> >
> >   int __read_mostly futex_cmpxchg_enabled;
> > @@ -211,6 +240,35 @@ static unsigned long __read_mostly futex_hashsize;
> >
> >   static struct futex_hash_bucket *futex_queues;
> >
> > +static inline void futex_get_mm(union futex_key *key)
> > +{
> > +	atomic_inc(&key->private.mm->mm_count);
> > +#ifdef CONFIG_SMP
> > +	/*
> > +	 * Ensure futex_get_mm() implies a full barrier such that
> > +	 * get_futex_key() implies a full barrier. This is relied upon
> > +	 * as full barrier (B), see the ordering comment above.
> > +	 */
> > +	smp_mb__after_atomic_inc();
> > +#endif
> > +}
> > +
> > +static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
> > +{
> > +#ifdef CONFIG_SMP
> > +	/*
> > +	 * If the hash bucket is locked then we know the ticket counter
> > +	 * is non-zero and thus there is at least one waiter in the queue.
> > +	 */
> > +	if (spin_is_locked(&hb->lock))
> > +		return true;
> > +	smp_rmb(); /* Make sure we check the lock state first */
> > +	return !plist_head_empty(&hb->chain);
> > +#else
> > +	return true;
> > +#endif
> > +}
> 
> The ticket spinlock counter is a cyclic counter that can cycle through 0 
> periodically. So the zero-ness of the counter has no relation to whether 
> it is locked or not.  Your comment above is not correct. What 
> spin_is_locked() can tell you is whether one or more tasks are trying to 
> get into the critical section which can be a waiter (most likely) or a 
> waker. Coupled with checking if the list is empty, that could be a 
> cheaper alternative to using a separate atomic counter, but it is also 
> slightly less reliable and has a higher chance of false positive.

Yeah, that was more or less what I was trying to say, but obviously
wasn't clear enough. I'll rephrase and send a proper new version for
this patchset.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2014-01-02 11:37 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-19 18:45 [PATCH v3 0/4] futex: Wakeup optimizations Davidlohr Bueso
2013-12-19 18:45 ` [PATCH v3 1/4] futex: Misc cleanups Davidlohr Bueso
2013-12-19 18:45 ` [PATCH v3 2/4] futex: Larger hash table Davidlohr Bueso
2013-12-19 18:45 ` [PATCH v3 3/4] futex: Document ordering guarantees Davidlohr Bueso
2013-12-19 18:45 ` [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
2013-12-19 19:25   ` Ingo Molnar
2013-12-19 19:29     ` Davidlohr Bueso
2013-12-19 19:42       ` Ingo Molnar
2013-12-19 22:35         ` Joe Perches
2013-12-20  9:11           ` Peter Zijlstra
2013-12-20 10:27             ` Ingo Molnar
2013-12-19 23:14   ` Linus Torvalds
2013-12-19 23:42     ` Davidlohr Bueso
2013-12-19 23:53       ` Linus Torvalds
2013-12-20  0:04         ` Linus Torvalds
2013-12-20  0:26           ` Darren Hart
2013-12-20  0:12         ` Linus Torvalds
2013-12-20  2:22         ` Davidlohr Bueso
2013-12-20  3:13           ` Linus Torvalds
2013-12-20  6:13             ` Davidlohr Bueso
2013-12-20  8:54             ` Peter Zijlstra
2013-12-20 19:30     ` Davidlohr Bueso
2013-12-20 19:54       ` Linus Torvalds
2013-12-21  5:57         ` Davidlohr Bueso
2013-12-21  1:36     ` Davidlohr Bueso
2013-12-21  2:34       ` Darren Hart
2013-12-21  3:08         ` Davidlohr Bueso
2013-12-25  3:13       ` Waiman Long
2014-01-02 11:37         ` Davidlohr Bueso
  -- strict thread matches above, loose matches on Subject: below --
2013-12-03  9:45 [PATCH v2 0/4] futex: Wakeup optimizations Davidlohr Bueso
2013-12-03  9:45 ` [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
2013-12-10 18:10   ` [PATCH v3 " Davidlohr Bueso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox