public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Davidlohr Bueso <davidlohr@hp.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@kernel.org, dvhart@linux.intel.com, peterz@infradead.org,
	tglx@linutronix.de, efault@gmx.de, jeffm@suse.com,
	torvalds@linux-foundation.org, scott.norton@hp.com,
	tom.vaden@hp.com, aswin@hp.com, Waiman.Long@hp.com,
	jason.low2@hp.com, davidlohr@hp.com
Subject: [PATCH 3/5] futex: Larger hash table
Date: Fri, 22 Nov 2013 16:56:35 -0800	[thread overview]
Message-ID: <1385168197-8612-4-git-send-email-davidlohr@hp.com> (raw)
In-Reply-To: <1385168197-8612-1-git-send-email-davidlohr@hp.com>

Currently, the futex global hash table suffers from it's fixed, smallish
(for today's standards) size of 256 entries, as well as its lack of NUMA
awareness. Large systems, using many futexes, can be prone to high amounts
of collisions; where these futexes hash to the same bucket and lead to
extra contention on the same hb->lock. Furthermore, cacheline bouncing is a
reality when we have multiple hb->locks residing on the same cacheline and
different futexes hash to adjacent buckets.

This patch keeps the current static size of 16 entries for small systems,
or otherwise, 256 * ncpus (or larger as we need to round the number to a
power of 2). Note that this number of CPUs accounts for all CPUs that can
ever be available in the system, taking into consideration things like
hotpluging. While we do impose extra overhead at bootup by making the hash
table larger, this is a one time thing, and does not shadow the benefits
of this patch.

Also, similar to other core kernel components (pid, dcache, tcp), by using
alloc_large_system_hash() we benefit from its NUMA awareness and thus the
table is distributed among the nodes instead of in a single one. We impose
this function's minimum limit of 256 entries, so that in worst case scenarios
or issues, we still end up using the current amount anyways.

For a custom microbenchmark that pounds on the uaddr hashing -- making the wait
path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of futexes,
we can see the following benefits on a 80-core, 8-socket 1Tb server:

+---------+----------------------------------+------------------------------------------+----------+
| threads | baseline (ops/sec) [insns/cycle] | large hash table (ops/sec) [insns/cycle] | increase |
+---------+----------------------------------+------------------------------------------+----------+
|     512 | 34429    [0.07]                  | 255274    [0.48]                         | +641.45% |
|     256 | 65452    [0.07]                  | 443563    [0.41]                         | +577.69% |
|     128 | 125111   [0.07]                  | 742613    [0.33]                         | +493.56% |
|      80 | 203642   [0.09]                  | 1028147   [0.29]                         | +404.87% |
|      64 | 262944   [0.09]                  | 997300    [0.28]                         | +279.28% |
|      32 | 642390   [0.24]                  | 965996    [0.27]                         | +50.37   |
+---------+----------------------------------+------------------------------------------+----------+

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: Tom Vaden <tom.vaden@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 kernel/futex.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 0768c68..5fa9eb0 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -63,6 +63,7 @@
 #include <linux/sched/rt.h>
 #include <linux/hugetlb.h>
 #include <linux/freezer.h>
+#include <linux/bootmem.h>
 
 #include <asm/futex.h>
 
@@ -70,7 +71,11 @@
 
 int __read_mostly futex_cmpxchg_enabled;
 
-#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
+#if CONFIG_BASE_SMALL
+static unsigned long futex_hashsize = 16;
+#else
+static unsigned long futex_hashsize;
+#endif
 
 /*
  * Futex flags used to encode options to functions and preserve them across
@@ -151,7 +156,11 @@ struct futex_hash_bucket {
 	struct plist_head chain;
 };
 
-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+#if CONFIG_BASE_SMALL
+static struct futex_hash_bucket futex_queues[futex_hashsize];
+#else
+static struct futex_hash_bucket *futex_queues;
+#endif
 
 /*
  * We hash on the keys returned from get_futex_key (see below).
@@ -161,7 +170,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key)
 	u32 hash = jhash2((u32*)&key->both.word,
 			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
 			  key->both.offset);
-	return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+	return &futex_queues[hash & (futex_hashsize - 1)];
 }
 
 /*
@@ -2715,7 +2724,14 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 static int __init futex_init(void)
 {
 	u32 curval;
-	int i;
+	unsigned long i;
+
+#if !CONFIG_BASE_SMALL
+	futex_hashsize = roundup_pow_of_two((256 * num_possible_cpus()));
+	futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
+					       futex_hashsize, 0, 0, NULL, NULL,
+					       256, futex_hashsize);
+#endif
 
 	/*
 	 * This will fail and we want it. Some arch implementations do
@@ -2730,7 +2746,7 @@ static int __init futex_init(void)
 	if (cmpxchg_futex_value_locked(&curval, NULL, 0, 0) == -EFAULT)
 		futex_cmpxchg_enabled = 1;
 
-	for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
+	for (i = 0; i < futex_hashsize; i++) {
 		plist_head_init(&futex_queues[i].chain);
 		spin_lock_init(&futex_queues[i].lock);
 	}
-- 
1.8.1.4


  parent reply	other threads:[~2013-11-23  0:57 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-23  0:56 [PATCH 0/5] futex: Wakeup optimizations Davidlohr Bueso
2013-11-23  0:56 ` [PATCH 1/5] futex: Misc cleanups Davidlohr Bueso
2013-11-23  6:52   ` Darren Hart
2013-11-23  0:56 ` [PATCH 2/5] futex: Check for pi futex_q only once Davidlohr Bueso
2013-11-23  6:33   ` Darren Hart
2013-11-24  5:19     ` Davidlohr Bueso
2013-11-23  0:56 ` Davidlohr Bueso [this message]
2013-11-23  6:52   ` [PATCH 3/5] futex: Larger hash table Darren Hart
2013-11-23  0:56 ` [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
2013-11-23  1:25   ` Linus Torvalds
2013-11-23  3:03     ` Jason Low
2013-11-23  3:19     ` Davidlohr Bueso
2013-11-23  7:23       ` Darren Hart
2013-11-23 13:16       ` Thomas Gleixner
2013-11-24  3:46         ` Linus Torvalds
2013-11-24  5:15           ` Davidlohr Bueso
2013-11-25 12:01             ` Thomas Gleixner
2013-11-25 16:23           ` Thomas Gleixner
2013-11-25 16:36             ` Peter Zijlstra
2013-11-25 17:32               ` Thomas Gleixner
2013-11-25 17:38                 ` Peter Zijlstra
2013-11-25 18:55                 ` Davidlohr Bueso
2013-11-25 19:52                   ` Thomas Gleixner
2013-11-25 19:47         ` Thomas Gleixner
2013-11-25 20:03           ` Darren Hart
2013-11-25 20:26             ` Thomas Gleixner
2013-11-26 13:53             ` Thomas Gleixner
2013-11-23  4:05     ` Waiman Long
2013-11-23  5:40   ` Darren Hart
2013-11-23  5:42     ` Hart, Darren
2013-11-23  7:20   ` Darren Hart
2013-11-23  0:56 ` [PATCH 5/5] sched,futex: Provide delayed wakeup list Davidlohr Bueso
2013-11-23 11:48   ` Peter Zijlstra
2013-11-23 12:01     ` Peter Zijlstra
2013-11-24  5:25       ` Davidlohr Bueso
2013-11-23  5:55 ` [PATCH 0/5] futex: Wakeup optimizations Darren Hart
2013-11-23  6:35   ` Mike Galbraith
2013-11-23  6:38   ` Davidlohr Bueso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1385168197-8612-4-git-send-email-davidlohr@hp.com \
    --to=davidlohr@hp.com \
    --cc=Waiman.Long@hp.com \
    --cc=aswin@hp.com \
    --cc=dvhart@linux.intel.com \
    --cc=efault@gmx.de \
    --cc=jason.low2@hp.com \
    --cc=jeffm@suse.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=scott.norton@hp.com \
    --cc=tglx@linutronix.de \
    --cc=tom.vaden@hp.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox