From: Rasmus Villemoes <linux@rasmusvillemoes.dk>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
Darren Hart <darren@dvhart.com>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@kernel.org>,
Michael Kerrisk <mtk.manpages@googlemail.com>,
Davidlohr Bueso <dave@stgolabs.net>, Chris Mason <clm@fb.com>,
"Carlos O'Donell" <carlos@redhat.com>,
Torvald Riegel <triegel@redhat.com>,
Eric Dumazet <edumazet@google.com>
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes
Date: Sun, 03 Apr 2016 01:48:07 +0200 [thread overview]
Message-ID: <87zitbmv2g.fsf@rasmusvillemoes.dk> (raw)
In-Reply-To: <20160402110035.753145539@linutronix.de> (Thomas Gleixner's message of "Sat, 02 Apr 2016 11:09:18 -0000")
On Sat, Apr 02 2016, Thomas Gleixner <tglx@linutronix.de> wrote:
> The standard futex mechanism in the Linux kernel uses a global hash to store
> transient state. Collisions on that hash can lead to performance degradation
> and on real-time enabled kernels even to priority inversions.
>
> To guarantee futexes without collisions on the global kernel hash, we provide
> a mechanism to attach to a futex. This creates futex private state which
> avoids hash collisions and on NUMA systems also cross node memory access.
Hi,
A few minor comments inline below, and a question about the design:
How is an application supposed to handle it when the kernel fails to
achieve the no collision-goal? With any reasonable upper bound on the
size of the local hash table (which of course has to be there, whether
sysctl'ed or not), and regardless of the hashing scheme used, it
seems inevitable that someone is going to get -ENOSPC when trying to
attach. Moreover, since different threads can attach to different sets
of futexes, one thread may succesfully attach to a futex, while another
fails - the second thread is then permanently prevented from operating
on that futex (?).
Why not use some sort of tree instead? Or fall back to a traditional
chained hash table once we're no longer allowed to increase the table
size? Of course these have worse lookup performance, and maybe failing
the attach in the rare case is better than penalizing the common case,
but it would be nice to have some mention of this in the change log.
Alternatively [this is not really thought through], maybe one could move
the decision and the complexity to userspace: On succesful FUTEX_ATTACH,
return an index into a small per-task array of struct futex_state*. On
subsequent FUTEX_ATTACHED operations on that futex, userspace passes in
this index somehow (either instead of uaddr, in which case the kernel
array would have to include this in addition to the futex_state pointer,
or by making uaddr actually point to a struct { int *futex_addr; int
attach_idx; }, or...) Then each thread would have to maintain a (futex
address => index) mapping, but that's more or less what the kernel
otherwise has to do.
> +
> +static unsigned int hash_prime(unsigned int size)
> +{
> + switch(size) {
> + case 16:
> + default: return 13;
> + case 32: return 31;
> + case 64: return 61;
> + case 128: return 127;
> + case 256: return 251;
> + case 512: return 509;
> + case 1024: return 1021;
> + case 2048: return 2039;
> + case 4096: return 4093;
> + }
> +}
> +
There should probably be some mention of TASK_CACHE_{BASE,MAX}_SIZE here
so that anyone updating those would also look here, so we don't end up
using 13 out of 8192 slots... BUILD_BUG_ON(TASK_CACHE_{BASE,MAX}_SIZE
!= {16,4096}) should do it.
> +static struct futex_cache *futex_alloc_cache(int cache_size)
> +{
> + struct futex_cache *tc;
> + size_t size;
> +
> + /* Allocate a new task cache */
> + size = sizeof(*tc) + cache_size * sizeof(struct futex_cache_slot);
> + tc = kzalloc_node(size, GFP_KERNEL, numa_node_id());
> + if (tc) {
> + tc->hash_prime = hash_prime(cache_size);
> + tc->cache_size = cache_size;
> + }
> + return tc;
> +}
> +
> +static int
> +futex_rehash_task_cache(struct futex_cache *tc, struct futex_cache *tcnew)
> +{
> + unsigned long *newmap = tcnew->cache_map;
> + unsigned int prime = tcnew->hash_prime;
> + unsigned long *map = tc->cache_map;
> + unsigned int size = tc->cache_size;
> + unsigned int slot, newslot;
> +
> + slot = find_first_bit(map, size);
> + for (; slot < size; slot = find_next_bit(map, size, slot + 1)) {
> + newslot = hash_local_futex(tc->slots[slot].uaddr, prime);
> + /*
> + * Paranoia. Rehashing to a larger cache should not result in
> + * collisions which did not exist in the small one.
> + */
This doesn't sound right when you're doing mod prime hashing; two
numbers can easily have the same remainder mod 31 while having distinct
remainders mod 13.
> + if (__test_and_set_bit(newslot, newmap))
> + return -ENOSPC;
> + /* Copy uaddr and futex state pointer */
> + tcnew->slots[newslot] = tc->slots[slot];
> + }
> + return 0;
> +}
> +
> +/**
> + * futex_get_task_cache_slot - Get a slot in the tasks local cache
> + *
> + * If the cache is not yet available it's allocated. If the existing cache is
> + * too small the cache is extended.
> + *
> + * Returns a valid slot or an error code
> + */
> +static int futex_get_task_cache_slot(u32 __user *uaddr)
> +{
> + struct futex_cache *tcnew, *tc = current->futex_cache;
> + unsigned int cache_size;
> + int slot;
> +
> + /* First caller allocates the initial cache */
> + if (!tc) {
> + tc = futex_alloc_cache(TASK_CACHE_BASE_SIZE);
> + if (!tc)
> + return -ENOMEM;
> + current->futex_cache = tc;
> + slot = hash_local_futex(uaddr, tc->hash_prime);
> + return slot;
> + }
> +
> + slot = hash_local_futex(uaddr, tc->hash_prime);
> +
> + /* Check whether the slot is populated already */
> + if (!test_bit(slot, tc->cache_map))
> + return slot;
> +
> + /* Was this futex attached already ? */
> + if (tc->slots[slot].uaddr == uaddr)
> + return -EEXIST;
> +
> + cache_size = tc->cache_size;
> +retry:
> + /* Task has reached max cache size? */
> + if (cache_size >= TASK_CACHE_MAX_SIZE)
> + return -ENOSPC;
> +
> + cache_size *= 2;
> + tcnew = futex_alloc_cache(cache_size);
> + if (!tcnew)
> + return -ENOMEM;
> +
> + /*
> + * If the rehashing fails or the slot for uaddr is busy after
> + * rehashing, try with a larger cache.
> + */
> + slot = hash_local_futex(uaddr, tcnew->hash_prime);
> +
> + if (futex_rehash_task_cache(tc, tcnew) ||
> + test_bit(slot, tcnew->cache_map)) {
> + kfree(tcnew);
> + goto retry;
> + }
> +
> + /* Populate the new cache and return the slot number */
> + current->futex_cache = tcnew;
> + return slot;
> +}
I may be misreading it, but this seems to leak the old ->futex_cache?
Rasmus
next prev parent reply other threads:[~2016-04-02 23:48 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-02 11:09 [RFC patch 0/7] futex: Add support for attached futexes Thomas Gleixner
2016-04-02 11:09 ` [RFC patch 1/7] futex: Provide helpers for hash bucket add/remove Thomas Gleixner
2016-04-02 11:09 ` [RFC patch 2/7] futex: Add some more function commentry Thomas Gleixner
2016-04-02 11:09 ` [RFC patch 3/7] futex: Make key init a helper function Thomas Gleixner
2016-04-02 11:09 ` [RFC patch 4/7] futex: Add support for attached futexes Thomas Gleixner
2016-04-02 16:26 ` Peter Zijlstra
2016-04-02 18:01 ` Thomas Gleixner
2016-04-02 16:29 ` Peter Zijlstra
2016-04-03 9:59 ` Thomas Gleixner
2016-04-02 18:19 ` Andy Lutomirski
2016-04-03 9:57 ` Thomas Gleixner
2016-04-03 13:18 ` Andy Lutomirski
2016-04-03 15:56 ` Thomas Gleixner
2016-04-03 16:11 ` Andy Lutomirski
2016-04-02 23:48 ` Rasmus Villemoes [this message]
2016-04-03 10:05 ` Thomas Gleixner
2016-04-03 11:16 ` Ingo Molnar
2016-04-03 11:30 ` Linus Torvalds
2016-04-05 7:44 ` Torvald Riegel
2016-04-05 15:58 ` Carlos O'Donell
2016-04-02 11:09 ` [RFC patch 5/7] perf/bench/futex-hash: Support " Thomas Gleixner
2016-04-02 11:09 ` [RFC patch 6/7] futex.2: Document attached mode Thomas Gleixner
2016-04-02 11:09 ` [RFC patch 7/7] [PATCH] glibc: nptl: Add support for attached pthread_mutexes Thomas Gleixner
2016-04-02 16:30 ` Peter Zijlstra
2016-04-02 16:32 ` Peter Zijlstra
2016-04-03 10:08 ` Thomas Gleixner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87zitbmv2g.fsf@rasmusvillemoes.dk \
--to=linux@rasmusvillemoes.dk \
--cc=bigeasy@linutronix.de \
--cc=carlos@redhat.com \
--cc=clm@fb.com \
--cc=darren@dvhart.com \
--cc=dave@stgolabs.net \
--cc=edumazet@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=mtk.manpages@googlemail.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=triegel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox