From: Waiman Long <waiman.long@hp.com>
To: Davidlohr Bueso <davidlohr@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Ingo Molnar <mingo@redhat.com>,
Darren Hart <dvhart@linux.intel.com>,
Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Paul McKenney <paulmck@linux.vnet.ibm.com>,
Mike Galbraith <efault@gmx.de>, Jeff Mahoney <jeffm@suse.com>,
Jason Low <jason.low2@hp.com>, Tom Vaden <tom.vaden@hp.com>,
"Norton, Scott J" <scott.norton@hp.com>,
"Chandramouleeswaran, Aswin" <aswin@hp.com>,
Ingo Molnar <mingo@kernel.org>
Subject: Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
Date: Tue, 24 Dec 2013 22:13:16 -0500 [thread overview]
Message-ID: <52BA4D4C.6070709@hp.com> (raw)
In-Reply-To: <1387589770.3119.2.camel@buesod1.americas.hpqcorp.net>
On 12/20/2013 08:36 PM, Davidlohr Bueso wrote:
> On Thu, 2013-12-19 at 15:14 -0800, Linus Torvalds wrote:
>> On Thu, Dec 19, 2013 at 10:45 AM, Davidlohr Bueso<davidlohr@hp.com> wrote:
>>> - increment the counter at queue_lock() as we always end up calling
>>> queue_me() which adds the element to the list. Upon any error,
>>> queue_unlock() is called for housekeeping, for which we decrement
>>> to mach the increment done in queue_lock().
>>>
>>> - decrement the counter at __unqueue_me() to reflect when an element is
>>> removed from the queue for wakeup related purposes.
>> I still hate this whole separate counter thing. It seems really annoying.
>>
>> If re-ordering things didn't work out, then why can't just the counter
>> we *already* have in the spinlock itself work as the counter? Your
>> counter update logic seems to basically match when you take the
>> spinlock anyway.
> So the following has passed all testing, just like the atomics variant.
> Thoughts?
>
> Thanks,
> Davidlohr
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fcc6850..c8c7ce5 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -73,19 +73,22 @@
> * Basic futex operation and ordering guarantees:
> *
> * The waiter reads the futex value in user space and calls
> - * futex_wait(). This function computes the hash bucket and acquires
> - * the hash bucket lock. After that it reads the futex user space value
> - * again and verifies that the data has not changed. If it has not
> - * changed it enqueues itself into the hash bucket, releases the hash
> + * futex_wait(). It computes the hash bucket and acquires the hash
> + * bucket lock. After that it reads the futex user space value again
> + * and verifies that the data has not changed. If it has not changed
> + * it enqueues itself into the hash bucket, releases the hash
> * bucket lock and schedules.
> *
> * The waker side modifies the user space value of the futex and calls
> - * futex_wake(). This functions computes the hash bucket and acquires
> - * the hash bucket lock. Then it looks for waiters on that futex in the
> - * hash bucket and wakes them.
> + * futex_wake(). It computes the hash bucket and acquires the hash
> + * bucket lock. Then it looks for waiters on that futex in the hash
> + * bucket and wakes them.
> *
> - * Note that the spin_lock serializes waiters and wakers, so that the
> - * following scenario is avoided:
> + * In scenarios where wakeups are called and no tasks are blocked on a futex,
> + * taking the hb spinlock can be avoided and simply return. In order for this
> + * optimization to work, ordering guarantees must exist so that the waiter
> + * being added to the list is acknowledged when the list is concurrently being
> + * checked by the waker, avoiding scenarios like the following:
> *
> * CPU 0 CPU 1
> * val = *futex;
> @@ -106,24 +109,50 @@
> * This would cause the waiter on CPU 0 to wait forever because it
> * missed the transition of the user space value from val to newval
> * and the waker did not find the waiter in the hash bucket queue.
> - * The spinlock serializes that:
> + *
> + * The correct serialization ensures that a waiter either observes
> + * the changed user space value before blocking or is woken by a
> + * concurrent waker:
> *
> * CPU 0 CPU 1
> * val = *futex;
> * sys_futex(WAIT, futex, val);
> * futex_wait(futex, val);
> - * lock(hash_bucket(futex));
> - * uval = *futex;
> - * *futex = newval;
> - * sys_futex(WAKE, futex);
> - * futex_wake(futex);
> - * lock(hash_bucket(futex));
> + *
> + * waiters++;
> + * mb(); (A)<-- paired with -.
> + * |
> + * lock(hash_bucket(futex)); |
> + * |
> + * uval = *futex; |
> + * | *futex = newval;
> + * | sys_futex(WAKE, futex);
> + * | futex_wake(futex);
> + * |
> + * `-------> mb(); (B)
Checking the state of the spinlock counter isn't the same as
incrementing a waiter count. So your pseudo code here is misleading. See
further explanation below.
> * if (uval == val)
> - * queue();
> + * queue();
> * unlock(hash_bucket(futex));
> - * schedule(); if (!queue_empty())
> - * wake_waiters(futex);
> - * unlock(hash_bucket(futex));
> + * schedule(); if (waiters)
> + * lock(hash_bucket(futex));
> + * wake_waiters(futex);
> + * unlock(hash_bucket(futex));
> + *
> + * Where (A) orders the waiters increment and the futex value read -- this
> + * is guaranteed by the head counter in the hb spinlock; and where (B)
> + * orders the write to futex and the waiters read.
> + *
> + * This yields the following case (where X:=waiters, Y:=futex):
> + *
> + * X = Y = 0
> + *
> + * w[X]=1 w[Y]=1
> + * MB MB
> + * r[Y]=y r[X]=x
> + *
> + * Which guarantees that x==0&& y==0 is impossible; which translates back into
> + * the guarantee that we cannot both miss the futex variable change and the
> + * enqueue.
> */
>
> int __read_mostly futex_cmpxchg_enabled;
> @@ -211,6 +240,35 @@ static unsigned long __read_mostly futex_hashsize;
>
> static struct futex_hash_bucket *futex_queues;
>
> +static inline void futex_get_mm(union futex_key *key)
> +{
> + atomic_inc(&key->private.mm->mm_count);
> +#ifdef CONFIG_SMP
> + /*
> + * Ensure futex_get_mm() implies a full barrier such that
> + * get_futex_key() implies a full barrier. This is relied upon
> + * as full barrier (B), see the ordering comment above.
> + */
> + smp_mb__after_atomic_inc();
> +#endif
> +}
> +
> +static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
> +{
> +#ifdef CONFIG_SMP
> + /*
> + * If the hash bucket is locked then we know the ticket counter
> + * is non-zero and thus there is at least one waiter in the queue.
> + */
> + if (spin_is_locked(&hb->lock))
> + return true;
> + smp_rmb(); /* Make sure we check the lock state first */
> + return !plist_head_empty(&hb->chain);
> +#else
> + return true;
> +#endif
> +}
The ticket spinlock counter is a cyclic counter that can cycle through 0
periodically. So the zero-ness of the counter has no relation to whether
it is locked or not. Your comment above is not correct. What
spin_is_locked() can tell you is whether one or more tasks are trying to
get into the critical section which can be a waiter (most likely) or a
waker. Coupled with checking if the list is empty, that could be a
cheaper alternative to using a separate atomic counter, but it is also
slightly less reliable and has a higher chance of false positive.
-Longman
next prev parent reply other threads:[~2013-12-25 3:13 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-19 18:45 [PATCH v3 0/4] futex: Wakeup optimizations Davidlohr Bueso
2013-12-19 18:45 ` [PATCH v3 1/4] futex: Misc cleanups Davidlohr Bueso
2013-12-19 18:45 ` [PATCH v3 2/4] futex: Larger hash table Davidlohr Bueso
2013-12-19 18:45 ` [PATCH v3 3/4] futex: Document ordering guarantees Davidlohr Bueso
2013-12-19 18:45 ` [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
2013-12-19 19:25 ` Ingo Molnar
2013-12-19 19:29 ` Davidlohr Bueso
2013-12-19 19:42 ` Ingo Molnar
2013-12-19 22:35 ` Joe Perches
2013-12-20 9:11 ` Peter Zijlstra
2013-12-20 10:27 ` Ingo Molnar
2013-12-19 23:14 ` Linus Torvalds
2013-12-19 23:42 ` Davidlohr Bueso
2013-12-19 23:53 ` Linus Torvalds
2013-12-20 0:04 ` Linus Torvalds
2013-12-20 0:26 ` Darren Hart
2013-12-20 0:12 ` Linus Torvalds
2013-12-20 2:22 ` Davidlohr Bueso
2013-12-20 3:13 ` Linus Torvalds
2013-12-20 6:13 ` Davidlohr Bueso
2013-12-20 8:54 ` Peter Zijlstra
2013-12-20 19:30 ` Davidlohr Bueso
2013-12-20 19:54 ` Linus Torvalds
2013-12-21 5:57 ` Davidlohr Bueso
2013-12-21 1:36 ` Davidlohr Bueso
2013-12-21 2:34 ` Darren Hart
2013-12-21 3:08 ` Davidlohr Bueso
2013-12-25 3:13 ` Waiman Long [this message]
2014-01-02 11:37 ` Davidlohr Bueso
-- strict thread matches above, loose matches on Subject: below --
2013-12-03 9:45 [PATCH v2 0/4] futex: Wakeup optimizations Davidlohr Bueso
2013-12-03 9:45 ` [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
2013-12-10 18:10 ` [PATCH v3 " Davidlohr Bueso
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52BA4D4C.6070709@hp.com \
--to=waiman.long@hp.com \
--cc=aswin@hp.com \
--cc=davidlohr@hp.com \
--cc=dvhart@linux.intel.com \
--cc=efault@gmx.de \
--cc=jason.low2@hp.com \
--cc=jeffm@suse.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=mingo@redhat.com \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=scott.norton@hp.com \
--cc=tglx@linutronix.de \
--cc=tom.vaden@hp.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox