All of lore.kernel.org
 help / color / mirror / Atom feed
From: Waiman Long <waiman.long@hp.com>
To: Davidlohr Bueso <davidlohr@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Darren Hart <dvhart@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Mike Galbraith <efault@gmx.de>, Jeff Mahoney <jeffm@suse.com>,
	Jason Low <jason.low2@hp.com>, Tom Vaden <tom.vaden@hp.com>,
	"Norton, Scott J" <scott.norton@hp.com>,
	"Chandramouleeswaran, Aswin" <aswin@hp.com>,
	Ingo Molnar <mingo@kernel.org>
Subject: Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup
Date: Tue, 24 Dec 2013 22:13:16 -0500	[thread overview]
Message-ID: <52BA4D4C.6070709@hp.com> (raw)
In-Reply-To: <1387589770.3119.2.camel@buesod1.americas.hpqcorp.net>

On 12/20/2013 08:36 PM, Davidlohr Bueso wrote:
> On Thu, 2013-12-19 at 15:14 -0800, Linus Torvalds wrote:
>> On Thu, Dec 19, 2013 at 10:45 AM, Davidlohr Bueso<davidlohr@hp.com>  wrote:
>>> - increment the counter at queue_lock() as we always end up calling
>>>    queue_me() which adds the element to the list. Upon any error,
>>>    queue_unlock() is called for housekeeping, for which we decrement
>>>    to mach the increment done in queue_lock().
>>>
>>> - decrement the counter at __unqueue_me() to reflect when an element is
>>>    removed from the queue for wakeup related purposes.
>> I still hate this whole separate counter thing. It seems really annoying.
>>
>> If re-ordering things didn't work out, then why can't just the counter
>> we *already* have in the spinlock itself work as the counter? Your
>> counter update logic seems to basically match when you take the
>> spinlock anyway.
> So the following has passed all testing, just like the atomics variant.
> Thoughts?
>
> Thanks,
> Davidlohr
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fcc6850..c8c7ce5 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -73,19 +73,22 @@
>    * Basic futex operation and ordering guarantees:
>    *
>    * The waiter reads the futex value in user space and calls
> - * futex_wait(). This function computes the hash bucket and acquires
> - * the hash bucket lock. After that it reads the futex user space value
> - * again and verifies that the data has not changed. If it has not
> - * changed it enqueues itself into the hash bucket, releases the hash
> + * futex_wait(). It computes the hash bucket and acquires the hash
> + * bucket lock. After that it reads the futex user space value again
> + * and verifies that the data has not changed. If it has not changed
> + * it enqueues itself into the hash bucket, releases the hash
>    * bucket lock and schedules.
>    *
>    * The waker side modifies the user space value of the futex and calls
> - * futex_wake(). This functions computes the hash bucket and acquires
> - * the hash bucket lock. Then it looks for waiters on that futex in the
> - * hash bucket and wakes them.
> + * futex_wake(). It computes the hash bucket and acquires the hash
> + * bucket lock. Then it looks for waiters on that futex in the hash
> + * bucket and wakes them.
>    *
> - * Note that the spin_lock serializes waiters and wakers, so that the
> - * following scenario is avoided:
> + * In scenarios where wakeups are called and no tasks are blocked on a futex,
> + * taking the hb spinlock can be avoided and simply return. In order for this
> + * optimization to work, ordering guarantees must exist so that the waiter
> + * being added to the list is acknowledged when the list is concurrently being
> + * checked by the waker, avoiding scenarios like the following:
>    *
>    * CPU 0                               CPU 1
>    * val = *futex;
> @@ -106,24 +109,50 @@
>    * This would cause the waiter on CPU 0 to wait forever because it
>    * missed the transition of the user space value from val to newval
>    * and the waker did not find the waiter in the hash bucket queue.
> - * The spinlock serializes that:
> + *
> + * The correct serialization ensures that a waiter either observes
> + * the changed user space value before blocking or is woken by a
> + * concurrent waker:
>    *
>    * CPU 0                               CPU 1
>    * val = *futex;
>    * sys_futex(WAIT, futex, val);
>    *   futex_wait(futex, val);
> - *   lock(hash_bucket(futex));
> - *   uval = *futex;
> - *                                     *futex = newval;
> - *                                     sys_futex(WAKE, futex);
> - *                                       futex_wake(futex);
> - *                                       lock(hash_bucket(futex));
> + *
> + *   waiters++;
> + *   mb(); (A)<-- paired with -.
> + *                              |
> + *   lock(hash_bucket(futex));  |
> + *                              |
> + *   uval = *futex;             |
> + *                              |        *futex = newval;
> + *                              |        sys_futex(WAKE, futex);
> + *                              |          futex_wake(futex);
> + *                              |
> + *                              `------->    mb(); (B)

Checking the state of the spinlock counter isn't the same as 
incrementing a waiter count. So your pseudo code here is misleading. See 
further explanation below.

>    *   if (uval == val)
> - *      queue();
> + *     queue();
>    *     unlock(hash_bucket(futex));
> - *     schedule();                       if (!queue_empty())
> - *                                         wake_waiters(futex);
> - *                                       unlock(hash_bucket(futex));
> + *     schedule();                         if (waiters)
> + *                                           lock(hash_bucket(futex));
> + *                                           wake_waiters(futex);
> + *                                           unlock(hash_bucket(futex));
> + *
> + * Where (A) orders the waiters increment and the futex value read -- this
> + * is guaranteed by the head counter in the hb spinlock; and where (B)
> + * orders the write to futex and the waiters read.
> + *
> + * This yields the following case (where X:=waiters, Y:=futex):
> + *
> + *	X = Y = 0
> + *
> + *	w[X]=1		w[Y]=1
> + *	MB		MB
> + *	r[Y]=y		r[X]=x
> + *
> + * Which guarantees that x==0&&  y==0 is impossible; which translates back into
> + * the guarantee that we cannot both miss the futex variable change and the
> + * enqueue.
>    */
>
>   int __read_mostly futex_cmpxchg_enabled;
> @@ -211,6 +240,35 @@ static unsigned long __read_mostly futex_hashsize;
>
>   static struct futex_hash_bucket *futex_queues;
>
> +static inline void futex_get_mm(union futex_key *key)
> +{
> +	atomic_inc(&key->private.mm->mm_count);
> +#ifdef CONFIG_SMP
> +	/*
> +	 * Ensure futex_get_mm() implies a full barrier such that
> +	 * get_futex_key() implies a full barrier. This is relied upon
> +	 * as full barrier (B), see the ordering comment above.
> +	 */
> +	smp_mb__after_atomic_inc();
> +#endif
> +}
> +
> +static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
> +{
> +#ifdef CONFIG_SMP
> +	/*
> +	 * If the hash bucket is locked then we know the ticket counter
> +	 * is non-zero and thus there is at least one waiter in the queue.
> +	 */
> +	if (spin_is_locked(&hb->lock))
> +		return true;
> +	smp_rmb(); /* Make sure we check the lock state first */
> +	return !plist_head_empty(&hb->chain);
> +#else
> +	return true;
> +#endif
> +}

The ticket spinlock counter is a cyclic counter that can cycle through 0 
periodically. So the zero-ness of the counter has no relation to whether 
it is locked or not.  Your comment above is not correct. What 
spin_is_locked() can tell you is whether one or more tasks are trying to 
get into the critical section which can be a waiter (most likely) or a 
waker. Coupled with checking if the list is empty, that could be a 
cheaper alternative to using a separate atomic counter, but it is also 
slightly less reliable and has a higher chance of false positive.

-Longman

  parent reply	other threads:[~2013-12-25  3:13 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-19 18:45 [PATCH v3 0/4] futex: Wakeup optimizations Davidlohr Bueso
2013-12-19 18:45 ` [PATCH v3 1/4] futex: Misc cleanups Davidlohr Bueso
2013-12-19 18:45 ` [PATCH v3 2/4] futex: Larger hash table Davidlohr Bueso
2013-12-19 18:45 ` [PATCH v3 3/4] futex: Document ordering guarantees Davidlohr Bueso
2013-12-19 18:45 ` [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
2013-12-19 19:25   ` Ingo Molnar
2013-12-19 19:29     ` Davidlohr Bueso
2013-12-19 19:42       ` Ingo Molnar
2013-12-19 22:35         ` Joe Perches
2013-12-20  9:11           ` Peter Zijlstra
2013-12-20 10:27             ` Ingo Molnar
2013-12-19 23:14   ` Linus Torvalds
2013-12-19 23:42     ` Davidlohr Bueso
2013-12-19 23:53       ` Linus Torvalds
2013-12-20  0:04         ` Linus Torvalds
2013-12-20  0:26           ` Darren Hart
2013-12-20  0:12         ` Linus Torvalds
2013-12-20  2:22         ` Davidlohr Bueso
2013-12-20  3:13           ` Linus Torvalds
2013-12-20  6:13             ` Davidlohr Bueso
2013-12-20  8:54             ` Peter Zijlstra
2013-12-20 19:30     ` Davidlohr Bueso
2013-12-20 19:54       ` Linus Torvalds
2013-12-21  5:57         ` Davidlohr Bueso
2013-12-21  1:36     ` Davidlohr Bueso
2013-12-21  2:34       ` Darren Hart
2013-12-21  3:08         ` Davidlohr Bueso
2013-12-25  3:13       ` Waiman Long [this message]
2014-01-02 11:37         ` Davidlohr Bueso
  -- strict thread matches above, loose matches on Subject: below --
2013-12-03  9:45 [PATCH v2 0/4] futex: Wakeup optimizations Davidlohr Bueso
2013-12-03  9:45 ` [PATCH v2 4/4] futex: Avoid taking hb lock if nothing to wakeup Davidlohr Bueso
2013-12-10 18:10   ` [PATCH v3 " Davidlohr Bueso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52BA4D4C.6070709@hp.com \
    --to=waiman.long@hp.com \
    --cc=aswin@hp.com \
    --cc=davidlohr@hp.com \
    --cc=dvhart@linux.intel.com \
    --cc=efault@gmx.de \
    --cc=jason.low2@hp.com \
    --cc=jeffm@suse.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=mingo@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=scott.norton@hp.com \
    --cc=tglx@linutronix.de \
    --cc=tom.vaden@hp.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.