All of lore.kernel.org
 help / color / mirror / Atom feed
* Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])
       [not found] <20041113164048.2f31a8dd.akpm@osdl.org>
@ 2004-11-14  9:00 ` Emergency Services Jamie Lokier
  2004-11-14  9:09   ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Emergency Services Jamie Lokier @ 2004-11-14  9:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Rusty Russell, Ingo Molnar, Hidetoshi Seto,
	bert hubert

Andrew Morton wrote:
> Jamie, if you're around: help!

Revert the patch which moves queue_me(); it's buggy.  It is a bug to
move queue_me() after get_user().

It fully explains the blocking threads in Apache and Evolution.

Even if it worked, the patch wouldn't have saved any time, as it's a
rare condition if the caller is using futex properly.

The patch below provides an explanation; I'd appreciate it being
applied.

    ---

>  - According to man of futex:
>      "If the futex was not equal to the expected value, the operation
>       returns -EWOULDBLOCK."
>    but now, here is no description about the rare case:
>      "returns 0 if the futex was not equal to the expected value, but
>       the process was woken by a FUTEX_WAKE call."
>    this behavior on rare case causes the hang which I found.

This case can occur, by design.

Bert, are you still updating the futex man pages?  (Or is anyone else?)

If you are, then:

The patch below might provide some text for use in the manual, but
even if you can't easily explain why, the possibility of FUTEX_WAIT
returning 0 and counting as a wakeup when the memory word doesn't
equal val should be mentioned.

I'd appreciate being added to the authors list while you're there,
thanks :)

I think the man page would be a little clearer if the various E
conditions (ETIMEDOUT etc.) were listed in the errors section (even
though they aren't errors).  Think about consitency with other man
pages which list EINTR and EAGAIN there.  Also, it would be consistent
to say EAGAIN instead of EWOULDBLOCK (they're synonyms in Linux
anyway, but other man pages use EAGAIN as it's the modern name for it).

The phrase "(or other spurious error)" should be removed as it's
actually a kernel bug (but not serious) for that to occur, and
no different to EINTR from other syscalls in that respect.

In the section for FUTEX_WAIT behaviour, you might explain what
"atomically verifies .. and sleeps awaiting FUTEX_WAKE" really means,
perhaps removing the word atomic.  It's not really
atomic-test-conditional-sleep, it's just carefully ordered.  (Though
it's equivalent to atomic-sleep-test followed by conditional-wake).

The difference is precisely that it may return 0 and count as a wakeup
even when the memory word doesn't match prior to the
effectively-atomic region.

    ---

Hidetoshi Seto's example (at the FTP URL with the patch) calls
pthread_cond_signal without mentioning a mutex.  That's the wrong way
to use pthread_cond_signal, as explained in the Glibc documentation.

Note that moving queue_me() after get_user() in futex_wake() does NOT
fix Hidetoshi's observed problem.

Just think about the same 4 threads in "[simulation]", but scheduled
in a slightly different sequence.  Especially, look at splitting up
the sequence _beteen_ get_user() and queue_me(), and _between_ "wake++
and updated futex val" and "FUTEX_WAKE: no one is in waitqueue / A is
in waitqueue".

The basic logical reason why Hidetoshi's patch doesn't fix anything is
that if the get_user() test is done before queue_me() in the kernel,
that is *exactly the same* as if userspace does the word test itself
just before calling FUTEX_WAIT and FUTEX_WAIT doesn't do any test at all.

In Hidetoshi's pseudo-code, the bug is in pthread_cond_signal: it
should test the return value of FUTEX_WAKE and increment the wake
variable conditionally, not unconditionally as it does.  Fix that, and
subsequent signals will wake B.  The reason B is not woken initially
is because mutexes aren't used.  These aren't futex bugs.

    ---

You're right about the double-down_read() problem - I hadn't realised
that could deadlock.  That will require a per-task flag to make the
fault handler not take the semaphore when the fault occurs in these
places.  But that's a separate bug, not addressed here.

    ---

That ->nqueued loop in FUTEX_CMP_REQUEUE is able to return -EAGAIN
even when the memory word does equal the argument - that's quite ugly.

That and the smp_mb() section look dubious.  They're a workaround to
simulate doing something inside the spinlocks, but that is different
to the ordering properties that FUTEX_WAIT offers.

I mention this because it's nearly the same problem as prompted this
patch: that FUTEX_WAIT isn't as atomic as some people think it is, and
most importantly, making it more atomic (by using the spinlocks) does
not fix design problems in the caller.

That suggests to me that the callers of FUTEX_CMP_REQUEUE, if they
depend on that ->nqueued / smb_mb() loop, then they may have a race
which will cause problems.  If they don't depend on it, then it
shouldn't be there.

In fact that whole primitive does not look very conceptually
convincing.  Some kind of requeue-and-test primtive makes sense, but
conceptually, it would make sense to be testing *uaddr2 at the same
time, but it doesn't.

    ---

Signed-off-by: Jamie Lokier <jamie@shareable.org>

Explain why futex waiters must queue the current thread before testing
the memory word, not after.  Consequently, futex_wait() can return 0
and count as a wakeup even if the memory word doesn't match the
value at the start of the syscall.

c.orig	2004-11-03 04:04:50.000000000 +0000
+++ linux-2.6.9/kernel/futex.c	2004-11-14 08:58:27.067607610 +0000
@@ -6,7 +6,7 @@
  *  (C) Copyright 2003 Red Hat Inc, All Rights Reserved
  *
  *  Removed page pinning, fix privately mapped COW pages and other cleanups
- *  (C) Copyright 2003 Jamie Lokier
+ *  (C) Copyright 2003, 2004 Jamie Lokier
  *
  *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
@@ -489,9 +489,24 @@
 	queue_me(&q, -1, NULL);
 
 	/*
-	 * Access the page after the futex is queued.
+	 * Access the page AFTER the futex is queued.
+	 * Order is important:
+	 *
+	 *   Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
+	 *   Userspace waker:  if (cond(var)) { var = new; futex_wake(&var); }
+	 *
+	 * The basic logical guarantee of a futex is that it blocks ONLY
+	 * if cond(var) is known to be true at the time of blocking, for
+	 * any cond.  If we queued after testing *uaddr, that would open
+	 * a race condition where we could block indefinitely with
+	 * cond(var) false, which would violate the guarantee.
+	 *
+	 * A consequence is that futex_wait() can return zero and absorb
+	 * a wakeup when *uaddr != val on entry to the syscall.  This is
+	 * rare, but normal.
+	 *
 	 * We hold the mmap semaphore, so the mapping cannot have changed
-	 * since we looked it up.
+	 * since we looked it up in get_futex_key.
 	 */
 	if (get_user(curval, (int __user *)uaddr) != 0) {
 		ret = -EFAULT;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])
  2004-11-14  9:00 ` Emergency Services Jamie Lokier
@ 2004-11-14  9:09   ` Andrew Morton
  2004-11-14  9:23     ` Jamie Lokier
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2004-11-14  9:09 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel, rusty, mingo, seto.hidetoshi, ahu

Emergency Services Jamie Lokier <jamie@shareable.org> wrote:
>
> Revert the patch which moves queue_me(); it's buggy.  It is a bug to
>  move queue_me() after get_user().

yup.

>  It fully explains the blocking threads in Apache and Evolution.
> 
>  Even if it worked, the patch wouldn't have saved any time, as it's a
>  rare condition if the caller is using futex properly.

The patch wasn't supposed to optimise anything.  It fixed a bug which was
causing hangs.  See
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc1/2.6.10-rc1-mm5/broken-out/futex_wait-fix.patch

Or are you saying that userspace is buggy??

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])
  2004-11-14  9:09   ` Andrew Morton
@ 2004-11-14  9:23     ` Jamie Lokier
  2004-11-14  9:50       ` bert hubert
  0 siblings, 1 reply; 7+ messages in thread
From: Jamie Lokier @ 2004-11-14  9:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, rusty, mingo, seto.hidetoshi, ahu

Andrew Morton wrote:
> The patch wasn't supposed to optimise anything.  It fixed a bug which was
> causing hangs.  See
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc1/2.6.10-rc1-mm5/broken-out/futex_wait-fix.patch
> 
> Or are you saying that userspace is buggy??

I haven't looked at the NPTL code, but that URL's pseudo-code is buggy.
The call to FUTEX_WAKE should be doing wake++ conditionally on the
return value, not unconditionally.

Also, the patch doesn't actually fix the described problem.

It may hide it in tests, but the race or a similar one is present in a
different execution order.

The real NPTL code is more complicated than described at that URL,
because real pthread_cond_wait takes a mutex argument as well.  The
bug report does not say how that is handled, and it is critically
important that the mutex and convar are updated concurrently in the
right way.

So I don't know if NPTL is buggy, but the pseudo-code given in the bug
report is (because of unconditional wake++), and so is the failure
example (because it doesn't use a mutex).

-- Jamie

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])
  2004-11-14  9:23     ` Jamie Lokier
@ 2004-11-14  9:50       ` bert hubert
  2004-11-15 14:12         ` Jamie Lokier
  0 siblings, 1 reply; 7+ messages in thread
From: bert hubert @ 2004-11-14  9:50 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andrew Morton, linux-kernel, rusty, mingo, seto.hidetoshi

On Sun, Nov 14, 2004 at 09:23:08AM +0000, Jamie Lokier wrote:

> So I don't know if NPTL is buggy, but the pseudo-code given in the bug
> report is (because of unconditional wake++), and so is the failure
> example (because it doesn't use a mutex).

Please advise if 'Emergency Services''s update to the manpage is correct
(two levels up this message thread), if so, I can apply it and forward to
aeb.

Thanks.

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])
@ 2004-11-15  4:24 Chuck Ebbert
  2004-11-15  8:08 ` Jamie Lokier
  0 siblings, 1 reply; 7+ messages in thread
From: Chuck Ebbert @ 2004-11-15  4:24 UTC (permalink / raw)
  To: Emergency Services Jamie Lokier; +Cc: linux-kernel

On Sun, 14 Nov 2004 at 09:00:23 +0000 Emergency Services Jamie Lokier wrote:

>+       * The basic logical guarantee of a futex is that it blocks ONLY
>+       * if cond(var) is known to be true at the time of blocking, for
>+       * any cond.  If we queued after testing *uaddr, that would open
>+       * a race condition where we could block indefinitely with
>+       * cond(var) false, which would violate the guarantee.
>+       *
>+       * A consequence is that futex_wait() can return zero and absorb
>+       * a wakeup when *uaddr != val on entry to the syscall.  This is
>+       * rare, but normal.


   Why can't it absorb a wakeup and still return -EAGAIN when this happens?

   IOW why not apply this patch to the original code?

================================================================================
        return -EINTR;
 
  out_unqueue:
-       /* If we were woken (and unqueued), we succeeded, whatever. */
-       if (!unqueue_me(&q))
-               ret = 0;
+       unqueue_me(&q); /* ignore result from unqueue */
  out_release_sem:
        up_read(&current->mm->mmap_sem);
        return ret;
================================================================================

   ...and what is "Emergency Services", BTW?

--Chuck Ebbert  14-Nov-04  21:28:56

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])
  2004-11-15  4:24 Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) Chuck Ebbert
@ 2004-11-15  8:08 ` Jamie Lokier
  0 siblings, 0 replies; 7+ messages in thread
From: Jamie Lokier @ 2004-11-15  8:08 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel

Chuck Ebbert wrote:
> On Sun, 14 Nov 2004 at 09:00:23 +0000 Emergency Services Jamie Lokier wrote:
> 
> >+       * The basic logical guarantee of a futex is that it blocks ONLY
> >+       * if cond(var) is known to be true at the time of blocking, for
> >+       * any cond.  If we queued after testing *uaddr, that would open
> >+       * a race condition where we could block indefinitely with
> >+       * cond(var) false, which would violate the guarantee.
> >+       *
> >+       * A consequence is that futex_wait() can return zero and absorb
> >+       * a wakeup when *uaddr != val on entry to the syscall.  This is
> >+       * rare, but normal.
> 
>    Why can't it absorb a wakeup and still return -EAGAIN when this happens?
>    IOW why not apply this patch to the original code?
> 
>   out_unqueue:
> -       /* If we were woken (and unqueued), we succeeded, whatever. */
> -       if (!unqueue_me(&q))
> -               ret = 0;
> +       unqueue_me(&q); /* ignore result from unqueue */
>   out_release_sem:
>         up_read(&current->mm->mmap_sem);
>         return ret;

Because the number of wakeups reported to FUTEX_WAKE must _exactly_
match the number of wakeups reported to FUTEX_WAIT.

They are like tokens, and for some data structures the return value
mustn't be lost or ignored, because that would break structure
invariants - such as the matching counters in the pthread condvars
which precipitated this thread.

>    ...and what is "Emergency Services", BTW?

My little joke, as I wouldn't have known about this if Andrew Morton
hadn't forwarded me the message asking about it (I've been away from l-k).

-- Jamie

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])
  2004-11-14  9:50       ` bert hubert
@ 2004-11-15 14:12         ` Jamie Lokier
  0 siblings, 0 replies; 7+ messages in thread
From: Jamie Lokier @ 2004-11-15 14:12 UTC (permalink / raw)
  To: bert hubert, Andrew Morton, linux-kernel, rusty, mingo,
	seto.hidetoshi

bert hubert wrote:
> On Sun, Nov 14, 2004 at 09:23:08AM +0000, Jamie Lokier wrote:
> 
> > So I don't know if NPTL is buggy, but the pseudo-code given in the bug
> > report is (because of unconditional wake++), and so is the failure
> > example (because it doesn't use a mutex).
> 
> Please advise if 'Emergency Services''s update to the manpage is correct
> (two levels up this message thread), if so, I can apply it and forward to
> aeb.

'Emergency Services' was me, if that's what you're asking.  I believe
the updates to be correct and I have studied the futex code quite a
lot.

Two more things for the man page.  You wrote:

     To reiterate, bare futexes are not intended as an easy to use
     abstraction for end-users.  Implementors are expected to be
     assembly literate and to have read the sources of the futex
     userspace library referenced below.

I agree they are not intended as an easy to use abstraction.  However,
users do not have to be assembly literate, in the sense that it is
possible to write code using futex which is architecture-indepedent.

For mutexes, architecture-dependent locked bus cycles are used, but
some code which uses futex is written in C using counters.
pthread_cond_signal/wait which started this thread is an example.  So
I suggest a change to read:

     To reiterate, bare futexes are not intended as an easy to use
     abstraction for end-users.  Implementors are expected to
     understand processor memory ordering, barriers and
     synchronisation, and to have read the sources of the futex
     userspace library referenced below.

Secondly, is it appropriate to add Ulrich Drepper's "Futexes Are
Tricky" paper to SEE ALSO?

     "Futexes Are Tricky", Ulrich Drepper, June 2004,
     http://people.redhat.com/drepper/futex.pdf

It's a very interesting paper, worth reading.  But note that Ulrich's
description of the FUTEX_WAIT operation in that paper is *wrong*:

     This means that the operation to wait on a futex is composed of
     getting the lock for the futex, checking the current value, if
     necessary adding the thread to the wait queue, and releasing the lock.

In fact, waiting does not get the lock for the futex.  It relies on
the ordering of (1) adding to the wait queue, (2) checking the current
value, and (3) removing from the wait queue if the value doesn't
match.  Among other things, this is necessary because checking the
current value cannot be done with a spinlock held.

The effect is very similar, but not exactly the same.

-- Jamie

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-11-15 14:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-11-15  4:24 Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) Chuck Ebbert
2004-11-15  8:08 ` Jamie Lokier
     [not found] <20041113164048.2f31a8dd.akpm@osdl.org>
2004-11-14  9:00 ` Emergency Services Jamie Lokier
2004-11-14  9:09   ` Andrew Morton
2004-11-14  9:23     ` Jamie Lokier
2004-11-14  9:50       ` bert hubert
2004-11-15 14:12         ` Jamie Lokier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.