* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])
@ 2004-11-15 4:24 Chuck Ebbert
2004-11-15 8:08 ` Jamie Lokier
0 siblings, 1 reply; 7+ messages in thread
From: Chuck Ebbert @ 2004-11-15 4:24 UTC (permalink / raw)
To: Emergency Services Jamie Lokier; +Cc: linux-kernel
On Sun, 14 Nov 2004 at 09:00:23 +0000 Emergency Services Jamie Lokier wrote:
>+ * The basic logical guarantee of a futex is that it blocks ONLY
>+ * if cond(var) is known to be true at the time of blocking, for
>+ * any cond. If we queued after testing *uaddr, that would open
>+ * a race condition where we could block indefinitely with
>+ * cond(var) false, which would violate the guarantee.
>+ *
>+ * A consequence is that futex_wait() can return zero and absorb
>+ * a wakeup when *uaddr != val on entry to the syscall. This is
>+ * rare, but normal.
Why can't it absorb a wakeup and still return -EAGAIN when this happens?
IOW why not apply this patch to the original code?
================================================================================
return -EINTR;
out_unqueue:
- /* If we were woken (and unqueued), we succeeded, whatever. */
- if (!unqueue_me(&q))
- ret = 0;
+ unqueue_me(&q); /* ignore result from unqueue */
out_release_sem:
up_read(¤t->mm->mmap_sem);
return ret;
================================================================================
...and what is "Emergency Services", BTW?
--Chuck Ebbert 14-Nov-04 21:28:56
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) 2004-11-15 4:24 Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) Chuck Ebbert @ 2004-11-15 8:08 ` Jamie Lokier 0 siblings, 0 replies; 7+ messages in thread From: Jamie Lokier @ 2004-11-15 8:08 UTC (permalink / raw) To: Chuck Ebbert; +Cc: linux-kernel Chuck Ebbert wrote: > On Sun, 14 Nov 2004 at 09:00:23 +0000 Emergency Services Jamie Lokier wrote: > > >+ * The basic logical guarantee of a futex is that it blocks ONLY > >+ * if cond(var) is known to be true at the time of blocking, for > >+ * any cond. If we queued after testing *uaddr, that would open > >+ * a race condition where we could block indefinitely with > >+ * cond(var) false, which would violate the guarantee. > >+ * > >+ * A consequence is that futex_wait() can return zero and absorb > >+ * a wakeup when *uaddr != val on entry to the syscall. This is > >+ * rare, but normal. > > Why can't it absorb a wakeup and still return -EAGAIN when this happens? > IOW why not apply this patch to the original code? > > out_unqueue: > - /* If we were woken (and unqueued), we succeeded, whatever. */ > - if (!unqueue_me(&q)) > - ret = 0; > + unqueue_me(&q); /* ignore result from unqueue */ > out_release_sem: > up_read(¤t->mm->mmap_sem); > return ret; Because the number of wakeups reported to FUTEX_WAKE must _exactly_ match the number of wakeups reported to FUTEX_WAIT. They are like tokens, and for some data structures the return value mustn't be lost or ignored, because that would break structure invariants - such as the matching counters in the pthread condvars which precipitated this thread. > ...and what is "Emergency Services", BTW? My little joke, as I wouldn't have known about this if Andrew Morton hadn't forwarded me the message asking about it (I've been away from l-k). -- Jamie ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <20041113164048.2f31a8dd.akpm@osdl.org>]
* Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) [not found] <20041113164048.2f31a8dd.akpm@osdl.org> @ 2004-11-14 9:00 ` Emergency Services Jamie Lokier 2004-11-14 9:09 ` Andrew Morton 0 siblings, 1 reply; 7+ messages in thread From: Emergency Services Jamie Lokier @ 2004-11-14 9:00 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, Rusty Russell, Ingo Molnar, Hidetoshi Seto, bert hubert Andrew Morton wrote: > Jamie, if you're around: help! Revert the patch which moves queue_me(); it's buggy. It is a bug to move queue_me() after get_user(). It fully explains the blocking threads in Apache and Evolution. Even if it worked, the patch wouldn't have saved any time, as it's a rare condition if the caller is using futex properly. The patch below provides an explanation; I'd appreciate it being applied. --- > - According to man of futex: > "If the futex was not equal to the expected value, the operation > returns -EWOULDBLOCK." > but now, here is no description about the rare case: > "returns 0 if the futex was not equal to the expected value, but > the process was woken by a FUTEX_WAKE call." > this behavior on rare case causes the hang which I found. This case can occur, by design. Bert, are you still updating the futex man pages? (Or is anyone else?) If you are, then: The patch below might provide some text for use in the manual, but even if you can't easily explain why, the possibility of FUTEX_WAIT returning 0 and counting as a wakeup when the memory word doesn't equal val should be mentioned. I'd appreciate being added to the authors list while you're there, thanks :) I think the man page would be a little clearer if the various E conditions (ETIMEDOUT etc.) were listed in the errors section (even though they aren't errors). Think about consitency with other man pages which list EINTR and EAGAIN there. Also, it would be consistent to say EAGAIN instead of EWOULDBLOCK (they're synonyms in Linux anyway, but other man pages use EAGAIN as it's the modern name for it). The phrase "(or other spurious error)" should be removed as it's actually a kernel bug (but not serious) for that to occur, and no different to EINTR from other syscalls in that respect. In the section for FUTEX_WAIT behaviour, you might explain what "atomically verifies .. and sleeps awaiting FUTEX_WAKE" really means, perhaps removing the word atomic. It's not really atomic-test-conditional-sleep, it's just carefully ordered. (Though it's equivalent to atomic-sleep-test followed by conditional-wake). The difference is precisely that it may return 0 and count as a wakeup even when the memory word doesn't match prior to the effectively-atomic region. --- Hidetoshi Seto's example (at the FTP URL with the patch) calls pthread_cond_signal without mentioning a mutex. That's the wrong way to use pthread_cond_signal, as explained in the Glibc documentation. Note that moving queue_me() after get_user() in futex_wake() does NOT fix Hidetoshi's observed problem. Just think about the same 4 threads in "[simulation]", but scheduled in a slightly different sequence. Especially, look at splitting up the sequence _beteen_ get_user() and queue_me(), and _between_ "wake++ and updated futex val" and "FUTEX_WAKE: no one is in waitqueue / A is in waitqueue". The basic logical reason why Hidetoshi's patch doesn't fix anything is that if the get_user() test is done before queue_me() in the kernel, that is *exactly the same* as if userspace does the word test itself just before calling FUTEX_WAIT and FUTEX_WAIT doesn't do any test at all. In Hidetoshi's pseudo-code, the bug is in pthread_cond_signal: it should test the return value of FUTEX_WAKE and increment the wake variable conditionally, not unconditionally as it does. Fix that, and subsequent signals will wake B. The reason B is not woken initially is because mutexes aren't used. These aren't futex bugs. --- You're right about the double-down_read() problem - I hadn't realised that could deadlock. That will require a per-task flag to make the fault handler not take the semaphore when the fault occurs in these places. But that's a separate bug, not addressed here. --- That ->nqueued loop in FUTEX_CMP_REQUEUE is able to return -EAGAIN even when the memory word does equal the argument - that's quite ugly. That and the smp_mb() section look dubious. They're a workaround to simulate doing something inside the spinlocks, but that is different to the ordering properties that FUTEX_WAIT offers. I mention this because it's nearly the same problem as prompted this patch: that FUTEX_WAIT isn't as atomic as some people think it is, and most importantly, making it more atomic (by using the spinlocks) does not fix design problems in the caller. That suggests to me that the callers of FUTEX_CMP_REQUEUE, if they depend on that ->nqueued / smb_mb() loop, then they may have a race which will cause problems. If they don't depend on it, then it shouldn't be there. In fact that whole primitive does not look very conceptually convincing. Some kind of requeue-and-test primtive makes sense, but conceptually, it would make sense to be testing *uaddr2 at the same time, but it doesn't. --- Signed-off-by: Jamie Lokier <jamie@shareable.org> Explain why futex waiters must queue the current thread before testing the memory word, not after. Consequently, futex_wait() can return 0 and count as a wakeup even if the memory word doesn't match the value at the start of the syscall. c.orig 2004-11-03 04:04:50.000000000 +0000 +++ linux-2.6.9/kernel/futex.c 2004-11-14 08:58:27.067607610 +0000 @@ -6,7 +6,7 @@ * (C) Copyright 2003 Red Hat Inc, All Rights Reserved * * Removed page pinning, fix privately mapped COW pages and other cleanups - * (C) Copyright 2003 Jamie Lokier + * (C) Copyright 2003, 2004 Jamie Lokier * * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly * enough at me, Linus for the original (flawed) idea, Matthew @@ -489,9 +489,24 @@ queue_me(&q, -1, NULL); /* - * Access the page after the futex is queued. + * Access the page AFTER the futex is queued. + * Order is important: + * + * Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val); + * Userspace waker: if (cond(var)) { var = new; futex_wake(&var); } + * + * The basic logical guarantee of a futex is that it blocks ONLY + * if cond(var) is known to be true at the time of blocking, for + * any cond. If we queued after testing *uaddr, that would open + * a race condition where we could block indefinitely with + * cond(var) false, which would violate the guarantee. + * + * A consequence is that futex_wait() can return zero and absorb + * a wakeup when *uaddr != val on entry to the syscall. This is + * rare, but normal. + * * We hold the mmap semaphore, so the mapping cannot have changed - * since we looked it up. + * since we looked it up in get_futex_key. */ if (get_user(curval, (int __user *)uaddr) != 0) { ret = -EFAULT; ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) 2004-11-14 9:00 ` Emergency Services Jamie Lokier @ 2004-11-14 9:09 ` Andrew Morton 2004-11-14 9:23 ` Jamie Lokier 0 siblings, 1 reply; 7+ messages in thread From: Andrew Morton @ 2004-11-14 9:09 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel, rusty, mingo, seto.hidetoshi, ahu Emergency Services Jamie Lokier <jamie@shareable.org> wrote: > > Revert the patch which moves queue_me(); it's buggy. It is a bug to > move queue_me() after get_user(). yup. > It fully explains the blocking threads in Apache and Evolution. > > Even if it worked, the patch wouldn't have saved any time, as it's a > rare condition if the caller is using futex properly. The patch wasn't supposed to optimise anything. It fixed a bug which was causing hangs. See ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc1/2.6.10-rc1-mm5/broken-out/futex_wait-fix.patch Or are you saying that userspace is buggy?? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) 2004-11-14 9:09 ` Andrew Morton @ 2004-11-14 9:23 ` Jamie Lokier 2004-11-14 9:50 ` bert hubert 0 siblings, 1 reply; 7+ messages in thread From: Jamie Lokier @ 2004-11-14 9:23 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, rusty, mingo, seto.hidetoshi, ahu Andrew Morton wrote: > The patch wasn't supposed to optimise anything. It fixed a bug which was > causing hangs. See > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc1/2.6.10-rc1-mm5/broken-out/futex_wait-fix.patch > > Or are you saying that userspace is buggy?? I haven't looked at the NPTL code, but that URL's pseudo-code is buggy. The call to FUTEX_WAKE should be doing wake++ conditionally on the return value, not unconditionally. Also, the patch doesn't actually fix the described problem. It may hide it in tests, but the race or a similar one is present in a different execution order. The real NPTL code is more complicated than described at that URL, because real pthread_cond_wait takes a mutex argument as well. The bug report does not say how that is handled, and it is critically important that the mutex and convar are updated concurrently in the right way. So I don't know if NPTL is buggy, but the pseudo-code given in the bug report is (because of unconditional wake++), and so is the failure example (because it doesn't use a mutex). -- Jamie ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) 2004-11-14 9:23 ` Jamie Lokier @ 2004-11-14 9:50 ` bert hubert 2004-11-15 14:12 ` Jamie Lokier 0 siblings, 1 reply; 7+ messages in thread From: bert hubert @ 2004-11-14 9:50 UTC (permalink / raw) To: Jamie Lokier; +Cc: Andrew Morton, linux-kernel, rusty, mingo, seto.hidetoshi On Sun, Nov 14, 2004 at 09:23:08AM +0000, Jamie Lokier wrote: > So I don't know if NPTL is buggy, but the pseudo-code given in the bug > report is (because of unconditional wake++), and so is the failure > example (because it doesn't use a mutex). Please advise if 'Emergency Services''s update to the manpage is correct (two levels up this message thread), if so, I can apply it and forward to aeb. Thanks. -- http://www.PowerDNS.com Open source, database driven DNS Software http://lartc.org Linux Advanced Routing & Traffic Control HOWTO ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) 2004-11-14 9:50 ` bert hubert @ 2004-11-15 14:12 ` Jamie Lokier 0 siblings, 0 replies; 7+ messages in thread From: Jamie Lokier @ 2004-11-15 14:12 UTC (permalink / raw) To: bert hubert, Andrew Morton, linux-kernel, rusty, mingo, seto.hidetoshi bert hubert wrote: > On Sun, Nov 14, 2004 at 09:23:08AM +0000, Jamie Lokier wrote: > > > So I don't know if NPTL is buggy, but the pseudo-code given in the bug > > report is (because of unconditional wake++), and so is the failure > > example (because it doesn't use a mutex). > > Please advise if 'Emergency Services''s update to the manpage is correct > (two levels up this message thread), if so, I can apply it and forward to > aeb. 'Emergency Services' was me, if that's what you're asking. I believe the updates to be correct and I have studied the futex code quite a lot. Two more things for the man page. You wrote: To reiterate, bare futexes are not intended as an easy to use abstraction for end-users. Implementors are expected to be assembly literate and to have read the sources of the futex userspace library referenced below. I agree they are not intended as an easy to use abstraction. However, users do not have to be assembly literate, in the sense that it is possible to write code using futex which is architecture-indepedent. For mutexes, architecture-dependent locked bus cycles are used, but some code which uses futex is written in C using counters. pthread_cond_signal/wait which started this thread is an example. So I suggest a change to read: To reiterate, bare futexes are not intended as an easy to use abstraction for end-users. Implementors are expected to understand processor memory ordering, barriers and synchronisation, and to have read the sources of the futex userspace library referenced below. Secondly, is it appropriate to add Ulrich Drepper's "Futexes Are Tricky" paper to SEE ALSO? "Futexes Are Tricky", Ulrich Drepper, June 2004, http://people.redhat.com/drepper/futex.pdf It's a very interesting paper, worth reading. But note that Ulrich's description of the FUTEX_WAIT operation in that paper is *wrong*: This means that the operation to wait on a futex is composed of getting the lock for the futex, checking the current value, if necessary adding the thread to the wait queue, and releasing the lock. In fact, waiting does not get the lock for the futex. It relies on the ordering of (1) adding to the wait queue, (2) checking the current value, and (3) removing from the wait queue if the value doesn't match. Among other things, this is necessary because checking the current value cannot be done with a spinlock held. The effect is very similar, but not exactly the same. -- Jamie ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2004-11-15 14:14 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-11-15 4:24 Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) Chuck Ebbert
2004-11-15 8:08 ` Jamie Lokier
[not found] <20041113164048.2f31a8dd.akpm@osdl.org>
2004-11-14 9:00 ` Emergency Services Jamie Lokier
2004-11-14 9:09 ` Andrew Morton
2004-11-14 9:23 ` Jamie Lokier
2004-11-14 9:50 ` bert hubert
2004-11-15 14:12 ` Jamie Lokier
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.