* Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) [not found] <20041113164048.2f31a8dd.akpm@osdl.org> @ 2004-11-14 9:00 ` Emergency Services Jamie Lokier 2004-11-14 9:09 ` Andrew Morton 0 siblings, 1 reply; 28+ messages in thread From: Emergency Services Jamie Lokier @ 2004-11-14 9:00 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, Rusty Russell, Ingo Molnar, Hidetoshi Seto, bert hubert Andrew Morton wrote: > Jamie, if you're around: help! Revert the patch which moves queue_me(); it's buggy. It is a bug to move queue_me() after get_user(). It fully explains the blocking threads in Apache and Evolution. Even if it worked, the patch wouldn't have saved any time, as it's a rare condition if the caller is using futex properly. The patch below provides an explanation; I'd appreciate it being applied. --- > - According to man of futex: > "If the futex was not equal to the expected value, the operation > returns -EWOULDBLOCK." > but now, here is no description about the rare case: > "returns 0 if the futex was not equal to the expected value, but > the process was woken by a FUTEX_WAKE call." > this behavior on rare case causes the hang which I found. This case can occur, by design. Bert, are you still updating the futex man pages? (Or is anyone else?) If you are, then: The patch below might provide some text for use in the manual, but even if you can't easily explain why, the possibility of FUTEX_WAIT returning 0 and counting as a wakeup when the memory word doesn't equal val should be mentioned. I'd appreciate being added to the authors list while you're there, thanks :) I think the man page would be a little clearer if the various E conditions (ETIMEDOUT etc.) were listed in the errors section (even though they aren't errors). Think about consitency with other man pages which list EINTR and EAGAIN there. Also, it would be consistent to say EAGAIN instead of EWOULDBLOCK (they're synonyms in Linux anyway, but other man pages use EAGAIN as it's the modern name for it). The phrase "(or other spurious error)" should be removed as it's actually a kernel bug (but not serious) for that to occur, and no different to EINTR from other syscalls in that respect. In the section for FUTEX_WAIT behaviour, you might explain what "atomically verifies .. and sleeps awaiting FUTEX_WAKE" really means, perhaps removing the word atomic. It's not really atomic-test-conditional-sleep, it's just carefully ordered. (Though it's equivalent to atomic-sleep-test followed by conditional-wake). The difference is precisely that it may return 0 and count as a wakeup even when the memory word doesn't match prior to the effectively-atomic region. --- Hidetoshi Seto's example (at the FTP URL with the patch) calls pthread_cond_signal without mentioning a mutex. That's the wrong way to use pthread_cond_signal, as explained in the Glibc documentation. Note that moving queue_me() after get_user() in futex_wake() does NOT fix Hidetoshi's observed problem. Just think about the same 4 threads in "[simulation]", but scheduled in a slightly different sequence. Especially, look at splitting up the sequence _beteen_ get_user() and queue_me(), and _between_ "wake++ and updated futex val" and "FUTEX_WAKE: no one is in waitqueue / A is in waitqueue". The basic logical reason why Hidetoshi's patch doesn't fix anything is that if the get_user() test is done before queue_me() in the kernel, that is *exactly the same* as if userspace does the word test itself just before calling FUTEX_WAIT and FUTEX_WAIT doesn't do any test at all. In Hidetoshi's pseudo-code, the bug is in pthread_cond_signal: it should test the return value of FUTEX_WAKE and increment the wake variable conditionally, not unconditionally as it does. Fix that, and subsequent signals will wake B. The reason B is not woken initially is because mutexes aren't used. These aren't futex bugs. --- You're right about the double-down_read() problem - I hadn't realised that could deadlock. That will require a per-task flag to make the fault handler not take the semaphore when the fault occurs in these places. But that's a separate bug, not addressed here. --- That ->nqueued loop in FUTEX_CMP_REQUEUE is able to return -EAGAIN even when the memory word does equal the argument - that's quite ugly. That and the smp_mb() section look dubious. They're a workaround to simulate doing something inside the spinlocks, but that is different to the ordering properties that FUTEX_WAIT offers. I mention this because it's nearly the same problem as prompted this patch: that FUTEX_WAIT isn't as atomic as some people think it is, and most importantly, making it more atomic (by using the spinlocks) does not fix design problems in the caller. That suggests to me that the callers of FUTEX_CMP_REQUEUE, if they depend on that ->nqueued / smb_mb() loop, then they may have a race which will cause problems. If they don't depend on it, then it shouldn't be there. In fact that whole primitive does not look very conceptually convincing. Some kind of requeue-and-test primtive makes sense, but conceptually, it would make sense to be testing *uaddr2 at the same time, but it doesn't. --- Signed-off-by: Jamie Lokier <jamie@shareable.org> Explain why futex waiters must queue the current thread before testing the memory word, not after. Consequently, futex_wait() can return 0 and count as a wakeup even if the memory word doesn't match the value at the start of the syscall. c.orig 2004-11-03 04:04:50.000000000 +0000 +++ linux-2.6.9/kernel/futex.c 2004-11-14 08:58:27.067607610 +0000 @@ -6,7 +6,7 @@ * (C) Copyright 2003 Red Hat Inc, All Rights Reserved * * Removed page pinning, fix privately mapped COW pages and other cleanups - * (C) Copyright 2003 Jamie Lokier + * (C) Copyright 2003, 2004 Jamie Lokier * * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly * enough at me, Linus for the original (flawed) idea, Matthew @@ -489,9 +489,24 @@ queue_me(&q, -1, NULL); /* - * Access the page after the futex is queued. + * Access the page AFTER the futex is queued. + * Order is important: + * + * Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val); + * Userspace waker: if (cond(var)) { var = new; futex_wake(&var); } + * + * The basic logical guarantee of a futex is that it blocks ONLY + * if cond(var) is known to be true at the time of blocking, for + * any cond. If we queued after testing *uaddr, that would open + * a race condition where we could block indefinitely with + * cond(var) false, which would violate the guarantee. + * + * A consequence is that futex_wait() can return zero and absorb + * a wakeup when *uaddr != val on entry to the syscall. This is + * rare, but normal. + * * We hold the mmap semaphore, so the mapping cannot have changed - * since we looked it up. + * since we looked it up in get_futex_key. */ if (get_user(curval, (int __user *)uaddr) != 0) { ret = -EFAULT; ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) 2004-11-14 9:00 ` Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) Emergency Services Jamie Lokier @ 2004-11-14 9:09 ` Andrew Morton 2004-11-14 9:23 ` Jamie Lokier 0 siblings, 1 reply; 28+ messages in thread From: Andrew Morton @ 2004-11-14 9:09 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel, rusty, mingo, seto.hidetoshi, ahu Emergency Services Jamie Lokier <jamie@shareable.org> wrote: > > Revert the patch which moves queue_me(); it's buggy. It is a bug to > move queue_me() after get_user(). yup. > It fully explains the blocking threads in Apache and Evolution. > > Even if it worked, the patch wouldn't have saved any time, as it's a > rare condition if the caller is using futex properly. The patch wasn't supposed to optimise anything. It fixed a bug which was causing hangs. See ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc1/2.6.10-rc1-mm5/broken-out/futex_wait-fix.patch Or are you saying that userspace is buggy?? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) 2004-11-14 9:09 ` Andrew Morton @ 2004-11-14 9:23 ` Jamie Lokier 2004-11-14 9:50 ` bert hubert 2004-11-15 0:58 ` Hidetoshi Seto 0 siblings, 2 replies; 28+ messages in thread From: Jamie Lokier @ 2004-11-14 9:23 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, rusty, mingo, seto.hidetoshi, ahu Andrew Morton wrote: > The patch wasn't supposed to optimise anything. It fixed a bug which was > causing hangs. See > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc1/2.6.10-rc1-mm5/broken-out/futex_wait-fix.patch > > Or are you saying that userspace is buggy?? I haven't looked at the NPTL code, but that URL's pseudo-code is buggy. The call to FUTEX_WAKE should be doing wake++ conditionally on the return value, not unconditionally. Also, the patch doesn't actually fix the described problem. It may hide it in tests, but the race or a similar one is present in a different execution order. The real NPTL code is more complicated than described at that URL, because real pthread_cond_wait takes a mutex argument as well. The bug report does not say how that is handled, and it is critically important that the mutex and convar are updated concurrently in the right way. So I don't know if NPTL is buggy, but the pseudo-code given in the bug report is (because of unconditional wake++), and so is the failure example (because it doesn't use a mutex). -- Jamie ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) 2004-11-14 9:23 ` Jamie Lokier @ 2004-11-14 9:50 ` bert hubert 2004-11-15 14:12 ` Jamie Lokier 2004-11-15 0:58 ` Hidetoshi Seto 1 sibling, 1 reply; 28+ messages in thread From: bert hubert @ 2004-11-14 9:50 UTC (permalink / raw) To: Jamie Lokier; +Cc: Andrew Morton, linux-kernel, rusty, mingo, seto.hidetoshi On Sun, Nov 14, 2004 at 09:23:08AM +0000, Jamie Lokier wrote: > So I don't know if NPTL is buggy, but the pseudo-code given in the bug > report is (because of unconditional wake++), and so is the failure > example (because it doesn't use a mutex). Please advise if 'Emergency Services''s update to the manpage is correct (two levels up this message thread), if so, I can apply it and forward to aeb. Thanks. -- http://www.PowerDNS.com Open source, database driven DNS Software http://lartc.org Linux Advanced Routing & Traffic Control HOWTO ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) 2004-11-14 9:50 ` bert hubert @ 2004-11-15 14:12 ` Jamie Lokier 2004-11-16 8:30 ` Futex queue_me/get_user ordering Hidetoshi Seto 0 siblings, 1 reply; 28+ messages in thread From: Jamie Lokier @ 2004-11-15 14:12 UTC (permalink / raw) To: bert hubert, Andrew Morton, linux-kernel, rusty, mingo, seto.hidetoshi bert hubert wrote: > On Sun, Nov 14, 2004 at 09:23:08AM +0000, Jamie Lokier wrote: > > > So I don't know if NPTL is buggy, but the pseudo-code given in the bug > > report is (because of unconditional wake++), and so is the failure > > example (because it doesn't use a mutex). > > Please advise if 'Emergency Services''s update to the manpage is correct > (two levels up this message thread), if so, I can apply it and forward to > aeb. 'Emergency Services' was me, if that's what you're asking. I believe the updates to be correct and I have studied the futex code quite a lot. Two more things for the man page. You wrote: To reiterate, bare futexes are not intended as an easy to use abstraction for end-users. Implementors are expected to be assembly literate and to have read the sources of the futex userspace library referenced below. I agree they are not intended as an easy to use abstraction. However, users do not have to be assembly literate, in the sense that it is possible to write code using futex which is architecture-indepedent. For mutexes, architecture-dependent locked bus cycles are used, but some code which uses futex is written in C using counters. pthread_cond_signal/wait which started this thread is an example. So I suggest a change to read: To reiterate, bare futexes are not intended as an easy to use abstraction for end-users. Implementors are expected to understand processor memory ordering, barriers and synchronisation, and to have read the sources of the futex userspace library referenced below. Secondly, is it appropriate to add Ulrich Drepper's "Futexes Are Tricky" paper to SEE ALSO? "Futexes Are Tricky", Ulrich Drepper, June 2004, http://people.redhat.com/drepper/futex.pdf It's a very interesting paper, worth reading. But note that Ulrich's description of the FUTEX_WAIT operation in that paper is *wrong*: This means that the operation to wait on a futex is composed of getting the lock for the futex, checking the current value, if necessary adding the thread to the wait queue, and releasing the lock. In fact, waiting does not get the lock for the futex. It relies on the ordering of (1) adding to the wait queue, (2) checking the current value, and (3) removing from the wait queue if the value doesn't match. Among other things, this is necessary because checking the current value cannot be done with a spinlock held. The effect is very similar, but not exactly the same. -- Jamie ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-15 14:12 ` Jamie Lokier @ 2004-11-16 8:30 ` Hidetoshi Seto 2004-11-16 14:58 ` Jamie Lokier 0 siblings, 1 reply; 28+ messages in thread From: Hidetoshi Seto @ 2004-11-16 8:30 UTC (permalink / raw) To: Jamie Lokier; +Cc: bert hubert, Andrew Morton, linux-kernel, rusty, mingo OMG... Wait, wait... Don't do anything. I have to deeply apologize to all for my mistake. If my understanding is correct, this bug is "2.4 futex"(RHEL3) *SPECIFIC*!! I had swallow the story that 2.6 futex has the same problem... So I realize that 2.6 futex never behave: >> "returns 0 if the futex was not equal to the expected value, but >> the process was woken by a FUTEX_WAKE call." Update of manpage is now unnecessary, I think. # First of all, I would appreciate if you could read my old post: "Kernel bug in futex_wait, cause application hang with NPTL" http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/2044.html # Then, let's go on to the main subject. Jamie Lokier wrote: > In fact, waiting does not get the lock for the futex. It relies on > the ordering of (1) adding to the wait queue, (2) checking the current > value, and (3) removing from the wait queue if the value doesn't > match. Among other things, this is necessary because checking the > current value cannot be done with a spinlock held. If my understanding is correct, 2.6 futex does not get any spinlocks, but a semaphore: [kernel/futex.c](from 2.6, RHEL4b2) 286 static int futex_wake(unsigned long uaddr, int nr_wake) 287 { : 294 down_read(¤t->mm->mmap_sem); : 306 wake_futex(this); : 314 up_read(¤t->mm->mmap_sem); 315 return ret; 316 } : 477 static int futex_wait(unsigned long uaddr, int val, unsigned long time) 478 { : 483 down_read(¤t->mm->mmap_sem); : 489 queue_me(&q, -1, NULL); : 500 if (curval != val) { 501 ret = -EWOULDBLOCK; 502 goto out_unqueue; 503 } : 509 up_read(¤t->mm->mmap_sem); : 528 time = schedule_timeout(time); : 536 /* If we were woken (and unqueued), we succeeded, whatever. */ 537 if (!unqueue_me(&q)) 538 return 0; 539 if (time == 0) 540 return -ETIMEDOUT; 541 /* A spurious wakeup should never happen. */ 542 WARN_ON(!signal_pending(current)); 543 return -EINTR; 544 545 out_unqueue: 546 /* If we were woken (and unqueued), we succeeded, whatever. */ 547 if (!unqueue_me(&q)) 548 ret = 0; 549 out_release_sem: 550 up_read(¤t->mm->mmap_sem); 551 return ret; 552 } This semaphore prevents a waiter which temporarily queued to check the val from being target of wakeup. So my "[simulation]" is wrong if it is on 2.6, since wake_Y never be able to touch the queue while wait_A is in the queue to have the val to be checked. (If it is not possible that there are threads which go around with same futex/condvar but each have different mmap_sem,) 2.6 futex is quite good. # Next, let's see how about 2.4 futex: [kernel/futex.c](from 2.4, RHEL3U2) 154 static inline int futex_wake(unsigned long uaddr, int offset, int num) 155 { : 160 lock_futex_mm(); : 176 wake_up_all(&this->waiters); : 185 unlock_futex_mm(); : 188 return ret; 189 } : 310 static inline int futex_wait(unsigned long uaddr, 311 int offset, 312 int val, 313 unsigned long time) 314 { : 323 lock_futex_mm(); : 330 __queue_me(&q, page, uaddr, offset, -1, NULL); : 342 if (curval != val) { 343 unlock_futex_mm(); 344 ret = -EWOULDBLOCK; 345 goto out; 346 } : 357 unlock_futex_mm(); 358 time = schedule_timeout(time); : 365 if (time == 0) { 366 ret = -ETIMEDOUT; 367 goto out; 368 } 369 if (signal_pending(current)) 370 ret = -EINTR; 371 out: 372 /* Were we woken up anyway? */ 373 if (!unqueue_me(&q)) 374 ret = 0; 375 put_page(q.page); 376 377 return ret; : 383 } 2.4 futex uses spinlocks. 74 static inline void lock_futex_mm(void) 75 { 76 spin_lock(¤t->mm->page_table_lock); 77 spin_lock(&vcache_lock); 78 spin_lock(&futex_lock); 79 } 80 81 static inline void unlock_futex_mm(void) 82 { 83 spin_unlock(&futex_lock); 84 spin_unlock(&vcache_lock); 85 spin_unlock(¤t->mm->page_table_lock); 86 } However, this spinlocks fail to prevent topical waiters from wakeups. Because the spinlocks are released *before* unqueue_me(&q) (line 343 & 373). So this failure allows wake_Y to touch the queue while wait_A is in it. Of course as you know, this brings bug which I have mentioned. (I don't know how many distributions have 2.4 futex in itself, but) At least 2.4 futex in RHEL3U2 is buggy. # I regret that I could not notice this fact earlier. I'm sorry... I hope you'll accept my apology. Thanks, H.Seto ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-16 8:30 ` Futex queue_me/get_user ordering Hidetoshi Seto @ 2004-11-16 14:58 ` Jamie Lokier 2004-11-18 1:29 ` Hidetoshi Seto 0 siblings, 1 reply; 28+ messages in thread From: Jamie Lokier @ 2004-11-16 14:58 UTC (permalink / raw) To: Hidetoshi Seto; +Cc: bert hubert, Andrew Morton, linux-kernel, rusty, mingo Hidetoshi Seto wrote: > I have to deeply apologize to all for my mistake. > If my understanding is correct, this bug is "2.4 futex"(RHEL3) *SPECIFIC*!! > I had swallow the story that 2.6 futex has the same problem... Wrong, 2.6 has the same behaviour! > So I realize that 2.6 futex never behave: > >> "returns 0 if the futex was not equal to the expected value, but > >> the process was woken by a FUTEX_WAKE call." > > Update of manpage is now unnecessary, I think. It is necessary. > First of all, I would appreciate if you could read my old post: > "Kernel bug in futex_wait, cause application hang with NPTL" > http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/2044.html > If my understanding is correct, 2.6 futex does not get any spinlocks, > but a semaphore: > > 286 static int futex_wake(unsigned long uaddr, int nr_wake) > : > 294 down_read(¤t->mm->mmap_sem); > > 477 static int futex_wait(unsigned long uaddr, int val, unsigned long time) > : > 483 down_read(¤t->mm->mmap_sem); > This semaphore prevents a waiter which temporarily queued to check the val > from being target of wakeup. No, because it's a read-write semaphore, and we do "down_read" on it which is a shared lock. It does not prevent concurrent wake and wait operations! The only reason we use this semaphore is to block against vma-changing operations (like mmap) while we look up the futex key and memory word. > (If it is not possible that there are threads which go around with same > futex/condvar but each have different mmap_sem,) Actually it is possible, with process-shared condvars, but it's irrelevant because down_read doesn't prevent concurrent wakes and waits. [About 2.4 futex in RHEL3U2 which takes spinlocks instead]: > However, this spinlocks fail to prevent topical waiters from wakeups. > Because the spinlocks are released *before* unqueue_me(&q) (line 343 & 373). > So this failure allows wake_Y to touch the queue while wait_A is in it. This order is necessary, because it's not safe to call get_user() while holding any spinlocks. It is not a bug in RHEL. > At least 2.4 futex in RHEL3U2 is buggy. I don't think it is, because I think the behaviour you'll see with RHEL3U2 is no different than 2.6, just slower ;) -- Jamie ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-16 14:58 ` Jamie Lokier @ 2004-11-18 1:29 ` Hidetoshi Seto 0 siblings, 0 replies; 28+ messages in thread From: Hidetoshi Seto @ 2004-11-18 1:29 UTC (permalink / raw) To: Jamie Lokier; +Cc: bert hubert, Andrew Morton, linux-kernel, rusty, mingo Jamie Lokier wrote: > Hidetoshi Seto wrote: > >>I have to deeply apologize to all for my mistake. >>If my understanding is correct, this bug is "2.4 futex"(RHEL3) *SPECIFIC*!! >>I had swallow the story that 2.6 futex has the same problem... > > Wrong, 2.6 has the same behaviour! > >>So I realize that 2.6 futex never behave: >> >>>> "returns 0 if the futex was not equal to the expected value, but >>>> the process was woken by a FUTEX_WAKE call." >> >>Update of manpage is now unnecessary, I think. > > It is necessary. > >>First of all, I would appreciate if you could read my old post: >>"Kernel bug in futex_wait, cause application hang with NPTL" >>http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/2044.html > >>If my understanding is correct, 2.6 futex does not get any spinlocks, >>but a semaphore: >> >> 286 static int futex_wake(unsigned long uaddr, int nr_wake) >> : >> 294 down_read(¤t->mm->mmap_sem); >> >> 477 static int futex_wait(unsigned long uaddr, int val, unsigned long time) >> : >> 483 down_read(¤t->mm->mmap_sem); > >>This semaphore prevents a waiter which temporarily queued to check the val >>from being target of wakeup. > > No, because it's a read-write semaphore, and we do "down_read" on it > which is a shared lock. It does not prevent concurrent wake and wait > operations! Aha, yes. You are right. > [About 2.4 futex in RHEL3U2 which takes spinlocks instead]: > >>However, this spinlocks fail to prevent topical waiters from wakeups. >>Because the spinlocks are released *before* unqueue_me(&q) (line 343 & 373). >>So this failure allows wake_Y to touch the queue while wait_A is in it. > > This order is necessary, because it's not safe to call get_user() > while holding any spinlocks. It is not a bug in RHEL. I think 2.4 is fixable. My original patch for 2.4 was: /*----- patch begin -----*/ diff -Naur linux-2.4.21-EL3_org/kernel/futex.c linux-2.4.21-EL3/kernel/futex.c --- linux-2.4.21-EL3_org/kernel/futex.c 2004-08-25 19:47:35.418632860 +0900 +++ linux-2.4.21-EL3/kernel/futex.c 2004-08-25 19:48:32.505546224 +0900 @@ -297,14 +297,20 @@ spin_lock(&vcache_lock); spin_lock(&futex_lock); + ret = __unqueue_me(q); + spin_unlock(&futex_lock); + spin_unlock(&vcache_lock); + return ret; +} + +static inline int __unqueue_me(struct futex_q *q) +{ if (!list_empty(&q->list)) { list_del(&q->list); __detach_vcache(&q->vcache); - ret = 1; + return 1; } - spin_unlock(&futex_lock); - spin_unlock(&vcache_lock); - return ret; + return 0; } static inline int futex_wait(unsigned long uaddr, @@ -333,13 +339,18 @@ * Page is pinned, but may no longer be in this address space. * It cannot schedule, so we access it with the spinlock held. */ - if (!access_ok(VERIFY_READ, uaddr, 4)) - goto out_fault; + if (!access_ok(VERIFY_READ, uaddr, 4)) { + __unqueue_me(&q); + unlock_futex_mm(); + ret = -EFAULT; + goto out; + } kaddr = kmap_atomic(page, KM_USER0); curval = *(int*)(kaddr + offset); kunmap_atomic(kaddr, KM_USER0); if (curval != val) { + __unqueue_me(&q); unlock_futex_mm(); ret = -EWOULDBLOCK; goto out; @@ -364,22 +375,18 @@ */ if (time == 0) { ret = -ETIMEDOUT; - goto out; + goto out_wait; } if (signal_pending(current)) ret = -EINTR; -out: +out_wait: /* Were we woken up anyway? */ if (!unqueue_me(&q)) ret = 0; +out: put_page(q.page); return ret; - -out_fault: - unlock_futex_mm(); - ret = -EFAULT; - goto out; } long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout, /*----- patch end -----*/ This patch just reorder old codes in fault route: if(fault){ unlock(futex); ret = -ERRVAR; unqueue(); put_page(); return ret; } to new one: if(fault){ unqueue_in_lock(); unlock(futex); ret = -ERRVAR; put_page(); return ret; } It protects the temporarily queued thread from wakes, doesn't it? If this work, it could be said that we can fix 2.6 futex with a spinlock... but it will be slow, slow... Thanks, H.Seto ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-14 9:23 ` Jamie Lokier 2004-11-14 9:50 ` bert hubert @ 2004-11-15 0:58 ` Hidetoshi Seto 2004-11-15 2:01 ` Jamie Lokier 1 sibling, 1 reply; 28+ messages in thread From: Hidetoshi Seto @ 2004-11-15 0:58 UTC (permalink / raw) To: Jamie Lokier, mingo; +Cc: Andrew Morton, linux-kernel, rusty, ahu Jamie Lokier wrote: > Andrew Morton wrote: > >>The patch wasn't supposed to optimise anything. It fixed a bug which was >>causing hangs. See >>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc1/2.6.10-rc1-mm5/broken-out/futex_wait-fix.patch >> >>Or are you saying that userspace is buggy?? > > > I haven't looked at the NPTL code, but that URL's pseudo-code is buggy. > The call to FUTEX_WAKE should be doing wake++ conditionally on the > return value, not unconditionally. (snip) > > So I don't know if NPTL is buggy, but the pseudo-code given in the bug > report is (because of unconditional wake++), and so is the failure > example (because it doesn't use a mutex). > > -- Jamie from glibc-2.3.3(RHEL4b2): 31 int 32 __pthread_cond_signal (cond) 33 pthread_cond_t *cond; 34 { 35 /* Make sure we are alone. */ 36 lll_mutex_lock (cond->__data.__lock); 37 38 /* Are there any waiters to be woken? */ 39 if (cond->__data.__total_seq > cond->__data.__wakeup_seq) 40 { 41 /* Yes. Mark one of them as woken. */ 42 ++cond->__data.__wakeup_seq; 43 ++cond->__data.__futex; 44 45 /* Wake one. */ 46 lll_futex_wake (&cond->__data.__futex, 1); 47 } 48 49 /* We are done. */ 50 lll_mutex_unlock (cond->__data.__lock); 51 52 return 0; 53 } Ingo, is this buggy? We should start again with a question: Is this a kernel's bug or NPTL's bug? Thanks, H.Seto ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-15 0:58 ` Hidetoshi Seto @ 2004-11-15 2:01 ` Jamie Lokier 2004-11-15 3:06 ` Hidetoshi Seto 0 siblings, 1 reply; 28+ messages in thread From: Jamie Lokier @ 2004-11-15 2:01 UTC (permalink / raw) To: Hidetoshi Seto; +Cc: mingo, Andrew Morton, linux-kernel, rusty, ahu Hidetoshi Seto wrote: > >So I don't know if NPTL is buggy, but the pseudo-code given in the bug > >report is (because of unconditional wake++), and so is the failure > >example (because it doesn't use a mutex). > > from glibc-2.3.3(RHEL4b2): > > 31 int > 32 __pthread_cond_signal (cond) > 33 pthread_cond_t *cond; > 34 { > 35 /* Make sure we are alone. */ > 36 lll_mutex_lock (cond->__data.__lock); > 37 > 38 /* Are there any waiters to be woken? */ > 39 if (cond->__data.__total_seq > cond->__data.__wakeup_seq) > 40 { > 41 /* Yes. Mark one of them as woken. */ > 42 ++cond->__data.__wakeup_seq; > 43 ++cond->__data.__futex; > 44 > 45 /* Wake one. */ > 46 lll_futex_wake (&cond->__data.__futex, 1); > 47 } > 48 > 49 /* We are done. */ > 50 lll_mutex_unlock (cond->__data.__lock); > 51 > 52 return 0; > 53 } > > Ingo, is this buggy? > > We should start again with a question: > Is this a kernel's bug or NPTL's bug? Third possibility: your test is buggy. Do you actually use a mutex in your test when you call pthread_cond_wait, and does the waker hold it when it calls pthread_cond_signal? If you don't use a mutex as you are supposed to with condvars, then it might not be a kernel or NPTL bug. I'm not sure if POSIX-specified behaviour is defined when you use condvars without a mutex. If you do use a mutex (and you just didn't mention it), then the code above is not enough to decide if there's an NPTL bug. We need to look at pthread_cond_wait as well, to see how it does the "atomic" wait and mutex release. -- Jamie ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-15 2:01 ` Jamie Lokier @ 2004-11-15 3:06 ` Hidetoshi Seto 2004-11-15 13:22 ` Jamie Lokier 0 siblings, 1 reply; 28+ messages in thread From: Hidetoshi Seto @ 2004-11-15 3:06 UTC (permalink / raw) To: Jamie Lokier; +Cc: mingo, Andrew Morton, linux-kernel, rusty, ahu Jamie Lokier wrote: > Third possibility: your test is buggy. Do you actually use a mutex in > your test when you call pthread_cond_wait, and does the waker hold it > when it calls pthread_cond_signal? > > If you don't use a mutex as you are supposed to with condvars, then it > might not be a kernel or NPTL bug. I'm not sure if POSIX-specified > behaviour is defined when you use condvars without a mutex. > > If you do use a mutex (and you just didn't mention it), then the code > above is not enough to decide if there's an NPTL bug. We need to look > at pthread_cond_wait as well, to see how it does the "atomic" wait and > mutex release. > > -- Jamie Now I'm asking our test team about that. Again, from glibc-2.3.3(RHEL4b2): [nptl/sysdeps/pthread/pthread_cond_wait.c] 85 int 86 __pthread_cond_wait (cond, mutex) 87 pthread_cond_t *cond; 88 pthread_mutex_t *mutex; 89 { 90 struct _pthread_cleanup_buffer buffer; 91 struct _condvar_cleanup_buffer cbuffer; 92 int err; 93 94 /* Make sure we are along. */ 95 lll_mutex_lock (cond->__data.__lock); 96 97 /* Now we can release the mutex. */ 98 err = __pthread_mutex_unlock_usercnt (mutex, 0); 99 if (__builtin_expect (err, 0)) 100 { 101 lll_mutex_unlock (cond->__data.__lock); 102 return err; 103 } 104 105 /* We have one new user of the condvar. */ 106 ++cond->__data.__total_seq; 107 ++cond->__data.__futex; 108 cond->__data.__nwaiters += 1 << COND_CLOCK_BITS; 109 110 /* Remember the mutex we are using here. If there is already a 111 different address store this is a bad user bug. Do not store 112 anything for pshared condvars. */ 113 if (cond->__data.__mutex != (void *) ~0l) 114 cond->__data.__mutex = mutex; 115 116 /* Prepare structure passed to cancellation handler. */ 117 cbuffer.cond = cond; 118 cbuffer.mutex = mutex; 119 120 /* Before we block we enable cancellation. Therefore we have to 121 install a cancellation handler. */ 122 __pthread_cleanup_push (&buffer, __condvar_cleanup, &cbuffer); 123 124 /* The current values of the wakeup counter. The "woken" counter 125 must exceed this value. */ 126 unsigned long long int val; 127 unsigned long long int seq; 128 val = seq = cond->__data.__wakeup_seq; 129 /* Remember the broadcast counter. */ 130 cbuffer.bc_seq = cond->__data.__broadcast_seq; 131 132 do 133 { 134 unsigned int futex_val = cond->__data.__futex; 135 136 /* Prepare to wait. Release the condvar futex. */ 137 lll_mutex_unlock (cond->__data.__lock); 138 139 /* Enable asynchronous cancellation. Required by the standard. */ 140 cbuffer.oldtype = __pthread_enable_asynccancel (); 141 142 /* Wait until woken by signal or broadcast. */ 143 lll_futex_wait (&cond->__data.__futex, futex_val); 144 145 /* Disable asynchronous cancellation. */ 146 __pthread_disable_asynccancel (cbuffer.oldtype); 147 148 /* We are going to look at shared data again, so get the lock. */ 149 lll_mutex_lock (cond->__data.__lock); 150 151 /* If a broadcast happened, we are done. */ 152 if (cbuffer.bc_seq != cond->__data.__broadcast_seq) 153 goto bc_out; 154 155 /* Check whether we are eligible for wakeup. */ 156 val = cond->__data.__wakeup_seq; 157 } 158 while (val == seq || cond->__data.__woken_seq == val); 159 160 /* Another thread woken up. */ 161 ++cond->__data.__woken_seq; 162 163 bc_out: 164 165 cond->__data.__nwaiters -= 1 << COND_CLOCK_BITS; 166 167 /* If pthread_cond_destroy was called on this varaible already, 168 notify the pthread_cond_destroy caller all waiters have left 169 and it can be successfully destroyed. */ 170 if (cond->__data.__total_seq == -1ULL 171 && cond->__data.__nwaiters < (1 << COND_CLOCK_BITS)) 172 lll_futex_wake (&cond->__data.__nwaiters, 1); 173 174 /* We are done with the condvar. */ 175 lll_mutex_unlock (cond->__data.__lock); 176 177 /* The cancellation handling is back to normal, remove the handler. */ 178 __pthread_cleanup_pop (&buffer, 0); 179 180 /* Get the mutex before returning. */ 181 return __pthread_mutex_cond_lock (mutex); 182 } I'm not sure but it seems that the pseudo-code could be: (mutex must be locked before calling pthread_cond_wait.) -A01 pthread_cond_wait { +A01 pthread_cond_wait (futex,mutex) { +A0* mutex_unlock(mutex); A02 timeout = 0; A03 lock(counters); A04 total++; A05 val = get_from(futex); A06 unlock(counters); A07 A08 sys_futex(futex, FUTEX_WAIT, val, timeout); A09 A10 lock(counters); A11 woken++; A12 unlock(counters); +A1* mutex_lock(mutex); A13 } (and it's better to replace var "futex" to "cond".) Is it possible that NPTL shut the window between mutex_unlock() and actual queueing in futex_wait? Thanks, H.Seto ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-15 3:06 ` Hidetoshi Seto @ 2004-11-15 13:22 ` Jamie Lokier 2004-11-17 8:47 ` Jakub Jelinek 0 siblings, 1 reply; 28+ messages in thread From: Jamie Lokier @ 2004-11-15 13:22 UTC (permalink / raw) To: Hidetoshi Seto; +Cc: mingo, Andrew Morton, linux-kernel, rusty, ahu Hidetoshi Seto wrote: > I'm not sure but it seems that the pseudo-code could be: > > (mutex must be locked before calling pthread_cond_wait.) > -A01 pthread_cond_wait { > +A01 pthread_cond_wait (futex,mutex) { > +A0* mutex_unlock(mutex); > A02 timeout = 0; > A03 lock(counters); No, it is: > -A01 pthread_cond_wait { > +A01 pthread_cond_wait (futex,mutex) { > A02 timeout = 0; > A03 lock(counters); > +A0* mutex_unlock(mutex); An important difference! However, I must humbly apologise. Having studied your failure case more, I see that the problems you observe can occur even if you do take the mutex properly. Consider 4 threads, doing these in parallel (the same as your example but explicitly mentioning the mutex): wait_A: lock mutex; pthread_cond_wait(cond, mutex); unlock mutex wake_X: lock mutex; pthread_cond_signal(cond); unlock mutex wait_B: lock mutex; pthread_cond_wait(cond, mutex); unlock mutex wake_Y: lock mutex; pthread_cond_signal(cond); unlock mutex Then with the code you have posted, it is possible to see the sequence of events which you described. The observed problems are: 1. A lost wakeup. wait_A is woken, but wait_B is not, even though the second pthread_cond_signal is "observably" after wait_B. The operation order is observable in sense that wait_B could update the data structure which is protected by cond+mutex, and wake_Y could read that update prior to deciding to signal. _Logically_, what happens is that wait_A is woken by wake_X, but it does not immediately re-acquire the mutex. In this time window, wait_B and wake_Y both run, and then wait_A acquires the mutex. During this window, wait_A is able to absorb the second signal. It's not clear to me if POSIX requires wait_B to be signalled or not in this case. 2. Future lost wakeups. Future calls to pthread_cond_signal(cond) fail to wake wait_B, even much later, because cond's NPTL data structure is inconsistent. It's invariant is broken. This is a bug in NPTL and it's easy to fix. Never increment wake unconditionally. Instead, increment it conditionally when (a) FUTEX_WAKE returns 1, and also (b) when FUTEX_WAIT returns -EAGAIN. Both these problem are possible, in exactly the way you described, even if you do take the mutex properly. Unfortunately, the kernel patch you tried does not fix these lost wakeups (in addition to other problems it causes). This is the sequence which fails (I've added numbers): > 1. wait_A: calls pthread_cond_wait: > total++, prepare to call FUTEX_WAIT with val=0. > # status: [1/0/0] (0) queue={}(empty) # > > 2. wake_X: calls pthread_cond_signal: > no one in waitqueue, just wake++ and update futex val. > # status: [1/1/0] (1) queue={}(empty) # > > 3. wait_B: calls pthread_cond_wait: > total++, prepare to call FUTEX_WAIT with val=1. > # status: [2/1/0] (1) queue={}(empty) # > > 4. wait_A: calls FUTEX_WAIT with val=0: > after queueing, compare val. 0!=1 ... this should be blocked... > # status: [2/1/0] (1) queue={A} # > > 5. wait_B: calls FUTEX_WAIT with val=1: > after queueing, compare val. 1==1 ... OK, let's schedule()... > # status: [2/1/0] (1) queue={A,B} (B=sleeping) # > > 6. wake_Y: calls pthread_cond_signal: > A is in waitqueue ... dequeue A, wake++ and update futex val. > # status: [2/2/0] (2) queue={B} (B=sleeping) # > > 7. wait_A: end of FUTEX_WAIT with val=0: > try to dequeue but already dequeued, return anyway. > # status: [2/2/0] (2) queue={B} (B=sleeping) # > > 8. wait_A: end of pthread_cond_wait: > woken++. > # status: [2/2/1] (2) queue={B} (B=sleeping) # > > This is bug: > wait_A: wakeup > wait_B: sleeping > wake_X: wake A > wake_Y: wake A again > > if subsequent wake_Z try to wake B: > > wake_Z: calls pthread_cond_signal: > since total==wake, do nothing. > # status: [2/2/1] (2) queue={B} (B=sleeping) # > > If wait_C comes, B become to can be woken, but C... With your kernel patch, the above sequence is prevented. Unfortunately, a very similar sequence shows the same problems. I think the reason you do not see them in testing is because the timing is changed. This is the sequence, very similar to your example, which fails in the same way with your kernel patch: 1. wait_A: calls pthread_cond_wait: total++, prepare to call FUTEX_WAIT with val=0. + calls FUTEX_WAIT with val=0. + inside futex_wait(), kernel compares val. 0==0, not yet queued. # status: [1/0/0] (0) queue={}(empty) # 2. wake_X: calls pthread_cond_signal: no one in waitqueue, just wake++ and update futex val. # status: [1/1/0] (1) queue={}(empty) # 3. wait_B: calls pthread_cond_wait: total++, prepare to call FUTEX_WAIT with val=1. # status: [2/1/0] (1) queue={}(empty) # - 4. wait_A: calls FUTEX_WAIT with val=0: - after queueing, compare val. 0!=1 ... this should be blocked... + 4. wait_A: inside futex_wait(), already compared val. and will block: + calls queue_me()... then schedule()... # status: [2/1/0] (1) queue={A} # 5. wait_B: calls FUTEX_WAIT with val=1: after queueing, compare val. 1==1 ... OK, let's schedule()... # status: [2/1/0] (1) queue={A,B} (B=sleeping) # 6. wake_Y: calls pthread_cond_signal: A is in waitqueue ... dequeue A, wake++ and update futex val. # status: [2/2/0] (2) queue={B} (B=sleeping) # 7. wait_A: end of FUTEX_WAIT with val=0: - try to dequeue but already dequeued, return anyway. + woken, return. # status: [2/2/0] (2) queue={B} (B=sleeping) # 8. wait_A: end of pthread_cond_wait: woken++. # status: [2/2/1] (2) queue={B} (B=sleeping) # I hope that explains why this is not fixed by changing the order of operations in the kernel. The problem of a wakeup being lost during many concurrent operations is not easy to fix. However, the most important property is that at least one waiter is running and has the mutex at the end of all the concurrent operations. That property is satisfied. So first it is important to know if this specific lost wakeup is really a bug, or if it is acceptable POSIX behaviour. The problem of multiple future wakeups being lost is easy to fix in NPTL, by updating the state conditionally on the return values from FUTEX_WAKE and FUTEX_WAIT instead of ignoring the return values. -- Jamie ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-15 13:22 ` Jamie Lokier @ 2004-11-17 8:47 ` Jakub Jelinek 2004-11-18 2:10 ` Hidetoshi Seto ` (2 more replies) 0 siblings, 3 replies; 28+ messages in thread From: Jakub Jelinek @ 2004-11-17 8:47 UTC (permalink / raw) To: Jamie Lokier Cc: Hidetoshi Seto, mingo, Andrew Morton, linux-kernel, rusty, ahu, drepper On Mon, Nov 15, 2004 at 01:22:18PM +0000, Jamie Lokier wrote: > 1. A lost wakeup. > > wait_A is woken, but wait_B is not, even though the second > pthread_cond_signal is "observably" after wait_B. > > The operation order is observable in sense that wait_B could > update the data structure which is protected by cond+mutex, and > wake_Y could read that update prior to deciding to signal. > > _Logically_, what happens is that wait_A is woken by wake_X, but > it does not immediately re-acquire the mutex. In this time > window, wait_B and wake_Y both run, and then wait_A acquires the > mutex. During this window, wait_A is able to absorb the second > signal. > > It's not clear to me if POSIX requires wait_B to be signalled or > not in this case. > > 2. Future lost wakeups. > > Future calls to pthread_cond_signal(cond) fail to wake wait_B, > even much later, because cond's NPTL data structure is > inconsistent. It's invariant is broken. > > This is a bug in NPTL and it's easy to fix. Never increment wake > unconditionally. Instead, increment it conditionally when (a) > FUTEX_WAKE returns 1, and also (b) when FUTEX_WAIT returns -EAGAIN. If you think it is fixable in userland, please write at least the pseudo code that you believe should work. We have spent quite a lot of time on that code and don't believe this is solvable in userland. E.g. the futex IMHO must be incremented before FUTEX_WAKE, as otherwise the woken tasks wouldn't see the effect. I believe the only place this is solvable in is the kernel, by ensuring atomicity (i.e. queuing task iff curval == expected_val operation atomic wrt. futex_wake/futex_requeue in other tasks). In the RHEL3 2.4.x backport this is easy, as spinlock is held around the user access (the page is first pinned, then lock taken, then value compared (but that is guaranteed to be non-blocking) and if equal queued, then unlocked and unpinned. In 2.6.x this is harder if the kernel cannot allow some spinlock to be taken while doing user access, but I guess the kernel needs to cope with this, e.g. by queueing the task early but mark it as maybe queued only. If futex_wake sees such a bit, it would wait until futex_wait notifies it it has decided whether that one should be queued or not. Or something else, whatever, as long as the right semantics is ensured. Just FYI, current pseudo code is (not mentioning cancellation stuff here, code/data to deal with pthread_cond_destroy semantics, timedwait and pshared condvars): typedef struct { int lock, futex; uint64_t total_seq, wakeup_seq, woken_seq; void *mutex; uint32_t broadcast_seq; } pthread_cond_t; pthread_cond_signal (cond) { mutex_lock (lock); if (total_seq > wakeup_seq) { ++wakeup_seq, ++futex; futex (&futex, FUTEX_WAKE, 1); } mutex_unlock (lock); } pthread_cond_wait (cond, mtx) { mutex_lock (lock); mutex_unlock (mtx->lock); ++total_seq; ++futex; mutex = mtx; bc_seq = broadcast_seq; seq = wakeup_seq; do { val = futex; mutex_unlock (lock); futex (&futex, FUTEX_WAIT, val); mutex_lock (lock); if (bc_seq != broadcast_seq) goto out; } while (wakeup_seq == seq || woken_seq == wakeup_seq); ++woken_seq; out: mutex_unlock (lock); mutex_lock (mtx->lock); } pthread_cond_broadcast (cond) { mutex_lock (lock); if (total_seq > wakeup_seq) { woken_seq = wakeup_seq = total_seq; futex = 2 * total_seq; ++broadcast_seq; val = futex; mutex_unlock (lock); if (futex (&futex, FUTEX_CMP_REQUEUE, 1, INT_MAX, &mutex->lock, val) < 0) futex (&futex, FUTEX_WAKE, INT_MAX); return; } mutex_unlock (lock); } Jakub ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-17 8:47 ` Jakub Jelinek @ 2004-11-18 2:10 ` Hidetoshi Seto 2004-11-18 7:20 ` Jamie Lokier 2004-11-26 17:06 ` Jamie Lokier 2 siblings, 0 replies; 28+ messages in thread From: Hidetoshi Seto @ 2004-11-18 2:10 UTC (permalink / raw) To: Jakub Jelinek Cc: Jamie Lokier, mingo, Andrew Morton, linux-kernel, rusty, ahu, drepper Jakub Jelinek wrote: > On Mon, Nov 15, 2004 at 01:22:18PM +0000, Jamie Lokier wrote: > >> 1. A lost wakeup. >> >> wait_A is woken, but wait_B is not, even though the second >> pthread_cond_signal is "observably" after wait_B. >> >> The operation order is observable in sense that wait_B could >> update the data structure which is protected by cond+mutex, and >> wake_Y could read that update prior to deciding to signal. >> >> _Logically_, what happens is that wait_A is woken by wake_X, but >> it does not immediately re-acquire the mutex. In this time >> window, wait_B and wake_Y both run, and then wait_A acquires the >> mutex. During this window, wait_A is able to absorb the second >> signal. >> >> It's not clear to me if POSIX requires wait_B to be signalled or >> not in this case. >> >> 2. Future lost wakeups. >> >> Future calls to pthread_cond_signal(cond) fail to wake wait_B, >> even much later, because cond's NPTL data structure is >> inconsistent. It's invariant is broken. >> >> This is a bug in NPTL and it's easy to fix. Never increment wake >> unconditionally. Instead, increment it conditionally when (a) >> FUTEX_WAKE returns 1, and also (b) when FUTEX_WAIT returns -EAGAIN. > > > If you think it is fixable in userland, please write at least the pseudo > code that you believe should work. We have spent quite a lot of time > on that code and don't believe this is solvable in userland. > E.g. the futex IMHO must be incremented before FUTEX_WAKE, as otherwise > the woken tasks wouldn't see the effect. > > I believe the only place this is solvable in is the kernel, by ensuring > atomicity (i.e. queuing task iff curval == expected_val operation atomic > wrt. futex_wake/futex_requeue in other tasks). I agree. I think this is kernel problem. Even if it is possible to avoid this problem by tricks in userland, I think it is ugly that it could happen that threads having randomness val could be waken. i.g.: >>>> >> "returns 0 if the futex was not equal to the expected value, but >>>> >> the process was woken by a FUTEX_WAKE call." Still now, update of manpage is unnecessary, I think. Thanks, H.Seto ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-17 8:47 ` Jakub Jelinek 2004-11-18 2:10 ` Hidetoshi Seto @ 2004-11-18 7:20 ` Jamie Lokier 2004-11-18 19:47 ` Jakub Jelinek 2004-11-26 17:06 ` Jamie Lokier 2 siblings, 1 reply; 28+ messages in thread From: Jamie Lokier @ 2004-11-18 7:20 UTC (permalink / raw) To: Jakub Jelinek Cc: Hidetoshi Seto, mingo, Andrew Morton, linux-kernel, rusty, ahu, drepper Jakub Jelinek wrote: > On Mon, Nov 15, 2004 at 01:22:18PM +0000, Jamie Lokier wrote: > > 1. A lost wakeup. > > > > wait_A is woken, but wait_B is not, even though the second > > pthread_cond_signal is "observably" after wait_B. > > > > The operation order is observable in sense that wait_B could > > update the data structure which is protected by cond+mutex, and > > wake_Y could read that update prior to deciding to signal. > > > > _Logically_, what happens is that wait_A is woken by wake_X, but > > it does not immediately re-acquire the mutex. In this time > > window, wait_B and wake_Y both run, and then wait_A acquires the > > mutex. During this window, wait_A is able to absorb the second > > signal. > > > > It's not clear to me if POSIX requires wait_B to be signalled or > > not in this case. > > > > 2. Future lost wakeups. > > > > Future calls to pthread_cond_signal(cond) fail to wake wait_B, > > even much later, because cond's NPTL data structure is > > inconsistent. It's invariant is broken. > > > > This is a bug in NPTL and it's easy to fix. Never increment wake > > unconditionally. Instead, increment it conditionally when (a) > > FUTEX_WAKE returns 1, and also (b) when FUTEX_WAIT returns -EAGAIN. > > If you think it is fixable in userland, please write at least the pseudo > code that you believe should work. We have spent quite a lot of time > on that code and don't believe this is solvable in userland. > E.g. the futex IMHO must be incremented before FUTEX_WAKE, as otherwise > the woken tasks wouldn't see the effect. Do you have an answer for whether the behaviour of (a) is a bug or not? I don't know if it's a bug, or if that part of NPTL behaviour is acceptable under POSIX. Even if it's acceptable, you might decide it's not acceptable quality to do that. That answer affects my answer. > I believe the only place this is solvable in is the kernel, by ensuring > atomicity (i.e. queuing task iff curval == expected_val operation atomic > wrt. futex_wake/futex_requeue in other tasks). I think it's solvable in userspace. I have a solution, but I'm tired and will send it tomorrow. This is just to let you know I'm looking at the problem. -- Jamie ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-18 7:20 ` Jamie Lokier @ 2004-11-18 19:47 ` Jakub Jelinek 2005-03-17 10:26 ` Jakub Jelinek 0 siblings, 1 reply; 28+ messages in thread From: Jakub Jelinek @ 2004-11-18 19:47 UTC (permalink / raw) To: Jamie Lokier Cc: Hidetoshi Seto, mingo, Andrew Morton, linux-kernel, rusty, ahu, Ulrich Drepper, Roland McGrath On Thu, Nov 18, 2004 at 07:20:58AM +0000, Jamie Lokier wrote: > Do you have an answer for whether the behaviour of (a) is a bug or > not? I don't know if it's a bug, or if that part of NPTL behaviour is > acceptable under POSIX. Even if it's acceptable, you might decide > it's not acceptable quality to do that. Not sure what you mean by (a) there, so assuming you meant 1. If pthread_cond_{signal,broadcast} is called with the condvar's associated mutex held, then the standard is pretty clear when a thread is considered blocked in pthread_cond_*wait on the condvar, as releasing the mutex and getting blocked on the condvar in pthread_cond_*wait shall be observed as atomic by other threads. If pthread_cond_{signal,broadcast} is called without the mutex held, it is not that clear. Anyway, pthread_cond_signal is supposed to wake at least one thread blocked in pthread_cond_*wait (if there are any). The scenario described in futex_wait-fix.patch IMHO can happen even if all calls to pthread_cond_signal are done with mutex held around it, i.e. A B X Y pthread_mutex_lock (&mtx); pthread_cond_wait (&cv, &mtx); - mtx release *) total++ [1/0/0] (0) {} pthread_mutex_lock (&mtx); pthread_cond_signal (&cv); - wake++ [1/1/0] (1) {} FUTEX_WAKE, 1 (returns, nothing is queued) pthread_mutex_unlock (&mtx); pthread_mutex_lock (&mtx); pthread_cond_wait (&cv, &mtx); - mtx release *) total++ [2/1/0] (1) {} FUTEX_WAIT, 0 queue_me [2/1/0] (1) {A} 0 != 1 FUTEX_WAIT, 1 queue_me [2/1/0] (1) {A,B} 1 == 1 pthread_mutex_lock (&mtx); pthread_cond_signal (&cv); - wake++ [2/2/0] (2) {A,B} FUTEX_WAKE, 1 (unqueues incorrectly A) [2/2/0] (2) {B} pthread_mutex_unlock (&mtx); try to dequeue but already dequeued would normally return EWOULDBLOCK here but as unqueue_me failed, returns 0 woken++ [2/2/1] (2) {B} schedule_timeout (forever) - mtx reacquire pthread_cond_wait returns pthread_mutex_unlock (&mtx); ------------------- the code would like to say pthread_mutex_unlock (&mtx); and pthread_exit here, but never reaches there. Now, if at this point say A pthread_join's B, Y pthread_join's A and X pthread_join's Y, the program should eventually finish, as B must have been woken up according to the standard. Whether signal in X means pthread_cond_wait in A returning first or pthread_cond_wait in B returning first is I believe not defined unless special scheduling policy is used, as both A and B are supposed to contend for mtx lock. But I believe both A and B must be awaken, assuming no other thread attempts to acquire mtx afterwards. *) therefore other threads that acquire mtx can now consider A blocked on the condvar Jakub ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-18 19:47 ` Jakub Jelinek @ 2005-03-17 10:26 ` Jakub Jelinek 2005-03-17 15:20 ` Jamie Lokier 0 siblings, 1 reply; 28+ messages in thread From: Jakub Jelinek @ 2005-03-17 10:26 UTC (permalink / raw) To: Jamie Lokier Cc: Hidetoshi Seto, mingo, Andrew Morton, linux-kernel, rusty, ahu, Ulrich Drepper, Roland McGrath On Thu, Nov 18, 2004 at 02:47:26PM -0500, Jakub Jelinek wrote: > The scenario described in futex_wait-fix.patch IMHO can happen even > if all calls to pthread_cond_signal are done with mutex held around it, i.e. > A B X Y > pthread_mutex_lock (&mtx); > pthread_cond_wait (&cv, &mtx); > - mtx release *) > total++ [1/0/0] (0) {} > pthread_mutex_lock (&mtx); > pthread_cond_signal (&cv); > - wake++ [1/1/0] (1) {} > FUTEX_WAKE, 1 (returns, nothing is queued) > pthread_mutex_unlock (&mtx); > pthread_mutex_lock (&mtx); > pthread_cond_wait (&cv, &mtx); > - mtx release *) > total++ [2/1/0] (1) {} > FUTEX_WAIT, 0 > queue_me [2/1/0] (1) {A} > 0 != 1 > FUTEX_WAIT, 1 > queue_me [2/1/0] (1) {A,B} > 1 == 1 > pthread_mutex_lock (&mtx); > pthread_cond_signal (&cv); > - wake++ [2/2/0] (2) {A,B} > FUTEX_WAKE, 1 (unqueues incorrectly A) > [2/2/0] (2) {B} > pthread_mutex_unlock (&mtx); > try to dequeue but already dequeued > would normally return EWOULDBLOCK here > but as unqueue_me failed, returns 0 > woken++ [2/2/1] (2) {B} > schedule_timeout (forever) > - mtx reacquire > pthread_cond_wait returns > pthread_mutex_unlock (&mtx); > > ------------------- > the code would like to say pthread_mutex_unlock (&mtx); > and pthread_exit here, but never reaches there. ... http://www.ussg.iu.edu/hypermail/linux/kernel/0411.2/0953.html Your argument in November was that you don't want to slow down the kernel and that userland must be able to cope with the non-atomicity of futex syscall. But with the recent changes to futex.c I think kernel can ensure atomicity for free. With get_futex_value_locked doing the user access in_atomic () and repeating if that failed, I think it would be just a matter of something as in the patch below (totally untested though). It would simplify requeue implementation (getting rid of the nqueued field), as well as never enqueue a futex in futex_wait until the *uaddr == val uaccess check has shown it should be enqueued. And I don't think the kernel will be any slower because of that, in the common case where get_futex_value_locked does not cause a mm fault (userland typically accessed that memory a few cycles before the syscall), the futex_wait change is just about doing first half of queue_me before the user access and second half after it. --- linux-2.6.11/kernel/futex.c.jj 2005-03-17 04:42:29.000000000 -0500 +++ linux-2.6.11/kernel/futex.c 2005-03-17 05:13:45.000000000 -0500 @@ -97,7 +97,6 @@ struct futex_q { */ struct futex_hash_bucket { spinlock_t lock; - unsigned int nqueued; struct list_head chain; }; @@ -265,7 +264,6 @@ static inline int get_futex_value_locked inc_preempt_count(); ret = __copy_from_user_inatomic(dest, from, sizeof(int)); dec_preempt_count(); - preempt_check_resched(); return ret ? -EFAULT : 0; } @@ -339,7 +337,6 @@ static int futex_requeue(unsigned long u struct list_head *head1; struct futex_q *this, *next; int ret, drop_count = 0; - unsigned int nqueued; retry: down_read(¤t->mm->mmap_sem); @@ -354,23 +351,24 @@ static int futex_requeue(unsigned long u bh1 = hash_futex(&key1); bh2 = hash_futex(&key2); - nqueued = bh1->nqueued; + if (bh1 < bh2) + spin_lock(&bh1->lock); + spin_lock(&bh2->lock); + if (bh1 > bh2) + spin_lock(&bh1->lock); + if (likely(valp != NULL)) { int curval; - /* In order to avoid doing get_user while - holding bh1->lock and bh2->lock, nqueued - (monotonically increasing field) must be first - read, then *uaddr1 fetched from userland and - after acquiring lock nqueued field compared with - the stored value. The smp_mb () below - makes sure that bh1->nqueued is read from memory - before *uaddr1. */ - smp_mb(); - ret = get_futex_value_locked(&curval, (int __user *)uaddr1); if (unlikely(ret)) { + spin_unlock(&bh1->lock); + if (bh1 != bh2) + spin_unlock(&bh2->lock); + + preempt_check_resched(); + /* If we would have faulted, release mmap_sem, fault * it in and start all over again. */ @@ -385,21 +383,10 @@ static int futex_requeue(unsigned long u } if (curval != *valp) { ret = -EAGAIN; - goto out; + goto out_unlock; } } - if (bh1 < bh2) - spin_lock(&bh1->lock); - spin_lock(&bh2->lock); - if (bh1 > bh2) - spin_lock(&bh1->lock); - - if (unlikely(nqueued != bh1->nqueued && valp != NULL)) { - ret = -EAGAIN; - goto out_unlock; - } - head1 = &bh1->chain; list_for_each_entry_safe(this, next, head1, list) { if (!match_futex (&this->key, &key1)) @@ -435,13 +422,9 @@ out: return ret; } -/* - * queue_me and unqueue_me must be called as a pair, each - * exactly once. They are called with the hashed spinlock held. - */ - /* The key must be already stored in q->key. */ -static void queue_me(struct futex_q *q, int fd, struct file *filp) +static inline struct futex_hash_bucket * +queue_lock(struct futex_q *q, int fd, struct file *filp) { struct futex_hash_bucket *bh; @@ -455,11 +438,35 @@ static void queue_me(struct futex_q *q, q->lock_ptr = &bh->lock; spin_lock(&bh->lock); - bh->nqueued++; + return bh; +} + +static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *bh) +{ list_add_tail(&q->list, &bh->chain); spin_unlock(&bh->lock); } +static inline void +queue_unlock(struct futex_q *q, struct futex_hash_bucket *bh) +{ + spin_unlock(&bh->lock); + drop_key_refs(&q->key); +} + +/* + * queue_me and unqueue_me must be called as a pair, each + * exactly once. They are called with the hashed spinlock held. + */ + +/* The key must be already stored in q->key. */ +static void queue_me(struct futex_q *q, int fd, struct file *filp) +{ + struct futex_hash_bucket *bh; + bh = queue_lock(q, fd, filp); + __queue_me(q, bh); +} + /* Return 1 if we were still queued (ie. 0 means we were woken) */ static int unqueue_me(struct futex_q *q) { @@ -503,6 +510,7 @@ static int futex_wait(unsigned long uadd DECLARE_WAITQUEUE(wait, current); int ret, curval; struct futex_q q; + struct futex_hash_bucket *bh; retry: down_read(¤t->mm->mmap_sem); @@ -511,7 +519,7 @@ static int futex_wait(unsigned long uadd if (unlikely(ret != 0)) goto out_release_sem; - queue_me(&q, -1, NULL); + bh = queue_lock(&q, -1, NULL); /* * Access the page AFTER the futex is queued. @@ -537,14 +545,15 @@ static int futex_wait(unsigned long uadd ret = get_futex_value_locked(&curval, (int __user *)uaddr); if (unlikely(ret)) { + queue_unlock(&q, bh); + + preempt_check_resched(); + /* If we would have faulted, release mmap_sem, fault it in and * start all over again. */ up_read(¤t->mm->mmap_sem); - if (!unqueue_me(&q)) /* There's a chance we got woken already */ - return 0; - ret = get_user(curval, (int __user *)uaddr); if (!ret) @@ -553,9 +562,15 @@ static int futex_wait(unsigned long uadd } if (curval != val) { ret = -EWOULDBLOCK; - goto out_unqueue; + queue_unlock(&q, bh); + preempt_check_resched(); + goto out_release_sem; } + /* Only actually queue if *uaddr contained val. */ + __queue_me(&q, bh); + preempt_check_resched(); + /* * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. Jakub ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2005-03-17 10:26 ` Jakub Jelinek @ 2005-03-17 15:20 ` Jamie Lokier 2005-03-17 15:55 ` Jakub Jelinek 2005-03-18 16:53 ` Jakub Jelinek 0 siblings, 2 replies; 28+ messages in thread From: Jamie Lokier @ 2005-03-17 15:20 UTC (permalink / raw) To: Jakub Jelinek Cc: Hidetoshi Seto, mingo, Andrew Morton, linux-kernel, rusty, ahu, Ulrich Drepper, Roland McGrath, Scott Snyder Jakub Jelinek wrote: > http://www.ussg.iu.edu/hypermail/linux/kernel/0411.2/0953.html > > Your argument in November was that you don't want to slow down the > kernel and that userland must be able to cope with the > non-atomicity of futex syscall. Those were two of them. But my other main concern is conceptual. Right now, a futex_wait call is roughly equivalent to to add_wait_queue, which is quite versatile. It means anything you can do with one futex, you can extend to multiple futexes (e.g. waiting on more than one lock), and you can do asynchronously (e.g. futex_wait can be implemented in userspace as futex_fd[1] + poll[2], and therefore things like poll-driven state machines where one of the state machines wants to wait on a lock are possible). [1] Ulrich was mistaken in his paper to say futex_fd needs to check a word to be useful; userspace is supposed to check the word after futex_fd and before polling or waiting on it. This is more useful because it extends to multiple futexes. [2] actually it can't right now because of a flaw in futex_fd's poll function, but that could be fixed. The _principle_ is sound. If you change futex_wait to be "atomic", and then have userspace locks which _depend_ on that atomicity, it becomes impossible to wait on multiple of those locks, or make poll-driven state machines which can wait on those locks. There are applications and libraries which use futex, not just for threading but things like database locks in files. You can do userspace threading and simulate most blocking system calls by making them non-blocking and using poll). (I'm not saying anything against NPTL by this, by the way - NPTL is a very good general purpose library - but there are occasions when an application wants to do it's own equivalent of simulated blocking system calls for one reason or another. My favourite being research into inter-thread JIT-optimisation in an environment like valgrind). Right now, in principle, futex_wait is among the system calls which can be simulated by making it non-blocking (= futex_fd) and using poll()[2]. Which means programs using futex themselves can be subject to interesting thread optimisations by code which knows nothing about the program (similar to valgrind..) If you change futex_wait to be "atomic", then it would be _impossible_ to take a some random 3rd party library which is using that futex_wait, and convert it's blocking system calls to use poll-driven state machines instead. I think taking that away would be a great conceptual loss. It's not a _huge_ loss, but considering it's only Glibc which is demanding this and futexes have another property, token-passing, which Glibc could be using instead - why not use it? That said, let's look at your patch. > It would simplify requeue implementation (getting rid of the nqueued > field), The change to FUTEX_REQUEUE2 is an improvement :) nqueued is an abomination, like the rest of FUTEX_REQUEUE2 :) > @@ -265,7 +264,6 @@ static inline int get_futex_value_locked > inc_preempt_count(); > ret = __copy_from_user_inatomic(dest, from, sizeof(int)); > dec_preempt_count(); > - preempt_check_resched(); > > return ret ? -EFAULT : 0; > } inc_preempt_count() and dec_preempt_count() aren't needed, as preemption is disabled by the queue spinlocks. So get_futex_value_locked isn't needed any more: with the spinlocks held, __get_user will do. > [numerous instances of...] > + preempt_check_resched(); Not required. The spin unlocks will do this. > But with the recent changes to futex.c I think kernel can ensure > atomicity for free. I agree it would probably not slow the kernel, but I would _strongly_ prefer that Glibc were fixed to use the token-passing property, if Glibc is the driving intention behind this patch - instead of this becoming a semantic that application-level users of futex (like database and IPC libraries) come to depend on and which can't be decomposed into a multiple-waiting form. (I admit that the kernel code does look nicer with get_futex_value_locked gone, though). By the way, do you know of Scott Snyder's recent work on fixing Glibc in this way? He bumped into one of Glibc's currently broken corner cases, fixed it (according to the algorithm I gave in November), and reported that it works fine with the fix. -- Jamie ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2005-03-17 15:20 ` Jamie Lokier @ 2005-03-17 15:55 ` Jakub Jelinek 2005-03-18 17:00 ` Ingo Molnar 2005-03-18 16:53 ` Jakub Jelinek 1 sibling, 1 reply; 28+ messages in thread From: Jakub Jelinek @ 2005-03-17 15:55 UTC (permalink / raw) To: Jamie Lokier Cc: Hidetoshi Seto, mingo, Andrew Morton, linux-kernel, rusty, ahu, Ulrich Drepper, Roland McGrath, Scott Snyder On Thu, Mar 17, 2005 at 03:20:31PM +0000, Jamie Lokier wrote: > If you change futex_wait to be "atomic", and then have userspace locks > which _depend_ on that atomicity, it becomes impossible to wait on > multiple of those locks, or make poll-driven state machines which can > wait on those locks. The futex man pages that have been around for years (certainly since mid 2002) certainly don't document FUTEX_WAIT as token passing operation, but as atomic operation: Say http://www.icewalkers.com/Linux/ManPages/futex-2.html FUTEX_WAIT This operation atomically verifies that the futex address still contains the value given, and sleeps awaiting FUTEX_WAKE on this futex address. If the timeout argument is non-NULL, its contents describe the maximum duration of the wait, which is infinite otherwise. For futex(4), this call is executed if decrementing the count gave a negative value (indi cating contention), and will sleep until another process releases the futex and executes the FUTEX_WAKE operation. RETURN VALUE FUTEX_WAIT Returns 0 if the process was woken by a FUTEX_WAKE call. In case of timeout, ETIMEDOUT is returned. If the futex was not equal to the expected value, the operation returns EWOULDBLOCK. Signals (or other spurious wakeups) cause FUTEX_WAIT to return EINTR. so there very well might be programs other than glibc that depend on this behaviour. Given that in most cases the race is not hit every day (after all, we have been living with it for several years), they probably wouldn't know there is a problem like that. > You can do userspace threading and simulate most blocking system calls > by making them non-blocking and using poll). Sure, but then you need to write your own locking as well and can just use the token passing property of futexes there. > It's not a _huge_ loss, but considering it's only Glibc which is > demanding this and futexes have another property, token-passing, which > Glibc could be using instead - why not use it? Because that requires requeue being done with the cv lock held, which means an extra context switch. > > @@ -265,7 +264,6 @@ static inline int get_futex_value_locked > > inc_preempt_count(); > > ret = __copy_from_user_inatomic(dest, from, sizeof(int)); > > dec_preempt_count(); > > - preempt_check_resched(); > > > > return ret ? -EFAULT : 0; > > } > > inc_preempt_count() and dec_preempt_count() aren't needed, as > preemption is disabled by the queue spinlocks. So > get_futex_value_locked isn't needed any more: with the spinlocks held, > __get_user will do. They aren't needed if CONFIG_PREEMPT. But with !CONFIG_PREEMPT, they are IMHO still needed, as spin_lock/spin_unlock call preempt_{disable,enable}, which is a nop if !CONFIG_PREEMPT. __get_user can't be used though, it should be __get_user_inatomic (or __copy_from_user_inatomic if the former doesn't exist). > > [numerous instances of...] > > + preempt_check_resched(); > > Not required. The spin unlocks will do this. True, preempt_check_resched() is a nop if !CONFIG_PREEMPT and for CONFIG_PREEMPT spin_unlock will handle it. Will remove them from the patch. > > But with the recent changes to futex.c I think kernel can ensure > > atomicity for free. > > I agree it would probably not slow the kernel, but I would _strongly_ > prefer that Glibc were fixed to use the token-passing property, if > Glibc is the driving intention behind this patch - instead of this > becoming a semantic that application-level users of futex (like > database and IPC libraries) come to depend on and which can't be > decomposed into a multiple-waiting form. > > (I admit that the kernel code does look nicer with > get_futex_value_locked gone, though). > > By the way, do you know of Scott Snyder's recent work on fixing Glibc > in this way? He bumped into one of Glibc's currently broken corner > cases, fixed it (according to the algorithm I gave in November), and > reported that it works fine with the fix. I certainly haven't seen his patch. Jakub ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2005-03-17 15:55 ` Jakub Jelinek @ 2005-03-18 17:00 ` Ingo Molnar 2005-03-21 2:55 ` Jamie Lokier 0 siblings, 1 reply; 28+ messages in thread From: Ingo Molnar @ 2005-03-18 17:00 UTC (permalink / raw) To: Jakub Jelinek Cc: Jamie Lokier, Hidetoshi Seto, Andrew Morton, linux-kernel, rusty, ahu, Ulrich Drepper, Roland McGrath, Scott Snyder * Jakub Jelinek <jakub@redhat.com> wrote: > The futex man pages that have been around for years (certainly since > mid 2002) certainly don't document FUTEX_WAIT as token passing > operation, but as atomic operation: > > Say http://www.icewalkers.com/Linux/ManPages/futex-2.html besides this documented-behavior argument, i dont think futexes should be degraded into waitqueues - in fact, to solve some of the known performance problems the opposite will have to happen: e.g. i believe that in the future we'll need to enable the kernel-side futex code to actually modify the futex variable. I.e. atomicity of the read in FUTEX_WAIT is an absolute must, and is only the first step. [ the double-context-switch problem in cond_signal() that Jamie mentioned is precisely one such case: pthread semantics force us that the wakeup of the wakee _must_ happen while still holding the internal lock. So we cannot just delay the wakeup to outside the glibc critical section. This double context-switch could be avoided if the 'release internal lock and wake up wakee' operation could be done atomically within the kernel. (A sane default 'userspace unlock' operation on a machine word could be defined .. e.g. decrement-to-zero.) ] so i'm very much in favor of your patch - it fixes a real bug and is also the right step forward. We'll need more locking code in the kernel to remove fundamental limitations of userspace (such as no ability to control preemption), not less. i've tested your latest patch (from today) on x86 and it boots/works fine with Fedora userspace, where futexes do get utilized, and ran a few tests as well. (Andrew - might make sense to include in the next -mm so that we get some feel of stability, while the conceptual discussion continues?) Ingo -- this patch makes FUTEX_WAIT atomic again. Signed-off-by: Jakub Jelinek <jakub@redhat.com> Acked-by: Ingo Molnar <mingo@elte.hu> --- linux/kernel/futex.c.orig +++ linux/kernel/futex.c @@ -97,7 +97,6 @@ struct futex_q { */ struct futex_hash_bucket { spinlock_t lock; - unsigned int nqueued; struct list_head chain; }; @@ -265,7 +264,6 @@ static inline int get_futex_value_locked inc_preempt_count(); ret = __copy_from_user_inatomic(dest, from, sizeof(int)); dec_preempt_count(); - preempt_check_resched(); return ret ? -EFAULT : 0; } @@ -339,7 +337,6 @@ static int futex_requeue(unsigned long u struct list_head *head1; struct futex_q *this, *next; int ret, drop_count = 0; - unsigned int nqueued; retry: down_read(¤t->mm->mmap_sem); @@ -354,23 +351,22 @@ static int futex_requeue(unsigned long u bh1 = hash_futex(&key1); bh2 = hash_futex(&key2); - nqueued = bh1->nqueued; + if (bh1 < bh2) + spin_lock(&bh1->lock); + spin_lock(&bh2->lock); + if (bh1 > bh2) + spin_lock(&bh1->lock); + if (likely(valp != NULL)) { int curval; - /* In order to avoid doing get_user while - holding bh1->lock and bh2->lock, nqueued - (monotonically increasing field) must be first - read, then *uaddr1 fetched from userland and - after acquiring lock nqueued field compared with - the stored value. The smp_mb () below - makes sure that bh1->nqueued is read from memory - before *uaddr1. */ - smp_mb(); - ret = get_futex_value_locked(&curval, (int __user *)uaddr1); if (unlikely(ret)) { + spin_unlock(&bh1->lock); + if (bh1 != bh2) + spin_unlock(&bh2->lock); + /* If we would have faulted, release mmap_sem, fault * it in and start all over again. */ @@ -385,21 +381,10 @@ static int futex_requeue(unsigned long u } if (curval != *valp) { ret = -EAGAIN; - goto out; + goto out_unlock; } } - if (bh1 < bh2) - spin_lock(&bh1->lock); - spin_lock(&bh2->lock); - if (bh1 > bh2) - spin_lock(&bh1->lock); - - if (unlikely(nqueued != bh1->nqueued && valp != NULL)) { - ret = -EAGAIN; - goto out_unlock; - } - head1 = &bh1->chain; list_for_each_entry_safe(this, next, head1, list) { if (!match_futex (&this->key, &key1)) @@ -435,13 +420,9 @@ out: return ret; } -/* - * queue_me and unqueue_me must be called as a pair, each - * exactly once. They are called with the hashed spinlock held. - */ - /* The key must be already stored in q->key. */ -static void queue_me(struct futex_q *q, int fd, struct file *filp) +static inline struct futex_hash_bucket * +queue_lock(struct futex_q *q, int fd, struct file *filp) { struct futex_hash_bucket *bh; @@ -455,11 +436,35 @@ static void queue_me(struct futex_q *q, q->lock_ptr = &bh->lock; spin_lock(&bh->lock); - bh->nqueued++; + return bh; +} + +static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *bh) +{ list_add_tail(&q->list, &bh->chain); spin_unlock(&bh->lock); } +static inline void +queue_unlock(struct futex_q *q, struct futex_hash_bucket *bh) +{ + spin_unlock(&bh->lock); + drop_key_refs(&q->key); +} + +/* + * queue_me and unqueue_me must be called as a pair, each + * exactly once. They are called with the hashed spinlock held. + */ + +/* The key must be already stored in q->key. */ +static void queue_me(struct futex_q *q, int fd, struct file *filp) +{ + struct futex_hash_bucket *bh; + bh = queue_lock(q, fd, filp); + __queue_me(q, bh); +} + /* Return 1 if we were still queued (ie. 0 means we were woken) */ static int unqueue_me(struct futex_q *q) { @@ -503,6 +508,7 @@ static int futex_wait(unsigned long uadd DECLARE_WAITQUEUE(wait, current); int ret, curval; struct futex_q q; + struct futex_hash_bucket *bh; retry: down_read(¤t->mm->mmap_sem); @@ -511,7 +517,7 @@ static int futex_wait(unsigned long uadd if (unlikely(ret != 0)) goto out_release_sem; - queue_me(&q, -1, NULL); + bh = queue_lock(&q, -1, NULL); /* * Access the page AFTER the futex is queued. @@ -537,14 +543,13 @@ static int futex_wait(unsigned long uadd ret = get_futex_value_locked(&curval, (int __user *)uaddr); if (unlikely(ret)) { + queue_unlock(&q, bh); + /* If we would have faulted, release mmap_sem, fault it in and * start all over again. */ up_read(¤t->mm->mmap_sem); - if (!unqueue_me(&q)) /* There's a chance we got woken already */ - return 0; - ret = get_user(curval, (int __user *)uaddr); if (!ret) @@ -553,9 +558,13 @@ static int futex_wait(unsigned long uadd } if (curval != val) { ret = -EWOULDBLOCK; - goto out_unqueue; + queue_unlock(&q, bh); + goto out_release_sem; } + /* Only actually queue if *uaddr contained val. */ + __queue_me(&q, bh); + /* * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. @@ -596,10 +605,6 @@ static int futex_wait(unsigned long uadd * have handled it for us already. */ return -EINTR; - out_unqueue: - /* If we were woken (and unqueued), we succeeded, whatever. */ - if (!unqueue_me(&q)) - ret = 0; out_release_sem: up_read(¤t->mm->mmap_sem); return ret; ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2005-03-18 17:00 ` Ingo Molnar @ 2005-03-21 2:55 ` Jamie Lokier 0 siblings, 0 replies; 28+ messages in thread From: Jamie Lokier @ 2005-03-21 2:55 UTC (permalink / raw) To: Ingo Molnar Cc: Jakub Jelinek, Hidetoshi Seto, Andrew Morton, linux-kernel, rusty, ahu, Ulrich Drepper, Roland McGrath, Scott Snyder Ingo Molnar wrote: > > * Jakub Jelinek <jakub@redhat.com> wrote: > > > The futex man pages that have been around for years (certainly since > > mid 2002) certainly don't document FUTEX_WAIT as token passing > > operation, but as atomic operation: > > > > Say http://www.icewalkers.com/Linux/ManPages/futex-2.html > > besides this documented-behavior argument, i dont think futexes should > be degraded into waitqueues I give in... Depending on atomicity makes it impossible for an application, which is linked with NPTL and Glibc, to write an NPTL-compatible "wait on two locks" function. I'm not saying that's a very clean thing to want, but it's a conceptual loss and I'm disappointed I seem to be the only one noticing it. On the other hand, I was mistaken to think it makes it impossible to write an emulation of synchronous futex() in terms of asynchronous futex().* In fact it makes it impossible to do so using the existing FUTEX_FD, but it would be possible if there were a FUTEX_FD2 added somewhere down the line. * - The reason you would do this is if you were writing userspace-threading for any reason, and you had to include an emulation of synchronous futex() in terms of async futex because there are some libraries which might run on top of the userspace-threading which use futex in an application-dependent way. > - in fact, to solve some of the known > performance problems the opposite will have to happen: e.g. i believe > that in the future we'll need to enable the kernel-side futex code to > actually modify the futex variable. I.e. atomicity of the read in > FUTEX_WAIT is an absolute must, and is only the first step. Some of those performance problems can be solved already by better use of FUTEX_REQUEUE instead of FUTEX_WAKE. > [ the double-context-switch problem in cond_signal() that Jamie > mentioned is precisely one such case: pthread semantics force us that > the wakeup of the wakee _must_ happen while still holding the internal > lock. So we cannot just delay the wakeup to outside the glibc critical > section. This double context-switch could be avoided if the 'release > internal lock and wake up wakee' operation could be done atomically > within the kernel. (A sane default 'userspace unlock' operation on a > machine word could be defined .. e.g. decrement-to-zero.) ] Did you not see the solution I gave last November, using FUTEX_REQUEUE? See: http://lkml.org/lkml/2004/11/29/201 I spent a _lot_ of time figuring it out but everyone was too busy to confirm that it worked. It would improve performance in a number of cases. I hope that it does not get ignored yet again. There _may_ be cases where more complex futex operations are needed, but we should try the better algorithms that use the existing futex operations before adding new ones. -- Jamie ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2005-03-17 15:20 ` Jamie Lokier 2005-03-17 15:55 ` Jakub Jelinek @ 2005-03-18 16:53 ` Jakub Jelinek 1 sibling, 0 replies; 28+ messages in thread From: Jakub Jelinek @ 2005-03-18 16:53 UTC (permalink / raw) To: Jamie Lokier Cc: Hidetoshi Seto, mingo, Andrew Morton, linux-kernel, rusty, ahu, Ulrich Drepper, Roland McGrath, Scott Snyder On Thu, Mar 17, 2005 at 03:20:31PM +0000, Jamie Lokier wrote: > > [numerous instances of...] > > + preempt_check_resched(); > > Not required. The spin unlocks will do this. Here is updated patch with those removed (all of them are preceeded by spin_unlock) and out_unqueue label and following unused code removed too. --- linux-2.6.11/kernel/futex.c.jj 2005-03-17 04:42:29.000000000 -0500 +++ linux-2.6.11/kernel/futex.c 2005-03-18 05:45:29.000000000 -0500 @@ -97,7 +97,6 @@ struct futex_q { */ struct futex_hash_bucket { spinlock_t lock; - unsigned int nqueued; struct list_head chain; }; @@ -265,7 +264,6 @@ static inline int get_futex_value_locked inc_preempt_count(); ret = __copy_from_user_inatomic(dest, from, sizeof(int)); dec_preempt_count(); - preempt_check_resched(); return ret ? -EFAULT : 0; } @@ -339,7 +337,6 @@ static int futex_requeue(unsigned long u struct list_head *head1; struct futex_q *this, *next; int ret, drop_count = 0; - unsigned int nqueued; retry: down_read(¤t->mm->mmap_sem); @@ -354,23 +351,22 @@ static int futex_requeue(unsigned long u bh1 = hash_futex(&key1); bh2 = hash_futex(&key2); - nqueued = bh1->nqueued; + if (bh1 < bh2) + spin_lock(&bh1->lock); + spin_lock(&bh2->lock); + if (bh1 > bh2) + spin_lock(&bh1->lock); + if (likely(valp != NULL)) { int curval; - /* In order to avoid doing get_user while - holding bh1->lock and bh2->lock, nqueued - (monotonically increasing field) must be first - read, then *uaddr1 fetched from userland and - after acquiring lock nqueued field compared with - the stored value. The smp_mb () below - makes sure that bh1->nqueued is read from memory - before *uaddr1. */ - smp_mb(); - ret = get_futex_value_locked(&curval, (int __user *)uaddr1); if (unlikely(ret)) { + spin_unlock(&bh1->lock); + if (bh1 != bh2) + spin_unlock(&bh2->lock); + /* If we would have faulted, release mmap_sem, fault * it in and start all over again. */ @@ -385,21 +381,10 @@ static int futex_requeue(unsigned long u } if (curval != *valp) { ret = -EAGAIN; - goto out; + goto out_unlock; } } - if (bh1 < bh2) - spin_lock(&bh1->lock); - spin_lock(&bh2->lock); - if (bh1 > bh2) - spin_lock(&bh1->lock); - - if (unlikely(nqueued != bh1->nqueued && valp != NULL)) { - ret = -EAGAIN; - goto out_unlock; - } - head1 = &bh1->chain; list_for_each_entry_safe(this, next, head1, list) { if (!match_futex (&this->key, &key1)) @@ -435,13 +420,9 @@ out: return ret; } -/* - * queue_me and unqueue_me must be called as a pair, each - * exactly once. They are called with the hashed spinlock held. - */ - /* The key must be already stored in q->key. */ -static void queue_me(struct futex_q *q, int fd, struct file *filp) +static inline struct futex_hash_bucket * +queue_lock(struct futex_q *q, int fd, struct file *filp) { struct futex_hash_bucket *bh; @@ -455,11 +436,35 @@ static void queue_me(struct futex_q *q, q->lock_ptr = &bh->lock; spin_lock(&bh->lock); - bh->nqueued++; + return bh; +} + +static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *bh) +{ list_add_tail(&q->list, &bh->chain); spin_unlock(&bh->lock); } +static inline void +queue_unlock(struct futex_q *q, struct futex_hash_bucket *bh) +{ + spin_unlock(&bh->lock); + drop_key_refs(&q->key); +} + +/* + * queue_me and unqueue_me must be called as a pair, each + * exactly once. They are called with the hashed spinlock held. + */ + +/* The key must be already stored in q->key. */ +static void queue_me(struct futex_q *q, int fd, struct file *filp) +{ + struct futex_hash_bucket *bh; + bh = queue_lock(q, fd, filp); + __queue_me(q, bh); +} + /* Return 1 if we were still queued (ie. 0 means we were woken) */ static int unqueue_me(struct futex_q *q) { @@ -503,6 +508,7 @@ static int futex_wait(unsigned long uadd DECLARE_WAITQUEUE(wait, current); int ret, curval; struct futex_q q; + struct futex_hash_bucket *bh; retry: down_read(¤t->mm->mmap_sem); @@ -511,7 +517,7 @@ static int futex_wait(unsigned long uadd if (unlikely(ret != 0)) goto out_release_sem; - queue_me(&q, -1, NULL); + bh = queue_lock(&q, -1, NULL); /* * Access the page AFTER the futex is queued. @@ -537,14 +543,13 @@ static int futex_wait(unsigned long uadd ret = get_futex_value_locked(&curval, (int __user *)uaddr); if (unlikely(ret)) { + queue_unlock(&q, bh); + /* If we would have faulted, release mmap_sem, fault it in and * start all over again. */ up_read(¤t->mm->mmap_sem); - if (!unqueue_me(&q)) /* There's a chance we got woken already */ - return 0; - ret = get_user(curval, (int __user *)uaddr); if (!ret) @@ -553,9 +558,13 @@ static int futex_wait(unsigned long uadd } if (curval != val) { ret = -EWOULDBLOCK; - goto out_unqueue; + queue_unlock(&q, bh); + goto out_release_sem; } + /* Only actually queue if *uaddr contained val. */ + __queue_me(&q, bh); + /* * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. @@ -596,10 +605,6 @@ static int futex_wait(unsigned long uadd * have handled it for us already. */ return -EINTR; - out_unqueue: - /* If we were woken (and unqueued), we succeeded, whatever. */ - if (!unqueue_me(&q)) - ret = 0; out_release_sem: up_read(¤t->mm->mmap_sem); return ret; Jakub ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-17 8:47 ` Jakub Jelinek 2004-11-18 2:10 ` Hidetoshi Seto 2004-11-18 7:20 ` Jamie Lokier @ 2004-11-26 17:06 ` Jamie Lokier 2004-11-28 17:36 ` Joe Seigh 2004-11-29 11:24 ` Jakub Jelinek 2 siblings, 2 replies; 28+ messages in thread From: Jamie Lokier @ 2004-11-26 17:06 UTC (permalink / raw) To: Jakub Jelinek Cc: Hidetoshi Seto, mingo, Andrew Morton, linux-kernel, rusty, ahu, drepper I've looked at the problem of lost-wakeups problem with NPTL condition variables and 2.6 futex, with the help of Jakub's finely presented pseudo-code. Unless I've made a mistake, it is fixable in userspace. [ It might be more efficient to fix it in kernel space - on the other hand, doing so might also make kernel futexes slower. In general, I prefer if the kernel futex semantics can be as "loose" as possible to minimise the locking they are absolutely required to do. Who knows, we might come up with an algorithm that uses even less cross-CPU traffic in the kernel, if the semantics permit it. However, I appreciate that a more "atomic" kernel semantic is easier to understand, and it is possible to implement that if it is really worth doing. I would like to see benchmarks proving it doesn't slow down normal futex stress tests though. It might not be slower at all. ] Ok. Userspace solutions first. Logically, waiters have four states: Awake, About to sleep, Sleeping and Drowsy. These don't correspond to places in the code; they are just logical states for the purpose of reasoning. Waiters go to sleep through a sequence, from Awake to About to sleep, then to Sleeping. This is prompted by the call to pthread_condvar_wait. Waking up is prompted by passing around WAKE tokens. The combined operation "futex++" followed by FUTEX_WAKE is always done as an ordered sequence, which we'll call offering a WAKE token. That operation offers a WAKE token to all waiters, and if there exists any single waiter in a state that will consume the token, that waiter consumes the token and transitions immediately to Awake. The waker offering a WAKE token knows if a waiter accepts the token that it offers. A waiter knows if it accepts a token. Tokens are conserved exactly (like energy and momentum). This is important. In the Sleeping state, waiters are woken by consuming a WAKE token, as soon as one becomes available. In the About to sleep state, two transitions are possible. If time passes with no WAKE tokens, they become Sleeping. If a WAKE token is offered, they do not consume it, but they transition to a state called Drowsy instead. In the Drowsy state, time can pass and it will transition to Awake. However, it can also accept a WAKE token in that state. This is optional: if a token is offered, it might not accept it. This is different from Sleeping, where if a token is offered it will definitely accepted it. These are all the transitions of a waiter: Awake -> About to sleep [Called pthread_condvar_wait] About to sleep -> Sleeping [Time passes] About to sleep -> Drowsy [Tickled by WAKE token but did not accept it] Sleeping -> Awake [Accept one WAKE token - guaranteed to accept] Drowsy -> Awake [Time passes] Drowsy -> Awake [Accept one WAKE token - may refuse] +--------------+ time passes +----------+ |About to sleep| ------------> | Sleeping | +--------------+ +----------+ | | tickled by | | token but did | | WAKE token not accept it | | (guaranteed to accept) V time passes V +----------+ --------------> +---------+ | Drowsy | | Awake | +----------+ --------------> +---------+ WAKE token (may refuse) The states actually correspond to the following real events. The condvar private mutex ensures that reading the futex value occurs before it is incremented: About to sleep == starting from mutex release by the waiter, until whichever comes first from FUTEX_WAKE and queue_me Sleeping == if FUTEX_WAKE comes after queue_me, this state begins at queue_me Drowsy == if FUTEX_WAKE comes before queue_me, the FUTEX_WAKE event is called "tickled by token" and this is the moment when Drowsy begins Awake == if FUTEX_WAKE comes before queue_me, Awake begins at unqueue_me or a subsequent FUTEX_WAKE, whichever comes first (these are the two transitions from Drowsy). if FUTEX_WAKE comes after queue_me, Awake begins at the moment of FUTEX_WAKE (this is the transition from Sleeping) On Mon, Nov 15, 2004 at 01:22:18PM +0000, Jamie Lokier wrote: > 2. Future lost wakeups. > > Future calls to pthread_cond_signal(cond) fail to wake wait_B, > even much later, because cond's NPTL data structure is > inconsistent. It's invariant is broken. > > This is a bug in NPTL and it's easy to fix. Never increment wake > unconditionally. Instead, increment it conditionally when (a) > FUTEX_WAKE returns 1, and also (b) when FUTEX_WAIT returns -EAGAIN. This is easy to solve. The key invariant which breaks is that (total_seq - wakeup_seq) is equal to the number waiters which are effectively blocked. This corresponds to the states "Sleeping" and "About to sleep". pthread_condvar_signal checks (total_seq - wakeup_seq), and if it's > 0, increments wakeup_seq. To maintain the invariant it, at the same time (i.e. inside the mutex), it offers a WAKE token (this is the operational sequence futex++ followed by FUTEX_WAKE). This is supposed to make one waiter in "About to sleep" or "Sleeping" transition to another state. When there is only one waiter, this works. When there are two or more waiters, this fails because one of them can be "Drowsy". That's not one of the states counted in (total_seq - wakeup_seq), but it might accept the WAKE token, causing the attempt to decrease the number in "About to sleep" and "Sleeping" to fail. After the invariant is broken, no amount of calling pthread_cond_signal will wake up all waiters. Now, a waker cannot detect which state ("Sleeping" or "Drowsy") accepted the token. A woken waiter cannot detect it either. Therefore the solution to this invariant _must_ involve not distinguishing those states. The solution to maintaining the invariant is to include "Drowsy" in the states counted by (total_seq - wakeup_seq). This means that wakeup_seq must not be incremented by the waker if FUTEX_WAKE reports the WAKE token is not accepted ("About to sleep" -> "Drowsy", it's still in the counted set). wakeup_set must also be incremented by the waiter if FUTEX_WAIT reports that it did _not_ receive a token ("Drowsy" -> "Awake"), as this means the counted set has changed but this has not yet been reflected in wakeup_seq. This still fails to wake up some waiters transiently (see later), but it solves this particular problem of the long term invariant breaking - this is the more serious problem. Here's the implementation. You'll notice that we do something significant: we look at the return value of futex operations. That's why they return a value! :) pthread_cond_signal (cond) { mutex_lock (lock); if (total_seq > wakeup_seq) { <<<<< ++wakeup_seq, ++futex; futex (&futex, FUTEX_WAKE, 1); ===== ++futex; wakeup_seq += futex (&futex, FUTEX_WAKE, 1); >>>>> } mutex_unlock (lock); } pthread_cond_wait (cond, mtx) { mutex_lock (lock); mutex_unlock (mtx->lock); ++total_seq; ++futex; mutex = mtx; bc_seq = broadcast_seq; seq = wakeup_seq; do { val = futex; mutex_unlock (lock); <<<<< futex (&futex, FUTEX_WAIT, val); mutex_lock (lock); ===== result = futex (&futex, FUTEX_WAIT, val); mutex_lock (lock); if (result < 0) wakeup_seq++; >>>>> if (bc_seq != broadcast_seq) goto out; } while (wakeup_seq == seq || woken_seq == wakeup_seq); ++woken_seq; out: mutex_unlock (lock); mutex_lock (mtx->lock); } (Thanks for the helpful pseudo-code, btw). Jakub Jelinek wrote: > E.g. the futex IMHO must be incremented before FUTEX_WAKE, as otherwise > the woken tasks wouldn't see the effect. futex must be incremented before FUTEX_WAKE, but wakeup_seq does not have to be incremented before FUTEX_WAKE - the private mutex means that it can be incremented after. > 1. A lost wakeup. > > wait_A is woken, but wait_B is not, even though the second > pthread_cond_signal is "observably" after wait_B. > > The operation order is observable in sense that wait_B could > update the data structure which is protected by cond+mutex, and > wake_Y could read that update prior to deciding to signal. > > _Logically_, what happens is that wait_A is woken by wake_X, but > it does not immediately re-acquire the mutex. In this time > window, wait_B and wake_Y both run, and then wait_A acquires the > mutex. During this window, wait_A is able to absorb the second > signal. > > It's not clear to me if POSIX requires wait_B to be signalled or > not in this case. Ok, I have seen written and it makes sense that two signals should result in both waiters woken in this case. I think that's a reasonable expectation. Using those logical states, this lost wakeup occurs because wait_A is woken by wake_X, entering the "Drowsy" state, and then it accepts a WAKE token from wake_Y, to become "Awake". Accepting a WAKE token in the "Drowsy" state prevents wait_B from accepted it. In extreme cases, there can be a large number of threads in the "Drowsy" state, absorbing a lot of wakeups together. There are several ways to fix this (the 6th is my favourite): 1. In the kernel, make the FUTEX_WAIT test-and-queue operation effectively atomic w.r.t. FUTEX_WAKE by more exclusive locks, as you have requested. Effect: Prevents the "Drowsy" state from accepting WAKE tokens. 2. Subtler: In the kernel, lock FUTEX_WAIT test-and-queue operations w.r.t. _other_ FUTEX_WAIT operations on the same futex, but not exclusive w.r.t. FUTEX_WAKE operations. Effect: Does not prevent "Drowsy" from accepting WAKE tokens, but does prevent any "Sleeping" states from existing at the same time, so "Drowsy" never steals WAKE tokens. To be more precise, just the region from get_user to unqueue_me needs to be locked w.r.t. other FUTEX_WAITs, but explaining this requires a more complicated state machine. This one is too subtle to be allowed, imho. Can you imagine the man page trying to explain it? 3. Related to above, but purely userspace. Lock a second private mutex around each call to FUTEX_WAIT. At first sight this looks like it would be a performance killer, but it's not totally obvious whether it would be: <<<<< result = futex (&futex, FUTEX_WAIT, val); ===== mutex_lock (lock2); result = futex (&futex, FUTEX_WAIT, val); mutex_unlock (lock2); >>>>> 4. A combination of low-impact kernel and userspace changes. In the kernel, change the return value of FUTEX_WAIT to report when the futex word didn't match but it received a wakeup anyway. Effect: Allows the waiter to detect that it absorbed a WAKE token in the "Drowsy" state, implying that it was maybe needed by another waiter, so it should re-transmit that token by calling FUTEX_WAKE. The kernel code change is trivial and has no performance impact on futexes in general, e.g. as used for mutexes, but here it might lead to redundant extra system calls in some cases. This strategy has a subtle behavioural quirk, which might be a flaw, I'm not sure, which is described at the end of answer 5 below. Kernel change looks like: out_unqueue: /* If we were woken (and unqueued), we succeeded, whatever. */ if (!unqueue_me(&q)) <<<<< ret = 0; ===== ret = 1; >>>>> Userspace change looks like: result = futex (&futex, FUTEX_WAIT, val); mutex_lock (lock); if (result < 0) wakeup_seq++; <<<<< ===== else if (result > 0) wakeup_seq += futex (&futex, FUTEX_WAKE, 1); >>>>> 5. Like 4, but in the kernel. We change the kernel to _always_ retransmit a wakeup if it's received by the unqueue_me() in the word-didn't-match branch. Effect: In the "Drowsy" state, a waiter may accept a WAKE token but then it will offer it again so they are never lost from "Sleeping" states. NOTE: This is NOT equivalent to changing the kernel to do test-and-queue atomically. With this change, a FUTEX_WAKE operation can return to userspace _before_ the final destination of the WAKE token decides to begin FUTEX_WAIT. This will result in spurious extra wakeups, erring too far the other way, because of the difference from atomicity described in the preceding paragraph. Therefore, I don't like this. It would fix the NPTL condition variables, but introduces two new problems: - It violates conservation of WAKE tokens (like energy and momentum), which some other futex-using code may depend on - unless the return value from FUTEX_WAIT is changed to report 1 when it receives a token or 2 when it forwards it successfully. - Some spurious wakeups at times when a wakeup is not required. - No logical benefit over doing it in userspace, but would take away flexibility if kernel always did it. 6. Like 4, but this requires no kernel change, just userspace. Another counter is used to detect when retransmision is needed: pthread_cond_signal (cond) { mutex_lock (lock); if (total_seq > wakeup_seq) { <<<<< ++wakeup_seq, ++futex; futex (&futex, FUTEX_WAKE, 1); ===== ++futex; ++missing; result = futex (&futex, FUTEX_WAKE, missing); wakeup_seq += result; missing -= result; >>>>> } mutex_unlock (lock); } pthread_cond_wait (cond, mtx) { mutex_lock (lock); mutex_unlock (mtx->lock); ++total_seq; ++futex; mutex = mtx; bc_seq = broadcast_seq; seq = wakeup_seq; do { val = futex; mutex_unlock (lock); <<<<< futex (&futex, FUTEX_WAIT, val); mutex_lock (lock); ===== result = futex (&futex, FUTEX_WAIT, val); mutex_lock (lock); if (result < 0) { ++wakeup_seq; --missing; } if (missing) { result = futex (&futex, FUTEX_WAKE, missing); wakeup_seq += result; missing -= result; } >>>>> if (bc_seq != broadcast_seq) goto out; } while (wakeup_seq == seq || woken_seq == wakeup_seq); ++woken_seq; out: mutex_unlock (lock); mutex_lock (mtx->lock); } NOTE: The difference in 5 between kernel atomic wakeups and kernel forwarded wakeups being observable has an analogous form in userspace pthreads condition variables, with any of the 4, 5 or 6 implementations. That is, anything that works by forwarding wakeups. If an application calls pthread_cond_signal, then that returns, and then the application calls pthread_cond_wait, forwarded wakeups could result in that wait being woken by the signal which logically preceded it. This happens because the wake is "in flight" so to speak. It would also result in a different wait, queued earlier than the pthread_cond_signal call, not being woken because this one is woken in its place. The total number woken is fine. The same thing can occur with solutions 4, 5 and 6. Those spuriously delayed wakeups may or may not be a problem. They are observable so a program's behaviour could be written to depend on them not occurring. However, that's a pretty subtle thing to depend on - not the sort of thing programs using condvars would normally do. This time I _really_ have no idea if that would be forbidden by POSIX or not. I suspect some implementations of condvar work a bit like queued signals or queued messages: where pthread_cond_signal while the signal itself is in flight and may be delivered to a subsequently starting wait, within a time window. Then again, maybe they aren't. > If you think it is fixable in userland, please write at least the pseudo > code that you believe should work. We have spent quite a lot of time > on that code and don't believe this is solvable in userland. I hope I have presented and explained the userland-only solutions. Out of all of the above, solution 6 looks most promising to me. Having a think about the wakeup ordering issues mentioned at the end, though. > I believe the only place this is solvable in is the kernel, by ensuring > atomicity (i.e. queuing task iff curval == expected_val operation atomic > wrt. futex_wake/futex_requeue in other tasks). In the RHEL3 2.4.x backport > this is easy, as spinlock is held around the user access (the page is first > pinned, then lock taken, then value compared (but that is guaranteed to > be non-blocking) and if equal queued, then unlocked and unpinned. > In 2.6.x this is harder if the kernel cannot allow some spinlock to be > taken while doing user access, but I guess the kernel needs to cope > with this, e.g. by queueing the task early but mark it as maybe queued > only. If futex_wake sees such a bit, it would wait until futex_wait > notifies it it has decided whether that one should be queued or not. > Or something else, whatever, as long as the right semantics is ensured. > Just FYI, current pseudo code is (not mentioning cancellation stuff here, > code/data to deal with pthread_cond_destroy semantics, timedwait and > pshared condvars): > > typedef struct { int lock, futex; uint64_t total_seq, wakeup_seq, woken_seq; > void *mutex; uint32_t broadcast_seq; } pthread_cond_t; A few questions: 1. Why are total_seq and so on 64 bit quantities? The comparison problem on overflow is solvable by changing (total_seq > wakeup_seq) to (int32_t) (total_seq - wakeup_seq) > 0, just like the kernel does with jiffies. If you imagine the number of waiters to exceed 2^31, you have bigger problems, because: 2. futex is 32 bits and can overflow. If a waiter blocks, then a waker is called 2^32 times in succession before the waiter can schedule again, the waiter will remain blocked after the waker returns. This is unlikely, except where it's done deliberately (e.g. SIGSTOP/CONT), and it's a bug and it only needs two threads! It could perhaps be used for denial of service. 3. Why is futex incremented in pthread_cond_wait? I don't see the reason for it. 4. In pthread_cond_broadcast, why is the mutex_unlock(lock) dropped before calling FUTEX_CMP_REQUEUE? Wouldn't it be better to drop the lock just after, in which case FUTEX_REQUEUE would be fine? pthread_cond_signal has no problem with holding the lock across FUTEX_WAKE, and I do not see any reason why that would be different for pthread_cond_broadcast. > pthread_cond_broadcast (cond) > { > mutex_lock (lock); > if (total_seq > wakeup_seq) { > woken_seq = wakeup_seq = total_seq; > futex = 2 * total_seq; > ++broadcast_seq; > val = futex; > mutex_unlock (lock); > if (futex (&futex, FUTEX_CMP_REQUEUE, 1, INT_MAX, &mutex->lock, val) < 0) > futex (&futex, FUTEX_WAKE, INT_MAX); > return; > } > mutex_unlock (lock); > } -- Jamie ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-26 17:06 ` Jamie Lokier @ 2004-11-28 17:36 ` Joe Seigh 2004-11-29 11:24 ` Jakub Jelinek 1 sibling, 0 replies; 28+ messages in thread From: Joe Seigh @ 2004-11-28 17:36 UTC (permalink / raw) To: linux-kernel Jamie Lokier wrote: > > I've looked at the problem of lost-wakeups problem with NPTL condition > variables and 2.6 futex, with the help of Jakub's finely presented > pseudo-code. Unless I've made a mistake, it is fixable in userspace. > > [ It might be more efficient to fix it in kernel space - on the other > hand, doing so might also make kernel futexes slower. In general, I > prefer if the kernel futex semantics can be as "loose" as possible > to minimise the locking they are absolutely required to do. Who > knows, we might come up with an algorithm that uses even less > cross-CPU traffic in the kernel, if the semantics permit it. > However, I appreciate that a more "atomic" kernel semantic is easier > to understand, and it is possible to implement that if it is really > worth doing. I would like to see benchmarks proving it doesn't slow > down normal futex stress tests though. It might not be slower at all. ] [...] > 5. Like 4, but in the kernel. We change the kernel to _always_ > retransmit a wakeup if it's received by the unqueue_me() in the > word-didn't-match branch. > > Effect: In the "Drowsy" state, a waiter may accept a WAKE token > but then it will offer it again so they are never lost from > "Sleeping" states. > > NOTE: This is NOT equivalent to changing the kernel to do > test-and-queue atomically. With this change, a FUTEX_WAKE > operation can return to userspace _before_ the final > destination of the WAKE token decides to begin FUTEX_WAIT. > > This will result in spurious extra wakeups, erring too far the > other way, because of the difference from atomicity described > in the preceding paragraph. > > Therefore, I don't like this. It would fix the NPTL condition > variables, but introduces two new problems: > > - It violates conservation of WAKE tokens (like energy and > momentum), which some other futex-using code may depend > on - unless the return value from FUTEX_WAIT is changed > to report 1 when it receives a token or 2 when it > forwards it successfully. > > - Some spurious wakeups at times when a wakeup is not > required. > > - No logical benefit over doing it in userspace, but > would take away flexibility if kernel always did it. > I think this is similar to a solution that I proposed elsewhere. You wake up some other thread, if any, waiting on the futex. This breaks what you call WAKE tokens but wait morphing with FUTEX_CMP_REQUEUE does that already as far as I can tell. A FUTEX_WAIT that has been requeued onto another futex could return EINTR instead of zero (one of the reasons you can't loop on EINTR's in the cond wait code). I did an alternate lock-free implementation of pthread condition variables with a work around of sorts for that futex wake preemption problem I mentioned earlier. I get a 3x to 200x performance improvement depending on what you are doing. So naturally I would be interested in a solution that doesn't require a userspace bottleneck. Joe Seigh ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-26 17:06 ` Jamie Lokier 2004-11-28 17:36 ` Joe Seigh @ 2004-11-29 11:24 ` Jakub Jelinek 2004-11-29 21:50 ` Jamie Lokier 1 sibling, 1 reply; 28+ messages in thread From: Jakub Jelinek @ 2004-11-29 11:24 UTC (permalink / raw) To: Jamie Lokier Cc: Hidetoshi Seto, mingo, Andrew Morton, linux-kernel, rusty, ahu, drepper On Fri, Nov 26, 2004 at 05:06:49PM +0000, Jamie Lokier wrote: Let's start with the questions: > A few questions: > > 1. Why are total_seq and so on 64 bit quantities? > > The comparison problem on overflow is solvable by changing > (total_seq > wakeup_seq) to (int32_t) (total_seq - > wakeup_seq) > 0, just like the kernel does with jiffies. > > If you imagine the number of waiters to exceed 2^31, you have > bigger problems, because: > > 2. futex is 32 bits and can overflow. If a waiter blocks, then > a waker is called 2^32 times in succession before the waiter > can schedule again, the waiter will remain blocked after the > waker returns. > > This is unlikely, except where it's done deliberately > (e.g. SIGSTOP/CONT), and it's a bug and it only needs two > threads! It could perhaps be used for denial of service. The only problem with the 32-bit overflow is if you get scheduled away in between releasing the CV's internal lock, i.e. lll_mutex_unlock (cond->__data.__lock); and if (get_user(curval, (int __user *)uaddr) != 0) { in kernel and don't get scheduled again for enough time to reach this place within 2^31 pthread_cond_{*wait,signal,broadcast} calls. There are no things on the userland side that would block and in kernel the only place you can block is down_read on mm's mmap_sem (but if the writer lock is held that long, other pthread_cond_* calls couldn't get in either) or the short term spinlocks on the hash bucket. SIGSTOP/SIGCONT affect the whole process, so unless you are talking about process shared condvars, these signals aren't going to help you in exploiting it. But, once you get past that point, current NPTL doesn't care if 2^31 or more other cv calls happen, it uses the 64-bit vars to determine what to do and they are big enough that overflows on them are just assumed not to happen. And only past that point the thread is blocked in longer-term waiting. > 3. Why is futex incremented in pthread_cond_wait? > I don't see the reason for it. See https://www.redhat.com/archives/phil-list/2004-May/msg00023.html https://www.redhat.com/archives/phil-list/2004-May/msg00022.html __data.__futex increases in pthread_cond_{signal,broadcast} are so that pthread_cond_*wait detects pthread_cond_{signal,broadcast} that happened in between releasing of internal cv lock in the *wait and being queued on the futex's wait queue. __data.__futex increases in pthread_cond_*wait are so that FUTEX_CMP_REQUEUE in pthread_cond_broadcast detects pthread_cond_*wait that happened in between releasing the internal lock in *broadcast and test in FUTEX_CMP_REQUEUE. > 4. In pthread_cond_broadcast, why is the mutex_unlock(lock) > dropped before calling FUTEX_CMP_REQUEUE? Wouldn't it be > better to drop the lock just after, in which case > FUTEX_REQUEUE would be fine? > > pthread_cond_signal has no problem with holding the lock > across FUTEX_WAKE, and I do not see any reason why that would > be different for pthread_cond_broadcast. Holding the internal lock over requeue kills performance of broadcast, if you hold the internal lock over the requeue, all the threads you wake up will block on the internal lock anyway. Jakub ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering 2004-11-29 11:24 ` Jakub Jelinek @ 2004-11-29 21:50 ` Jamie Lokier 0 siblings, 0 replies; 28+ messages in thread From: Jamie Lokier @ 2004-11-29 21:50 UTC (permalink / raw) To: Jakub Jelinek Cc: Hidetoshi Seto, mingo, Andrew Morton, linux-kernel, rusty, ahu, drepper Jakub Jelinek wrote: > > 2. futex is 32 bits and can overflow. If a waiter blocks, then > > a waker is called 2^32 times in succession before the waiter > > can schedule again, the waiter will remain blocked after the > > waker returns. > > > > This is unlikely, except where it's done deliberately > > (e.g. SIGSTOP/CONT), and it's a bug and it only needs two > > threads! It could perhaps be used for denial of service. > > The only problem with the 32-bit overflow is if you get scheduled > away in between releasing the CV's internal lock, i.e. > lll_mutex_unlock (cond->__data.__lock); > and > if (get_user(curval, (int __user *)uaddr) != 0) { > in kernel and don't get scheduled again for enough time to reach > this place within 2^31 pthread_cond_{*wait,signal,broadcast} calls. Yes. > There are no things on the userland side that would block and > in kernel the only place you can block is down_read on mm's mmap_sem > (but if the writer lock is held that long, other pthread_cond_* > calls couldn't get in either) or the short term spinlocks on the hash > bucket. SIGSTOP/SIGCONT affect the whole process, so unless you are > talking about process shared condvars, these signals aren't going to help > you in exploiting it. I agree, it is a difficult exploit, and the only consequence is a thread hangs. I though it worth mentioning only because Ulrich brings up a very similar 2^32 issue in "Futexes are tricky". > But, once you get past that point, current NPTL doesn't care if 2^31 or > more other cv calls happen, it uses the 64-bit vars to determine what to > do and they are big enough that overflows on them are just assumed not to > happen. And only past that point the thread is blocked in longer-term > waiting. About those 64-bit vars: don't the invariants guarantee the following? total_seq - wakeup_seq < number of waiters number of waiters is surely bounded by 2^31 (pid space), so 32-bit vars would be enough for sure, and using wraparound-safe comparisons (like time_after() in the kernel) would be strictly correct. I'm just offering an optimisation here: less memory, smaller code. > > 3. Why is futex incremented in pthread_cond_wait? > > I don't see the reason for it. I figured this out in a dream at the same time as you were writing this message! Then I woke and thought "doh!". Yes, it's pretty clear you must increment futex if the broadcast unlocks before requeuing. > See > https://www.redhat.com/archives/phil-list/2004-May/msg00023.html > https://www.redhat.com/archives/phil-list/2004-May/msg00022.html Examples of problems due to broadcast unlocking before requeueing and the necessary fixes. > > 4. In pthread_cond_broadcast, why is the mutex_unlock(lock) > > dropped before calling FUTEX_CMP_REQUEUE? Wouldn't it be > > better to drop the lock just after, in which case > > FUTEX_REQUEUE would be fine? > > > > pthread_cond_signal has no problem with holding the lock > > across FUTEX_WAKE, and I do not see any reason why that would > > be different for pthread_cond_broadcast. > > Holding the internal lock over requeue kills performance of broadcast, > if you hold the internal lock over the requeue, all the threads you > wake up will block on the internal lock anyway. Let's take a closer look. Do you mean broadcast of process-shared condvars? When a process-local broadcast requeues, it doesn't wake up lots of threads; it wakes exactly one thread. When a process-shared broadcast requeues, it wakes every waiter (because it doesn't know the address of the mutex). First the process-local case. There are potentially 2 redundant context switches when signalling, and there would be potentially 2 when broadcasting process-local _if_ the lock were released after the requeue: - switch to the thread just woken (#1 redundant switch) - it tries to get the mutex and fails - switch back to the signal/broadcast thread (#2 redundant switch) - signaller/broadcaster releases mutex - switch to the thread just woken (this is not redundant) I thought this was what you meant, at first, and I wondered why spend so much effort fixing it for broadcast and not for signal. Surely signal is as important. Then I realised you might mean process-shared wakeups being slow because broadcast cannot requeue in that case. Still, the earlier thought revealed a neat solution to those 2 potential context switches that also fixes process-shared broadcast, while retaining the lock over requeue. This is worth a look because I think it may turn out to be faster for the common process-local cases too - precisely because it prevents the potential 2 context switches after pthread_cond_signal. (Some messages indicate that has been observed sometimes). I'll explain with code. There may be mistakes, but hopefully the principle is conveyed. Something to watch out for is that FUTEX_REQUEUE is used to requeue to &lock _and_ &mutex->lock in this code. pthread_cond_signal (cond) { mutex_lock (lock); if (total_seq > wakeup_seq) { - ++wakeup_seq, ++futex; - futex (&futex, FUTEX_WAKE, 1); + ++futex; + if (futex (&futex, FUTEX_REQUEUE, 0, 1, &lock) > 0) { + ++wakeup_seq; + lock = WHATEVER_MAKES_UNLOCK_CALL_FUTEX_WAKE; + } } mutex_unlock (lock); } pthread_cond_broadcast (cond) { mutex_lock (lock); if (total_seq > wakeup_seq) { - woken_seq = wakeup_seq = total_seq; - futex = 2 * total_seq; - ++broadcast_seq; - val = futex; - mutex_unlock (lock); - if (process_shared || futex (&futex, FUTEX_CMP_REQUEUE, 1, INT_MAX, - &mutex->lock, val) < 0) - futex (&futex, FUTEX_WAKE, INT_MAX); - return; + count = total_seq - wakeup_seq; + ++futex; + if (process_shared) { + count = futex (&futex, FUTEX_REQUEUE, 0, count, &lock); + wakeup_seq += count; + if (count > 0) + lock = WHATEVER_MAKES_UNLOCK_CALL_FUTEX_WAKE; + } else if (futex (&futex, FUTEX_REQUEUE, 0, 1, &lock) > 0) { + count = futex (&futex, FUTEX_REQUEUE, 0, count - 1, &mutex->lock); + wakeup_seq += count + 1; + lock = WHATEVER_MAKES_UNLOCK_CALL_FUTEX_WAKE; + } } mutex_unlock (lock); } pthread_cond_wait (cond, mtx) { mutex_lock (lock); mutex_unlock (mtx->lock); ++total_seq; - ++futex; mutex = mtx; bc_seq = broadcast_seq; seq = wakeup_seq; do { val = futex; mutex_unlock (lock); - futex (&futex, FUTEX_WAIT, val); - mutex_lock (lock); - if (bc_seq != broadcast_seq) - goto out; + result = futex (&futex, FUTEX_WAIT, val); + mutex_lock (lock); + if (result < 0 && wakeup_seq < total_seq) + wakeup_seq++; } while (wakeup_seq == seq || woken_seq == wakeup_seq); ++woken_seq; - out: mutex_unlock (lock); mutex_lock (mtx->lock); } (By the way, there's a further optimisation not shown for process-shared broadcast: if wait is called with a mutex in the same page as the condvar, the offset within that page is valid for computing the mutex address in the process-shared broadcast, so it can requeue to the mutex in that case.) -- Jamie ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])
@ 2004-11-15 4:24 Chuck Ebbert
2004-11-15 8:08 ` Jamie Lokier
0 siblings, 1 reply; 28+ messages in thread
From: Chuck Ebbert @ 2004-11-15 4:24 UTC (permalink / raw)
To: Emergency Services Jamie Lokier; +Cc: linux-kernel
On Sun, 14 Nov 2004 at 09:00:23 +0000 Emergency Services Jamie Lokier wrote:
>+ * The basic logical guarantee of a futex is that it blocks ONLY
>+ * if cond(var) is known to be true at the time of blocking, for
>+ * any cond. If we queued after testing *uaddr, that would open
>+ * a race condition where we could block indefinitely with
>+ * cond(var) false, which would violate the guarantee.
>+ *
>+ * A consequence is that futex_wait() can return zero and absorb
>+ * a wakeup when *uaddr != val on entry to the syscall. This is
>+ * rare, but normal.
Why can't it absorb a wakeup and still return -EAGAIN when this happens?
IOW why not apply this patch to the original code?
================================================================================
return -EINTR;
out_unqueue:
- /* If we were woken (and unqueued), we succeeded, whatever. */
- if (!unqueue_me(&q))
- ret = 0;
+ unqueue_me(&q); /* ignore result from unqueue */
out_release_sem:
up_read(¤t->mm->mmap_sem);
return ret;
================================================================================
...and what is "Emergency Services", BTW?
--Chuck Ebbert 14-Nov-04 21:28:56
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) 2004-11-15 4:24 Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) Chuck Ebbert @ 2004-11-15 8:08 ` Jamie Lokier 0 siblings, 0 replies; 28+ messages in thread From: Jamie Lokier @ 2004-11-15 8:08 UTC (permalink / raw) To: Chuck Ebbert; +Cc: linux-kernel Chuck Ebbert wrote: > On Sun, 14 Nov 2004 at 09:00:23 +0000 Emergency Services Jamie Lokier wrote: > > >+ * The basic logical guarantee of a futex is that it blocks ONLY > >+ * if cond(var) is known to be true at the time of blocking, for > >+ * any cond. If we queued after testing *uaddr, that would open > >+ * a race condition where we could block indefinitely with > >+ * cond(var) false, which would violate the guarantee. > >+ * > >+ * A consequence is that futex_wait() can return zero and absorb > >+ * a wakeup when *uaddr != val on entry to the syscall. This is > >+ * rare, but normal. > > Why can't it absorb a wakeup and still return -EAGAIN when this happens? > IOW why not apply this patch to the original code? > > out_unqueue: > - /* If we were woken (and unqueued), we succeeded, whatever. */ > - if (!unqueue_me(&q)) > - ret = 0; > + unqueue_me(&q); /* ignore result from unqueue */ > out_release_sem: > up_read(¤t->mm->mmap_sem); > return ret; Because the number of wakeups reported to FUTEX_WAKE must _exactly_ match the number of wakeups reported to FUTEX_WAIT. They are like tokens, and for some data structures the return value mustn't be lost or ignored, because that would break structure invariants - such as the matching counters in the pthread condvars which precipitated this thread. > ...and what is "Emergency Services", BTW? My little joke, as I wouldn't have known about this if Andrew Morton hadn't forwarded me the message asking about it (I've been away from l-k). -- Jamie ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2005-03-21 2:56 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20041113164048.2f31a8dd.akpm@osdl.org>
2004-11-14 9:00 ` Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) Emergency Services Jamie Lokier
2004-11-14 9:09 ` Andrew Morton
2004-11-14 9:23 ` Jamie Lokier
2004-11-14 9:50 ` bert hubert
2004-11-15 14:12 ` Jamie Lokier
2004-11-16 8:30 ` Futex queue_me/get_user ordering Hidetoshi Seto
2004-11-16 14:58 ` Jamie Lokier
2004-11-18 1:29 ` Hidetoshi Seto
2004-11-15 0:58 ` Hidetoshi Seto
2004-11-15 2:01 ` Jamie Lokier
2004-11-15 3:06 ` Hidetoshi Seto
2004-11-15 13:22 ` Jamie Lokier
2004-11-17 8:47 ` Jakub Jelinek
2004-11-18 2:10 ` Hidetoshi Seto
2004-11-18 7:20 ` Jamie Lokier
2004-11-18 19:47 ` Jakub Jelinek
2005-03-17 10:26 ` Jakub Jelinek
2005-03-17 15:20 ` Jamie Lokier
2005-03-17 15:55 ` Jakub Jelinek
2005-03-18 17:00 ` Ingo Molnar
2005-03-21 2:55 ` Jamie Lokier
2005-03-18 16:53 ` Jakub Jelinek
2004-11-26 17:06 ` Jamie Lokier
2004-11-28 17:36 ` Joe Seigh
2004-11-29 11:24 ` Jakub Jelinek
2004-11-29 21:50 ` Jamie Lokier
2004-11-15 4:24 Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) Chuck Ebbert
2004-11-15 8:08 ` Jamie Lokier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).