* ERESTARTSYS escaping from sem_wait with RTLinux patch
@ 2009-10-10 9:09 Blaise Gassend
2009-10-10 16:40 ` ERESTARTSYS escaping from sem_wait with Preempt-RT Blaise Gassend
2009-10-10 17:59 ` ERESTARTSYS escaping from sem_wait with RTLinux patch Thomas Gleixner
0 siblings, 2 replies; 11+ messages in thread
From: Blaise Gassend @ 2009-10-10 9:09 UTC (permalink / raw)
To: linux-kernel; +Cc: Jeremy Leibs
[-- Attachment #1: Type: text/plain, Size: 2285 bytes --]
The attached python program, in which 500 threads spin with microsecond
sleeps, crashes with a "sem_wait: Unknown error 512" (conditions
described below). This appears to be due to an ERESTARTSYS generated
from futex_wait escaping to user space (libc). My understanding is that
this should never happen and I am trying to track down what is going on.
Questions that would help me make progress:
-------------------------------------------
1) Where is the ERESTARTSYS being prevented from getting to user space?
The only likely place I see for preventing ERESTARTSYS from escaping to
user space is in arch/*/kernel/signal*.c. However, I don't see how the
code there is being called if there no signal pending. Is that a path
for ERESTARTSYS to escape from the kernel?
The following comment in kernel/futex.h in futex_wait makes me wonder if
two threads are getting marked as ERESTARTSYS. The first one to leave
the kernel processes the signal and restarts. The second one doesn't
have a signal to handle, so it returns to user space without getting
into signal*.c and wreaks havoc.
(...)
/*
* We expect signal_pending(current), but another thread may
* have handled it for us already.
*/
if (!abs_time)
return -ERESTARTSYS;
(...)
2) Why would this be happening only with RT kernels?
3) Any suggestions on the best place to patch/workaround this?
My understanding is that if I was to treat ERESTARTSYS as an EAGAIN,
most applications would be perfectly happy. Would bad things happen if I
replaced the ERESTARTSYS in futex_wait with an EAGAIN?
Crash conditions:
-----------------
- RTLinux only.
- More cores seems to make things worse. Lots of crashes on a dual-quad
core machine. None observed yet on dual core. At least one crash on a
dual-quad core when run with "taskset -c 1"
- Various versions, including 2.6.29.6-rt23, and whatever the latest was
earlier today.
- Seen on both ia64 and x86
- Ubuntu hardy and jaunty
- Sometimes hapens within 2 seconds on a dual quad-core machine, other
times will go for up to 30 minutes to an hour without crashing. I
suspect a dependence on system activity, but haven't noticed an obvious
pattern.
- Time to crash appears to drop fast with more CPU cores.
[-- Attachment #2: threadprocs8.py --]
[-- Type: text/x-python, Size: 222 bytes --]
import threading
import time
exiting = False
def spin():
while not exiting:
time.sleep(0.000001)
for i in range(0,500):
threading.Thread(target=spin).start()
try:
spin()
finally:
exiting = True
^ permalink raw reply [flat|nested] 11+ messages in thread* ERESTARTSYS escaping from sem_wait with Preempt-RT 2009-10-10 9:09 ERESTARTSYS escaping from sem_wait with RTLinux patch Blaise Gassend @ 2009-10-10 16:40 ` Blaise Gassend 2009-10-10 17:59 ` ERESTARTSYS escaping from sem_wait with RTLinux patch Thomas Gleixner 1 sibling, 0 replies; 11+ messages in thread From: Blaise Gassend @ 2009-10-10 16:40 UTC (permalink / raw) To: linux-kernel Apologies, the message below is for Preempt-RT, not RTLinux. The original message with full details can be found here: http://lkml.org/lkml/2009/10/10/36 On Sat, 2009-10-10 at 02:09 -0700, Blaise Gassend wrote: > The attached python program, in which 500 threads spin with microsecond > sleeps, crashes with a "sem_wait: Unknown error 512" (conditions > described below). This appears to be due to an ERESTARTSYS generated > from futex_wait escaping to user space (libc). My understanding is that > this should never happen and I am trying to track down what is going on. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch 2009-10-10 9:09 ERESTARTSYS escaping from sem_wait with RTLinux patch Blaise Gassend 2009-10-10 16:40 ` ERESTARTSYS escaping from sem_wait with Preempt-RT Blaise Gassend @ 2009-10-10 17:59 ` Thomas Gleixner 2009-10-10 19:08 ` Jeremy Leibs 2009-10-11 5:48 ` Jeremy Leibs 1 sibling, 2 replies; 11+ messages in thread From: Thomas Gleixner @ 2009-10-10 17:59 UTC (permalink / raw) To: Blaise Gassend; +Cc: LKML, Jeremy Leibs, Darren Hart, Peter Zijlstra Blaise, On Sat, 10 Oct 2009, Blaise Gassend wrote: > 1) Where is the ERESTARTSYS being prevented from getting to user space? > > The only likely place I see for preventing ERESTARTSYS from escaping to > user space is in arch/*/kernel/signal*.c. However, I don't see how the > code there is being called if there no signal pending. Is that a path > for ERESTARTSYS to escape from the kernel? > > The following comment in kernel/futex.h in futex_wait makes me wonder if > two threads are getting marked as ERESTARTSYS. The first one to leave > the kernel processes the signal and restarts. The second one doesn't > have a signal to handle, so it returns to user space without getting > into signal*.c and wreaks havoc. > > (...) > /* > * We expect signal_pending(current), but another thread may > * have handled it for us already. > */ > if (!abs_time) > return -ERESTARTSYS; > (...) If the task is woken by a signal, then the task private flag TIF_SIGPENDING is set, but in case of a process wide signal the signal might have been handled by another thread of the same process before that thread reaches the signal handling code, but then ERESTARTSYS is handled gracefully. So you seem to trigger a code path which does not go through do_signal. > 2) Why would this be happening only with RT kernels? Slightly different timing and locking semantics. > 3) Any suggestions on the best place to patch/workaround this? > > My understanding is that if I was to treat ERESTARTSYS as an EAGAIN, > most applications would be perfectly happy. Would bad things happen if I > replaced the ERESTARTSYS in futex_wait with an EAGAIN? No workarounds please. We really want to know what's wrong. Two things to look at: 1) Does that happen with 2.6.31.2-rt13 as well ? 2) Add a check to the code path where ERESTARTSYS is returned: if (!signal_pending(current)) printk(KERN_ERR "....."); If you can see that message then we'll look further. I'll give your script a test ride on my systems as well. Thanks, tglx ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch 2009-10-10 17:59 ` ERESTARTSYS escaping from sem_wait with RTLinux patch Thomas Gleixner @ 2009-10-10 19:08 ` Jeremy Leibs 2009-10-11 2:07 ` Jeremy Leibs 2009-10-11 5:48 ` Jeremy Leibs 1 sibling, 1 reply; 11+ messages in thread From: Jeremy Leibs @ 2009-10-10 19:08 UTC (permalink / raw) To: Thomas Gleixner; +Cc: Blaise Gassend, LKML, Darren Hart, Peter Zijlstra Thomas, thanks for the quick reply. On Sat, Oct 10, 2009 at 10:59 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > Blaise, > > On Sat, 10 Oct 2009, Blaise Gassend wrote: >> 1) Where is the ERESTARTSYS being prevented from getting to user space? >> >> The only likely place I see for preventing ERESTARTSYS from escaping to >> user space is in arch/*/kernel/signal*.c. However, I don't see how the >> code there is being called if there no signal pending. Is that a path >> for ERESTARTSYS to escape from the kernel? >> >> The following comment in kernel/futex.h in futex_wait makes me wonder if >> two threads are getting marked as ERESTARTSYS. The first one to leave >> the kernel processes the signal and restarts. The second one doesn't >> have a signal to handle, so it returns to user space without getting >> into signal*.c and wreaks havoc. >> >> (...) >> /* >> * We expect signal_pending(current), but another thread may >> * have handled it for us already. >> */ >> if (!abs_time) >> return -ERESTARTSYS; >> (...) > > If the task is woken by a signal, then the task private flag > TIF_SIGPENDING is set, but in case of a process wide signal the signal > might have been handled by another thread of the same process before > that thread reaches the signal handling code, but then ERESTARTSYS is > handled gracefully. So you seem to trigger a code path which does not > go through do_signal. > >> 2) Why would this be happening only with RT kernels? > > Slightly different timing and locking semantics. > >> 3) Any suggestions on the best place to patch/workaround this? >> >> My understanding is that if I was to treat ERESTARTSYS as an EAGAIN, >> most applications would be perfectly happy. Would bad things happen if I >> replaced the ERESTARTSYS in futex_wait with an EAGAIN? > > No workarounds please. We really want to know what's wrong. > > Two things to look at: > > 1) Does that happen with 2.6.31.2-rt13 as well ? I am nearly certain we saw the problems with the newer kernel as well, although that was back with a much less concise test and I've since reinstalled over that machine in the process of trying a number of different 32/64 hardy/jaunty configurations on different hardware. I'll do a fresh install of that particular kernel with default configuration options on our hardware and let you know a little later today. > 2) Add a check to the code path where ERESTARTSYS is returned: > > if (!signal_pending(current)) > printk(KERN_ERR "....."); > > If you can see that message then we'll look further. I'll give your > script a test ride on my systems as well. > > Thanks, > > tglx > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch 2009-10-10 19:08 ` Jeremy Leibs @ 2009-10-11 2:07 ` Jeremy Leibs 0 siblings, 0 replies; 11+ messages in thread From: Jeremy Leibs @ 2009-10-11 2:07 UTC (permalink / raw) To: Thomas Gleixner; +Cc: Blaise Gassend, LKML, Darren Hart, Peter Zijlstra On Sat, Oct 10, 2009 at 12:08 PM, Jeremy Leibs <leibs@willowgarage.com> wrote: > Thomas, thanks for the quick reply. > > On Sat, Oct 10, 2009 at 10:59 AM, Thomas Gleixner <tglx@linutronix.de> wrote: -----snip----- >> >> 1) Does that happen with 2.6.31.2-rt13 as well ? > > I am nearly certain we saw the problems with the newer kernel as well, > although that was back with a much less concise test and I've since > reinstalled over that machine in the process of trying a number of > different 32/64 hardy/jaunty configurations on different hardware. > I'll do a fresh install of that particular kernel with default > configuration options on our hardware and let you know a little later > today. Running on fresh install of 64 bit jaunty: leibs@c1:~$ uname -rv 2.6.31.2-rt13 #1 SMP PREEMPT RT Sat Oct 10 15:34:13 PDT 2009 leibs@c1:~$ python threadprocs8.py sem_wait: Unknown error 512 Fatal Python error: ceval: orphan tstate Aborted ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch 2009-10-10 17:59 ` ERESTARTSYS escaping from sem_wait with RTLinux patch Thomas Gleixner 2009-10-10 19:08 ` Jeremy Leibs @ 2009-10-11 5:48 ` Jeremy Leibs 2009-10-12 14:16 ` Darren Hart 1 sibling, 1 reply; 11+ messages in thread From: Jeremy Leibs @ 2009-10-11 5:48 UTC (permalink / raw) To: Thomas Gleixner; +Cc: Blaise Gassend, LKML, Darren Hart, Peter Zijlstra On Sat, Oct 10, 2009 at 10:59 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > Blaise, > > On Sat, 10 Oct 2009, Blaise Gassend wrote: >> 1) Where is the ERESTARTSYS being prevented from getting to user space? >> >> The only likely place I see for preventing ERESTARTSYS from escaping to >> user space is in arch/*/kernel/signal*.c. However, I don't see how the >> code there is being called if there no signal pending. Is that a path >> for ERESTARTSYS to escape from the kernel? >> >> The following comment in kernel/futex.h in futex_wait makes me wonder if >> two threads are getting marked as ERESTARTSYS. The first one to leave >> the kernel processes the signal and restarts. The second one doesn't >> have a signal to handle, so it returns to user space without getting >> into signal*.c and wreaks havoc. >> >> (...) >> /* >> * We expect signal_pending(current), but another thread may >> * have handled it for us already. >> */ >> if (!abs_time) >> return -ERESTARTSYS; >> (...) > > If the task is woken by a signal, then the task private flag > TIF_SIGPENDING is set, but in case of a process wide signal the signal > might have been handled by another thread of the same process before > that thread reaches the signal handling code, but then ERESTARTSYS is > handled gracefully. So you seem to trigger a code path which does not > go through do_signal. > >> 2) Why would this be happening only with RT kernels? > > Slightly different timing and locking semantics. > >> 3) Any suggestions on the best place to patch/workaround this? >> >> My understanding is that if I was to treat ERESTARTSYS as an EAGAIN, >> most applications would be perfectly happy. Would bad things happen if I >> replaced the ERESTARTSYS in futex_wait with an EAGAIN? > > No workarounds please. We really want to know what's wrong. > > Two things to look at: > > 1) Does that happen with 2.6.31.2-rt13 as well ? > > 2) Add a check to the code path where ERESTARTSYS is returned: > > if (!signal_pending(current)) > printk(KERN_ERR "....."); > Ok, in 2.6.31.2-rt13, I modified futex.c as: ----- /* * We expect signal_pending(current), but another thread may * have handled it for us already. */ ret = -ERESTARTSYS; if (!abs_time) { if (!signal_pending(current)) printk(KERN_ERR "....."); goto out_put_key; } ----- Then when I cause the crash: leibs@c1:~$ python threadprocs8.py sem_wait: Unknown error 512 Segmentation fault dmesg shows me the corresponding: [ 82.232999] ..... [ 82.233177] python[2834]: segfault at 48 ip 00000000004b0177 sp 00007f9429788ad8 error 4 in python2.6[400000+216000] --Jeremy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch 2009-10-11 5:48 ` Jeremy Leibs @ 2009-10-12 14:16 ` Darren Hart [not found] ` <1255384010.10236.123.camel@lts.willowgarage.com> 0 siblings, 1 reply; 11+ messages in thread From: Darren Hart @ 2009-10-12 14:16 UTC (permalink / raw) To: Jeremy Leibs; +Cc: Thomas Gleixner, Blaise Gassend, LKML, Peter Zijlstra Jeremy Leibs wrote: > On Sat, Oct 10, 2009 at 10:59 AM, Thomas Gleixner <tglx@linutronix.de> wrote: >> Blaise, >> >> On Sat, 10 Oct 2009, Blaise Gassend wrote: >>> 1) Where is the ERESTARTSYS being prevented from getting to user space? >>> >>> The only likely place I see for preventing ERESTARTSYS from escaping to >>> user space is in arch/*/kernel/signal*.c. However, I don't see how the >>> code there is being called if there no signal pending. Is that a path >>> for ERESTARTSYS to escape from the kernel? >>> >>> The following comment in kernel/futex.h in futex_wait makes me wonder if >>> two threads are getting marked as ERESTARTSYS. The first one to leave >>> the kernel processes the signal and restarts. The second one doesn't >>> have a signal to handle, so it returns to user space without getting >>> into signal*.c and wreaks havoc. >>> >>> (...) >>> /* >>> * We expect signal_pending(current), but another thread may >>> * have handled it for us already. >>> */ >>> if (!abs_time) >>> return -ERESTARTSYS; >>> (...) >> If the task is woken by a signal, then the task private flag >> TIF_SIGPENDING is set, but in case of a process wide signal the signal >> might have been handled by another thread of the same process before >> that thread reaches the signal handling code, but then ERESTARTSYS is >> handled gracefully. So you seem to trigger a code path which does not >> go through do_signal. >> >>> 2) Why would this be happening only with RT kernels? >> Slightly different timing and locking semantics. >> >>> 3) Any suggestions on the best place to patch/workaround this? >>> >>> My understanding is that if I was to treat ERESTARTSYS as an EAGAIN, >>> most applications would be perfectly happy. Would bad things happen if I >>> replaced the ERESTARTSYS in futex_wait with an EAGAIN? >> No workarounds please. We really want to know what's wrong. >> >> Two things to look at: >> >> 1) Does that happen with 2.6.31.2-rt13 as well ? >> >> 2) Add a check to the code path where ERESTARTSYS is returned: >> >> if (!signal_pending(current)) >> printk(KERN_ERR "....."); >> > > Ok, in 2.6.31.2-rt13, I modified futex.c as: > ----- > /* > * We expect signal_pending(current), but another thread may > * have handled it for us already. > */ > ret = -ERESTARTSYS; > if (!abs_time) > { > if (!signal_pending(current)) > printk(KERN_ERR "....."); > goto out_put_key; > } > ----- > > Then when I cause the crash: > > leibs@c1:~$ python threadprocs8.py > sem_wait: Unknown error 512 > Segmentation fault > > dmesg shows me the corresponding: > [ 82.232999] ..... > [ 82.233177] python[2834]: segfault at 48 ip 00000000004b0177 sp > 00007f9429788ad8 error 4 in python2.6[400000+216000] OK, so I suspect one of two things. 1) Recent changes to futex.c have somehow created a wakeup race and unqueue_me() doesn't detect it was woken with FUTEX_WAKE, then falls out through the ERESTARTSYS path. 2) Recent changes have exposed an existing race in unqueue_me(). I'll do some runs on my 8-way systems and see if I can: o Identify the guilty patch o Identify the race in question Thanks for the test case! Now... why is sem_wait() being used in a timer call.... -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <1255384010.10236.123.camel@lts.willowgarage.com>]
[parent not found: <4AD3BD57.6080703@us.ibm.com>]
[parent not found: <4AD3D6AE.2050609@us.ibm.com>]
[parent not found: <4AD3FFB0.5030405@us.ibm.com>]
* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch [not found] ` <4AD3FFB0.5030405@us.ibm.com> @ 2009-10-13 4:54 ` Darren Hart 2009-10-13 8:56 ` Blaise Gassend 0 siblings, 1 reply; 11+ messages in thread From: Darren Hart @ 2009-10-13 4:54 UTC (permalink / raw) To: Blaise Gassend; +Cc: Jeremy Leibs, Thomas Gleixner, Peter Zijlstra, lkml, Darren Hart wrote: > Resending, hopefully with fixed whitespace mangling in the trace this > time... > > Darren Hart wrote: >> Darren Hart wrote: >>> Blaise Gassend wrote: >>>> A few more questions you may have answers to: >>>> >>>> Do you have any idea what this comment in futex.c could be referring >>>> to? >>>> >>>> /* * We expect signal_pending(current), but another thread may * >>>> have handled it for us already. */ >>>> As far as I have been able to understand, signals are thread-specific, >>>> and hence it doesn't make sense to me that another thread could have >>>> handled it. >>> >>> Signals are only thread specific when using something like >>> pthread_kill() to send the signal, otherwise they are process wide. >>> >>>> >>>>> OK, so I suspect one of two things. >>>>> >>>>> 1) Recent changes to futex.c have somehow created a wakeup race and >>>>> unqueue_me() doesn't detect it was woken with FUTEX_WAKE, then >>>>> falls >>>>> out through the ERESTARTSYS path. >>>>> >>>>> 2) Recent changes have exposed an existing race in unqueue_me(). >>>> >>>> Is it possible that there aren't many people using PREEMPT RT on 8 CPU >>>> machines, and hence this is a bug that just has't been observed yet? >>> >>> We actually do extensive testing on 8way systems with some large apps >>> that beat on futexes pretty badly. You've simply uncovered a nasty >>> little race in the wakeup path. >>> >>> I believe I have identified the patch where this became possible (I >>> don't say the cause of the bug, because it's possible this patch simply >>> exposed an existing race): >>> >>> 928686b77ab275fd7f828ff24bd510baca995425 futex: Wake up waiter outside >>> the hb->lock section >>> >>> I am currently instrumenting the futex code and trying to identify how >>> the race occurs. > > ... > >>> Full output here: > > ... > >> http://dvhart.com/darren/files/futex_wake_function.trace.gz >> >> It's a tad difficult to navigate, but I believe I am seeing >> wake_futex_list() try and wake python-3490 without previously adding >> it to the wake-list. If we are somehow not cleaning up our wake_list, >> this would explain why unqueue_me() sees the q->lock_ptr as non-null - >> wake_futex() wasn't called to clear it. > > OK, I believe I can confirm this with this subset of the trace. It follows > three futex_wait and wake-up cycles. The third wake-up occurs without the > python-3490 thread ever having been added to the wake_list (at least, there > is not record of it in the trace). Now to see why this might be the case... > > python-3490 [002] 259.041420: futex_wait <-do_futex > python-3490 [002] 259.041420: futex_wait_setup <-futex_wait > python-3490 [002] 259.043888: futex_wait_queue_me <-futex_wait > python-3490 [002] 259.043888: queue_me <-futex_wait_queue_me > python-3490 [002] 259.043920: schedule <-futex_wait_queue_me > python-3507 [004] 259.043929: wake_futex: adding python-3490 to > wake_list > python-3507 [004] 259.043957: wake_futex_list: wake_futex_list: > waking python-3490 > python-3490 [002] 259.043981: futex_wait: normal futex wake-up > detected for python-3490 > > python-3490 [002] 259.043987: futex_wait <-do_futex > python-3490 [002] 259.043987: futex_wait_setup <-futex_wait > python-3490 [002] 259.044323: futex_wait_queue_me <-futex_wait > python-3490 [002] 259.044323: queue_me <-futex_wait_queue_me > python-3495 [002] 259.044571: wake_futex: adding python-3490 to > wake_list Interesting, here we never see a wake_futex_list: waking python-3490. So the task wakes here and thinks it is a normal wakeup, when perhaps it is not. If a timeout or a signal were to occur here, we would not detect them as unqueue_me() would see the lock_ptr had been nulled by wake_futex(). The task returns to userspace ignoring the timeout or signal. > python-3490 [002] 259.044843: futex_wait: normal futex wake-up > detected for python-3490 > > python-3490 [002] 259.044848: futex_wait <-do_futex The app then puts it back to sleep here. > python-3490 [002] 259.044848: futex_wait_setup <-futex_wait > python-3490 [002] 259.046648: futex_wait_queue_me <-futex_wait > python-3490 [002] 259.046648: queue_me <-futex_wait_queue_me > python-3490 [002] 259.046664: schedule <-futex_wait_queue_me > ********* python-3490 was never added to the wake_list !!!!!!! > ********* > > python-3495 [002] 259.046680: wake_futex_list: wake_futex_list: > waking python-3490 When 3495 finally get's to run and complete it's futex_wake() call, the task still needs to be woken, so we wake it - but now it's enqueued with a different futex_q, which now has a valid lock_ptr, so upon wake-up we expect a signal! OK, I believe this establishes root cause. Now to come up with a fix... > python-3490 [002] 259.046816: futex_wait: returning 1, non-futex > wakeup for python-3490 > python-3490 [002] 259.046817: futex_wait: p->futex_wakeup: (null) > python-3490 [002] 259.046819: futex_wait: error in wake-up > detection, no signal pending for python-3490 Thanks, -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch 2009-10-13 4:54 ` Darren Hart @ 2009-10-13 8:56 ` Blaise Gassend 2009-10-13 15:13 ` Darren Hart 0 siblings, 1 reply; 11+ messages in thread From: Blaise Gassend @ 2009-10-13 8:56 UTC (permalink / raw) To: Darren Hart; +Cc: Jeremy Leibs, Thomas Gleixner, Peter Zijlstra, lkml, > When 3495 finally get's to run and complete it's futex_wake() call, the task > still needs to be woken, so we wake it - but now it's enqueued with a different > futex_q, which now has a valid lock_ptr, so upon wake-up we expect a signal! > > OK, I believe this establishes root cause. Now to come up with a fix... Wow, good work Darren! You definitely have more experience than I do at tracking down these in-kernel races, and I'm glad we have you looking at this. I'm snarfing up useful techniques from your progress emails. So if I understand correctly, there is a race between wake_futex and a timeout (or a signal, but AFAIK when my python example is running steady-state there are no signals). The problem occurs when wake_futex gets preempted half-way through, after it has NULLed lock_ptr, but before it has woken up the waiter. If a timeout/signal wakes the waiter, and the waiter runs and starts waiting again before the waker resumes, then the waker will end up waking the waiter a second time, without the lock_ptr for the second wait being NULLified. This causes the waiter to mis-interpret what woke it and leads to the fault we observed. Now I am wondering if the workaround described here http://markmail.org/message/at2s337yz3wclaif for what seems like this same problem isn't actually a legitimate fix. It ends up looking something like this: (lines 3 and 4 are new) ret = -ERESTARTSYS; if (!abs_time) { if (!signal_pending(current)) set_tsk_thread_flag(current,TIF_SIGPENDING); goto out_put_key; } I think this works because the only bad thing that is happening in the scenario you worked out is that the waiter failed to detect that the second wake was spurious, and instead interpreted it as a signal. However, the waiter can easily detect that the wakeup is spurious: 1) a futex wakeup can be detected by unqueue_me returning 0 2) a timeout can be detected by looking at to->task 3) a signal can be detected by calling signal_pending futex_wait is already doing 1) and 2). If it additionally did 3), which I think it is an inexpensive check, then it could detect the spurious wakeup, and restart the wait, either by doing a set_tsk_thread_flag(current,TIF_SIGPENDING) and relying on the signal handling to restart the syscall as in the patch above, or by looping back to somewhere around the futex_wait_setup call. All the other fix ideas that come to mind end up slowing down the common case by introducing additional locking in futex_wakeup, and hence seem less desirable. At this point I'll admit that I'm out of my depth and see what you think... Blaise ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch 2009-10-13 8:56 ` Blaise Gassend @ 2009-10-13 15:13 ` Darren Hart 2009-10-13 18:45 ` Thomas Gleixner 0 siblings, 1 reply; 11+ messages in thread From: Darren Hart @ 2009-10-13 15:13 UTC (permalink / raw) To: Blaise Gassend; +Cc: Jeremy Leibs, Thomas Gleixner, Peter Zijlstra, lkml, Blaise Gassend wrote: >> When 3495 finally get's to run and complete it's futex_wake() call, the task >> still needs to be woken, so we wake it - but now it's enqueued with a different >> futex_q, which now has a valid lock_ptr, so upon wake-up we expect a signal! >> >> OK, I believe this establishes root cause. Now to come up with a fix... > > Wow, good work Darren! You definitely have more experience than I do at > tracking down these in-kernel races, and I'm glad we have you looking at > this. I'm snarfing up useful techniques from your progress emails. Great, I learn a lot from reading other people's status-type email as well. Glad I can be on the contributing end once and a while :) > > So if I understand correctly, there is a race between wake_futex and a > timeout (or a signal, but AFAIK when my python example is running > steady-state there are no signals). The problem occurs when wake_futex > gets preempted half-way through, after it has NULLed lock_ptr, but > before it has woken up the waiter. If a timeout/signal wakes the waiter, > and the waiter runs and starts waiting again before the waker resumes, > then the waker will end up waking the waiter a second time, without the > lock_ptr for the second wait being NULLified. This causes the waiter to > mis-interpret what woke it and leads to the fault we observed. > > Now I am wondering if the workaround described here > http://markmail.org/message/at2s337yz3wclaif > for what seems like this same problem isn't actually a legitimate fix. > It ends up looking something like this: (lines 3 and 4 are new) > > ret = -ERESTARTSYS; > if (!abs_time) { > if (!signal_pending(current)) > set_tsk_thread_flag(current,TIF_SIGPENDING); > goto out_put_key; > } The trouble with this is it is a bandaid to a fundamentally broken wake-up path. I tried flagging the waiters on the wake-list as already woken and then skipping them in the wake_futex_list(), but this got ugly really fast. Talking with Thomas a bit more we're not sure the patch that introduced this lockless waking actually does any good, as the normal wakeup path doesn't take the hb->lock anyway, it's more likely the contention was due to an app like this that wakes a task and almost immediately puts it back to sleep on a futex before the waker has a chance to drop the hb->lock. The futex wake-up path is complicated enough as it is, in my personal opinion, we are better off dropping the "lockless wake-up" patch and removing the race and simplifying the wake-up path at the same time. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch 2009-10-13 15:13 ` Darren Hart @ 2009-10-13 18:45 ` Thomas Gleixner 0 siblings, 0 replies; 11+ messages in thread From: Thomas Gleixner @ 2009-10-13 18:45 UTC (permalink / raw) To: Darren Hart; +Cc: Blaise Gassend, Jeremy Leibs, Peter Zijlstra, lkml, On Tue, 13 Oct 2009, Darren Hart wrote: > Talking with Thomas a bit more we're not sure the patch that introduced this > lockless waking actually does any good, as the normal wakeup path doesn't take > the hb->lock anyway, it's more likely the contention was due to an app like > this that wakes a task and almost immediately puts it back to sleep on a futex > before the waker has a chance to drop the hb->lock. The patch was done before we restructured the wakeup code in mainline and it's not necessary anymore. > The futex wake-up path is complicated enough as it is, in my personal opinion, > we are better off dropping the "lockless wake-up" patch and removing the race > and simplifying the wake-up path at the same time. Nevertheless we need graceful handling of spurious wakeups in the waiter code. I'm working on a fix for mainline, which will be in the next -rt release as well (after I reverted the no longer valid wakeup optimization) Thanks folks for debugging that. tglx ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2009-10-13 18:46 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-10 9:09 ERESTARTSYS escaping from sem_wait with RTLinux patch Blaise Gassend
2009-10-10 16:40 ` ERESTARTSYS escaping from sem_wait with Preempt-RT Blaise Gassend
2009-10-10 17:59 ` ERESTARTSYS escaping from sem_wait with RTLinux patch Thomas Gleixner
2009-10-10 19:08 ` Jeremy Leibs
2009-10-11 2:07 ` Jeremy Leibs
2009-10-11 5:48 ` Jeremy Leibs
2009-10-12 14:16 ` Darren Hart
[not found] ` <1255384010.10236.123.camel@lts.willowgarage.com>
[not found] ` <4AD3BD57.6080703@us.ibm.com>
[not found] ` <4AD3D6AE.2050609@us.ibm.com>
[not found] ` <4AD3FFB0.5030405@us.ibm.com>
2009-10-13 4:54 ` Darren Hart
2009-10-13 8:56 ` Blaise Gassend
2009-10-13 15:13 ` Darren Hart
2009-10-13 18:45 ` Thomas Gleixner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox