ERESTARTSYS escaping from sem_wait with RTLinux patch

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* ERESTARTSYS escaping from sem_wait with RTLinux patch
@ 2009-10-10  9:09 Blaise Gassend
  2009-10-10 16:40 ` ERESTARTSYS escaping from sem_wait with Preempt-RT Blaise Gassend
  2009-10-10 17:59 ` ERESTARTSYS escaping from sem_wait with RTLinux patch Thomas Gleixner
  0 siblings, 2 replies; 11+ messages in thread
From: Blaise Gassend @ 2009-10-10  9:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jeremy Leibs

[-- Attachment #1: Type: text/plain, Size: 2285 bytes --]

The attached python program, in which 500 threads spin with microsecond
sleeps, crashes with a "sem_wait: Unknown error 512" (conditions
described below). This appears to be due to an ERESTARTSYS generated
from futex_wait escaping to user space (libc). My understanding is that
this should never happen and I am trying to track down what is going on.

Questions that would help me make progress:
-------------------------------------------

1) Where is the ERESTARTSYS being prevented from getting to user space? 

The only likely place I see for preventing ERESTARTSYS from escaping to
user space is in arch/*/kernel/signal*.c. However, I don't see how the
code there is being called if there no signal pending. Is that a path
for ERESTARTSYS to escape from the kernel?

The following comment in kernel/futex.h in futex_wait makes me wonder if
two threads are getting marked as ERESTARTSYS. The first one to leave
the kernel processes the signal and restarts. The second one doesn't
have a signal to handle, so it returns to user space without getting
into signal*.c and wreaks havoc.

    (...)
        /*
         * We expect signal_pending(current), but another thread may
         * have handled it for us already.
         */
        if (!abs_time)
                return -ERESTARTSYS;
    (...)

2) Why would this be happening only with RT kernels?

3) Any suggestions on the best place to patch/workaround this? 

My understanding is that if I was to treat ERESTARTSYS as an EAGAIN,
most applications would be perfectly happy. Would bad things happen if I
replaced the ERESTARTSYS in futex_wait with an EAGAIN?

Crash conditions:
-----------------

- RTLinux only.
- More cores seems to make things worse. Lots of crashes on a dual-quad
core machine. None observed yet on dual core. At least one crash on a
dual-quad core when run with "taskset -c 1"
- Various versions, including 2.6.29.6-rt23, and whatever the latest was
earlier today.
- Seen on both ia64 and x86
- Ubuntu hardy and jaunty
- Sometimes hapens within 2 seconds on a dual quad-core machine, other
times will go for up to 30 minutes to an hour without crashing. I
suspect a dependence on system activity, but haven't noticed an obvious
pattern.
- Time to crash appears to drop fast with more CPU cores.

[-- Attachment #2: threadprocs8.py --]
[-- Type: text/x-python, Size: 222 bytes --]

import threading
import time

exiting = False

def spin():
    while not exiting:
        time.sleep(0.000001)

for i in range(0,500):
    threading.Thread(target=spin).start()

try:
    spin()
finally:
    exiting = True

^ permalink raw reply	[flat|nested] 11+ messages in thread

* ERESTARTSYS escaping from sem_wait with Preempt-RT
  2009-10-10  9:09 ERESTARTSYS escaping from sem_wait with RTLinux patch Blaise Gassend
@ 2009-10-10 16:40 ` Blaise Gassend
  2009-10-10 17:59 ` ERESTARTSYS escaping from sem_wait with RTLinux patch Thomas Gleixner
  1 sibling, 0 replies; 11+ messages in thread
From: Blaise Gassend @ 2009-10-10 16:40 UTC (permalink / raw)
  To: linux-kernel

Apologies, the message below is for Preempt-RT, not RTLinux. The
original message with full details can be found here:
http://lkml.org/lkml/2009/10/10/36

On Sat, 2009-10-10 at 02:09 -0700, Blaise Gassend wrote:
> The attached python program, in which 500 threads spin with microsecond
> sleeps, crashes with a "sem_wait: Unknown error 512" (conditions
> described below). This appears to be due to an ERESTARTSYS generated
> from futex_wait escaping to user space (libc). My understanding is that
> this should never happen and I am trying to track down what is going on.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch
  2009-10-10  9:09 ERESTARTSYS escaping from sem_wait with RTLinux patch Blaise Gassend
  2009-10-10 16:40 ` ERESTARTSYS escaping from sem_wait with Preempt-RT Blaise Gassend
@ 2009-10-10 17:59 ` Thomas Gleixner
  2009-10-10 19:08   ` Jeremy Leibs
  2009-10-11  5:48   ` Jeremy Leibs
  1 sibling, 2 replies; 11+ messages in thread
From: Thomas Gleixner @ 2009-10-10 17:59 UTC (permalink / raw)
  To: Blaise Gassend; +Cc: LKML, Jeremy Leibs, Darren Hart, Peter Zijlstra

Blaise,

On Sat, 10 Oct 2009, Blaise Gassend wrote:
> 1) Where is the ERESTARTSYS being prevented from getting to user space? 
> 
> The only likely place I see for preventing ERESTARTSYS from escaping to
> user space is in arch/*/kernel/signal*.c. However, I don't see how the
> code there is being called if there no signal pending. Is that a path
> for ERESTARTSYS to escape from the kernel?
> 
> The following comment in kernel/futex.h in futex_wait makes me wonder if
> two threads are getting marked as ERESTARTSYS. The first one to leave
> the kernel processes the signal and restarts. The second one doesn't
> have a signal to handle, so it returns to user space without getting
> into signal*.c and wreaks havoc.
> 
>     (...)
>         /*
>          * We expect signal_pending(current), but another thread may
>          * have handled it for us already.
>          */
>         if (!abs_time)
>                 return -ERESTARTSYS;
>     (...)

If the task is woken by a signal, then the task private flag
TIF_SIGPENDING is set, but in case of a process wide signal the signal
might have been handled by another thread of the same process before
that thread reaches the signal handling code, but then ERESTARTSYS is
handled gracefully. So you seem to trigger a code path which does not
go through do_signal.

> 2) Why would this be happening only with RT kernels?

Slightly different timing and locking semantics.

> 3) Any suggestions on the best place to patch/workaround this? 
> 
> My understanding is that if I was to treat ERESTARTSYS as an EAGAIN,
> most applications would be perfectly happy. Would bad things happen if I
> replaced the ERESTARTSYS in futex_wait with an EAGAIN?

No workarounds please. We really want to know what's wrong.

Two things to look at:

1) Does that happen with 2.6.31.2-rt13 as well ?

2) Add a check to the code path where ERESTARTSYS is returned:

   if (!signal_pending(current))
      printk(KERN_ERR "....."); 

If you can see that message then we'll look further. I'll give your
script a test ride on my systems as well.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch
  2009-10-10 17:59 ` ERESTARTSYS escaping from sem_wait with RTLinux patch Thomas Gleixner
@ 2009-10-10 19:08   ` Jeremy Leibs
  2009-10-11  2:07     ` Jeremy Leibs
  2009-10-11  5:48   ` Jeremy Leibs
  1 sibling, 1 reply; 11+ messages in thread
From: Jeremy Leibs @ 2009-10-10 19:08 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Blaise Gassend, LKML, Darren Hart, Peter Zijlstra

Thomas, thanks for the quick reply.

On Sat, Oct 10, 2009 at 10:59 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> Blaise,
>
> On Sat, 10 Oct 2009, Blaise Gassend wrote:
>> 1) Where is the ERESTARTSYS being prevented from getting to user space?
>>
>> The only likely place I see for preventing ERESTARTSYS from escaping to
>> user space is in arch/*/kernel/signal*.c. However, I don't see how the
>> code there is being called if there no signal pending. Is that a path
>> for ERESTARTSYS to escape from the kernel?
>>
>> The following comment in kernel/futex.h in futex_wait makes me wonder if
>> two threads are getting marked as ERESTARTSYS. The first one to leave
>> the kernel processes the signal and restarts. The second one doesn't
>> have a signal to handle, so it returns to user space without getting
>> into signal*.c and wreaks havoc.
>>
>>     (...)
>>         /*
>>          * We expect signal_pending(current), but another thread may
>>          * have handled it for us already.
>>          */
>>         if (!abs_time)
>>                 return -ERESTARTSYS;
>>     (...)
>
> If the task is woken by a signal, then the task private flag
> TIF_SIGPENDING is set, but in case of a process wide signal the signal
> might have been handled by another thread of the same process before
> that thread reaches the signal handling code, but then ERESTARTSYS is
> handled gracefully. So you seem to trigger a code path which does not
> go through do_signal.
>
>> 2) Why would this be happening only with RT kernels?
>
> Slightly different timing and locking semantics.
>
>> 3) Any suggestions on the best place to patch/workaround this?
>>
>> My understanding is that if I was to treat ERESTARTSYS as an EAGAIN,
>> most applications would be perfectly happy. Would bad things happen if I
>> replaced the ERESTARTSYS in futex_wait with an EAGAIN?
>
> No workarounds please. We really want to know what's wrong.
>
> Two things to look at:
>
> 1) Does that happen with 2.6.31.2-rt13 as well ?

I am nearly certain we saw the problems with the newer kernel as well,
although that was back with a much less concise test and I've since
reinstalled over that machine in the process of trying a number of
different 32/64 hardy/jaunty configurations on different hardware.
I'll do a fresh install of that particular kernel with default
configuration options on our hardware and let you know a little later
today.

> 2) Add a check to the code path where ERESTARTSYS is returned:
>
>   if (!signal_pending(current))
>      printk(KERN_ERR ".....");
>
> If you can see that message then we'll look further. I'll give your
> script a test ride on my systems as well.
>
> Thanks,
>
>        tglx
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch
  2009-10-10 19:08   ` Jeremy Leibs
@ 2009-10-11  2:07     ` Jeremy Leibs
  0 siblings, 0 replies; 11+ messages in thread
From: Jeremy Leibs @ 2009-10-11  2:07 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Blaise Gassend, LKML, Darren Hart, Peter Zijlstra

On Sat, Oct 10, 2009 at 12:08 PM, Jeremy Leibs <leibs@willowgarage.com> wrote:
> Thomas, thanks for the quick reply.
>
> On Sat, Oct 10, 2009 at 10:59 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
-----snip-----
>>
>> 1) Does that happen with 2.6.31.2-rt13 as well ?
>
> I am nearly certain we saw the problems with the newer kernel as well,
> although that was back with a much less concise test and I've since
> reinstalled over that machine in the process of trying a number of
> different 32/64 hardy/jaunty configurations on different hardware.
> I'll do a fresh install of that particular kernel with default
> configuration options on our hardware and let you know a little later
> today.

Running on fresh install of 64 bit jaunty:

leibs@c1:~$ uname -rv
2.6.31.2-rt13 #1 SMP PREEMPT RT Sat Oct 10 15:34:13 PDT 2009

leibs@c1:~$ python threadprocs8.py
sem_wait: Unknown error 512
Fatal Python error: ceval: orphan tstate
Aborted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch
  2009-10-10 17:59 ` ERESTARTSYS escaping from sem_wait with RTLinux patch Thomas Gleixner
  2009-10-10 19:08   ` Jeremy Leibs
@ 2009-10-11  5:48   ` Jeremy Leibs
  2009-10-12 14:16     ` Darren Hart
  1 sibling, 1 reply; 11+ messages in thread
From: Jeremy Leibs @ 2009-10-11  5:48 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Blaise Gassend, LKML, Darren Hart, Peter Zijlstra

On Sat, Oct 10, 2009 at 10:59 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> Blaise,
>
> On Sat, 10 Oct 2009, Blaise Gassend wrote:
>> 1) Where is the ERESTARTSYS being prevented from getting to user space?
>>
>> The only likely place I see for preventing ERESTARTSYS from escaping to
>> user space is in arch/*/kernel/signal*.c. However, I don't see how the
>> code there is being called if there no signal pending. Is that a path
>> for ERESTARTSYS to escape from the kernel?
>>
>> The following comment in kernel/futex.h in futex_wait makes me wonder if
>> two threads are getting marked as ERESTARTSYS. The first one to leave
>> the kernel processes the signal and restarts. The second one doesn't
>> have a signal to handle, so it returns to user space without getting
>> into signal*.c and wreaks havoc.
>>
>>     (...)
>>         /*
>>          * We expect signal_pending(current), but another thread may
>>          * have handled it for us already.
>>          */
>>         if (!abs_time)
>>                 return -ERESTARTSYS;
>>     (...)
>
> If the task is woken by a signal, then the task private flag
> TIF_SIGPENDING is set, but in case of a process wide signal the signal
> might have been handled by another thread of the same process before
> that thread reaches the signal handling code, but then ERESTARTSYS is
> handled gracefully. So you seem to trigger a code path which does not
> go through do_signal.
>
>> 2) Why would this be happening only with RT kernels?
>
> Slightly different timing and locking semantics.
>
>> 3) Any suggestions on the best place to patch/workaround this?
>>
>> My understanding is that if I was to treat ERESTARTSYS as an EAGAIN,
>> most applications would be perfectly happy. Would bad things happen if I
>> replaced the ERESTARTSYS in futex_wait with an EAGAIN?
>
> No workarounds please. We really want to know what's wrong.
>
> Two things to look at:
>
> 1) Does that happen with 2.6.31.2-rt13 as well ?
>
> 2) Add a check to the code path where ERESTARTSYS is returned:
>
>   if (!signal_pending(current))
>      printk(KERN_ERR ".....");
>

Ok, in 2.6.31.2-rt13, I modified futex.c as:
-----
        /*
         * We expect signal_pending(current), but another thread may
         * have handled it for us already.
         */
        ret = -ERESTARTSYS;
        if (!abs_time)
          {
            if (!signal_pending(current))
              printk(KERN_ERR ".....");
            goto out_put_key;
          }
-----

Then when I cause the crash:

leibs@c1:~$ python threadprocs8.py
sem_wait: Unknown error 512
Segmentation fault

dmesg shows me the corresponding:
[   82.232999] .....
[   82.233177] python[2834]: segfault at 48 ip 00000000004b0177 sp
00007f9429788ad8 error 4 in python2.6[400000+216000]


--Jeremy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch
  2009-10-11  5:48   ` Jeremy Leibs
@ 2009-10-12 14:16     ` Darren Hart
       [not found]       ` <1255384010.10236.123.camel@lts.willowgarage.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Darren Hart @ 2009-10-12 14:16 UTC (permalink / raw)
  To: Jeremy Leibs; +Cc: Thomas Gleixner, Blaise Gassend, LKML, Peter Zijlstra

Jeremy Leibs wrote:
> On Sat, Oct 10, 2009 at 10:59 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> Blaise,
>>
>> On Sat, 10 Oct 2009, Blaise Gassend wrote:
>>> 1) Where is the ERESTARTSYS being prevented from getting to user space?
>>>
>>> The only likely place I see for preventing ERESTARTSYS from escaping to
>>> user space is in arch/*/kernel/signal*.c. However, I don't see how the
>>> code there is being called if there no signal pending. Is that a path
>>> for ERESTARTSYS to escape from the kernel?
>>>
>>> The following comment in kernel/futex.h in futex_wait makes me wonder if
>>> two threads are getting marked as ERESTARTSYS. The first one to leave
>>> the kernel processes the signal and restarts. The second one doesn't
>>> have a signal to handle, so it returns to user space without getting
>>> into signal*.c and wreaks havoc.
>>>
>>>     (...)
>>>         /*
>>>          * We expect signal_pending(current), but another thread may
>>>          * have handled it for us already.
>>>          */
>>>         if (!abs_time)
>>>                 return -ERESTARTSYS;
>>>     (...)
>> If the task is woken by a signal, then the task private flag
>> TIF_SIGPENDING is set, but in case of a process wide signal the signal
>> might have been handled by another thread of the same process before
>> that thread reaches the signal handling code, but then ERESTARTSYS is
>> handled gracefully. So you seem to trigger a code path which does not
>> go through do_signal.
>>
>>> 2) Why would this be happening only with RT kernels?
>> Slightly different timing and locking semantics.
>>
>>> 3) Any suggestions on the best place to patch/workaround this?
>>>
>>> My understanding is that if I was to treat ERESTARTSYS as an EAGAIN,
>>> most applications would be perfectly happy. Would bad things happen if I
>>> replaced the ERESTARTSYS in futex_wait with an EAGAIN?
>> No workarounds please. We really want to know what's wrong.
>>
>> Two things to look at:
>>
>> 1) Does that happen with 2.6.31.2-rt13 as well ?
>>
>> 2) Add a check to the code path where ERESTARTSYS is returned:
>>
>>   if (!signal_pending(current))
>>      printk(KERN_ERR ".....");
>>
> 
> Ok, in 2.6.31.2-rt13, I modified futex.c as:
> -----
>         /*
>          * We expect signal_pending(current), but another thread may
>          * have handled it for us already.
>          */
>         ret = -ERESTARTSYS;
>         if (!abs_time)
>           {
>             if (!signal_pending(current))
>               printk(KERN_ERR ".....");
>             goto out_put_key;
>           }
> -----
> 
> Then when I cause the crash:
> 
> leibs@c1:~$ python threadprocs8.py
> sem_wait: Unknown error 512
> Segmentation fault
> 
> dmesg shows me the corresponding:
> [   82.232999] .....
> [   82.233177] python[2834]: segfault at 48 ip 00000000004b0177 sp
> 00007f9429788ad8 error 4 in python2.6[400000+216000]


OK, so I suspect one of two things.

1) Recent changes to futex.c have somehow created a wakeup race and
    unqueue_me() doesn't detect it was woken with FUTEX_WAKE, then falls
    out through the ERESTARTSYS path.

2) Recent changes have exposed an existing race in unqueue_me().

I'll do some runs on my 8-way systems and see if I can:
o Identify the guilty patch
o Identify the race in question

Thanks for the test case! Now... why is sem_wait() being used in a timer 
call....

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 11+ messages in thread

[parent not found: <1255384010.10236.123.camel@lts.willowgarage.com>]

[parent not found: <4AD3BD57.6080703@us.ibm.com>]

[parent not found: <4AD3D6AE.2050609@us.ibm.com>]

[parent not found: <4AD3FFB0.5030405@us.ibm.com>]

* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch
       [not found]             ` <4AD3FFB0.5030405@us.ibm.com>
@ 2009-10-13  4:54               ` Darren Hart
  2009-10-13  8:56                 ` Blaise Gassend
  0 siblings, 1 reply; 11+ messages in thread
From: Darren Hart @ 2009-10-13  4:54 UTC (permalink / raw)
  To: Blaise Gassend; +Cc: Jeremy Leibs, Thomas Gleixner, Peter Zijlstra, lkml, 

Darren Hart wrote:
> Resending, hopefully with fixed whitespace mangling in the trace this
> time...
> 
> Darren Hart wrote:
>> Darren Hart wrote:
>>> Blaise Gassend wrote:
>>>> A few more questions you may have answers to:
>>>>
>>>> Do you have any idea what this comment in futex.c could be referring 
>>>> to?
>>>>
>>>> /*  * We expect signal_pending(current), but another thread may  * 
>>>> have handled it for us already.  */
>>>> As far as I have been able to understand, signals are thread-specific,
>>>> and hence it doesn't make sense to me that another thread could have
>>>> handled it.
>>>
>>> Signals are only thread specific when using something like
>>> pthread_kill() to send the signal, otherwise they are process wide.
>>>
>>>>
>>>>> OK, so I suspect one of two things.
>>>>>
>>>>> 1) Recent changes to futex.c have somehow created a wakeup race and
>>>>>     unqueue_me() doesn't detect it was woken with FUTEX_WAKE, then 
>>>>> falls
>>>>>     out through the ERESTARTSYS path.
>>>>>
>>>>> 2) Recent changes have exposed an existing race in unqueue_me().
>>>>
>>>> Is it possible that there aren't many people using PREEMPT RT on 8 CPU
>>>> machines, and hence this is a bug that just has't been observed yet?
>>>
>>> We actually do extensive testing on 8way systems with some large apps
>>> that beat on futexes pretty badly.  You've simply uncovered a nasty
>>> little race in the wakeup path.
>>>
>>> I believe I have identified the patch where this became possible (I
>>> don't say the cause of the bug, because it's possible this patch simply
>>> exposed an existing race):
>>>
>>> 928686b77ab275fd7f828ff24bd510baca995425 futex: Wake up waiter outside
>>> the hb->lock section
>>>
>>> I am currently instrumenting the futex code and trying to identify how
>>> the race occurs.
> 
> ...
> 
>>> Full output here:
> 
> ...
> 
>> http://dvhart.com/darren/files/futex_wake_function.trace.gz
>>
>> It's a tad difficult to navigate, but I believe I am seeing 
>> wake_futex_list() try and wake python-3490 without previously adding 
>> it to the wake-list.  If we are somehow not cleaning up our wake_list, 
>> this would explain why unqueue_me() sees the q->lock_ptr as non-null - 
>> wake_futex() wasn't called to clear it.
> 
> OK, I believe I can confirm this with this subset of the trace. It follows
> three futex_wait and wake-up cycles. The third wake-up occurs without the
> python-3490 thread ever having been added to the wake_list (at least, there
> is not record of it in the trace). Now to see why this might be the case...
> 
>         python-3490  [002]   259.041420: futex_wait <-do_futex
>         python-3490  [002]   259.041420: futex_wait_setup <-futex_wait
>         python-3490  [002]   259.043888: futex_wait_queue_me <-futex_wait
>         python-3490  [002]   259.043888: queue_me <-futex_wait_queue_me
>         python-3490  [002]   259.043920: schedule <-futex_wait_queue_me
>      python-3507  [004]   259.043929: wake_futex: adding python-3490 to 
> wake_list
>      python-3507  [004]   259.043957: wake_futex_list: wake_futex_list: 
> waking python-3490
>      python-3490  [002]   259.043981: futex_wait: normal futex wake-up 
> detected for python-3490
> 
>         python-3490  [002]   259.043987: futex_wait <-do_futex
>      python-3490  [002]   259.043987: futex_wait_setup <-futex_wait
>      python-3490  [002]   259.044323: futex_wait_queue_me <-futex_wait
>      python-3490  [002]   259.044323: queue_me <-futex_wait_queue_me
>      python-3495  [002]   259.044571: wake_futex: adding python-3490 to 
> wake_list

Interesting, here we never see a wake_futex_list: waking python-3490.
So the task wakes here and thinks it is a normal wakeup, when perhaps it is
not. If a timeout or a signal were to occur here, we would not detect them
as unqueue_me() would see the lock_ptr had been nulled by wake_futex(). The
task returns to userspace ignoring the timeout or signal.

>      python-3490  [002]   259.044843: futex_wait: normal futex wake-up 
> detected for python-3490
> 
>      python-3490  [002]   259.044848: futex_wait <-do_futex

The app then puts it back to sleep here.


>      python-3490  [002]   259.044848: futex_wait_setup <-futex_wait
>      python-3490  [002]   259.046648: futex_wait_queue_me <-futex_wait
>      python-3490  [002]   259.046648: queue_me <-futex_wait_queue_me
>      python-3490  [002]   259.046664: schedule <-futex_wait_queue_me
>           ********* python-3490 was never added to the wake_list !!!!!!! 
> *********
> 
>      python-3495  [002]   259.046680: wake_futex_list: wake_futex_list: 
> waking python-3490

When 3495 finally get's to run and complete it's futex_wake() call, the task
still needs to be woken, so we wake it - but now it's enqueued with a different
futex_q, which now has a valid lock_ptr, so upon wake-up we expect a signal!

OK, I believe this establishes root cause.  Now to come up with a fix...

>      python-3490  [002]   259.046816: futex_wait: returning 1, non-futex 
> wakeup for python-3490
>      python-3490  [002]   259.046817: futex_wait: p->futex_wakeup: (null)
>      python-3490  [002]   259.046819: futex_wait: error in wake-up 
> detection, no signal pending for python-3490

Thanks,

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch
  2009-10-13  4:54               ` Darren Hart
@ 2009-10-13  8:56                 ` Blaise Gassend
  2009-10-13 15:13                   ` Darren Hart
  0 siblings, 1 reply; 11+ messages in thread
From: Blaise Gassend @ 2009-10-13  8:56 UTC (permalink / raw)
  To: Darren Hart; +Cc: Jeremy Leibs, Thomas Gleixner, Peter Zijlstra, lkml,

> When 3495 finally get's to run and complete it's futex_wake() call, the task
> still needs to be woken, so we wake it - but now it's enqueued with a different
> futex_q, which now has a valid lock_ptr, so upon wake-up we expect a signal!
> 
> OK, I believe this establishes root cause.  Now to come up with a fix...

Wow, good work Darren! You definitely have more experience than I do at
tracking down these in-kernel races, and I'm glad we have you looking at
this. I'm snarfing up useful techniques from your progress emails.

So if I understand correctly, there is a race between wake_futex and a
timeout (or a signal, but AFAIK when my python example is running
steady-state there are no signals). The problem occurs when wake_futex
gets preempted half-way through, after it has NULLed lock_ptr, but
before it has woken up the waiter. If a timeout/signal wakes the waiter,
and the waiter runs and starts waiting again before the waker resumes,
then the waker will end up waking the waiter a second time, without the
lock_ptr for the second wait being NULLified. This causes the waiter to
mis-interpret what woke it and leads to the fault we observed.

Now I am wondering if the workaround described here
http://markmail.org/message/at2s337yz3wclaif
for what seems like this same problem isn't actually a legitimate fix.
It ends up looking something like this: (lines 3 and 4 are new)

  ret = -ERESTARTSYS;
  if (!abs_time) {
    if (!signal_pending(current))
      set_tsk_thread_flag(current,TIF_SIGPENDING);
    goto out_put_key;
  }

I think this works because the only bad thing that is happening in the
scenario you worked out is that the waiter failed to detect that the
second wake was spurious, and instead interpreted it as a signal.
However, the waiter can easily detect that the wakeup is spurious:
1) a futex wakeup can be detected by unqueue_me returning 0
2) a timeout can be detected by looking at to->task
3) a signal can be detected by calling signal_pending

futex_wait is already doing 1) and 2). If it additionally did 3), which
I think it is an inexpensive check, then it could detect the spurious
wakeup, and restart the wait, either by doing a
set_tsk_thread_flag(current,TIF_SIGPENDING) and relying on the signal
handling to restart the syscall as in the patch above, or by looping
back to somewhere around the futex_wait_setup call.

All the other fix ideas that come to mind end up slowing down the common
case by introducing additional locking in futex_wakeup, and hence seem
less desirable.

At this point I'll admit that I'm out of my depth and see what you
think...

Blaise

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch
  2009-10-13  8:56                 ` Blaise Gassend
@ 2009-10-13 15:13                   ` Darren Hart
  2009-10-13 18:45                     ` Thomas Gleixner
  0 siblings, 1 reply; 11+ messages in thread
From: Darren Hart @ 2009-10-13 15:13 UTC (permalink / raw)
  To: Blaise Gassend; +Cc: Jeremy Leibs, Thomas Gleixner, Peter Zijlstra, lkml,

Blaise Gassend wrote:
>> When 3495 finally get's to run and complete it's futex_wake() call, the task
>> still needs to be woken, so we wake it - but now it's enqueued with a different
>> futex_q, which now has a valid lock_ptr, so upon wake-up we expect a signal!
>>
>> OK, I believe this establishes root cause.  Now to come up with a fix...
> 
> Wow, good work Darren! You definitely have more experience than I do at
> tracking down these in-kernel races, and I'm glad we have you looking at
> this. I'm snarfing up useful techniques from your progress emails.

Great, I learn a lot from reading other people's status-type email as 
well.  Glad I can be on the contributing end once and a while :)

> 
> So if I understand correctly, there is a race between wake_futex and a
> timeout (or a signal, but AFAIK when my python example is running
> steady-state there are no signals). The problem occurs when wake_futex
> gets preempted half-way through, after it has NULLed lock_ptr, but
> before it has woken up the waiter. If a timeout/signal wakes the waiter,
> and the waiter runs and starts waiting again before the waker resumes,
> then the waker will end up waking the waiter a second time, without the
> lock_ptr for the second wait being NULLified. This causes the waiter to
> mis-interpret what woke it and leads to the fault we observed.
> 
> Now I am wondering if the workaround described here
> http://markmail.org/message/at2s337yz3wclaif
> for what seems like this same problem isn't actually a legitimate fix.
> It ends up looking something like this: (lines 3 and 4 are new)
> 
>   ret = -ERESTARTSYS;
>   if (!abs_time) {
>     if (!signal_pending(current))
>       set_tsk_thread_flag(current,TIF_SIGPENDING);
>     goto out_put_key;
>   }

The trouble with this is it is a bandaid to a fundamentally broken 
wake-up path. I tried flagging the waiters on the wake-list as already 
woken and then skipping them in the wake_futex_list(), but this got ugly 
really fast.

Talking with Thomas a bit more we're not sure the patch that introduced 
this lockless waking actually does any good, as the normal wakeup path 
doesn't take the hb->lock anyway, it's more likely the contention was 
due to an app like this that wakes a task and almost immediately puts it 
back to sleep on a futex before the waker has a chance to drop the hb->lock.

The futex wake-up path is complicated enough as it is, in my personal 
opinion, we are better off dropping the "lockless wake-up" patch and 
removing the race and simplifying the wake-up path at the same time.

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ERESTARTSYS escaping from sem_wait with RTLinux patch
  2009-10-13 15:13                   ` Darren Hart
@ 2009-10-13 18:45                     ` Thomas Gleixner
  0 siblings, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2009-10-13 18:45 UTC (permalink / raw)
  To: Darren Hart; +Cc: Blaise Gassend, Jeremy Leibs, Peter Zijlstra, lkml,

On Tue, 13 Oct 2009, Darren Hart wrote:
> Talking with Thomas a bit more we're not sure the patch that introduced this
> lockless waking actually does any good, as the normal wakeup path doesn't take
> the hb->lock anyway, it's more likely the contention was due to an app like
> this that wakes a task and almost immediately puts it back to sleep on a futex
> before the waker has a chance to drop the hb->lock.

The patch was done before we restructured the wakeup code in mainline
and it's not necessary anymore.
 
> The futex wake-up path is complicated enough as it is, in my personal opinion,
> we are better off dropping the "lockless wake-up" patch and removing the race
> and simplifying the wake-up path at the same time.

Nevertheless we need graceful handling of spurious wakeups in the
waiter code. I'm working on a fix for mainline, which will be in the
next -rt release as well (after I reverted the no longer valid wakeup
optimization)

Thanks folks for debugging that.

       tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-10-13 18:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-10  9:09 ERESTARTSYS escaping from sem_wait with RTLinux patch Blaise Gassend
2009-10-10 16:40 ` ERESTARTSYS escaping from sem_wait with Preempt-RT Blaise Gassend
2009-10-10 17:59 ` ERESTARTSYS escaping from sem_wait with RTLinux patch Thomas Gleixner
2009-10-10 19:08   ` Jeremy Leibs
2009-10-11  2:07     ` Jeremy Leibs
2009-10-11  5:48   ` Jeremy Leibs
2009-10-12 14:16     ` Darren Hart
     [not found]       ` <1255384010.10236.123.camel@lts.willowgarage.com>
     [not found]         ` <4AD3BD57.6080703@us.ibm.com>
     [not found]           ` <4AD3D6AE.2050609@us.ibm.com>
     [not found]             ` <4AD3FFB0.5030405@us.ibm.com>
2009-10-13  4:54               ` Darren Hart
2009-10-13  8:56                 ` Blaise Gassend
2009-10-13 15:13                   ` Darren Hart
2009-10-13 18:45                     ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox