All of lore.kernel.org
 help / color / mirror / Atom feed
From: Philippe Gerum <rpm@xenomai.org>
To: Kim De Mey <kim.demey@gmail.com>
Cc: xenomai@xenomai.org
Subject: Re: [Xenomai] Xenomai-forge pSOS app hangs on t_suspend + t_delete
Date: Fri, 10 Jan 2014 14:53:32 +0100	[thread overview]
Message-ID: <52CFFB5C.4030705@xenomai.org> (raw)
In-Reply-To: <CAKc2HpaQiKi2AGC0HoMGwN=oODSnFg8U6=P-u2R4f+-u7KiMog@mail.gmail.com>

On 01/10/2014 12:35 PM, Kim De Mey wrote:
> 2014/1/10 Philippe Gerum <rpm@xenomai.org>:
>> On 01/10/2014 10:25 AM, Philippe Gerum wrote:
>>>
>>> On 01/09/2014 11:29 AM, Kim De Mey wrote:
>>>>
>>>> 2014/1/9 Philippe Gerum <rpm@xenomai.org>:
>>>>>
>>>>> On 01/08/2014 01:23 PM, Kim De Mey wrote:
>>>>>>
>>>>>> 2014/1/8 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>
>>>>>>> On 01/08/2014 10:25 AM, Kim De Mey wrote:
>>>>>>
>>>>>>
>>>>>> Here are the backtraces:
>>>>>>
>>>>>> main thread:
>>>>>> (gdb) bt
>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>        at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>>>>>> #1  0x00b45e10 in threadobj_sleep () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>
>>>>>> #2  0x00b2b618 in tm_wkafter () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>
>>>>>> #3  0x10000b3c in main (argc=1, argv=0x104120e0) at
>>>>>> suspend_delete_easy.c:42
>>>>>>
>>>>>> first psos task:
>>>>>> (gdb) bt
>>>>>> #0  0x00b6b5c8 in __old_sem_wait (sem=<optimized out>) at
>>>>>> ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:105
>>>>>> #1  0x00b45c84 in threadobj_cancel () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>
>>>>>> #2  0x00b2ac44 in t_delete () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>
>>>>>> #3  0x10000a70 in test (a=0, b=0, c=0, d=0) at suspend_delete_easy.c:22
>>>>>> #4  0x00b2a658 in task_trampoline () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>
>>>>>> #5  0x00b63824 in start_thread (arg=<optimized out>) at
>>>>>> pthread_create.c:299
>>>>>> #6  0x00c7df4c in __thread_start () from
>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>
>>>>>> second psos task:
>>>>>> (gdb) bt
>>>>>> #0  0x00b6cbbc in read () from
>>>>>>
>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libpthread.so.0
>>>>>>
>>>>>> #1  0x00b47bd8 in notifier_wait () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>
>>>>>> #2  0x00b460b0 in notifier_callback () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>
>>>>>> #3  0x00b47d30 in notifier_sighandler () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>
>>>>>> #4  <signal handler called>
>>>>>> #5  sigcancel_handler (sig=32, si=0x1e03aa8, ctx=0x1e03b28) at
>>>>>> init.c:136
>>>>>> #6  <signal handler called>
>>>>>> #7  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>        at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>>>>>> #8  0x00b45e10 in threadobj_sleep () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>
>>>>>> #9  0x00b2b618 in tm_wkafter () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>
>>>>>> #10 0x100009bc in idle_task (a=0, b=0, c=0, d=0) at
>>>>>> suspend_delete_easy.c:9
>>>>>> #11 0x00b2a658 in task_trampoline () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>
>>>>>> #12 0x00b63824 in start_thread (arg=<optimized out>) at
>>>>>> pthread_create.c:299
>>>>>> #13 0x00c7df4c in __thread_start () from
>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>
>>>>>>
>>>>>> Invocation command line was
>>>>>>
>>>>>>      $ ./configure --target=mips64-buildroot-linux-gnu
>>>>>> --host=mips64-buildroot-linux-gnu --build=x86_64-unknown-linux-gnu
>>>>>> --prefix=/usr --exec-prefix=/usr --sysconfdir=/etc --program-prefix=
>>>>>> --disable-gtk-doc --disable-doc --disable-docs --disable-documentation
>>>>>> --with-xmlto=no --with-fop=no --enable-ipv6 --enable-static
>>>>>> --enable-shared --with-core=mercury
>>>>>> --includedir=/usr/include/xenomai-forge --disable-doc-install
>>>>>> --enable-lores-clock
>>>>>>
>>>>>>
>>>>>> Is this what you were asking?
>>>>>
>>>>>
>>>>> Yes, thanks.
>>>>>
>>>>>>
>>>>>>
>>>>>> By the way, I forgot the mention the following test that I did:
>>>>>>
>>>>>> When the application is blocked I've done an "echo 1 >
>>>>>> /proc/pid/fd/xx",
>>>>>> where xx is the file descriptor that is being read from in the
>>>>>> notifier_wait().
>>>>>> This kinda emulates a t_resume.
>>>>>>
>>>>>> After doing this the t_delete gets finalized and the application
>>>>>> unblocks.
>>>>>>
>>>>>
>>>>> This should fix the race, until the notifier is reworked.
>>>>>
>>>>> diff --git a/include/copperplate/threadobj.h
>>>>> b/include/copperplate/threadobj.h
>>>>> index 184c711..218c274 100644
>>>>> --- a/include/copperplate/threadobj.h
>>>>> +++ b/include/copperplate/threadobj.h
>>>>> @@ -124,6 +124,7 @@ void threadobj_save_timeout(struct
>>>>> threadobj_corespec *corespec,
>>>>>    #define __THREAD_S_ACTIVE      (1 << 5)        /* Running user
>>>>> code. */
>>>>>    #define __THREAD_S_SUSPENDED   (1 << 6)        /* Suspended via
>>>>> threadobj_suspend(). */
>>>>>    #define __THREAD_S_SAFE                (1 << 7)        /* TCB
>>>>> release deferred. */
>>>>> +#define __THREAD_S_ZOMBIE      (1 << 8)        /* Deletion process
>>>>> ongoing. */
>>>>>    #define __THREAD_S_DEBUG       (1 << 31)       /* Debug mode
>>>>> enabled. */
>>>>>    /*
>>>>>     * threadobj->run_state, locklessly updated by "current", merged
>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>> index a31479c..c54cdf8 100644
>>>>> --- a/lib/copperplate/threadobj.c
>>>>> +++ b/lib/copperplate/threadobj.c
>>>>> @@ -420,11 +420,13 @@ static void notifier_callback(const struct
>>>>> notifier *nf)
>>>>>            * threadobj_suspend().
>>>>>            */
>>>>>           threadobj_lock(current);
>>>>> -       current->status |= __THREAD_S_SUSPENDED;
>>>>> -       threadobj_unlock(current);
>>>>> -       notifier_wait(nf); /* Wait for threadobj_resume(). */
>>>>> -       threadobj_lock(current);
>>>>> -       current->status &= ~__THREAD_S_SUSPENDED;
>>>>> +       if ((current->status & __THREAD_S_ZOMBIE) == 0) {
>>>>> +               current->status |= __THREAD_S_SUSPENDED;
>>>>> +               threadobj_unlock(current);
>>>>> +               notifier_wait(nf); /* Wait for threadobj_resume(). */
>>>>> +               threadobj_lock(current);
>>>>> +               current->status &= ~__THREAD_S_SUSPENDED;
>>>>> +       }
>>>>>           threadobj_unlock(current);
>>>>>    }
>>>>>
>>>>> @@ -488,7 +490,7 @@ static inline void threadobj_run_corespec(struct
>>>>> threadobj *thobj)
>>>>>    {
>>>>>    }
>>>>>
>>>>> -static inline void threadobj_cancel_corespec(struct threadobj *thobj)
>>>>> +static inline void threadobj_cancel_corespec(struct threadobj
>>>>> *thobj) /* thobj->lock held */
>>>>>    {
>>>>>    }
>>>>>
>>>>> @@ -872,9 +874,9 @@ void threadobj_init(struct threadobj *thobj,
>>>>>
>>>>>    static void destroy_thread(struct threadobj *thobj)
>>>>>    {
>>>>> +       threadobj_cleanup_corespec(thobj);
>>>>>           __RT(pthread_cond_destroy(&thobj->barrier));
>>>>>           __RT(pthread_mutex_destroy(&thobj->lock));
>>>>> -       threadobj_cleanup_corespec(thobj);
>>>>>    }
>>>>>
>>>>>    void threadobj_destroy(struct threadobj *thobj) /* thobj->lock free */
>>>>> @@ -1089,8 +1091,6 @@ static void cancel_sync(struct threadobj
>>>>> *thobj) /* thobj->lock held */
>>>>>           int oldstate, ret = 0;
>>>>>           sem_t *sem;
>>>>>
>>>>> -       __threadobj_check_locked(thobj);
>>>>> -
>>>>>           /*
>>>>>            * We have to allocate the cancel sync sema4 in the main heap
>>>>>            * dynamically, so that it always live in valid memory when we
>>>>> @@ -1106,6 +1106,7 @@ static void cancel_sync(struct threadobj
>>>>> *thobj) /* thobj->lock held */
>>>>>                   __STD(sem_init(sem, sem_scope_attribute, 0));
>>>>>
>>>>>           thobj->cancel_sem = sem;
>>>>> +       thobj->status |= __THREAD_S_ZOMBIE;
>>>>>
>>>>>           /*
>>>>>            * If the thread to delete is warming up, wait until it
>>>>>
>>>>> --
>>>>> Philippe.
>>>>
>>>>
>>>>
>>>> I've tested the patch and it indeed fixes the problem. Thanks!
>>>>
>>>> For learning purpose, do you have an explanation as to why this happens?
>>>> Shouldn't the cancelling stop the blocking read? Or not in case the
>>>> read comes after the cancel?
>>>>
>>>
>>> Over mercury, since we don't have thread-directed suspend/resume support
>>> from the kernel, we force a particular thread into a suspended state by
>>> sending it a notification signal via the async file I/O mechanism
>>> (O_ASYNC). This way we can emulate a call like t_suspend() which
>>> requires task-directed, immediate and unconditional action.
>>>
>>> The signal handler then waits on a read() call until it receives a
>>> release message on the notifying I/O channel. But when that handler
>>> runs, SIGCANCEL is blocked, so the cancellation point is temporarily
>>> disabled. Conversely, when SIGCANCEL is first, the thread unwinds
>>> properly.
>>>
>>> Looking at the idle_task's status when the issue happens:
>>>
>>> {rpm@cobalt} grep ^Sig /proc/6029/status
>>> SigQ:    4/63608
>>> SigPnd:    0000000000000000
>>> ShdPnd:    0000000000000000
>>> SigBlk:    0000020080000000
>>>                   ^
>>> __SIGRTMIN, aka nptl's SIGCANCEL
>>>
>>
>> I'm unsure the spacing was right, so just in case:
>> __SIGRTMIN is 32, so bit #31 is relevant (i.e. 0x80000000).
>>
>
> I checked the Blocked Signals and bit #31 is blocked indeed, very
> useful to check this SigBlk.
> Thanks for explaining!
>

For the sake of completeness, this issue happens when SIGCANCEL
and the notification signal pile up, which explains why SIGCANCEL ends 
up being masked when read() is called by the notification handler.
This is confirmed by the backtrace you sent:

#0  0x00b6cbbc in read () from 
output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libpthread.so.0
#1  0x00b47bd8 in notifier_wait () from 
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#2  0x00b460b0 in notifier_callback () from 
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#3  0x00b47d30 in notifier_sighandler () from 
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#4  <signal handler called>
#5  sigcancel_handler (sig=32, si=0x1e03aa8, ctx=0x1e03b28) at init.c:136
#6  <signal handler called>
#7  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
req=<optimized out>, rem=<optimized out>)
      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51

Which gives:

1. idle_task receives SIGCANCEL, unblocked from tm_wkafter(), starts 
running sigcancel_handler() from the nptl code (frames #7 -> #5). The 
kernel has blocked SIGCANCEL on entry to the handler to prevent 
recursion from the same signal.

2. before sigcancel_handler() unwinds, it is preempted by the 
notification signal (NOTIFYSIG), which eventually blocks in read() 
(frame #4 -> #0). This happens before the thread finalizer is called.

So we end up sleeping on a read(), which impedes the releasing of the 
sema4 cancel_sync() waits on.

Randomness was introduced by the timing of events: we have to see both 
SIGCANCEL and NOTIFYSIG pending at the same time for the target thread. 
In that case, the kernel first delivers SIGCANCEL, then NOTIFYSIG, by 
order of priority (__SIGRTMIN < SIGRTMIN + 8), creating a pile up 
situation. Raising the priority of the t_suspend()+t_delete() caller 
compared to idle_task makes the issue more likely, since the later will 
wake up last for handling both signals. Nice bug.

-- 
Philippe.


  reply	other threads:[~2014-01-10 13:53 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-01-08  9:25 [Xenomai] Xenomai-forge pSOS app hangs on t_suspend + t_delete Kim De Mey
2014-01-08 10:04 ` Philippe Gerum
2014-01-08 12:23   ` Kim De Mey
2014-01-09  9:46     ` Philippe Gerum
2014-01-09 10:29       ` Kim De Mey
2014-01-10  9:25         ` Philippe Gerum
2014-01-10  9:30           ` Philippe Gerum
2014-01-10 11:35             ` Kim De Mey
2014-01-10 13:53               ` Philippe Gerum [this message]
2014-01-11 10:35                 ` Kim De Mey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52CFFB5C.4030705@xenomai.org \
    --to=rpm@xenomai.org \
    --cc=kim.demey@gmail.com \
    --cc=xenomai@xenomai.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.