All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xenomai] Xenomai-forge pSOS app hangs on t_suspend + t_delete
@ 2014-01-08  9:25 Kim De Mey
  2014-01-08 10:04 ` Philippe Gerum
  0 siblings, 1 reply; 10+ messages in thread
From: Kim De Mey @ 2014-01-08  9:25 UTC (permalink / raw)
  To: xenomai

Hi,

I have an issue with a pSOS application that hangs after doing t_delete.
We are using Xenomai-forge with Mercury core.
The issue still occurs with the latest update.

I can reproduce the issue with a very simple test application.
The test application has a pSOS task that creates and starts another task.
Then it suspends and deletes this task.
The t_suspend() and t_delete() functions are called right after each other.
This is the crucial part for the issue to happen.

Example code:

static void idle_task(u_long a,u_long b,u_long c,u_long d)
{
  while (1) tm_wkafter(100);
}

static void test(u_long a,u_long b,u_long c,u_long d)
{
  u_long tid,args[4] = {0,0,0,0};

  t_create("IDLE",10,0,0,0,&tid);
  t_start(tid,0,idle_task, args);

  tm_wkafter(1000);

  t_suspend(tid);
  t_delete(tid);
  printf("After t_delete of suspended task\n");

  while (1) tm_wkafter(1000);
}

What I notice is that the cancel_sync() call in threadobj.c remains stuck
at the sem_wait() call.
The sem_post in finalize_thread() does not happen.
It even looks like the finalize_thread() is never called.
So it seems that for some reason the thread does not get cancelled (?).

The issue does not always occur in my simple test.
If I run it on more than 1 core it does not happen.
So it looks like something racy.
I notice the following difference in call order:
The issue occurs when notifier_callback(), the suspend callback,
happens after the sem_wait() (and thus after pthread_cancel()).
The issue does not occur when the callback is started before that the
sem_wait() call (or pthread_cancel()) occurs.
I am not sure if that has anything to do with it.

Can somebody have a look at this or give me some pointers because
I had a look at the code and do not understand what could be causing this.

Regards,
Kim


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Xenomai] Xenomai-forge pSOS app hangs on t_suspend + t_delete
  2014-01-08  9:25 [Xenomai] Xenomai-forge pSOS app hangs on t_suspend + t_delete Kim De Mey
@ 2014-01-08 10:04 ` Philippe Gerum
  2014-01-08 12:23   ` Kim De Mey
  0 siblings, 1 reply; 10+ messages in thread
From: Philippe Gerum @ 2014-01-08 10:04 UTC (permalink / raw)
  To: Kim De Mey, xenomai

On 01/08/2014 10:25 AM, Kim De Mey wrote:
> Hi,
>
> I have an issue with a pSOS application that hangs after doing t_delete.
> We are using Xenomai-forge with Mercury core.
> The issue still occurs with the latest update.
>
> I can reproduce the issue with a very simple test application.
> The test application has a pSOS task that creates and starts another task.
> Then it suspends and deletes this task.
> The t_suspend() and t_delete() functions are called right after each other.
> This is the crucial part for the issue to happen.
>
> Example code:
>
> static void idle_task(u_long a,u_long b,u_long c,u_long d)
> {
>    while (1) tm_wkafter(100);
> }
>
> static void test(u_long a,u_long b,u_long c,u_long d)
> {
>    u_long tid,args[4] = {0,0,0,0};
>
>    t_create("IDLE",10,0,0,0,&tid);
>    t_start(tid,0,idle_task, args);
>
>    tm_wkafter(1000);
>
>    t_suspend(tid);
>    t_delete(tid);
>    printf("After t_delete of suspended task\n");
>
>    while (1) tm_wkafter(1000);
> }
>
> What I notice is that the cancel_sync() call in threadobj.c remains stuck
> at the sem_wait() call.
> The sem_post in finalize_thread() does not happen.
> It even looks like the finalize_thread() is never called.
> So it seems that for some reason the thread does not get cancelled (?).
>
> The issue does not always occur in my simple test.
> If I run it on more than 1 core it does not happen.
> So it looks like something racy.
> I notice the following difference in call order:
> The issue occurs when notifier_callback(), the suspend callback,
> happens after the sem_wait() (and thus after pthread_cancel()).
> The issue does not occur when the callback is started before that the
> sem_wait() call (or pthread_cancel()) occurs.
> I am not sure if that has anything to do with it.
>
> Can somebody have a look at this or give me some pointers because
> I had a look at the code and do not understand what could be causing this.
>

When your test hangs, could you attach gdb to the process, then dump the 
backtraces for all threads which are still present?

Also, the exact invocation command line of the build configuration 
script as available from config.log, would help.

TIA,

-- 
Philippe.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Xenomai] Xenomai-forge pSOS app hangs on t_suspend + t_delete
  2014-01-08 10:04 ` Philippe Gerum
@ 2014-01-08 12:23   ` Kim De Mey
  2014-01-09  9:46     ` Philippe Gerum
  0 siblings, 1 reply; 10+ messages in thread
From: Kim De Mey @ 2014-01-08 12:23 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

2014/1/8 Philippe Gerum <rpm@xenomai.org>:
> On 01/08/2014 10:25 AM, Kim De Mey wrote:
>>
>> Hi,
>>
>> I have an issue with a pSOS application that hangs after doing t_delete.
>> We are using Xenomai-forge with Mercury core.
>> The issue still occurs with the latest update.
>>
>> I can reproduce the issue with a very simple test application.
>> The test application has a pSOS task that creates and starts another task.
>> Then it suspends and deletes this task.
>> The t_suspend() and t_delete() functions are called right after each
>> other.
>> This is the crucial part for the issue to happen.
>>
>> Example code:
>>
>> static void idle_task(u_long a,u_long b,u_long c,u_long d)
>> {
>>    while (1) tm_wkafter(100);
>> }
>>
>> static void test(u_long a,u_long b,u_long c,u_long d)
>> {
>>    u_long tid,args[4] = {0,0,0,0};
>>
>>    t_create("IDLE",10,0,0,0,&tid);
>>    t_start(tid,0,idle_task, args);
>>
>>    tm_wkafter(1000);
>>
>>    t_suspend(tid);
>>    t_delete(tid);
>>    printf("After t_delete of suspended task\n");
>>
>>    while (1) tm_wkafter(1000);
>> }
>>
>> What I notice is that the cancel_sync() call in threadobj.c remains stuck
>> at the sem_wait() call.
>> The sem_post in finalize_thread() does not happen.
>> It even looks like the finalize_thread() is never called.
>> So it seems that for some reason the thread does not get cancelled (?).
>>
>> The issue does not always occur in my simple test.
>> If I run it on more than 1 core it does not happen.
>> So it looks like something racy.
>> I notice the following difference in call order:
>> The issue occurs when notifier_callback(), the suspend callback,
>> happens after the sem_wait() (and thus after pthread_cancel()).
>> The issue does not occur when the callback is started before that the
>> sem_wait() call (or pthread_cancel()) occurs.
>> I am not sure if that has anything to do with it.
>>
>> Can somebody have a look at this or give me some pointers because
>> I had a look at the code and do not understand what could be causing this.
>>
>
> When your test hangs, could you attach gdb to the process, then dump the
> backtraces for all threads which are still present?
>
> Also, the exact invocation command line of the build configuration script as
> available from config.log, would help.
>
> TIA,
>
> --
> Philippe.


Here are the backtraces:

main thread:
(gdb) bt
#0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
req=<optimized out>, rem=<optimized out>)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
#1  0x00b45e10 in threadobj_sleep () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#2  0x00b2b618 in tm_wkafter () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
#3  0x10000b3c in main (argc=1, argv=0x104120e0) at suspend_delete_easy.c:42

first psos task:
(gdb) bt
#0  0x00b6b5c8 in __old_sem_wait (sem=<optimized out>) at
../nptl/sysdeps/unix/sysv/linux/sem_wait.c:105
#1  0x00b45c84 in threadobj_cancel () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#2  0x00b2ac44 in t_delete () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
#3  0x10000a70 in test (a=0, b=0, c=0, d=0) at suspend_delete_easy.c:22
#4  0x00b2a658 in task_trampoline () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
#5  0x00b63824 in start_thread (arg=<optimized out>) at pthread_create.c:299
#6  0x00c7df4c in __thread_start () from
output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6

second psos task:
(gdb) bt
#0  0x00b6cbbc in read () from
output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libpthread.so.0
#1  0x00b47bd8 in notifier_wait () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#2  0x00b460b0 in notifier_callback () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#3  0x00b47d30 in notifier_sighandler () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#4  <signal handler called>
#5  sigcancel_handler (sig=32, si=0x1e03aa8, ctx=0x1e03b28) at init.c:136
#6  <signal handler called>
#7  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
req=<optimized out>, rem=<optimized out>)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
#8  0x00b45e10 in threadobj_sleep () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#9  0x00b2b618 in tm_wkafter () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
#10 0x100009bc in idle_task (a=0, b=0, c=0, d=0) at suspend_delete_easy.c:9
#11 0x00b2a658 in task_trampoline () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
#12 0x00b63824 in start_thread (arg=<optimized out>) at pthread_create.c:299
#13 0x00c7df4c in __thread_start () from
output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6


Invocation command line was

  $ ./configure --target=mips64-buildroot-linux-gnu
--host=mips64-buildroot-linux-gnu --build=x86_64-unknown-linux-gnu
--prefix=/usr --exec-prefix=/usr --sysconfdir=/etc --program-prefix=
--disable-gtk-doc --disable-doc --disable-docs --disable-documentation
--with-xmlto=no --with-fop=no --enable-ipv6 --enable-static
--enable-shared --with-core=mercury
--includedir=/usr/include/xenomai-forge --disable-doc-install
--enable-lores-clock


Is this what you were asking?


By the way, I forgot the mention the following test that I did:

When the application is blocked I've done an "echo 1 > /proc/pid/fd/xx",
where xx is the file descriptor that is being read from in the notifier_wait().
This kinda emulates a t_resume.

After doing this the t_delete gets finalized and the application unblocks.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Xenomai] Xenomai-forge pSOS app hangs on t_suspend + t_delete
  2014-01-08 12:23   ` Kim De Mey
@ 2014-01-09  9:46     ` Philippe Gerum
  2014-01-09 10:29       ` Kim De Mey
  0 siblings, 1 reply; 10+ messages in thread
From: Philippe Gerum @ 2014-01-09  9:46 UTC (permalink / raw)
  To: Kim De Mey; +Cc: xenomai

On 01/08/2014 01:23 PM, Kim De Mey wrote:
> 2014/1/8 Philippe Gerum <rpm@xenomai.org>:
>> On 01/08/2014 10:25 AM, Kim De Mey wrote:
> 
> Here are the backtraces:
> 
> main thread:
> (gdb) bt
> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
> req=<optimized out>, rem=<optimized out>)
>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
> #1  0x00b45e10 in threadobj_sleep () from
> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #2  0x00b2b618 in tm_wkafter () from
> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
> #3  0x10000b3c in main (argc=1, argv=0x104120e0) at suspend_delete_easy.c:42
> 
> first psos task:
> (gdb) bt
> #0  0x00b6b5c8 in __old_sem_wait (sem=<optimized out>) at
> ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:105
> #1  0x00b45c84 in threadobj_cancel () from
> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #2  0x00b2ac44 in t_delete () from
> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
> #3  0x10000a70 in test (a=0, b=0, c=0, d=0) at suspend_delete_easy.c:22
> #4  0x00b2a658 in task_trampoline () from
> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
> #5  0x00b63824 in start_thread (arg=<optimized out>) at pthread_create.c:299
> #6  0x00c7df4c in __thread_start () from
> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
> 
> second psos task:
> (gdb) bt
> #0  0x00b6cbbc in read () from
> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libpthread.so.0
> #1  0x00b47bd8 in notifier_wait () from
> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #2  0x00b460b0 in notifier_callback () from
> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #3  0x00b47d30 in notifier_sighandler () from
> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #4  <signal handler called>
> #5  sigcancel_handler (sig=32, si=0x1e03aa8, ctx=0x1e03b28) at init.c:136
> #6  <signal handler called>
> #7  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
> req=<optimized out>, rem=<optimized out>)
>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
> #8  0x00b45e10 in threadobj_sleep () from
> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #9  0x00b2b618 in tm_wkafter () from
> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
> #10 0x100009bc in idle_task (a=0, b=0, c=0, d=0) at suspend_delete_easy.c:9
> #11 0x00b2a658 in task_trampoline () from
> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
> #12 0x00b63824 in start_thread (arg=<optimized out>) at pthread_create.c:299
> #13 0x00c7df4c in __thread_start () from
> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
> 
> 
> Invocation command line was
> 
>    $ ./configure --target=mips64-buildroot-linux-gnu
> --host=mips64-buildroot-linux-gnu --build=x86_64-unknown-linux-gnu
> --prefix=/usr --exec-prefix=/usr --sysconfdir=/etc --program-prefix=
> --disable-gtk-doc --disable-doc --disable-docs --disable-documentation
> --with-xmlto=no --with-fop=no --enable-ipv6 --enable-static
> --enable-shared --with-core=mercury
> --includedir=/usr/include/xenomai-forge --disable-doc-install
> --enable-lores-clock
> 
> 
> Is this what you were asking?

Yes, thanks.

> 
> 
> By the way, I forgot the mention the following test that I did:
> 
> When the application is blocked I've done an "echo 1 > /proc/pid/fd/xx",
> where xx is the file descriptor that is being read from in the notifier_wait().
> This kinda emulates a t_resume.
> 
> After doing this the t_delete gets finalized and the application unblocks.
> 

This should fix the race, until the notifier is reworked.

diff --git a/include/copperplate/threadobj.h b/include/copperplate/threadobj.h
index 184c711..218c274 100644
--- a/include/copperplate/threadobj.h
+++ b/include/copperplate/threadobj.h
@@ -124,6 +124,7 @@ void threadobj_save_timeout(struct threadobj_corespec *corespec,
 #define __THREAD_S_ACTIVE	(1 << 5)	/* Running user code. */
 #define __THREAD_S_SUSPENDED	(1 << 6)	/* Suspended via threadobj_suspend(). */
 #define __THREAD_S_SAFE		(1 << 7)	/* TCB release deferred. */
+#define __THREAD_S_ZOMBIE	(1 << 8)	/* Deletion process ongoing. */
 #define __THREAD_S_DEBUG	(1 << 31)	/* Debug mode enabled. */
 /*
  * threadobj->run_state, locklessly updated by "current", merged
diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
index a31479c..c54cdf8 100644
--- a/lib/copperplate/threadobj.c
+++ b/lib/copperplate/threadobj.c
@@ -420,11 +420,13 @@ static void notifier_callback(const struct notifier *nf)
 	 * threadobj_suspend().
 	 */
 	threadobj_lock(current);
-	current->status |= __THREAD_S_SUSPENDED;
-	threadobj_unlock(current);
-	notifier_wait(nf); /* Wait for threadobj_resume(). */
-	threadobj_lock(current);
-	current->status &= ~__THREAD_S_SUSPENDED;
+	if ((current->status & __THREAD_S_ZOMBIE) == 0) {
+		current->status |= __THREAD_S_SUSPENDED;
+		threadobj_unlock(current);
+		notifier_wait(nf); /* Wait for threadobj_resume(). */
+		threadobj_lock(current);
+		current->status &= ~__THREAD_S_SUSPENDED;
+	}
 	threadobj_unlock(current);
 }
 
@@ -488,7 +490,7 @@ static inline void threadobj_run_corespec(struct threadobj *thobj)
 {
 }
 
-static inline void threadobj_cancel_corespec(struct threadobj *thobj)
+static inline void threadobj_cancel_corespec(struct threadobj *thobj) /* thobj->lock held */
 {
 }
 
@@ -872,9 +874,9 @@ void threadobj_init(struct threadobj *thobj,
 
 static void destroy_thread(struct threadobj *thobj)
 {
+	threadobj_cleanup_corespec(thobj);
 	__RT(pthread_cond_destroy(&thobj->barrier));
 	__RT(pthread_mutex_destroy(&thobj->lock));
-	threadobj_cleanup_corespec(thobj);
 }
 
 void threadobj_destroy(struct threadobj *thobj) /* thobj->lock free */
@@ -1089,8 +1091,6 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
 	int oldstate, ret = 0;
 	sem_t *sem;
 
-	__threadobj_check_locked(thobj);
-
 	/*
 	 * We have to allocate the cancel sync sema4 in the main heap
 	 * dynamically, so that it always live in valid memory when we
@@ -1106,6 +1106,7 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
 		__STD(sem_init(sem, sem_scope_attribute, 0));
 
 	thobj->cancel_sem = sem;
+	thobj->status |= __THREAD_S_ZOMBIE;
 
 	/*
 	 * If the thread to delete is warming up, wait until it

-- 
Philippe.


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Xenomai] Xenomai-forge pSOS app hangs on t_suspend + t_delete
  2014-01-09  9:46     ` Philippe Gerum
@ 2014-01-09 10:29       ` Kim De Mey
  2014-01-10  9:25         ` Philippe Gerum
  0 siblings, 1 reply; 10+ messages in thread
From: Kim De Mey @ 2014-01-09 10:29 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

2014/1/9 Philippe Gerum <rpm@xenomai.org>:
> On 01/08/2014 01:23 PM, Kim De Mey wrote:
>> 2014/1/8 Philippe Gerum <rpm@xenomai.org>:
>>> On 01/08/2014 10:25 AM, Kim De Mey wrote:
>>
>> Here are the backtraces:
>>
>> main thread:
>> (gdb) bt
>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>> req=<optimized out>, rem=<optimized out>)
>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>> #1  0x00b45e10 in threadobj_sleep () from
>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>> #2  0x00b2b618 in tm_wkafter () from
>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>> #3  0x10000b3c in main (argc=1, argv=0x104120e0) at suspend_delete_easy.c:42
>>
>> first psos task:
>> (gdb) bt
>> #0  0x00b6b5c8 in __old_sem_wait (sem=<optimized out>) at
>> ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:105
>> #1  0x00b45c84 in threadobj_cancel () from
>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>> #2  0x00b2ac44 in t_delete () from
>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>> #3  0x10000a70 in test (a=0, b=0, c=0, d=0) at suspend_delete_easy.c:22
>> #4  0x00b2a658 in task_trampoline () from
>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>> #5  0x00b63824 in start_thread (arg=<optimized out>) at pthread_create.c:299
>> #6  0x00c7df4c in __thread_start () from
>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>
>> second psos task:
>> (gdb) bt
>> #0  0x00b6cbbc in read () from
>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libpthread.so.0
>> #1  0x00b47bd8 in notifier_wait () from
>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>> #2  0x00b460b0 in notifier_callback () from
>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>> #3  0x00b47d30 in notifier_sighandler () from
>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>> #4  <signal handler called>
>> #5  sigcancel_handler (sig=32, si=0x1e03aa8, ctx=0x1e03b28) at init.c:136
>> #6  <signal handler called>
>> #7  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>> req=<optimized out>, rem=<optimized out>)
>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>> #8  0x00b45e10 in threadobj_sleep () from
>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>> #9  0x00b2b618 in tm_wkafter () from
>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>> #10 0x100009bc in idle_task (a=0, b=0, c=0, d=0) at suspend_delete_easy.c:9
>> #11 0x00b2a658 in task_trampoline () from
>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>> #12 0x00b63824 in start_thread (arg=<optimized out>) at pthread_create.c:299
>> #13 0x00c7df4c in __thread_start () from
>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>
>>
>> Invocation command line was
>>
>>    $ ./configure --target=mips64-buildroot-linux-gnu
>> --host=mips64-buildroot-linux-gnu --build=x86_64-unknown-linux-gnu
>> --prefix=/usr --exec-prefix=/usr --sysconfdir=/etc --program-prefix=
>> --disable-gtk-doc --disable-doc --disable-docs --disable-documentation
>> --with-xmlto=no --with-fop=no --enable-ipv6 --enable-static
>> --enable-shared --with-core=mercury
>> --includedir=/usr/include/xenomai-forge --disable-doc-install
>> --enable-lores-clock
>>
>>
>> Is this what you were asking?
>
> Yes, thanks.
>
>>
>>
>> By the way, I forgot the mention the following test that I did:
>>
>> When the application is blocked I've done an "echo 1 > /proc/pid/fd/xx",
>> where xx is the file descriptor that is being read from in the notifier_wait().
>> This kinda emulates a t_resume.
>>
>> After doing this the t_delete gets finalized and the application unblocks.
>>
>
> This should fix the race, until the notifier is reworked.
>
> diff --git a/include/copperplate/threadobj.h b/include/copperplate/threadobj.h
> index 184c711..218c274 100644
> --- a/include/copperplate/threadobj.h
> +++ b/include/copperplate/threadobj.h
> @@ -124,6 +124,7 @@ void threadobj_save_timeout(struct threadobj_corespec *corespec,
>  #define __THREAD_S_ACTIVE      (1 << 5)        /* Running user code. */
>  #define __THREAD_S_SUSPENDED   (1 << 6)        /* Suspended via threadobj_suspend(). */
>  #define __THREAD_S_SAFE                (1 << 7)        /* TCB release deferred. */
> +#define __THREAD_S_ZOMBIE      (1 << 8)        /* Deletion process ongoing. */
>  #define __THREAD_S_DEBUG       (1 << 31)       /* Debug mode enabled. */
>  /*
>   * threadobj->run_state, locklessly updated by "current", merged
> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
> index a31479c..c54cdf8 100644
> --- a/lib/copperplate/threadobj.c
> +++ b/lib/copperplate/threadobj.c
> @@ -420,11 +420,13 @@ static void notifier_callback(const struct notifier *nf)
>          * threadobj_suspend().
>          */
>         threadobj_lock(current);
> -       current->status |= __THREAD_S_SUSPENDED;
> -       threadobj_unlock(current);
> -       notifier_wait(nf); /* Wait for threadobj_resume(). */
> -       threadobj_lock(current);
> -       current->status &= ~__THREAD_S_SUSPENDED;
> +       if ((current->status & __THREAD_S_ZOMBIE) == 0) {
> +               current->status |= __THREAD_S_SUSPENDED;
> +               threadobj_unlock(current);
> +               notifier_wait(nf); /* Wait for threadobj_resume(). */
> +               threadobj_lock(current);
> +               current->status &= ~__THREAD_S_SUSPENDED;
> +       }
>         threadobj_unlock(current);
>  }
>
> @@ -488,7 +490,7 @@ static inline void threadobj_run_corespec(struct threadobj *thobj)
>  {
>  }
>
> -static inline void threadobj_cancel_corespec(struct threadobj *thobj)
> +static inline void threadobj_cancel_corespec(struct threadobj *thobj) /* thobj->lock held */
>  {
>  }
>
> @@ -872,9 +874,9 @@ void threadobj_init(struct threadobj *thobj,
>
>  static void destroy_thread(struct threadobj *thobj)
>  {
> +       threadobj_cleanup_corespec(thobj);
>         __RT(pthread_cond_destroy(&thobj->barrier));
>         __RT(pthread_mutex_destroy(&thobj->lock));
> -       threadobj_cleanup_corespec(thobj);
>  }
>
>  void threadobj_destroy(struct threadobj *thobj) /* thobj->lock free */
> @@ -1089,8 +1091,6 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>         int oldstate, ret = 0;
>         sem_t *sem;
>
> -       __threadobj_check_locked(thobj);
> -
>         /*
>          * We have to allocate the cancel sync sema4 in the main heap
>          * dynamically, so that it always live in valid memory when we
> @@ -1106,6 +1106,7 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>                 __STD(sem_init(sem, sem_scope_attribute, 0));
>
>         thobj->cancel_sem = sem;
> +       thobj->status |= __THREAD_S_ZOMBIE;
>
>         /*
>          * If the thread to delete is warming up, wait until it
>
> --
> Philippe.


I've tested the patch and it indeed fixes the problem. Thanks!

For learning purpose, do you have an explanation as to why this happens?
Shouldn't the cancelling stop the blocking read? Or not in case the
read comes after the cancel?

Regards,
Kim


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Xenomai] Xenomai-forge pSOS app hangs on t_suspend + t_delete
  2014-01-09 10:29       ` Kim De Mey
@ 2014-01-10  9:25         ` Philippe Gerum
  2014-01-10  9:30           ` Philippe Gerum
  0 siblings, 1 reply; 10+ messages in thread
From: Philippe Gerum @ 2014-01-10  9:25 UTC (permalink / raw)
  To: Kim De Mey; +Cc: xenomai

On 01/09/2014 11:29 AM, Kim De Mey wrote:
> 2014/1/9 Philippe Gerum <rpm@xenomai.org>:
>> On 01/08/2014 01:23 PM, Kim De Mey wrote:
>>> 2014/1/8 Philippe Gerum <rpm@xenomai.org>:
>>>> On 01/08/2014 10:25 AM, Kim De Mey wrote:
>>>
>>> Here are the backtraces:
>>>
>>> main thread:
>>> (gdb) bt
>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>> req=<optimized out>, rem=<optimized out>)
>>>       at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>>> #1  0x00b45e10 in threadobj_sleep () from
>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>> #2  0x00b2b618 in tm_wkafter () from
>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>> #3  0x10000b3c in main (argc=1, argv=0x104120e0) at suspend_delete_easy.c:42
>>>
>>> first psos task:
>>> (gdb) bt
>>> #0  0x00b6b5c8 in __old_sem_wait (sem=<optimized out>) at
>>> ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:105
>>> #1  0x00b45c84 in threadobj_cancel () from
>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>> #2  0x00b2ac44 in t_delete () from
>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>> #3  0x10000a70 in test (a=0, b=0, c=0, d=0) at suspend_delete_easy.c:22
>>> #4  0x00b2a658 in task_trampoline () from
>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>> #5  0x00b63824 in start_thread (arg=<optimized out>) at pthread_create.c:299
>>> #6  0x00c7df4c in __thread_start () from
>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>
>>> second psos task:
>>> (gdb) bt
>>> #0  0x00b6cbbc in read () from
>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libpthread.so.0
>>> #1  0x00b47bd8 in notifier_wait () from
>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>> #2  0x00b460b0 in notifier_callback () from
>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>> #3  0x00b47d30 in notifier_sighandler () from
>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>> #4  <signal handler called>
>>> #5  sigcancel_handler (sig=32, si=0x1e03aa8, ctx=0x1e03b28) at init.c:136
>>> #6  <signal handler called>
>>> #7  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>> req=<optimized out>, rem=<optimized out>)
>>>       at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>>> #8  0x00b45e10 in threadobj_sleep () from
>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>> #9  0x00b2b618 in tm_wkafter () from
>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>> #10 0x100009bc in idle_task (a=0, b=0, c=0, d=0) at suspend_delete_easy.c:9
>>> #11 0x00b2a658 in task_trampoline () from
>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>> #12 0x00b63824 in start_thread (arg=<optimized out>) at pthread_create.c:299
>>> #13 0x00c7df4c in __thread_start () from
>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>
>>>
>>> Invocation command line was
>>>
>>>     $ ./configure --target=mips64-buildroot-linux-gnu
>>> --host=mips64-buildroot-linux-gnu --build=x86_64-unknown-linux-gnu
>>> --prefix=/usr --exec-prefix=/usr --sysconfdir=/etc --program-prefix=
>>> --disable-gtk-doc --disable-doc --disable-docs --disable-documentation
>>> --with-xmlto=no --with-fop=no --enable-ipv6 --enable-static
>>> --enable-shared --with-core=mercury
>>> --includedir=/usr/include/xenomai-forge --disable-doc-install
>>> --enable-lores-clock
>>>
>>>
>>> Is this what you were asking?
>>
>> Yes, thanks.
>>
>>>
>>>
>>> By the way, I forgot the mention the following test that I did:
>>>
>>> When the application is blocked I've done an "echo 1 > /proc/pid/fd/xx",
>>> where xx is the file descriptor that is being read from in the notifier_wait().
>>> This kinda emulates a t_resume.
>>>
>>> After doing this the t_delete gets finalized and the application unblocks.
>>>
>>
>> This should fix the race, until the notifier is reworked.
>>
>> diff --git a/include/copperplate/threadobj.h b/include/copperplate/threadobj.h
>> index 184c711..218c274 100644
>> --- a/include/copperplate/threadobj.h
>> +++ b/include/copperplate/threadobj.h
>> @@ -124,6 +124,7 @@ void threadobj_save_timeout(struct threadobj_corespec *corespec,
>>   #define __THREAD_S_ACTIVE      (1 << 5)        /* Running user code. */
>>   #define __THREAD_S_SUSPENDED   (1 << 6)        /* Suspended via threadobj_suspend(). */
>>   #define __THREAD_S_SAFE                (1 << 7)        /* TCB release deferred. */
>> +#define __THREAD_S_ZOMBIE      (1 << 8)        /* Deletion process ongoing. */
>>   #define __THREAD_S_DEBUG       (1 << 31)       /* Debug mode enabled. */
>>   /*
>>    * threadobj->run_state, locklessly updated by "current", merged
>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>> index a31479c..c54cdf8 100644
>> --- a/lib/copperplate/threadobj.c
>> +++ b/lib/copperplate/threadobj.c
>> @@ -420,11 +420,13 @@ static void notifier_callback(const struct notifier *nf)
>>           * threadobj_suspend().
>>           */
>>          threadobj_lock(current);
>> -       current->status |= __THREAD_S_SUSPENDED;
>> -       threadobj_unlock(current);
>> -       notifier_wait(nf); /* Wait for threadobj_resume(). */
>> -       threadobj_lock(current);
>> -       current->status &= ~__THREAD_S_SUSPENDED;
>> +       if ((current->status & __THREAD_S_ZOMBIE) == 0) {
>> +               current->status |= __THREAD_S_SUSPENDED;
>> +               threadobj_unlock(current);
>> +               notifier_wait(nf); /* Wait for threadobj_resume(). */
>> +               threadobj_lock(current);
>> +               current->status &= ~__THREAD_S_SUSPENDED;
>> +       }
>>          threadobj_unlock(current);
>>   }
>>
>> @@ -488,7 +490,7 @@ static inline void threadobj_run_corespec(struct threadobj *thobj)
>>   {
>>   }
>>
>> -static inline void threadobj_cancel_corespec(struct threadobj *thobj)
>> +static inline void threadobj_cancel_corespec(struct threadobj *thobj) /* thobj->lock held */
>>   {
>>   }
>>
>> @@ -872,9 +874,9 @@ void threadobj_init(struct threadobj *thobj,
>>
>>   static void destroy_thread(struct threadobj *thobj)
>>   {
>> +       threadobj_cleanup_corespec(thobj);
>>          __RT(pthread_cond_destroy(&thobj->barrier));
>>          __RT(pthread_mutex_destroy(&thobj->lock));
>> -       threadobj_cleanup_corespec(thobj);
>>   }
>>
>>   void threadobj_destroy(struct threadobj *thobj) /* thobj->lock free */
>> @@ -1089,8 +1091,6 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>          int oldstate, ret = 0;
>>          sem_t *sem;
>>
>> -       __threadobj_check_locked(thobj);
>> -
>>          /*
>>           * We have to allocate the cancel sync sema4 in the main heap
>>           * dynamically, so that it always live in valid memory when we
>> @@ -1106,6 +1106,7 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>                  __STD(sem_init(sem, sem_scope_attribute, 0));
>>
>>          thobj->cancel_sem = sem;
>> +       thobj->status |= __THREAD_S_ZOMBIE;
>>
>>          /*
>>           * If the thread to delete is warming up, wait until it
>>
>> --
>> Philippe.
>
>
> I've tested the patch and it indeed fixes the problem. Thanks!
>
> For learning purpose, do you have an explanation as to why this happens?
> Shouldn't the cancelling stop the blocking read? Or not in case the
> read comes after the cancel?
>

Over mercury, since we don't have thread-directed suspend/resume support 
from the kernel, we force a particular thread into a suspended state by 
sending it a notification signal via the async file I/O mechanism 
(O_ASYNC). This way we can emulate a call like t_suspend() which 
requires task-directed, immediate and unconditional action.

The signal handler then waits on a read() call until it receives a 
release message on the notifying I/O channel. But when that handler 
runs, SIGCANCEL is blocked, so the cancellation point is temporarily 
disabled. Conversely, when SIGCANCEL is first, the thread unwinds properly.

Looking at the idle_task's status when the issue happens:

{rpm@cobalt} grep ^Sig /proc/6029/status
SigQ:	4/63608
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000020080000000
                 ^
__SIGRTMIN, aka nptl's SIGCANCEL

SigIgn:	0000000004000000
SigCgt:	0000060182000000

-- 
Philippe.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Xenomai] Xenomai-forge pSOS app hangs on t_suspend + t_delete
  2014-01-10  9:25         ` Philippe Gerum
@ 2014-01-10  9:30           ` Philippe Gerum
  2014-01-10 11:35             ` Kim De Mey
  0 siblings, 1 reply; 10+ messages in thread
From: Philippe Gerum @ 2014-01-10  9:30 UTC (permalink / raw)
  To: Kim De Mey; +Cc: xenomai

On 01/10/2014 10:25 AM, Philippe Gerum wrote:
> On 01/09/2014 11:29 AM, Kim De Mey wrote:
>> 2014/1/9 Philippe Gerum <rpm@xenomai.org>:
>>> On 01/08/2014 01:23 PM, Kim De Mey wrote:
>>>> 2014/1/8 Philippe Gerum <rpm@xenomai.org>:
>>>>> On 01/08/2014 10:25 AM, Kim De Mey wrote:
>>>>
>>>> Here are the backtraces:
>>>>
>>>> main thread:
>>>> (gdb) bt
>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>> req=<optimized out>, rem=<optimized out>)
>>>>       at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>>>> #1  0x00b45e10 in threadobj_sleep () from
>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>
>>>> #2  0x00b2b618 in tm_wkafter () from
>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>
>>>> #3  0x10000b3c in main (argc=1, argv=0x104120e0) at
>>>> suspend_delete_easy.c:42
>>>>
>>>> first psos task:
>>>> (gdb) bt
>>>> #0  0x00b6b5c8 in __old_sem_wait (sem=<optimized out>) at
>>>> ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:105
>>>> #1  0x00b45c84 in threadobj_cancel () from
>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>
>>>> #2  0x00b2ac44 in t_delete () from
>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>
>>>> #3  0x10000a70 in test (a=0, b=0, c=0, d=0) at suspend_delete_easy.c:22
>>>> #4  0x00b2a658 in task_trampoline () from
>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>
>>>> #5  0x00b63824 in start_thread (arg=<optimized out>) at
>>>> pthread_create.c:299
>>>> #6  0x00c7df4c in __thread_start () from
>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>
>>>> second psos task:
>>>> (gdb) bt
>>>> #0  0x00b6cbbc in read () from
>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libpthread.so.0
>>>>
>>>> #1  0x00b47bd8 in notifier_wait () from
>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>
>>>> #2  0x00b460b0 in notifier_callback () from
>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>
>>>> #3  0x00b47d30 in notifier_sighandler () from
>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>
>>>> #4  <signal handler called>
>>>> #5  sigcancel_handler (sig=32, si=0x1e03aa8, ctx=0x1e03b28) at
>>>> init.c:136
>>>> #6  <signal handler called>
>>>> #7  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>> req=<optimized out>, rem=<optimized out>)
>>>>       at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>>>> #8  0x00b45e10 in threadobj_sleep () from
>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>
>>>> #9  0x00b2b618 in tm_wkafter () from
>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>
>>>> #10 0x100009bc in idle_task (a=0, b=0, c=0, d=0) at
>>>> suspend_delete_easy.c:9
>>>> #11 0x00b2a658 in task_trampoline () from
>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>
>>>> #12 0x00b63824 in start_thread (arg=<optimized out>) at
>>>> pthread_create.c:299
>>>> #13 0x00c7df4c in __thread_start () from
>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>
>>>>
>>>> Invocation command line was
>>>>
>>>>     $ ./configure --target=mips64-buildroot-linux-gnu
>>>> --host=mips64-buildroot-linux-gnu --build=x86_64-unknown-linux-gnu
>>>> --prefix=/usr --exec-prefix=/usr --sysconfdir=/etc --program-prefix=
>>>> --disable-gtk-doc --disable-doc --disable-docs --disable-documentation
>>>> --with-xmlto=no --with-fop=no --enable-ipv6 --enable-static
>>>> --enable-shared --with-core=mercury
>>>> --includedir=/usr/include/xenomai-forge --disable-doc-install
>>>> --enable-lores-clock
>>>>
>>>>
>>>> Is this what you were asking?
>>>
>>> Yes, thanks.
>>>
>>>>
>>>>
>>>> By the way, I forgot the mention the following test that I did:
>>>>
>>>> When the application is blocked I've done an "echo 1 >
>>>> /proc/pid/fd/xx",
>>>> where xx is the file descriptor that is being read from in the
>>>> notifier_wait().
>>>> This kinda emulates a t_resume.
>>>>
>>>> After doing this the t_delete gets finalized and the application
>>>> unblocks.
>>>>
>>>
>>> This should fix the race, until the notifier is reworked.
>>>
>>> diff --git a/include/copperplate/threadobj.h
>>> b/include/copperplate/threadobj.h
>>> index 184c711..218c274 100644
>>> --- a/include/copperplate/threadobj.h
>>> +++ b/include/copperplate/threadobj.h
>>> @@ -124,6 +124,7 @@ void threadobj_save_timeout(struct
>>> threadobj_corespec *corespec,
>>>   #define __THREAD_S_ACTIVE      (1 << 5)        /* Running user
>>> code. */
>>>   #define __THREAD_S_SUSPENDED   (1 << 6)        /* Suspended via
>>> threadobj_suspend(). */
>>>   #define __THREAD_S_SAFE                (1 << 7)        /* TCB
>>> release deferred. */
>>> +#define __THREAD_S_ZOMBIE      (1 << 8)        /* Deletion process
>>> ongoing. */
>>>   #define __THREAD_S_DEBUG       (1 << 31)       /* Debug mode
>>> enabled. */
>>>   /*
>>>    * threadobj->run_state, locklessly updated by "current", merged
>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>> index a31479c..c54cdf8 100644
>>> --- a/lib/copperplate/threadobj.c
>>> +++ b/lib/copperplate/threadobj.c
>>> @@ -420,11 +420,13 @@ static void notifier_callback(const struct
>>> notifier *nf)
>>>           * threadobj_suspend().
>>>           */
>>>          threadobj_lock(current);
>>> -       current->status |= __THREAD_S_SUSPENDED;
>>> -       threadobj_unlock(current);
>>> -       notifier_wait(nf); /* Wait for threadobj_resume(). */
>>> -       threadobj_lock(current);
>>> -       current->status &= ~__THREAD_S_SUSPENDED;
>>> +       if ((current->status & __THREAD_S_ZOMBIE) == 0) {
>>> +               current->status |= __THREAD_S_SUSPENDED;
>>> +               threadobj_unlock(current);
>>> +               notifier_wait(nf); /* Wait for threadobj_resume(). */
>>> +               threadobj_lock(current);
>>> +               current->status &= ~__THREAD_S_SUSPENDED;
>>> +       }
>>>          threadobj_unlock(current);
>>>   }
>>>
>>> @@ -488,7 +490,7 @@ static inline void threadobj_run_corespec(struct
>>> threadobj *thobj)
>>>   {
>>>   }
>>>
>>> -static inline void threadobj_cancel_corespec(struct threadobj *thobj)
>>> +static inline void threadobj_cancel_corespec(struct threadobj
>>> *thobj) /* thobj->lock held */
>>>   {
>>>   }
>>>
>>> @@ -872,9 +874,9 @@ void threadobj_init(struct threadobj *thobj,
>>>
>>>   static void destroy_thread(struct threadobj *thobj)
>>>   {
>>> +       threadobj_cleanup_corespec(thobj);
>>>          __RT(pthread_cond_destroy(&thobj->barrier));
>>>          __RT(pthread_mutex_destroy(&thobj->lock));
>>> -       threadobj_cleanup_corespec(thobj);
>>>   }
>>>
>>>   void threadobj_destroy(struct threadobj *thobj) /* thobj->lock free */
>>> @@ -1089,8 +1091,6 @@ static void cancel_sync(struct threadobj
>>> *thobj) /* thobj->lock held */
>>>          int oldstate, ret = 0;
>>>          sem_t *sem;
>>>
>>> -       __threadobj_check_locked(thobj);
>>> -
>>>          /*
>>>           * We have to allocate the cancel sync sema4 in the main heap
>>>           * dynamically, so that it always live in valid memory when we
>>> @@ -1106,6 +1106,7 @@ static void cancel_sync(struct threadobj
>>> *thobj) /* thobj->lock held */
>>>                  __STD(sem_init(sem, sem_scope_attribute, 0));
>>>
>>>          thobj->cancel_sem = sem;
>>> +       thobj->status |= __THREAD_S_ZOMBIE;
>>>
>>>          /*
>>>           * If the thread to delete is warming up, wait until it
>>>
>>> --
>>> Philippe.
>>
>>
>> I've tested the patch and it indeed fixes the problem. Thanks!
>>
>> For learning purpose, do you have an explanation as to why this happens?
>> Shouldn't the cancelling stop the blocking read? Or not in case the
>> read comes after the cancel?
>>
>
> Over mercury, since we don't have thread-directed suspend/resume support
> from the kernel, we force a particular thread into a suspended state by
> sending it a notification signal via the async file I/O mechanism
> (O_ASYNC). This way we can emulate a call like t_suspend() which
> requires task-directed, immediate and unconditional action.
>
> The signal handler then waits on a read() call until it receives a
> release message on the notifying I/O channel. But when that handler
> runs, SIGCANCEL is blocked, so the cancellation point is temporarily
> disabled. Conversely, when SIGCANCEL is first, the thread unwinds properly.
>
> Looking at the idle_task's status when the issue happens:
>
> {rpm@cobalt} grep ^Sig /proc/6029/status
> SigQ:    4/63608
> SigPnd:    0000000000000000
> ShdPnd:    0000000000000000
> SigBlk:    0000020080000000
>                  ^
> __SIGRTMIN, aka nptl's SIGCANCEL
>

I'm unsure the spacing was right, so just in case:
__SIGRTMIN is 32, so bit #31 is relevant (i.e. 0x80000000).

-- 
Philippe.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Xenomai] Xenomai-forge pSOS app hangs on t_suspend + t_delete
  2014-01-10  9:30           ` Philippe Gerum
@ 2014-01-10 11:35             ` Kim De Mey
  2014-01-10 13:53               ` Philippe Gerum
  0 siblings, 1 reply; 10+ messages in thread
From: Kim De Mey @ 2014-01-10 11:35 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

2014/1/10 Philippe Gerum <rpm@xenomai.org>:
> On 01/10/2014 10:25 AM, Philippe Gerum wrote:
>>
>> On 01/09/2014 11:29 AM, Kim De Mey wrote:
>>>
>>> 2014/1/9 Philippe Gerum <rpm@xenomai.org>:
>>>>
>>>> On 01/08/2014 01:23 PM, Kim De Mey wrote:
>>>>>
>>>>> 2014/1/8 Philippe Gerum <rpm@xenomai.org>:
>>>>>>
>>>>>> On 01/08/2014 10:25 AM, Kim De Mey wrote:
>>>>>
>>>>>
>>>>> Here are the backtraces:
>>>>>
>>>>> main thread:
>>>>> (gdb) bt
>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>       at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>>>>> #1  0x00b45e10 in threadobj_sleep () from
>>>>>
>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>
>>>>> #2  0x00b2b618 in tm_wkafter () from
>>>>>
>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>
>>>>> #3  0x10000b3c in main (argc=1, argv=0x104120e0) at
>>>>> suspend_delete_easy.c:42
>>>>>
>>>>> first psos task:
>>>>> (gdb) bt
>>>>> #0  0x00b6b5c8 in __old_sem_wait (sem=<optimized out>) at
>>>>> ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:105
>>>>> #1  0x00b45c84 in threadobj_cancel () from
>>>>>
>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>
>>>>> #2  0x00b2ac44 in t_delete () from
>>>>>
>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>
>>>>> #3  0x10000a70 in test (a=0, b=0, c=0, d=0) at suspend_delete_easy.c:22
>>>>> #4  0x00b2a658 in task_trampoline () from
>>>>>
>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>
>>>>> #5  0x00b63824 in start_thread (arg=<optimized out>) at
>>>>> pthread_create.c:299
>>>>> #6  0x00c7df4c in __thread_start () from
>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>
>>>>> second psos task:
>>>>> (gdb) bt
>>>>> #0  0x00b6cbbc in read () from
>>>>>
>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libpthread.so.0
>>>>>
>>>>> #1  0x00b47bd8 in notifier_wait () from
>>>>>
>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>
>>>>> #2  0x00b460b0 in notifier_callback () from
>>>>>
>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>
>>>>> #3  0x00b47d30 in notifier_sighandler () from
>>>>>
>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>
>>>>> #4  <signal handler called>
>>>>> #5  sigcancel_handler (sig=32, si=0x1e03aa8, ctx=0x1e03b28) at
>>>>> init.c:136
>>>>> #6  <signal handler called>
>>>>> #7  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>       at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>>>>> #8  0x00b45e10 in threadobj_sleep () from
>>>>>
>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>
>>>>> #9  0x00b2b618 in tm_wkafter () from
>>>>>
>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>
>>>>> #10 0x100009bc in idle_task (a=0, b=0, c=0, d=0) at
>>>>> suspend_delete_easy.c:9
>>>>> #11 0x00b2a658 in task_trampoline () from
>>>>>
>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>
>>>>> #12 0x00b63824 in start_thread (arg=<optimized out>) at
>>>>> pthread_create.c:299
>>>>> #13 0x00c7df4c in __thread_start () from
>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>
>>>>>
>>>>> Invocation command line was
>>>>>
>>>>>     $ ./configure --target=mips64-buildroot-linux-gnu
>>>>> --host=mips64-buildroot-linux-gnu --build=x86_64-unknown-linux-gnu
>>>>> --prefix=/usr --exec-prefix=/usr --sysconfdir=/etc --program-prefix=
>>>>> --disable-gtk-doc --disable-doc --disable-docs --disable-documentation
>>>>> --with-xmlto=no --with-fop=no --enable-ipv6 --enable-static
>>>>> --enable-shared --with-core=mercury
>>>>> --includedir=/usr/include/xenomai-forge --disable-doc-install
>>>>> --enable-lores-clock
>>>>>
>>>>>
>>>>> Is this what you were asking?
>>>>
>>>>
>>>> Yes, thanks.
>>>>
>>>>>
>>>>>
>>>>> By the way, I forgot the mention the following test that I did:
>>>>>
>>>>> When the application is blocked I've done an "echo 1 >
>>>>> /proc/pid/fd/xx",
>>>>> where xx is the file descriptor that is being read from in the
>>>>> notifier_wait().
>>>>> This kinda emulates a t_resume.
>>>>>
>>>>> After doing this the t_delete gets finalized and the application
>>>>> unblocks.
>>>>>
>>>>
>>>> This should fix the race, until the notifier is reworked.
>>>>
>>>> diff --git a/include/copperplate/threadobj.h
>>>> b/include/copperplate/threadobj.h
>>>> index 184c711..218c274 100644
>>>> --- a/include/copperplate/threadobj.h
>>>> +++ b/include/copperplate/threadobj.h
>>>> @@ -124,6 +124,7 @@ void threadobj_save_timeout(struct
>>>> threadobj_corespec *corespec,
>>>>   #define __THREAD_S_ACTIVE      (1 << 5)        /* Running user
>>>> code. */
>>>>   #define __THREAD_S_SUSPENDED   (1 << 6)        /* Suspended via
>>>> threadobj_suspend(). */
>>>>   #define __THREAD_S_SAFE                (1 << 7)        /* TCB
>>>> release deferred. */
>>>> +#define __THREAD_S_ZOMBIE      (1 << 8)        /* Deletion process
>>>> ongoing. */
>>>>   #define __THREAD_S_DEBUG       (1 << 31)       /* Debug mode
>>>> enabled. */
>>>>   /*
>>>>    * threadobj->run_state, locklessly updated by "current", merged
>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>> index a31479c..c54cdf8 100644
>>>> --- a/lib/copperplate/threadobj.c
>>>> +++ b/lib/copperplate/threadobj.c
>>>> @@ -420,11 +420,13 @@ static void notifier_callback(const struct
>>>> notifier *nf)
>>>>           * threadobj_suspend().
>>>>           */
>>>>          threadobj_lock(current);
>>>> -       current->status |= __THREAD_S_SUSPENDED;
>>>> -       threadobj_unlock(current);
>>>> -       notifier_wait(nf); /* Wait for threadobj_resume(). */
>>>> -       threadobj_lock(current);
>>>> -       current->status &= ~__THREAD_S_SUSPENDED;
>>>> +       if ((current->status & __THREAD_S_ZOMBIE) == 0) {
>>>> +               current->status |= __THREAD_S_SUSPENDED;
>>>> +               threadobj_unlock(current);
>>>> +               notifier_wait(nf); /* Wait for threadobj_resume(). */
>>>> +               threadobj_lock(current);
>>>> +               current->status &= ~__THREAD_S_SUSPENDED;
>>>> +       }
>>>>          threadobj_unlock(current);
>>>>   }
>>>>
>>>> @@ -488,7 +490,7 @@ static inline void threadobj_run_corespec(struct
>>>> threadobj *thobj)
>>>>   {
>>>>   }
>>>>
>>>> -static inline void threadobj_cancel_corespec(struct threadobj *thobj)
>>>> +static inline void threadobj_cancel_corespec(struct threadobj
>>>> *thobj) /* thobj->lock held */
>>>>   {
>>>>   }
>>>>
>>>> @@ -872,9 +874,9 @@ void threadobj_init(struct threadobj *thobj,
>>>>
>>>>   static void destroy_thread(struct threadobj *thobj)
>>>>   {
>>>> +       threadobj_cleanup_corespec(thobj);
>>>>          __RT(pthread_cond_destroy(&thobj->barrier));
>>>>          __RT(pthread_mutex_destroy(&thobj->lock));
>>>> -       threadobj_cleanup_corespec(thobj);
>>>>   }
>>>>
>>>>   void threadobj_destroy(struct threadobj *thobj) /* thobj->lock free */
>>>> @@ -1089,8 +1091,6 @@ static void cancel_sync(struct threadobj
>>>> *thobj) /* thobj->lock held */
>>>>          int oldstate, ret = 0;
>>>>          sem_t *sem;
>>>>
>>>> -       __threadobj_check_locked(thobj);
>>>> -
>>>>          /*
>>>>           * We have to allocate the cancel sync sema4 in the main heap
>>>>           * dynamically, so that it always live in valid memory when we
>>>> @@ -1106,6 +1106,7 @@ static void cancel_sync(struct threadobj
>>>> *thobj) /* thobj->lock held */
>>>>                  __STD(sem_init(sem, sem_scope_attribute, 0));
>>>>
>>>>          thobj->cancel_sem = sem;
>>>> +       thobj->status |= __THREAD_S_ZOMBIE;
>>>>
>>>>          /*
>>>>           * If the thread to delete is warming up, wait until it
>>>>
>>>> --
>>>> Philippe.
>>>
>>>
>>>
>>> I've tested the patch and it indeed fixes the problem. Thanks!
>>>
>>> For learning purpose, do you have an explanation as to why this happens?
>>> Shouldn't the cancelling stop the blocking read? Or not in case the
>>> read comes after the cancel?
>>>
>>
>> Over mercury, since we don't have thread-directed suspend/resume support
>> from the kernel, we force a particular thread into a suspended state by
>> sending it a notification signal via the async file I/O mechanism
>> (O_ASYNC). This way we can emulate a call like t_suspend() which
>> requires task-directed, immediate and unconditional action.
>>
>> The signal handler then waits on a read() call until it receives a
>> release message on the notifying I/O channel. But when that handler
>> runs, SIGCANCEL is blocked, so the cancellation point is temporarily
>> disabled. Conversely, when SIGCANCEL is first, the thread unwinds
>> properly.
>>
>> Looking at the idle_task's status when the issue happens:
>>
>> {rpm@cobalt} grep ^Sig /proc/6029/status
>> SigQ:    4/63608
>> SigPnd:    0000000000000000
>> ShdPnd:    0000000000000000
>> SigBlk:    0000020080000000
>>                  ^
>> __SIGRTMIN, aka nptl's SIGCANCEL
>>
>
> I'm unsure the spacing was right, so just in case:
> __SIGRTMIN is 32, so bit #31 is relevant (i.e. 0x80000000).
>

I checked the Blocked Signals and bit #31 is blocked indeed, very
useful to check this SigBlk.
Thanks for explaining!

> --
> Philippe.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Xenomai] Xenomai-forge pSOS app hangs on t_suspend + t_delete
  2014-01-10 11:35             ` Kim De Mey
@ 2014-01-10 13:53               ` Philippe Gerum
  2014-01-11 10:35                 ` Kim De Mey
  0 siblings, 1 reply; 10+ messages in thread
From: Philippe Gerum @ 2014-01-10 13:53 UTC (permalink / raw)
  To: Kim De Mey; +Cc: xenomai

On 01/10/2014 12:35 PM, Kim De Mey wrote:
> 2014/1/10 Philippe Gerum <rpm@xenomai.org>:
>> On 01/10/2014 10:25 AM, Philippe Gerum wrote:
>>>
>>> On 01/09/2014 11:29 AM, Kim De Mey wrote:
>>>>
>>>> 2014/1/9 Philippe Gerum <rpm@xenomai.org>:
>>>>>
>>>>> On 01/08/2014 01:23 PM, Kim De Mey wrote:
>>>>>>
>>>>>> 2014/1/8 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>
>>>>>>> On 01/08/2014 10:25 AM, Kim De Mey wrote:
>>>>>>
>>>>>>
>>>>>> Here are the backtraces:
>>>>>>
>>>>>> main thread:
>>>>>> (gdb) bt
>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>        at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>>>>>> #1  0x00b45e10 in threadobj_sleep () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>
>>>>>> #2  0x00b2b618 in tm_wkafter () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>
>>>>>> #3  0x10000b3c in main (argc=1, argv=0x104120e0) at
>>>>>> suspend_delete_easy.c:42
>>>>>>
>>>>>> first psos task:
>>>>>> (gdb) bt
>>>>>> #0  0x00b6b5c8 in __old_sem_wait (sem=<optimized out>) at
>>>>>> ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:105
>>>>>> #1  0x00b45c84 in threadobj_cancel () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>
>>>>>> #2  0x00b2ac44 in t_delete () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>
>>>>>> #3  0x10000a70 in test (a=0, b=0, c=0, d=0) at suspend_delete_easy.c:22
>>>>>> #4  0x00b2a658 in task_trampoline () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>
>>>>>> #5  0x00b63824 in start_thread (arg=<optimized out>) at
>>>>>> pthread_create.c:299
>>>>>> #6  0x00c7df4c in __thread_start () from
>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>
>>>>>> second psos task:
>>>>>> (gdb) bt
>>>>>> #0  0x00b6cbbc in read () from
>>>>>>
>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libpthread.so.0
>>>>>>
>>>>>> #1  0x00b47bd8 in notifier_wait () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>
>>>>>> #2  0x00b460b0 in notifier_callback () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>
>>>>>> #3  0x00b47d30 in notifier_sighandler () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>
>>>>>> #4  <signal handler called>
>>>>>> #5  sigcancel_handler (sig=32, si=0x1e03aa8, ctx=0x1e03b28) at
>>>>>> init.c:136
>>>>>> #6  <signal handler called>
>>>>>> #7  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>        at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>>>>>> #8  0x00b45e10 in threadobj_sleep () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>
>>>>>> #9  0x00b2b618 in tm_wkafter () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>
>>>>>> #10 0x100009bc in idle_task (a=0, b=0, c=0, d=0) at
>>>>>> suspend_delete_easy.c:9
>>>>>> #11 0x00b2a658 in task_trampoline () from
>>>>>>
>>>>>> /repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>
>>>>>> #12 0x00b63824 in start_thread (arg=<optimized out>) at
>>>>>> pthread_create.c:299
>>>>>> #13 0x00c7df4c in __thread_start () from
>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>
>>>>>>
>>>>>> Invocation command line was
>>>>>>
>>>>>>      $ ./configure --target=mips64-buildroot-linux-gnu
>>>>>> --host=mips64-buildroot-linux-gnu --build=x86_64-unknown-linux-gnu
>>>>>> --prefix=/usr --exec-prefix=/usr --sysconfdir=/etc --program-prefix=
>>>>>> --disable-gtk-doc --disable-doc --disable-docs --disable-documentation
>>>>>> --with-xmlto=no --with-fop=no --enable-ipv6 --enable-static
>>>>>> --enable-shared --with-core=mercury
>>>>>> --includedir=/usr/include/xenomai-forge --disable-doc-install
>>>>>> --enable-lores-clock
>>>>>>
>>>>>>
>>>>>> Is this what you were asking?
>>>>>
>>>>>
>>>>> Yes, thanks.
>>>>>
>>>>>>
>>>>>>
>>>>>> By the way, I forgot the mention the following test that I did:
>>>>>>
>>>>>> When the application is blocked I've done an "echo 1 >
>>>>>> /proc/pid/fd/xx",
>>>>>> where xx is the file descriptor that is being read from in the
>>>>>> notifier_wait().
>>>>>> This kinda emulates a t_resume.
>>>>>>
>>>>>> After doing this the t_delete gets finalized and the application
>>>>>> unblocks.
>>>>>>
>>>>>
>>>>> This should fix the race, until the notifier is reworked.
>>>>>
>>>>> diff --git a/include/copperplate/threadobj.h
>>>>> b/include/copperplate/threadobj.h
>>>>> index 184c711..218c274 100644
>>>>> --- a/include/copperplate/threadobj.h
>>>>> +++ b/include/copperplate/threadobj.h
>>>>> @@ -124,6 +124,7 @@ void threadobj_save_timeout(struct
>>>>> threadobj_corespec *corespec,
>>>>>    #define __THREAD_S_ACTIVE      (1 << 5)        /* Running user
>>>>> code. */
>>>>>    #define __THREAD_S_SUSPENDED   (1 << 6)        /* Suspended via
>>>>> threadobj_suspend(). */
>>>>>    #define __THREAD_S_SAFE                (1 << 7)        /* TCB
>>>>> release deferred. */
>>>>> +#define __THREAD_S_ZOMBIE      (1 << 8)        /* Deletion process
>>>>> ongoing. */
>>>>>    #define __THREAD_S_DEBUG       (1 << 31)       /* Debug mode
>>>>> enabled. */
>>>>>    /*
>>>>>     * threadobj->run_state, locklessly updated by "current", merged
>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>> index a31479c..c54cdf8 100644
>>>>> --- a/lib/copperplate/threadobj.c
>>>>> +++ b/lib/copperplate/threadobj.c
>>>>> @@ -420,11 +420,13 @@ static void notifier_callback(const struct
>>>>> notifier *nf)
>>>>>            * threadobj_suspend().
>>>>>            */
>>>>>           threadobj_lock(current);
>>>>> -       current->status |= __THREAD_S_SUSPENDED;
>>>>> -       threadobj_unlock(current);
>>>>> -       notifier_wait(nf); /* Wait for threadobj_resume(). */
>>>>> -       threadobj_lock(current);
>>>>> -       current->status &= ~__THREAD_S_SUSPENDED;
>>>>> +       if ((current->status & __THREAD_S_ZOMBIE) == 0) {
>>>>> +               current->status |= __THREAD_S_SUSPENDED;
>>>>> +               threadobj_unlock(current);
>>>>> +               notifier_wait(nf); /* Wait for threadobj_resume(). */
>>>>> +               threadobj_lock(current);
>>>>> +               current->status &= ~__THREAD_S_SUSPENDED;
>>>>> +       }
>>>>>           threadobj_unlock(current);
>>>>>    }
>>>>>
>>>>> @@ -488,7 +490,7 @@ static inline void threadobj_run_corespec(struct
>>>>> threadobj *thobj)
>>>>>    {
>>>>>    }
>>>>>
>>>>> -static inline void threadobj_cancel_corespec(struct threadobj *thobj)
>>>>> +static inline void threadobj_cancel_corespec(struct threadobj
>>>>> *thobj) /* thobj->lock held */
>>>>>    {
>>>>>    }
>>>>>
>>>>> @@ -872,9 +874,9 @@ void threadobj_init(struct threadobj *thobj,
>>>>>
>>>>>    static void destroy_thread(struct threadobj *thobj)
>>>>>    {
>>>>> +       threadobj_cleanup_corespec(thobj);
>>>>>           __RT(pthread_cond_destroy(&thobj->barrier));
>>>>>           __RT(pthread_mutex_destroy(&thobj->lock));
>>>>> -       threadobj_cleanup_corespec(thobj);
>>>>>    }
>>>>>
>>>>>    void threadobj_destroy(struct threadobj *thobj) /* thobj->lock free */
>>>>> @@ -1089,8 +1091,6 @@ static void cancel_sync(struct threadobj
>>>>> *thobj) /* thobj->lock held */
>>>>>           int oldstate, ret = 0;
>>>>>           sem_t *sem;
>>>>>
>>>>> -       __threadobj_check_locked(thobj);
>>>>> -
>>>>>           /*
>>>>>            * We have to allocate the cancel sync sema4 in the main heap
>>>>>            * dynamically, so that it always live in valid memory when we
>>>>> @@ -1106,6 +1106,7 @@ static void cancel_sync(struct threadobj
>>>>> *thobj) /* thobj->lock held */
>>>>>                   __STD(sem_init(sem, sem_scope_attribute, 0));
>>>>>
>>>>>           thobj->cancel_sem = sem;
>>>>> +       thobj->status |= __THREAD_S_ZOMBIE;
>>>>>
>>>>>           /*
>>>>>            * If the thread to delete is warming up, wait until it
>>>>>
>>>>> --
>>>>> Philippe.
>>>>
>>>>
>>>>
>>>> I've tested the patch and it indeed fixes the problem. Thanks!
>>>>
>>>> For learning purpose, do you have an explanation as to why this happens?
>>>> Shouldn't the cancelling stop the blocking read? Or not in case the
>>>> read comes after the cancel?
>>>>
>>>
>>> Over mercury, since we don't have thread-directed suspend/resume support
>>> from the kernel, we force a particular thread into a suspended state by
>>> sending it a notification signal via the async file I/O mechanism
>>> (O_ASYNC). This way we can emulate a call like t_suspend() which
>>> requires task-directed, immediate and unconditional action.
>>>
>>> The signal handler then waits on a read() call until it receives a
>>> release message on the notifying I/O channel. But when that handler
>>> runs, SIGCANCEL is blocked, so the cancellation point is temporarily
>>> disabled. Conversely, when SIGCANCEL is first, the thread unwinds
>>> properly.
>>>
>>> Looking at the idle_task's status when the issue happens:
>>>
>>> {rpm@cobalt} grep ^Sig /proc/6029/status
>>> SigQ:    4/63608
>>> SigPnd:    0000000000000000
>>> ShdPnd:    0000000000000000
>>> SigBlk:    0000020080000000
>>>                   ^
>>> __SIGRTMIN, aka nptl's SIGCANCEL
>>>
>>
>> I'm unsure the spacing was right, so just in case:
>> __SIGRTMIN is 32, so bit #31 is relevant (i.e. 0x80000000).
>>
>
> I checked the Blocked Signals and bit #31 is blocked indeed, very
> useful to check this SigBlk.
> Thanks for explaining!
>

For the sake of completeness, this issue happens when SIGCANCEL
and the notification signal pile up, which explains why SIGCANCEL ends 
up being masked when read() is called by the notification handler.
This is confirmed by the backtrace you sent:

#0  0x00b6cbbc in read () from 
output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libpthread.so.0
#1  0x00b47bd8 in notifier_wait () from 
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#2  0x00b460b0 in notifier_callback () from 
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#3  0x00b47d30 in notifier_sighandler () from 
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
#4  <signal handler called>
#5  sigcancel_handler (sig=32, si=0x1e03aa8, ctx=0x1e03b28) at init.c:136
#6  <signal handler called>
#7  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
req=<optimized out>, rem=<optimized out>)
      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51

Which gives:

1. idle_task receives SIGCANCEL, unblocked from tm_wkafter(), starts 
running sigcancel_handler() from the nptl code (frames #7 -> #5). The 
kernel has blocked SIGCANCEL on entry to the handler to prevent 
recursion from the same signal.

2. before sigcancel_handler() unwinds, it is preempted by the 
notification signal (NOTIFYSIG), which eventually blocks in read() 
(frame #4 -> #0). This happens before the thread finalizer is called.

So we end up sleeping on a read(), which impedes the releasing of the 
sema4 cancel_sync() waits on.

Randomness was introduced by the timing of events: we have to see both 
SIGCANCEL and NOTIFYSIG pending at the same time for the target thread. 
In that case, the kernel first delivers SIGCANCEL, then NOTIFYSIG, by 
order of priority (__SIGRTMIN < SIGRTMIN + 8), creating a pile up 
situation. Raising the priority of the t_suspend()+t_delete() caller 
compared to idle_task makes the issue more likely, since the later will 
wake up last for handling both signals. Nice bug.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Xenomai] Xenomai-forge pSOS app hangs on t_suspend + t_delete
  2014-01-10 13:53               ` Philippe Gerum
@ 2014-01-11 10:35                 ` Kim De Mey
  0 siblings, 0 replies; 10+ messages in thread
From: Kim De Mey @ 2014-01-11 10:35 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

On Jan 10, 2014 2:52 PM, "Philippe Gerum" <rpm@xenomai.org> wrote:
>
> On 01/10/2014 12:35 PM, Kim De Mey wrote:
>>
>> 2014/1/10 Philippe Gerum <rpm@xenomai.org>:
>>>
>>> On 01/10/2014 10:25 AM, Philippe Gerum wrote:
>>>>
>>>>
>>>> On 01/09/2014 11:29 AM, Kim De Mey wrote:
>>>>>
>>>>>
>>>>> 2014/1/9 Philippe Gerum <rpm@xenomai.org>:
>>>>>>
>>>>>>
>>>>>> On 01/08/2014 01:23 PM, Kim De Mey wrote:
>>>>>>>
>>>>>>>
>>>>>>> 2014/1/8 Philippe Gerum <rpm@xenomai.org>:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 01/08/2014 10:25 AM, Kim De Mey wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Here are the backtraces:
>>>>>>>
>>>>>>> main thread:
>>>>>>> (gdb) bt
>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized
out>,
>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>        at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>>>>>>> #1  0x00b45e10 in threadobj_sleep () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>
>>>>>>> #2  0x00b2b618 in tm_wkafter () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>
>>>>>>> #3  0x10000b3c in main (argc=1, argv=0x104120e0) at
>>>>>>> suspend_delete_easy.c:42
>>>>>>>
>>>>>>> first psos task:
>>>>>>> (gdb) bt
>>>>>>> #0  0x00b6b5c8 in __old_sem_wait (sem=<optimized out>) at
>>>>>>> ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:105
>>>>>>> #1  0x00b45c84 in threadobj_cancel () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>
>>>>>>> #2  0x00b2ac44 in t_delete () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>
>>>>>>> #3  0x10000a70 in test (a=0, b=0, c=0, d=0) at
suspend_delete_easy.c:22
>>>>>>> #4  0x00b2a658 in task_trampoline () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>
>>>>>>> #5  0x00b63824 in start_thread (arg=<optimized out>) at
>>>>>>> pthread_create.c:299
>>>>>>> #6  0x00c7df4c in __thread_start () from
>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>
>>>>>>> second psos task:
>>>>>>> (gdb) bt
>>>>>>> #0  0x00b6cbbc in read () from
>>>>>>>
>>>>>>>
output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libpthread.so.0
>>>>>>>
>>>>>>> #1  0x00b47bd8 in notifier_wait () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>
>>>>>>> #2  0x00b460b0 in notifier_callback () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>
>>>>>>> #3  0x00b47d30 in notifier_sighandler () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>
>>>>>>> #4  <signal handler called>
>>>>>>> #5  sigcancel_handler (sig=32, si=0x1e03aa8, ctx=0x1e03b28) at
>>>>>>> init.c:136
>>>>>>> #6  <signal handler called>
>>>>>>> #7  clock_nanosleep (clock_id=<optimized out>, flags=<optimized
out>,
>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>        at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>>>>>>> #8  0x00b45e10 in threadobj_sleep () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>
>>>>>>> #9  0x00b2b618 in tm_wkafter () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>
>>>>>>> #10 0x100009bc in idle_task (a=0, b=0, c=0, d=0) at
>>>>>>> suspend_delete_easy.c:9
>>>>>>> #11 0x00b2a658 in task_trampoline () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>
>>>>>>> #12 0x00b63824 in start_thread (arg=<optimized out>) at
>>>>>>> pthread_create.c:299
>>>>>>> #13 0x00c7df4c in __thread_start () from
>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>
>>>>>>>
>>>>>>> Invocation command line was
>>>>>>>
>>>>>>>      $ ./configure --target=mips64-buildroot-linux-gnu
>>>>>>> --host=mips64-buildroot-linux-gnu --build=x86_64-unknown-linux-gnu
>>>>>>> --prefix=/usr --exec-prefix=/usr --sysconfdir=/etc --program-prefix=
>>>>>>> --disable-gtk-doc --disable-doc --disable-docs
--disable-documentation
>>>>>>> --with-xmlto=no --with-fop=no --enable-ipv6 --enable-static
>>>>>>> --enable-shared --with-core=mercury
>>>>>>> --includedir=/usr/include/xenomai-forge --disable-doc-install
>>>>>>> --enable-lores-clock
>>>>>>>
>>>>>>>
>>>>>>> Is this what you were asking?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Yes, thanks.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> By the way, I forgot the mention the following test that I did:
>>>>>>>
>>>>>>> When the application is blocked I've done an "echo 1 >
>>>>>>> /proc/pid/fd/xx",
>>>>>>> where xx is the file descriptor that is being read from in the
>>>>>>> notifier_wait().
>>>>>>> This kinda emulates a t_resume.
>>>>>>>
>>>>>>> After doing this the t_delete gets finalized and the application
>>>>>>> unblocks.
>>>>>>>
>>>>>>
>>>>>> This should fix the race, until the notifier is reworked.
>>>>>>
>>>>>> diff --git a/include/copperplate/threadobj.h
>>>>>> b/include/copperplate/threadobj.h
>>>>>> index 184c711..218c274 100644
>>>>>> --- a/include/copperplate/threadobj.h
>>>>>> +++ b/include/copperplate/threadobj.h
>>>>>> @@ -124,6 +124,7 @@ void threadobj_save_timeout(struct
>>>>>> threadobj_corespec *corespec,
>>>>>>    #define __THREAD_S_ACTIVE      (1 << 5)        /* Running user
>>>>>> code. */
>>>>>>    #define __THREAD_S_SUSPENDED   (1 << 6)        /* Suspended via
>>>>>> threadobj_suspend(). */
>>>>>>    #define __THREAD_S_SAFE                (1 << 7)        /* TCB
>>>>>> release deferred. */
>>>>>> +#define __THREAD_S_ZOMBIE      (1 << 8)        /* Deletion process
>>>>>> ongoing. */
>>>>>>    #define __THREAD_S_DEBUG       (1 << 31)       /* Debug mode
>>>>>> enabled. */
>>>>>>    /*
>>>>>>     * threadobj->run_state, locklessly updated by "current", merged
>>>>>> diff --git a/lib/copperplate/threadobj.c
b/lib/copperplate/threadobj.c
>>>>>> index a31479c..c54cdf8 100644
>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>> @@ -420,11 +420,13 @@ static void notifier_callback(const struct
>>>>>> notifier *nf)
>>>>>>            * threadobj_suspend().
>>>>>>            */
>>>>>>           threadobj_lock(current);
>>>>>> -       current->status |= __THREAD_S_SUSPENDED;
>>>>>> -       threadobj_unlock(current);
>>>>>> -       notifier_wait(nf); /* Wait for threadobj_resume(). */
>>>>>> -       threadobj_lock(current);
>>>>>> -       current->status &= ~__THREAD_S_SUSPENDED;
>>>>>> +       if ((current->status & __THREAD_S_ZOMBIE) == 0) {
>>>>>> +               current->status |= __THREAD_S_SUSPENDED;
>>>>>> +               threadobj_unlock(current);
>>>>>> +               notifier_wait(nf); /* Wait for threadobj_resume(). */
>>>>>> +               threadobj_lock(current);
>>>>>> +               current->status &= ~__THREAD_S_SUSPENDED;
>>>>>> +       }
>>>>>>           threadobj_unlock(current);
>>>>>>    }
>>>>>>
>>>>>> @@ -488,7 +490,7 @@ static inline void threadobj_run_corespec(struct
>>>>>> threadobj *thobj)
>>>>>>    {
>>>>>>    }
>>>>>>
>>>>>> -static inline void threadobj_cancel_corespec(struct threadobj
*thobj)
>>>>>> +static inline void threadobj_cancel_corespec(struct threadobj
>>>>>> *thobj) /* thobj->lock held */
>>>>>>    {
>>>>>>    }
>>>>>>
>>>>>> @@ -872,9 +874,9 @@ void threadobj_init(struct threadobj *thobj,
>>>>>>
>>>>>>    static void destroy_thread(struct threadobj *thobj)
>>>>>>    {
>>>>>> +       threadobj_cleanup_corespec(thobj);
>>>>>>           __RT(pthread_cond_destroy(&thobj->barrier));
>>>>>>           __RT(pthread_mutex_destroy(&thobj->lock));
>>>>>> -       threadobj_cleanup_corespec(thobj);
>>>>>>    }
>>>>>>
>>>>>>    void threadobj_destroy(struct threadobj *thobj) /* thobj->lock
free */
>>>>>> @@ -1089,8 +1091,6 @@ static void cancel_sync(struct threadobj
>>>>>> *thobj) /* thobj->lock held */
>>>>>>           int oldstate, ret = 0;
>>>>>>           sem_t *sem;
>>>>>>
>>>>>> -       __threadobj_check_locked(thobj);
>>>>>> -
>>>>>>           /*
>>>>>>            * We have to allocate the cancel sync sema4 in the main
heap
>>>>>>            * dynamically, so that it always live in valid memory
when we
>>>>>> @@ -1106,6 +1106,7 @@ static void cancel_sync(struct threadobj
>>>>>> *thobj) /* thobj->lock held */
>>>>>>                   __STD(sem_init(sem, sem_scope_attribute, 0));
>>>>>>
>>>>>>           thobj->cancel_sem = sem;
>>>>>> +       thobj->status |= __THREAD_S_ZOMBIE;
>>>>>>
>>>>>>           /*
>>>>>>            * If the thread to delete is warming up, wait until it
>>>>>>
>>>>>> --
>>>>>> Philippe.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I've tested the patch and it indeed fixes the problem. Thanks!
>>>>>
>>>>> For learning purpose, do you have an explanation as to why this
happens?
>>>>> Shouldn't the cancelling stop the blocking read? Or not in case the
>>>>> read comes after the cancel?
>>>>>
>>>>
>>>> Over mercury, since we don't have thread-directed suspend/resume
support
>>>> from the kernel, we force a particular thread into a suspended state by
>>>> sending it a notification signal via the async file I/O mechanism
>>>> (O_ASYNC). This way we can emulate a call like t_suspend() which
>>>> requires task-directed, immediate and unconditional action.
>>>>
>>>> The signal handler then waits on a read() call until it receives a
>>>> release message on the notifying I/O channel. But when that handler
>>>> runs, SIGCANCEL is blocked, so the cancellation point is temporarily
>>>> disabled. Conversely, when SIGCANCEL is first, the thread unwinds
>>>> properly.
>>>>
>>>> Looking at the idle_task's status when the issue happens:
>>>>
>>>> {rpm@cobalt} grep ^Sig /proc/6029/status
>>>> SigQ:    4/63608
>>>> SigPnd:    0000000000000000
>>>> ShdPnd:    0000000000000000
>>>> SigBlk:    0000020080000000
>>>>                   ^
>>>> __SIGRTMIN, aka nptl's SIGCANCEL
>>>>
>>>
>>> I'm unsure the spacing was right, so just in case:
>>> __SIGRTMIN is 32, so bit #31 is relevant (i.e. 0x80000000).
>>>
>>
>> I checked the Blocked Signals and bit #31 is blocked indeed, very
>> useful to check this SigBlk.
>> Thanks for explaining!
>>
>
> For the sake of completeness, this issue happens when SIGCANCEL
> and the notification signal pile up, which explains why SIGCANCEL ends up
being masked when read() is called by the notification handler.
> This is confirmed by the backtrace you sent:
>
>
> #0  0x00b6cbbc in read () from
output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libpthread.so.0
> #1  0x00b47bd8 in notifier_wait () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #2  0x00b460b0 in notifier_callback () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #3  0x00b47d30 in notifier_sighandler () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #4  <signal handler called>
> #5  sigcancel_handler (sig=32, si=0x1e03aa8, ctx=0x1e03b28) at init.c:136
> #6  <signal handler called>
> #7  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
> req=<optimized out>, rem=<optimized out>)
>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>
> Which gives:
>
> 1. idle_task receives SIGCANCEL, unblocked from tm_wkafter(), starts
running sigcancel_handler() from the nptl code (frames #7 -> #5). The
kernel has blocked SIGCANCEL on entry to the handler to prevent recursion
from the same signal.
>
> 2. before sigcancel_handler() unwinds, it is preempted by the
notification signal (NOTIFYSIG), which eventually blocks in read() (frame
#4 -> #0). This happens before the thread finalizer is called.
>
> So we end up sleeping on a read(), which impedes the releasing of the
sema4 cancel_sync() waits on.
>
> Randomness was introduced by the timing of events: we have to see both
SIGCANCEL and NOTIFYSIG pending at the same time for the target thread. In
that case, the kernel first delivers SIGCANCEL, then NOTIFYSIG, by order of
priority (__SIGRTMIN < SIGRTMIN + 8), creating a pile up situation. Raising
the priority of the t_suspend()+t_delete() caller compared to idle_task
makes the issue more likely, since the later will wake up last for handling
both signals. Nice bug.
>
> --
> Philippe.

That explains nicely why it did not always occur, which I was still
wondering. Interesting!
A nice bug it was.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-01-11 10:35 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-08  9:25 [Xenomai] Xenomai-forge pSOS app hangs on t_suspend + t_delete Kim De Mey
2014-01-08 10:04 ` Philippe Gerum
2014-01-08 12:23   ` Kim De Mey
2014-01-09  9:46     ` Philippe Gerum
2014-01-09 10:29       ` Kim De Mey
2014-01-10  9:25         ` Philippe Gerum
2014-01-10  9:30           ` Philippe Gerum
2014-01-10 11:35             ` Kim De Mey
2014-01-10 13:53               ` Philippe Gerum
2014-01-11 10:35                 ` Kim De Mey

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.