* [Xenomai] Process shared rt_event_wait() never signaled on ARM with Mercury core @ 2016-02-10 18:41 Charles Kiorpes 2016-02-10 20:21 ` Philippe Gerum 0 siblings, 1 reply; 10+ messages in thread From: Charles Kiorpes @ 2016-02-10 18:41 UTC (permalink / raw) To: xenomai Hello, I am having some difficulty getting two processes to signal each other on an AM572x EVM. I have two small sample applications: - app1 : creates an event "MYTEST" using rt_event_create(), clears the event, and then rt_event_wait()s for TM_INFINITE. When the event is signaled, app1 rt_event_clear()s the event and then loops back to the rt_event_wait() - app2 : binds to the "MYTEST" event, sleeps for one second, rt_event_signal()s the event, and then loops back to the one second sleep I am using the Mercury core, built natively, configured with: xenomai-3.0.1/configure \ --with-core=mercury \ --enable-debug \ --enable-pshared \ --disable-registry \ --enable-smp \ --enable-condvar-workaround I launch both applications in different terminals in this order: # sudo ./app1 --session="Test" # sudo ./app2 --session="Test" The following behavior results: - app1 claims to have successfully created the event - app2 claims to have successfully bound to the event - app2 loops correctly, signalling the event once per second - app1 never wakes up from the first rt_event_wait() I am running the 4.1.15-ti-rt kernel with full preemption. On an x86 virtual machine with a vanilla 4.2.0-27-generic kernel, the applications perform as expected (app1 wakes up when app2 signals the event). Any advice as to where to look for further debugging would be greatly appreciated. Thanks, Charles ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Xenomai] Process shared rt_event_wait() never signaled on ARM with Mercury core 2016-02-10 18:41 [Xenomai] Process shared rt_event_wait() never signaled on ARM with Mercury core Charles Kiorpes @ 2016-02-10 20:21 ` Philippe Gerum 2016-02-11 12:57 ` Charles Kiorpes 0 siblings, 1 reply; 10+ messages in thread From: Philippe Gerum @ 2016-02-10 20:21 UTC (permalink / raw) To: Charles Kiorpes, xenomai On 02/10/2016 07:41 PM, Charles Kiorpes wrote: > Hello, > I am having some difficulty getting two processes to signal each other on > an AM572x EVM. > I have two small sample applications: > > - app1 : creates an event "MYTEST" using rt_event_create(), clears the > event, and then rt_event_wait()s for TM_INFINITE. When the event is > signaled, app1 rt_event_clear()s the event and then loops back to the > rt_event_wait() > > - app2 : binds to the "MYTEST" event, sleeps for one second, > rt_event_signal()s the event, and then loops back to the one second sleep > > I am using the Mercury core, built natively, configured with: > xenomai-3.0.1/configure \ > --with-core=mercury \ > --enable-debug \ > --enable-pshared \ > --disable-registry \ > --enable-smp \ > --enable-condvar-workaround > > I launch both applications in different terminals in this order: > # sudo ./app1 --session="Test" > # sudo ./app2 --session="Test" > > The following behavior results: > - app1 claims to have successfully created the event > - app2 claims to have successfully bound to the event > - app2 loops correctly, signalling the event once per second > - app1 never wakes up from the first rt_event_wait() > > I am running the 4.1.15-ti-rt kernel with full preemption. > > On an x86 virtual machine with a vanilla 4.2.0-27-generic kernel, the > applications perform as expected (app1 wakes up when app2 signals the > event). > > Any advice as to where to look for further debugging would be greatly > appreciated. > Over Mercury, the event synchronization is based on a common mutex+condvar pair obtained from the underlying *libc basically. The rest of the implementation for this mechanism is shared with the Cobalt side, except the workaround for the PI breakage the glibc suffers, which is Mercury-only. - does omitting --enable-condvar-workaround change the behavior? - does this issue also happen when both threads belong to the same process (i.e. is it a session-based processing issue?) - does the code check for each and every return code, particularly the return status of rt_event_signal()? Attaching your self-contained test case may help finding out where the problem stands. - after enabling the registry (--enable-registry), is the reported event value consistent when looking at /var/run/xenomai/<user>/Test/alchemy/events/<event_name>? Also, you have a test suite available for Alchemy, which can run on either Mercury/Cobalt cores unmodified. There is a test exercising events there. e.g.: $ cd lib/alchemy/testsuite $ make XENO_CONFIG=/your/xenomai/install/bin/xeno-config test NOTE: The "test" rule in the Makefile assumes that the current shell user is sudoer. -- Philippe. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Xenomai] Process shared rt_event_wait() never signaled on ARM with Mercury core 2016-02-10 20:21 ` Philippe Gerum @ 2016-02-11 12:57 ` Charles Kiorpes 2016-02-12 10:43 ` Philippe Gerum 2016-02-12 10:55 ` Philippe Gerum 0 siblings, 2 replies; 10+ messages in thread From: Charles Kiorpes @ 2016-02-11 12:57 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai On Wed, Feb 10, 2016 at 3:21 PM, Philippe Gerum <rpm@xenomai.org> wrote: > On 02/10/2016 07:41 PM, Charles Kiorpes wrote: > > Hello, > > I am having some difficulty getting two processes to signal each other on > > an AM572x EVM. > > I have two small sample applications: > > > > - app1 : creates an event "MYTEST" using rt_event_create(), clears the > > event, and then rt_event_wait()s for TM_INFINITE. When the event is > > signaled, app1 rt_event_clear()s the event and then loops back to the > > rt_event_wait() > > > > - app2 : binds to the "MYTEST" event, sleeps for one second, > > rt_event_signal()s the event, and then loops back to the one second sleep > > > > I am using the Mercury core, built natively, configured with: > > xenomai-3.0.1/configure \ > > --with-core=mercury \ > > --enable-debug \ > > --enable-pshared \ > > --disable-registry \ > > --enable-smp \ > > --enable-condvar-workaround > > > > I launch both applications in different terminals in this order: > > # sudo ./app1 --session="Test" > > # sudo ./app2 --session="Test" > > > > The following behavior results: > > - app1 claims to have successfully created the event > > - app2 claims to have successfully bound to the event > > - app2 loops correctly, signalling the event once per second > > - app1 never wakes up from the first rt_event_wait() > > > > I am running the 4.1.15-ti-rt kernel with full preemption. > > > > On an x86 virtual machine with a vanilla 4.2.0-27-generic kernel, the > > applications perform as expected (app1 wakes up when app2 signals the > > event). > > > > Any advice as to where to look for further debugging would be greatly > > appreciated. > > > > Over Mercury, the event synchronization is based on a common > mutex+condvar pair obtained from the underlying *libc basically. The > rest of the implementation for this mechanism is shared with the Cobalt > side, except the workaround for the PI breakage the glibc suffers, which > is Mercury-only. > > - does omitting --enable-condvar-workaround change the behavior? > Omitting --enable-condvar-workaround has no effect on the behavior of this test app. > - does this issue also happen when both threads belong to the same > process (i.e. is it a session-based processing issue?) > I am unsure, I will construct a test case for this and reply with more info. > - does the code check for each and every return code, particularly the > return status of rt_event_signal()? Attaching your self-contained test > case may help finding out where the problem stands. > The code checks the return value of all alchemy API calls and printfs+returns if an error is encountered. I have attached app1.c and app2.c for review. > - after enabling the registry (--enable-registry), is the reported event > value consistent when looking at > /var/run/xenomai/<user>/Test/alchemy/events/<event_name>? > > The event shows the correct signaled value under /var/run/xenomai/root/Test/<app1_pid>/alchemy/events/MYTEST. There are no extra events present, for example under app2's pid. > Also, you have a test suite available for Alchemy, which can run on > either Mercury/Cobalt cores unmodified. There is a test exercising > events there. > > e.g.: > > $ cd lib/alchemy/testsuite > $ make XENO_CONFIG=/your/xenomai/install/bin/xeno-config test > > NOTE: The "test" rule in the Makefile assumes that the current shell > user is sudoer. > > I attempted to run several tests: 'task-1', 'event-1', and 'mutex-1'. Each of these hung indefinitely. A gdb trace indicated that they were hanging on __libc_do_syscall() within __pthread_cond_wait() within threadobj_cond_wait(). I have attached the full backtrace from mutex-1 as mutex-1_bt.txt I will be trying combinations of kernels and configuration options throughout the course of the day. > -- > Philippe. Thanks, -Charles -------------- next part -------------- (gdb) bt full #0 __libc_do_syscall () at ../ports/sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:43 No locals. #1 0xb6f67204 in __pthread_cond_wait (cond=cond@entry=0xb6cb0194, mutex=mutex@entry=0xb6cb009c) at pthread_cond_wait.c:178 _a2tmp = 11 _a2 = 11 _nametmp = 240 _a3tmp = 3 _a3 = 3 _a1 = -1228209768 _v1tmp = -1228210020 _a4tmp = 0 _a4 = 0 _v1 = -1228210020 _name = 240 __op = 11 futex_val = 3 buffer = {__routine = 0xb6f66f1d <__condvar_cleanup>, __arg = 0xbefff9e8, __canceltype = -1090520544, __prev = 0x0} cbuffer = {oldtype = 1, cond = 0xb6cb0194, mutex = 0xb6cb009c, bc_seq = 0} err = <optimized out> pshared = 128 pi_flag = 1 val = <optimized out> seq = 1 #2 0xb6fa0f8c in threadobj_cond_wait (cond=cond@entry=0xb6cb0194, lock=lock@entry=0xb6cb009c) at ../../../xenomai-3.0.1/lib/copperplate/threadobj.c:959 ret = <optimized out> #3 0xb6fa0fe4 in wait_on_barrier (thobj=thobj@entry=0xb6cb0094, mask=mask@entry=16) at ../../../xenomai-3.0.1/lib/copperplate/threadobj.c:1210 __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {-1228232340, -1228210028, 69993, -1090520152, 0, 0, -1224740864, 0, 688021297, 554153222, 0 <repeats 16 times>, -1225393808, -1224796768, 5, 0, 0, -1224757760, 0, -1224758200, -1, -1225356317, -1225151504, -1224798208, -1090520212, -1224800400, -1224806400, -1225019392, 110988, 0, 1, -1224740864, 0, -1224861195, -1224800400, 0, -1224806400, -1225019392, 111420, 0, 1, 1, 0, 0, 0, 0, -1225030060, -1228232340, -1228210028, 69993}, __mask_was_saved = 0}}, __pad = {0xbefffbe8, 0x0, 0xb6cb0094, 0x11169 <task_b>}} __not_first_call = <optimized out> status = <optimized out> #4 0xb6fa15d6 in threadobj_start (thobj=thobj@entry=0xb6cb0094) at ../../../xenomai-3.0.1/lib/copperplate/threadobj.c:1247 ret = 0 oldstate = 1 #5 0xb6fc2ee8 in rt_task_start (task=<optimized out>, entry=0x11169 <task_b>, arg=0x0) at ../../../xenomai-3.0.1/lib/alchemy/task.c:634 tcb = 0xb6caff6c ret = 69993 #6 0x00011562 in main (argc=1, argv=0x22538) at mutex-1.c:130 ret = 0 -------------- next part -------------- A non-text attachment was scrubbed... Name: app1.c Type: text/x-csrc Size: 1076 bytes Desc: not available URL: <http://xenomai.org/pipermail/xenomai/attachments/20160211/5df88846/attachment.c> -------------- next part -------------- A non-text attachment was scrubbed... Name: app2.c Type: text/x-csrc Size: 759 bytes Desc: not available URL: <http://xenomai.org/pipermail/xenomai/attachments/20160211/5df88846/attachment-0001.c> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Xenomai] Process shared rt_event_wait() never signaled on ARM with Mercury core 2016-02-11 12:57 ` Charles Kiorpes @ 2016-02-12 10:43 ` Philippe Gerum 2016-02-12 14:08 ` Charles Kiorpes 2016-02-12 10:55 ` Philippe Gerum 1 sibling, 1 reply; 10+ messages in thread From: Philippe Gerum @ 2016-02-12 10:43 UTC (permalink / raw) To: Charles Kiorpes; +Cc: xenomai On 02/11/2016 01:57 PM, Charles Kiorpes wrote: > > I attempted to run several tests: 'task-1', 'event-1', and 'mutex-1'. > Each of these hung indefinitely. A gdb trace indicated that they were > hanging on __libc_do_syscall() within __pthread_cond_wait() within > threadobj_cond_wait(). > > I have attached the full backtrace from mutex-1 as mutex-1_bt.txt > Ok, if the test suite does not pass, something is badly wrong, so we should investigate that hang issue before anything else. The backtrace reveals that copperplate cannot handshake with a newly spawned task, this is the purpose of the wait_on_barrier() call over the context of rt_task_start(). That barrier should be signaled by a call to threadobj_notify_entry() from the internal trampoline code of the emerging thread (task_entry() in alchemy/task.c). - maybe task_prologue_2() (alchemy/task.c) which is called earlier hangs indefinitely, and therefore prevents threadobj_notify_entry() from running? - maybe the new thread does not even start for some reason, are we sure task_entry() is reached (e.g. do we hit a breakpoint there?) Could you inspect the current thread list under gdb when the program hangs? Also, I would recommend to enable full debugging for now (--enable-debug=full) to get accurate line information, assuming the issue should still show up with a non-optimized code. Hopefully. -- Philippe. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Xenomai] Process shared rt_event_wait() never signaled on ARM with Mercury core 2016-02-12 10:43 ` Philippe Gerum @ 2016-02-12 14:08 ` Charles Kiorpes 2016-02-12 15:25 ` Philippe Gerum 0 siblings, 1 reply; 10+ messages in thread From: Charles Kiorpes @ 2016-02-12 14:08 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai On Fri, Feb 12, 2016 at 5:43 AM, Philippe Gerum <rpm@xenomai.org> wrote: > On 02/11/2016 01:57 PM, Charles Kiorpes wrote: > > > > I attempted to run several tests: 'task-1', 'event-1', and 'mutex-1'. > > Each of these hung indefinitely. A gdb trace indicated that they were > > hanging on __libc_do_syscall() within __pthread_cond_wait() within > > threadobj_cond_wait(). > > > > I have attached the full backtrace from mutex-1 as mutex-1_bt.txt > > > > Ok, if the test suite does not pass, something is badly wrong, so we > should investigate that hang issue before anything else. > > The backtrace reveals that copperplate cannot handshake with a newly > spawned task, this is the purpose of the wait_on_barrier() call over the > context of rt_task_start(). That barrier should be signaled by a call to > threadobj_notify_entry() from the internal trampoline code of the > emerging thread (task_entry() in alchemy/task.c). > > - maybe task_prologue_2() (alchemy/task.c) which is called earlier hangs > indefinitely, and therefore prevents threadobj_notify_entry() from running? > > - maybe the new thread does not even start for some reason, are we sure > task_entry() is reached (e.g. do we hit a breakpoint there?) > > Could you inspect the current thread list under gdb when the program hangs? > > Also, I would recommend to enable full debugging for now > (--enable-debug=full) to get accurate line information, assuming the > issue should still show up with a non-optimized code. Hopefully. > > -- > Philippe. > I ran the task-1 test under gdb with this Xenomai configuration: --with-core=mercury \ --enable-debug=full \ --enable-registry \ --enable-smp \ --enable-pshared \ --enable-condvar-workaround It appears that the new thread is being launched, and getting stuck in threadobj_wait_start() within task_prologue_2(), as you indicated might be the case. I have attached the thread list and a full backtrace for each thread (in separate files by thread id). As per your other message, my kernel configs all include CONFIG_FUTEX. I have tried glibc 2.19 and 2.21, as well as RT patched and vanilla kernels. Interestingly, when I removed --enable-pshared from my configuration, the task-1 test passed. - Charles -------------- next part -------------- (gdb) thread 1 [Switching to thread 1 (Thread 0xb6ff0000 (LWP 14056))] #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 46 in ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S (gdb) bt full #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 No locals. #1 0xb6f5f6f2 in __pthread_cond_wait (cond=0xb6caa194, mutex=0xb6caa09c) at pthread_cond_wait.c:177 _a2tmp = 11 _a2 = 11 _nametmp = 240 _a3tmp = 3 _a3 = 3 _a1 = -1228234344 _v1tmp = -1228234596 _a4tmp = 0 _a1tmp = -1228234344 _a4 = 0 _v1 = -1228234596 _name = 240 __ret = <optimized out> futex_val = 3 buffer = {__routine = 0xb6f5f3e1 <__condvar_cleanup>, __arg = 0xbefffa00, __canceltype = -1225039472, __prev = 0x0} cbuffer = {oldtype = 1, cond = 0xb6caa194, mutex = 0xb6caa09c, bc_seq = 0} err = <optimized out> pshared = <optimized out> pi_flag = 1 val = <optimized out> seq = <optimized out> #2 0xb6f9cd9a in threadobj_cond_wait (cond=0xb6caa194, lock=0xb6caa09c) at ../../../xenomai-3/lib/copperplate/threadobj.c:959 ret = -1228234604 #3 0xb6f9d348 in wait_on_barrier (thobj=0xb6caa094, mask=16) at ../../../xenomai-3/lib/copperplate/threadobj.c:1210 __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {-1446288694, -1580381599, -1090520024, 0, 0, -1090520496, 0, 0, -1224740864, 0 <repeats 17 times>, 2, 0, 5, 0, 1, -1224758200, -1659757782, -1224874387, 0, -1224738124, -1224757760, -1090520312, -1224757760, 0, -1, -1224796312, -1225176080, -1224794112, -1225021852, -1224802304, -1224801088, 1, 0, -1224859353, -1224796312, 1, 5, 0, 0, 1, -1225039472, 0, -1090520024, 0, -1090520232, -1090520024, 0, 0}, __mask_was_saved = 0}}, __pad = {0xbefffc28, 0x0, 0x0, 0xb6ca496c}} __cancel_routine = 0xb6f5e225 <__GI___pthread_mutex_unlock> __cancel_arg = 0xb6caa09c __not_first_call = 0 oldstate = 0 status = 73 #4 0xb6f9d3fa in threadobj_start (thobj=0xb6caa094) at ../../../xenomai-3/lib/copperplate/threadobj.c:1247 current = 0xb6ca496c ret = 0 oldstate = 1 #5 0xb6fc2cc8 in rt_task_start (task=0x20ed0 <t_main>, entry=0x10985 <main_task>, arg=0xdeadbeef) at ../../../xenomai-3/lib/alchemy/task.c:634 tcb = 0xb6ca9f6c svc = {cancel_type = -559038737} ret = 67973 #6 0x00010a62 in main (argc=1, argv=0x21538) at task-1.c:26 ret = 0 -------------- next part -------------- (gdb) thread 2 [Switching to thread 2 (Thread 0xb6ca1460 (LWP 14063))] #0 0xb6f61cbe in read () at ../sysdeps/unix/syscall-template.S:81 81 ../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) bt full #0 0xb6f61cbe in read () at ../sysdeps/unix/syscall-template.S:81 No locals. #1 0xb6f1936e in ?? () from /lib/arm-linux-gnueabihf/libfuse.so.2 No symbol table info available. Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) -------------- next part -------------- (gdb) thread 3 [Switching to thread 3 (Thread 0xb6c6f460 (LWP 14065))] #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 46 ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory. (gdb) bt full #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 No locals. #1 0xb6f5f6f2 in __pthread_cond_wait (cond=0xb6caa194, mutex=0xb6caa09c) at pthread_cond_wait.c:177 _a2tmp = 11 _a2 = 11 _nametmp = 240 _a3tmp = 1 _a3 = 1 _a1 = -1228234344 _v1tmp = -1228234596 _a4tmp = 0 _a1tmp = -1228234344 _a4 = 0 _v1 = -1228234596 _name = 240 __ret = <optimized out> futex_val = 1 buffer = {__routine = 0xb6f5f3e1 <__condvar_cleanup>, __arg = 0xb6c6ec18, __canceltype = -1225039472, __prev = 0x0} cbuffer = {oldtype = 1, cond = 0xb6caa194, mutex = 0xb6caa09c, bc_seq = 0} err = <optimized out> pshared = <optimized out> pi_flag = 1 val = <optimized out> seq = <optimized out> #2 0xb6f9cd9a in threadobj_cond_wait (cond=0xb6caa194, lock=0xb6caa09c) at ../../../xenomai-3/lib/copperplate/threadobj.c:959 ret = -1228234596 #3 0xb6f9d348 in wait_on_barrier (thobj=0xb6caa094, mask=5) at ../../../xenomai-3/lib/copperplate/threadobj.c:1210 __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {-1577945870, -1580381599, -1228476860, -1090520124, 0, -1228477336, -1228476512, -1224802304, 0, -1228474732, 0 <repeats 16 times>, 1, -1224758200, -1228477152, -1228477004, 0, -1228477120, -1224757760, -1228477168, -1224757760, -1224738124, -1, 0, -1225177696, -1224794112, 1, -1224794112, 0, 0, 0, 0, -1224793672, -1228477112, -1224793672, 0, -1, 0, -1225027584, 119632, -1225021852, 0, -1228234596, -1090520124, 0, -1228477040, -1228476512, -1224802304, 0, -1225404053}, __mask_was_saved = 0}}, __pad = {0xb6c6ee80, 0x0, 0x0, 0xb6c6ed90}} __cancel_routine = 0xb6f5e225 <__GI___pthread_mutex_unlock> __cancel_arg = 0xb6caa09c __not_first_call = 0 oldstate = 0 status = 8 #4 0xb6f9d45e in threadobj_wait_start () at ../../../xenomai-3/lib/copperplate/threadobj.c:1276 current = 0xb6caa094 status = -1228234900 #5 0xb6fc2728 in task_prologue_2 (tcb=0xb6ca9f6c) at ../../../xenomai-3/lib/alchemy/task.c:211 ret = 0 #6 0xb6fc276e in task_entry (arg=0xb6ca9f6c) at ../../../xenomai-3/lib/alchemy/task.c:227 __ret = -1228476896 tcb = 0xb6ca9f6c svc = {cancel_type = 1} ret = -1228475296 __FUNCTION__ = "task_entry" #7 0xb6f9a288 in thread_trampoline (arg=0xbefffb94) at ../../../xenomai-3/lib/copperplate/internal.c:251 cta = 0xbefffb94 _cta = {stacksize = 0, detachstate = 1, policy = 1, param_ex = {__sched_priority = 99, sched_u = {rr = {__sched_rr_quantum = {tv_sec = -1225255424, tv_nsec = 0}}}}, prologue = 0xb6fc26d1 <task_prologue_1>, run = 0xb6fc275d <task_entry>, arg = 0xb6ca9f6c, __reserved = {status = -38, warm = {__size = "\001\000\000\000\000\000\000\000\060m\376\266\001\000\000", __align = 1}, released = 0x10cac}} released = {__size = '\000' <repeats 15 times>, __align = 0} ret = 0 __FUNCTION__ = "thread_trampoline" #8 0xb6f5b424 in start_thread (arg=0x0) at pthread_create.c:335 pd = 0x0 unwind_buf = {cancel_jmp_buf = {{jmp_buf = {-1577945366, -1581175653, -1228475296, -1090520280, 0, -1228476816, -1228476512, -1224802304, 0, -1228474732, 0 <repeats 54 times>}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread" #9 0xb6eb243c in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:89 from /lib/arm-linux-gnueabihf/libc.so.6 No locals. Backtrace stopped: previous frame identical to this frame (corrupt stack?) -------------- next part -------------- (gdb) info threads Id Target Id Frame 3 Thread 0xb6c6f460 (LWP 14065) "main_task" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 2 Thread 0xb6ca1460 (LWP 14063) "task-1" 0xb6f61cbe in read () at ../sysdeps/unix/syscall-template.S:81 * 1 Thread 0xb6ff0000 (LWP 14056) "task-1" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Xenomai] Process shared rt_event_wait() never signaled on ARM with Mercury core 2016-02-12 14:08 ` Charles Kiorpes @ 2016-02-12 15:25 ` Philippe Gerum 2016-02-12 19:07 ` Charles Kiorpes 0 siblings, 1 reply; 10+ messages in thread From: Philippe Gerum @ 2016-02-12 15:25 UTC (permalink / raw) To: Charles Kiorpes; +Cc: xenomai On 02/12/2016 03:08 PM, Charles Kiorpes wrote: > > > On Fri, Feb 12, 2016 at 5:43 AM, Philippe Gerum <rpm@xenomai.org > <mailto:rpm@xenomai.org>> wrote: > > On 02/11/2016 01:57 PM, Charles Kiorpes wrote: > > > > I attempted to run several tests: 'task-1', 'event-1', and 'mutex-1'. > > Each of these hung indefinitely. A gdb trace indicated that they were > > hanging on __libc_do_syscall() within __pthread_cond_wait() within > > threadobj_cond_wait(). > > > > I have attached the full backtrace from mutex-1 as mutex-1_bt.txt > > > > Ok, if the test suite does not pass, something is badly wrong, so we > should investigate that hang issue before anything else. > > The backtrace reveals that copperplate cannot handshake with a newly > spawned task, this is the purpose of the wait_on_barrier() call over the > context of rt_task_start(). That barrier should be signaled by a call to > threadobj_notify_entry() from the internal trampoline code of the > emerging thread (task_entry() in alchemy/task.c). > > - maybe task_prologue_2() (alchemy/task.c) which is called earlier hangs > indefinitely, and therefore prevents threadobj_notify_entry() from > running? > > - maybe the new thread does not even start for some reason, are we sure > task_entry() is reached (e.g. do we hit a breakpoint there?) > > Could you inspect the current thread list under gdb when the program > hangs? > > Also, I would recommend to enable full debugging for now > (--enable-debug=full) to get accurate line information, assuming the > issue should still show up with a non-optimized code. Hopefully. > > -- > Philippe. > > > I ran the task-1 test under gdb with this Xenomai configuration: > --with-core=mercury \ > --enable-debug=full \ > --enable-registry \ > --enable-smp \ > --enable-pshared \ > --enable-condvar-workaround > > It appears that the new thread is being launched, and getting stuck in > threadobj_wait_start() within task_prologue_2(), as you indicated might > be the case. > I have attached the thread list and a full backtrace for each thread (in > separate files by thread id). > > As per your other message, my kernel configs all include CONFIG_FUTEX. > > I have tried glibc 2.19 and 2.21, as well as RT patched and vanilla kernels. > > Interestingly, when I removed --enable-pshared from my configuration, > the task-1 test passed. > Here is the sync pattern the code normally achieves, once the parent has successfully spawned a child thread, which has to wait for a start signal before it may run application code: 1. parent calls threadobj_start(child) 1.1 child->status |= __THREAD_S_STARTED 1.2 wait for child->status & __THREAD_S_ACTIVE 2. child calls threadobj_wait_start(self) 2.1 wait for self->status & __THREAD_S_STARTED 2.2 raise self->status |= __THREAD_S_ACTIVE All accesses to the status bits are serialized by a per-thread mutex, operated by the threadobj_lock/unlock accessors, which also covers the condvar signaling/waiting as one would expect. When running in pshared mode, thread descriptors (holding ->status, mutex and barrier sync) are obtained from /dev/shm. If --disable-pshared, we are using 100% process-private memory. Case 1: a race when manipulating the thread status due to inconsistent locking. I could not find any so far. Case 2: a cache coherence issue in SMP, also caused by improper locking. Otherwise, the locking should enforce memory barriers as expected. Case 3: anything not mentioned in other cases... - Could you paste/copy the disassembly (objdump -dl rather than gdb's disass) of the wait_on_barrier() function? - Does running both programs with --cpu-affinity=0/1 change the outcome? - Without specifying any affinity this time, could you run the current test with the debug patch below applied (this is clearly not a fix)? The patch forces the code to read the value of the ->status field before waiting on the barrier. With that code in and a backtrace showing locals, we should be able to check the status word when threadobj_wait_start() is entered. diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c index cc64caa..ed85a12 100644 --- a/lib/copperplate/threadobj.c +++ b/lib/copperplate/threadobj.c @@ -1273,7 +1273,9 @@ void threadobj_wait_start(void) /* current->lock free. */ int status; threadobj_lock(current); - status = wait_on_barrier(current, __THREAD_S_STARTED|__THREAD_S_ABORTED); + status = current->status; + if (!(status & __THREAD_S_STARTED)) + status = wait_on_barrier(current, __THREAD_S_STARTED|__THREAD_S_ABORTED); threadobj_unlock(current); /* -- Philippe. ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [Xenomai] Process shared rt_event_wait() never signaled on ARM with Mercury core 2016-02-12 15:25 ` Philippe Gerum @ 2016-02-12 19:07 ` Charles Kiorpes 2016-02-16 15:40 ` Philippe Gerum 0 siblings, 1 reply; 10+ messages in thread From: Charles Kiorpes @ 2016-02-12 19:07 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai > Here is the sync pattern the code normally achieves, once the parent has > successfully spawned a child thread, which has to wait for a start signal > before it may run application code: > > 1. parent calls threadobj_start(child) > 1.1 child->status |= __THREAD_S_STARTED > 1.2 wait for child->status & __THREAD_S_ACTIVE > > 2. child calls threadobj_wait_start(self) > 2.1 wait for self->status & __THREAD_S_STARTED > 2.2 raise self->status |= __THREAD_S_ACTIVE > > All accesses to the status bits are serialized by a per-thread mutex, > operated by the threadobj_lock/unlock accessors, which also covers the > condvar signaling/waiting as one would expect. > > When running in pshared mode, thread descriptors (holding ->status, mutex > and barrier sync) are obtained from /dev/shm. If --disable-pshared, we are > using 100% process-private memory. > > Case 1: a race when manipulating the thread status due to inconsistent > locking. I could not find any so far. > > Case 2: a cache coherence issue in SMP, also caused by improper locking. > Otherwise, the locking should enforce memory barriers as expected. > > Case 3: anything not mentioned in other cases... > > - Could you paste/copy the disassembly (objdump -dl rather than gdb's > disass) of the wait_on_barrier() function? > > I have attached the disassembly as wait_on_barrier_disas.txt > - Does running both programs with --cpu-affinity=0/1 change the outcome? > > There is no change in behavior when trying any combination of cpu affinities, with either the "task-1" alchemy test or my event test apps. > - Without specifying any affinity this time, could you run the current > test with the debug patch below applied (this is clearly not a fix)? The > patch forces the code to read the value of the ->status field before > waiting on the barrier. With that code in and a backtrace showing locals, > we should be able to check the status word when threadobj_wait_start() is > entered. > diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c > index cc64caa..ed85a12 100644 > --- a/lib/copperplate/threadobj.c > +++ b/lib/copperplate/threadobj.c > @@ -1273,7 +1273,9 @@ void threadobj_wait_start(void) /* current->lock > free. */ > int status; > > threadobj_lock(current); > - status = wait_on_barrier(current, > __THREAD_S_STARTED|__THREAD_S_ABORTED); > + status = current->status; > + if (!(status & __THREAD_S_STARTED)) > + status = wait_on_barrier(current, > __THREAD_S_STARTED|__THREAD_S_ABORTED); > threadobj_unlock(current); > > /* > > -- > Philippe. > I patched in the debug and I have attached the full backtraces of threads 1 and 3 of the "task-1" alchemy test. At the time of the hang: - parent sees status = 73 (matches the flags set during threadobj_start()) - child sees status = 8 (locked?) - Charles -------------- next part -------------- 00001868 <wait_on_barrier>: wait_on_barrier(): /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1200 1868: b580 push {r7, lr} 186a: b0ce sub sp, #312 ; 0x138 186c: af00 add r7, sp, #0 186e: 1d3b adds r3, r7, #4 1870: 6018 str r0, [r3, #0] 1872: 463b mov r3, r7 1874: 6019 str r1, [r3, #0] /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1204 1876: 1d3b adds r3, r7, #4 1878: 681b ldr r3, [r3, #0] 187a: 6a9b ldr r3, [r3, #40] ; 0x28 187c: f8c7 3134 str.w r3, [r7, #308] ; 0x134 /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1205 1880: 463b mov r3, r7 1882: f8d7 2134 ldr.w r2, [r7, #308] ; 0x134 1886: 681b ldr r3, [r3, #0] 1888: 4013 ands r3, r2 188a: 2b00 cmp r3, #0 188c: d148 bne.n 1920 <wait_on_barrier+0xb8> /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1207 188e: 1d3b adds r3, r7, #4 1890: 681b ldr r3, [r3, #0] 1892: 6a5b ldr r3, [r3, #36] ; 0x24 1894: f8c7 3130 str.w r3, [r7, #304] ; 0x130 /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1208 1898: f240 0300 movw r3, #0 189c: f2c0 0300 movt r3, #0 18a0: f8c7 312c str.w r3, [r7, #300] ; 0x12c 18a4: 1d3b adds r3, r7, #4 18a6: 681b ldr r3, [r3, #0] 18a8: 3308 adds r3, #8 18aa: f8c7 3128 str.w r3, [r7, #296] ; 0x128 18ae: f107 0308 add.w r3, r7, #8 18b2: 2100 movs r1, #0 18b4: 4618 mov r0, r3 18b6: f7ff fffe bl 0 <__sigsetjmp> 18ba: f8c7 0124 str.w r0, [r7, #292] ; 0x124 18be: f8d7 3124 ldr.w r3, [r7, #292] ; 0x124 18c2: 2b00 cmp r3, #0 18c4: d009 beq.n 18da <wait_on_barrier+0x72> /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1208 (discriminator 2) 18c6: f8d7 312c ldr.w r3, [r7, #300] ; 0x12c 18ca: f8d7 0128 ldr.w r0, [r7, #296] ; 0x128 18ce: 4798 blx r3 18d0: f107 0308 add.w r3, r7, #8 18d4: 4618 mov r0, r3 18d6: f7ff fffe bl 0 <__pthread_unwind_next> /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1208 (discriminator 3) 18da: f107 0308 add.w r3, r7, #8 18de: 4618 mov r0, r3 18e0: f7ff fffe bl 0 <__pthread_register_cancel> /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1209 (discriminator 3) 18e4: 1d3b adds r3, r7, #4 18e6: 6818 ldr r0, [r3, #0] 18e8: f7fe fd80 bl 3ec <__threadobj_tag_unlocked> /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1210 (discriminator 3) 18ec: 1d3b adds r3, r7, #4 18ee: 681b ldr r3, [r3, #0] 18f0: f503 7280 add.w r2, r3, #256 ; 0x100 18f4: 1d3b adds r3, r7, #4 18f6: 681b ldr r3, [r3, #0] 18f8: 3308 adds r3, #8 18fa: 4619 mov r1, r3 18fc: 4610 mov r0, r2 18fe: f7ff fffe bl 1364 <threadobj_cond_wait> /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1211 (discriminator 3) 1902: 1d3b adds r3, r7, #4 1904: 6818 ldr r0, [r3, #0] 1906: f7fe fd61 bl 3cc <__threadobj_tag_locked> /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1212 (discriminator 3) 190a: f107 0308 add.w r3, r7, #8 190e: 4618 mov r0, r3 1910: f7ff fffe bl 0 <__pthread_unregister_cancel> /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1213 (discriminator 3) 1914: 1d3b adds r3, r7, #4 1916: 681b ldr r3, [r3, #0] 1918: f8d7 2130 ldr.w r2, [r7, #304] ; 0x130 191c: 625a str r2, [r3, #36] ; 0x24 /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1214 (discriminator 3) 191e: e7aa b.n 1876 <wait_on_barrier+0xe> /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1206 1920: bf00 nop /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1216 1922: f8d7 3134 ldr.w r3, [r7, #308] ; 0x134 /home/debian/Development/build/lib/copperplate/../../../xenomai-3/lib/copperplate/threadobj.c:1217 1926: 4618 mov r0, r3 1928: f507 779c add.w r7, r7, #312 ; 0x138 192c: 46bd mov sp, r7 192e: bd80 pop {r7, pc} -------------- next part -------------- (gdb) thread 1 [Switching to thread 1 (Thread 0xb6ff0000 (LWP 4007))] #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 46 in ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S (gdb) bt full #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 No locals. #1 0xb6f5f6f2 in __pthread_cond_wait (cond=0xb6caa194, mutex=0xb6caa09c) at pthread_cond_wait.c:177 _a2tmp = 11 _a2 = 11 _nametmp = 240 _a3tmp = 3 _a3 = 3 _a1 = -1228234344 _v1tmp = -1228234596 _a4tmp = 0 _a1tmp = -1228234344 _a4 = 0 _v1 = -1228234596 _name = 240 __ret = <optimized out> futex_val = 3 buffer = {__routine = 0xb6f5f3e1 <__condvar_cleanup>, __arg = 0xbefffa00, __canceltype = -1225038976, __prev = 0x0} cbuffer = {oldtype = 1, cond = 0xb6caa194, mutex = 0xb6caa09c, bc_seq = 0} err = <optimized out> pshared = <optimized out> pi_flag = 1 val = <optimized out> seq = <optimized out> #2 0xb6f9cfa6 in threadobj_cond_wait (cond=0xb6caa194, lock=0xb6caa09c) at ../../../xenomai-3/lib/copperplate/threadobj.c:980 ret = -1228234604 #3 0xb6f9d554 in wait_on_barrier (thobj=0xb6caa094, mask=16) at ../../../xenomai-3/lib/copperplate/threadobj.c:1231 __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {-619902981, -754243412, -1090520024, 0, 0, -1090520496, 0, 0, -1224740864, 0 <repeats 17 times>, 2, 0, 5, 0, 1, -1224758200, -1659757782, -1224874387, 0, -1224738124, -1224757760, -1090520312, -1224757760, 0, -1, -1224796312, -1225176048, -1224794112, -1225021852, -1224802304, -1224801088, 1, 0, -1224859353, -1224796312, 1, 5, 0, 0, 1, -1225038976, 0, -1090520024, 0, -1090520232, -1090520024, 0, 0}, __mask_was_saved = 0}}, __pad = {0xbefffc28, 0x0, 0x0, 0xb6ca496c}} __cancel_routine = 0xb6f5e225 <__GI___pthread_mutex_unlock> __cancel_arg = 0xb6caa09c __not_first_call = 0 oldstate = 0 status = 73 #4 0xb6f9d606 in threadobj_start (thobj=0xb6caa094) at ../../../xenomai-3/lib/copperplate/threadobj.c:1268 current = 0xb6ca496c ret = 0 oldstate = 1 #5 0xb6fc2d10 in rt_task_start (task=0x20ed0 <t_main>, entry=0x10985 <main_task>, arg=0xdeadbeef) at ../../../xenomai-3/lib/alchemy/task.c:634 tcb = 0xb6ca9f6c svc = {cancel_type = -559038737} ret = 67973 #6 0x00010a62 in main (argc=1, argv=0x21548) at task-1.c:26 ret = 0 (gdb) -------------- next part -------------- (gdb) thread 3 [Switching to thread 3 (Thread 0xb6c6f460 (LWP 4016))] #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 46 in ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S (gdb) bt full #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 No locals. #1 0xb6f5f6f2 in __pthread_cond_wait (cond=0xb6caa194, mutex=0xb6caa09c) at pthread_cond_wait.c:177 _a2tmp = 11 _a2 = 11 _nametmp = 240 _a3tmp = 1 _a3 = 1 _a1 = -1228234344 _v1tmp = -1228234596 _a4tmp = 0 _a1tmp = -1228234344 _a4 = 0 _v1 = -1228234596 _name = 240 __ret = <optimized out> futex_val = 1 buffer = {__routine = 0xb6f5f3e1 <__condvar_cleanup>, __arg = 0xb6c6ec18, __canceltype = -1225038976, __prev = 0x0} cbuffer = {oldtype = 1, cond = 0xb6caa194, mutex = 0xb6caa09c, bc_seq = 0} err = <optimized out> pshared = <optimized out> pi_flag = 1 val = <optimized out> seq = <optimized out> #2 0xb6f9cfa6 in threadobj_cond_wait (cond=0xb6caa194, lock=0xb6caa09c) at ../../../xenomai-3/lib/copperplate/threadobj.c:980 ret = -1228234596 #3 0xb6f9d554 in wait_on_barrier (thobj=0xb6caa094, mask=5) at ../../../xenomai-3/lib/copperplate/threadobj.c:1231 __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {-751562301, -754243412, -1228476860, -1090520124, 0, -1228477336, -1228476512, -1224802304, 0, -1228474732, 0 <repeats 16 times>, 1, -1224758200, -1228477152, -1228477004, 0, -1228477120, -1224757760, -1228477168, -1224757760, -1224738124, -1, 0, -1225177664, -1224794112, 1, -1224794112, 0, 0, 0, 0, -1224793672, -1228477112, -1224793672, 0, -1, 0, -1225027584, 119632, -1225021852, 0, -1228234596, -1090520124, 0, -1228477040, -1228476512, -1224802304, 0, -1225404053}, __mask_was_saved = 0}}, __pad = {0xb6c6ee80, 0x0, 0x0, 0xb6c6ed90}} __cancel_routine = 0xb6f5e225 <__GI___pthread_mutex_unlock> __cancel_arg = 0xb6caa09c __not_first_call = 0 oldstate = 0 status = 8 #4 0xb6f9d67a in threadobj_wait_start () at ../../../xenomai-3/lib/copperplate/threadobj.c:1299 current = 0xb6caa094 status = 8 #5 0xb6fc2770 in task_prologue_2 (tcb=0xb6ca9f6c) at ../../../xenomai-3/lib/alchemy/task.c:211 ret = 0 #6 0xb6fc27b6 in task_entry (arg=0xb6ca9f6c) at ../../../xenomai-3/lib/alchemy/task.c:227 __ret = -1228476896 tcb = 0xb6ca9f6c svc = {cancel_type = 1} ret = -1228475296 __FUNCTION__ = "task_entry" #7 0xb6f9a2d8 in thread_trampoline (arg=0xbefffb94) at ../../../xenomai-3/lib/copperplate/internal.c:251 cta = 0xbefffb94 _cta = {stacksize = 0, detachstate = 1, policy = 1, param_ex = {__sched_priority = 99, sched_u = {rr = {__sched_rr_quantum = {tv_sec = -1225254836, tv_nsec = 0}}}}, prologue = 0xb6fc2719 <task_prologue_1>, run = 0xb6fc27a5 <task_entry>, arg = 0xb6ca9f6c, __reserved = {status = -38, warm = { __size = "\001\000\000\000\000\000\000\000\060m\376\266\001\000\000", __align = 1}, released = 0x10cac}} released = {__size = '\000' <repeats 15 times>, __align = 0} ret = 0 __FUNCTION__ = "thread_trampoline" #8 0xb6f5b424 in start_thread (arg=0x0) at pthread_create.c:335 pd = 0x0 unwind_buf = {cancel_jmp_buf = {{jmp_buf = {-751561765, -754497110, -1228475296, -1090520280, 0, -1228476816, -1228476512, -1224802304, 0, -1228474732, 0 <repeats 54 times>}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread" #9 0xb6eb243c in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:89 from /lib/arm-linux-gnueabihf/libc.so.6 No locals. Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Xenomai] Process shared rt_event_wait() never signaled on ARM with Mercury core 2016-02-12 19:07 ` Charles Kiorpes @ 2016-02-16 15:40 ` Philippe Gerum 2016-02-17 21:34 ` Charles Kiorpes 0 siblings, 1 reply; 10+ messages in thread From: Philippe Gerum @ 2016-02-16 15:40 UTC (permalink / raw) To: Charles Kiorpes; +Cc: xenomai On 02/12/2016 08:07 PM, Charles Kiorpes wrote: > > Here is the sync pattern the code normally achieves, once the parent > has successfully spawned a child thread, which has to wait for a > start signal before it may run application code: > > 1. parent calls threadobj_start(child) > 1.1 child->status |= __THREAD_S_STARTED > 1.2 wait for child->status & __THREAD_S_ACTIVE > > 2. child calls threadobj_wait_start(self) > 2.1 wait for self->status & __THREAD_S_STARTED > 2.2 raise self->status |= __THREAD_S_ACTIVE > > All accesses to the status bits are serialized by a per-thread > mutex, operated by the threadobj_lock/unlock accessors, which also > covers the condvar signaling/waiting as one would expect. > > When running in pshared mode, thread descriptors (holding ->status, > mutex and barrier sync) are obtained from /dev/shm. If > --disable-pshared, we are using 100% process-private memory. > > Case 1: a race when manipulating the thread status due to > inconsistent locking. I could not find any so far. > > Case 2: a cache coherence issue in SMP, also caused by improper > locking. Otherwise, the locking should enforce memory barriers as > expected. > > Case 3: anything not mentioned in other cases... > > - Could you paste/copy the disassembly (objdump -dl rather than > gdb's disass) of the wait_on_barrier() function? > > > I have attached the disassembly as wait_on_barrier_disas.txt > > > - Does running both programs with --cpu-affinity=0/1 change the outcome? > > > There is no change in behavior when trying any combination of cpu > affinities, with either the "task-1" alchemy test or my event test apps. > > > - Without specifying any affinity this time, could you run the > current test with the debug patch below applied (this is clearly not > a fix)? The patch forces the code to read the value of the ->status > field before waiting on the barrier. With that code in and a > backtrace showing locals, we should be able to check the status word > when threadobj_wait_start() is entered. > > > diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c > index cc64caa..ed85a12 100644 > --- a/lib/copperplate/threadobj.c > +++ b/lib/copperplate/threadobj.c > @@ -1273,7 +1273,9 @@ void threadobj_wait_start(void) /* > current->lock free. */ > int status; > > threadobj_lock(current); > - status = wait_on_barrier(current, > __THREAD_S_STARTED|__THREAD_S_ABORTED); > + status = current->status; > + if (!(status & __THREAD_S_STARTED)) > + status = wait_on_barrier(current, > __THREAD_S_STARTED|__THREAD_S_ABORTED); > threadobj_unlock(current); > > /* > > -- > Philippe. > > > I patched in the debug and I have attached the full backtraces of > threads 1 and 3 of the "task-1" alchemy test. > > At the time of the hang: > - parent sees status = 73 (matches the flags set during > threadobj_start()) > - child sees status = 8 (locked?) > Yes, this is a debugging status with --enable-debug to check for locking consistency, also raised while holding the lock. From the traces, it looks like the child never sees any of the status bits raised by the parent when entering wait_on_barrier(), although its priority is strictly lower. At this point, we need to consider the toolchain. Which one are you using, and specifically, is it built for supporting multi-thread applications (such as implementing atomic operations)? I have attached a test code basically reproducing what copperplate does under the hood over a native kernel. e.g.: (term-1) ./basic-shm --create (term-2) ./basic-shm --listen or you can invert the logic by having the listener create the shm segment, waiting for the signals: (term-1) ./basic-shm --create --listen (term-2) ./basic-shm The code is not smart enough to detect when the listener attempts to reuse an obsolete shared memory file from a previous run. For this reason, the creator should always run first in a new test. -- Philippe. -------------- next part -------------- A non-text attachment was scrubbed... Name: basic-shm.c Type: text/x-csrc Size: 3294 bytes Desc: not available URL: <http://xenomai.org/pipermail/xenomai/attachments/20160216/1e20b172/attachment.c> -------------- next part -------------- CFLAGS = -O2 -g LDFLAGS = all: basic-shm %: %.c $(CROSS_COMPILE)$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $< -lpthread -lrt clean: $(RM) basic-shm ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Xenomai] Process shared rt_event_wait() never signaled on ARM with Mercury core 2016-02-16 15:40 ` Philippe Gerum @ 2016-02-17 21:34 ` Charles Kiorpes 0 siblings, 0 replies; 10+ messages in thread From: Charles Kiorpes @ 2016-02-17 21:34 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai The provided test program also did not perform correctly in the current environment (no signals were received by the listener). However, after testing various toolchains, I tried it on the same hardware (Beagleboard-x15 sample) with a TI-supplied rt kernel, and it did run correctly. This indicates to me that your earlier instinct was correct; this is a kernel configuration problem. I am in the process of doing a thorough comparison of the kernel configs to determine what option(s) made the difference. If I can find the offending kernel configuration option(s), I will post them here. This will take some time though, as I will need to do quite a bit of building and testing of kernels. Thank you for your assistance, - Charles On Tue, Feb 16, 2016 at 10:40 AM, Philippe Gerum <rpm@xenomai.org> wrote: > On 02/12/2016 08:07 PM, Charles Kiorpes wrote: > > > > Here is the sync pattern the code normally achieves, once the parent > > has successfully spawned a child thread, which has to wait for a > > start signal before it may run application code: > > > > 1. parent calls threadobj_start(child) > > 1.1 child->status |= __THREAD_S_STARTED > > 1.2 wait for child->status & __THREAD_S_ACTIVE > > > > 2. child calls threadobj_wait_start(self) > > 2.1 wait for self->status & __THREAD_S_STARTED > > 2.2 raise self->status |= __THREAD_S_ACTIVE > > > > All accesses to the status bits are serialized by a per-thread > > mutex, operated by the threadobj_lock/unlock accessors, which also > > covers the condvar signaling/waiting as one would expect. > > > > When running in pshared mode, thread descriptors (holding ->status, > > mutex and barrier sync) are obtained from /dev/shm. If > > --disable-pshared, we are using 100% process-private memory. > > > > Case 1: a race when manipulating the thread status due to > > inconsistent locking. I could not find any so far. > > > > Case 2: a cache coherence issue in SMP, also caused by improper > > locking. Otherwise, the locking should enforce memory barriers as > > expected. > > > > Case 3: anything not mentioned in other cases... > > > > - Could you paste/copy the disassembly (objdump -dl rather than > > gdb's disass) of the wait_on_barrier() function? > > > > > > I have attached the disassembly as wait_on_barrier_disas.txt > > > > > > - Does running both programs with --cpu-affinity=0/1 change the > outcome? > > > > > > There is no change in behavior when trying any combination of cpu > > affinities, with either the "task-1" alchemy test or my event test apps. > > > > > > - Without specifying any affinity this time, could you run the > > current test with the debug patch below applied (this is clearly not > > a fix)? The patch forces the code to read the value of the ->status > > field before waiting on the barrier. With that code in and a > > backtrace showing locals, we should be able to check the status word > > when threadobj_wait_start() is entered. > > > > > > diff --git a/lib/copperplate/threadobj.c > b/lib/copperplate/threadobj.c > > index cc64caa..ed85a12 100644 > > --- a/lib/copperplate/threadobj.c > > +++ b/lib/copperplate/threadobj.c > > @@ -1273,7 +1273,9 @@ void threadobj_wait_start(void) /* > > current->lock free. */ > > int status; > > > > threadobj_lock(current); > > - status = wait_on_barrier(current, > > __THREAD_S_STARTED|__THREAD_S_ABORTED); > > + status = current->status; > > + if (!(status & __THREAD_S_STARTED)) > > + status = wait_on_barrier(current, > > __THREAD_S_STARTED|__THREAD_S_ABORTED); > > threadobj_unlock(current); > > > > /* > > > > -- > > Philippe. > > > > > > I patched in the debug and I have attached the full backtraces of > > threads 1 and 3 of the "task-1" alchemy test. > > > > At the time of the hang: > > - parent sees status = 73 (matches the flags set during > > threadobj_start()) > > - child sees status = 8 (locked?) > > > > Yes, this is a debugging status with --enable-debug to check for locking > consistency, also raised while holding the lock. From the traces, it > looks like the child never sees any of the status bits raised by the > parent when entering wait_on_barrier(), although its priority is > strictly lower. > > At this point, we need to consider the toolchain. Which one are you > using, and specifically, is it built for supporting multi-thread > applications (such as implementing atomic operations)? > > I have attached a test code basically reproducing what copperplate does > under the hood over a native kernel. > > e.g.: > (term-1) ./basic-shm --create > (term-2) ./basic-shm --listen > > or you can invert the logic by having the listener create the shm > segment, waiting for the signals: > > (term-1) ./basic-shm --create --listen > (term-2) ./basic-shm > > The code is not smart enough to detect when the listener attempts to > reuse an obsolete shared memory file from a previous run. For this > reason, the creator should always run first in a new test. > > -- > Philippe. > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Xenomai] Process shared rt_event_wait() never signaled on ARM with Mercury core 2016-02-11 12:57 ` Charles Kiorpes 2016-02-12 10:43 ` Philippe Gerum @ 2016-02-12 10:55 ` Philippe Gerum 1 sibling, 0 replies; 10+ messages in thread From: Philippe Gerum @ 2016-02-12 10:55 UTC (permalink / raw) To: Charles Kiorpes; +Cc: xenomai On 02/11/2016 01:57 PM, Charles Kiorpes wrote: > will be trying combinations of kernels and configuration options > throughout the course of the day. > The embedded-specific options in the kernel configuration (CONFIG_EMBEDDED) should be checked, particularly CONFIG_FUTEX. -- Philippe. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2016-02-17 21:34 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-02-10 18:41 [Xenomai] Process shared rt_event_wait() never signaled on ARM with Mercury core Charles Kiorpes 2016-02-10 20:21 ` Philippe Gerum 2016-02-11 12:57 ` Charles Kiorpes 2016-02-12 10:43 ` Philippe Gerum 2016-02-12 14:08 ` Charles Kiorpes 2016-02-12 15:25 ` Philippe Gerum 2016-02-12 19:07 ` Charles Kiorpes 2016-02-16 15:40 ` Philippe Gerum 2016-02-17 21:34 ` Charles Kiorpes 2016-02-12 10:55 ` Philippe Gerum
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.