From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <53BECD7D.2060003@xenomai.org> Date: Thu, 10 Jul 2014 19:29:33 +0200 From: Philippe Gerum MIME-Version: 1.0 References: <1404677764.4624.YahooMailNeo@web171603.mail.ir2.yahoo.com> <53B9BC6A.9070501@xenomai.org> <1404835832.96010.YahooMailNeo@web171606.mail.ir2.yahoo.com> <53BD10C2.60109@xenomai.org> <1404942153.53492.YahooMailNeo@web171605.mail.ir2.yahoo.com> <53BE5DC6.9040406@xenomai.org> <1405012399.63072.YahooMailNeo@web171605.mail.ir2.yahoo.com> In-Reply-To: <1405012399.63072.YahooMailNeo@web171605.mail.ir2.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai] Issue with cobalt_monitor_wait() List-Id: Discussions about the Xenomai project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Matthias Schneider , "xenomai@xenomai.org" On 07/10/2014 07:13 PM, Matthias Schneider wrote: > > > > > ----- Original Message ----- >> From: Philippe Gerum >> To: Matthias Schneider ; "xenomai@xenomai.org" >> Cc: >> Sent: Thursday, July 10, 2014 11:32 AM >> Subject: Re: [Xenomai] Issue with cobalt_monitor_wait() >> >> On 07/09/2014 11:42 PM, Matthias Schneider wrote: >>> ----- Original Message ----- >>> >>>> From: Philippe Gerum >>>> To: Matthias Schneider ; >> "xenomai@xenomai.org" >>>> Cc: >>>> Sent: Wednesday, July 9, 2014 11:52 AM >>>> Subject: Re: [Xenomai] Issue with cobalt_monitor_wait() >>>> >>>> On 07/08/2014 06:10 PM, Matthias Schneider wrote: >>>>> ----- Original Message ----- >>>>> >>>>>> From: Philippe Gerum >>>>>> To: Matthias Schneider ; >>>> "xenomai@xenomai.org" >>>>>> Cc: >>>>>> Sent: Sunday, July 6, 2014 11:15 PM >>>>>> Subject: Re: [Xenomai] Issue with cobalt_monitor_wait() >>>>>> >>>>>> On 07/06/2014 10:16 PM, Matthias Schneider wrote: >>>>>> >>>>>> [snip] >>>>>> >>>>>>> On thing I do not understand is: >>>>>>> >>>>>>> in kernel cobalt_monitor_wait(), the synch object is >> unlocked via >>>>>>> xnsynch_release(). What happens if this synchobj was >> locked via >>>>>>> mon->gate.fastlock ? Shouldnt that also be released? >>>>>>> >>>>>> >>>>>> xnsynch_release() handles fastlocks as well. >>>>>> >>>>>> >>>>>>> What other reason could there be if the synch object >> was released >>>>>>> via xnsynch_release, xnsynch_acquire was interrupted >> for >>>>>>> xnsynch_release to block? >>>>>>> >>>>>> >>>>>> Since the issue seems to be easily reproducible, could you >> send a >>>>>> self-contained piece of code illustrating it? >>>>>> >>>>>> Also, please mention if you are seeing this issue only when >> running >>>> your >>>>>> app over GDB, or if it currently happens without any debugger >> attached. >>>>>> >>>>>> TIA, >>>>> >>>>> >>>>> It seems I have not described the problematic scenario completely >> - >>>>> >>>>> there were two other threads that call called syncobj_lock() >>>>> / cobalt_monitor_enter() at about the same time. (Actually there >>>>> are three concurrent on the queue that is being tested, two >> receive >>>>> operation and one send operation). I am pretty sure that the >> issue is >>>>> extremely timing dependent. >>>>> >>>>> Anyway, the testcase would be >>>>> >>>>> queue_test_receive_peek_multiple_tasks() >>>>> >>>> >>>> I could not reproduce the issue yet, but could you check if this patch >>>> has any influence on this bug? TIA, >>>> >>>> diff --git a/kernel/cobalt/posix/syscall.c >> b/kernel/cobalt/posix/syscall.c >>>> index d921d81..3856794 100644 >>>> --- a/kernel/cobalt/posix/syscall.c >>>> +++ b/kernel/cobalt/posix/syscall.c >>>> @@ -156,7 +156,7 @@ static struct xnsyscall cobalt_syscalls[] = { >>>> SKINCALL_DEF(sc_cobalt_monitor_enter, cobalt_monitor_enter, >> primary), >>>> SKINCALL_DEF(sc_cobalt_monitor_wait, cobalt_monitor_wait, >>>> nonrestartable), >>>> SKINCALL_DEF(sc_cobalt_monitor_sync, cobalt_monitor_sync, >>>> nonrestartable), >>>> - SKINCALL_DEF(sc_cobalt_monitor_exit, cobalt_monitor_exit, >> primary), >>>> + SKINCALL_DEF(sc_cobalt_monitor_exit, cobalt_monitor_exit, >> nonrestartable), >>>> SKINCALL_DEF(sc_cobalt_event_init, cobalt_event_init, current), >>>> SKINCALL_DEF(sc_cobalt_event_destroy, cobalt_event_destroy, >> current), >>>> SKINCALL_DEF(sc_cobalt_event_wait, cobalt_event_wait, primary), >>>> diff --git a/lib/cobalt/internal.c b/lib/cobalt/internal.c >>>> index e0d990d..6c1331d 100644 >>>> --- a/lib/cobalt/internal.c >>>> +++ b/lib/cobalt/internal.c >>>> @@ -230,6 +230,7 @@ int cobalt_monitor_exit(cobalt_monitor_t *mon) >>>> struct cobalt_monitor_data *datp; >>>> unsigned long status; >>>> xnhandle_t cur; >>>> + int ret; >>>> >>>> __sync_synchronize(); >>>> >>>> @@ -246,9 +247,13 @@ int cobalt_monitor_exit(cobalt_monitor_t *mon) >>>> if (xnsynch_fast_release(&datp->owner, cur)) >>>> return 0; >>>> syscall: >>>> - return XENOMAI_SKINCALL1(__cobalt_muxid, >>>> - sc_cobalt_monitor_exit, >>>> - mon); >>>> + do >>>> + ret = XENOMAI_SKINCALL1(__cobalt_muxid, >>>> + sc_cobalt_monitor_exit, >>>> + mon); >>>> + while (ret == -EINTR); >>>> + >>>> + return ret; >>>> } >>>> >>>> int cobalt_monitor_wait(cobalt_monitor_t *mon, int event, >>>> >>> >>> >>> Hm, it seems when I run into the issue, cobalt_monitor_exit() isnt >>> called at all... >>> >>> Having compiled the cobalt kernel without optimization, >>> I noticed that cobalt_monitor_wait() actually sets u_ret = -EINTR and >>> apparently cobalt_monitor_enter_inner() seems to work, thus setting ret >>> to 0. However, in internal.c:cobalt_monitor_wait (in user mode), >>> both ret and opret seem to be set to -EINTR. This would explain that >>> the second call of internal.c:cobalt_monitor_wait to cobalt_monitor_enter >>> will block indefinitely since the sync object is already locked. >>> >>> Investigating what else happens on the way back to user mode, it seems >>> that the return code is changed from 0 to -EINTR by the following stack: >>> >>> #0 __xn_error_return (regs=0xde0fffb0, v=-4) at >> arch/arm/xenomai/include/asm/xenomai/syscall.h:62 >>> #1 prepare_for_signal (p=, >> thread=thread@entry=0xde702e08, regs=regs@entry=0xde0fffb0, >> sysflags=sysflags@entry=134) at kernel/xenomai/shadow.c:1842 >>> #2 0xc00c68a8 in handle_head_syscall (regs=0xde0fffb0, ipd=0xc07d63c0 >> ) at kernel/xenomai/shadow.c:1996 >>> #3 ipipe_syscall_hook (ipd=0xc07d63c0 , >> regs=0xde0fffb0) at kernel/xenomai/shadow.c:2164 >>> #4 0xc00959a8 in __ipipe_notify_syscall (regs=regs@entry=0xde0fffb0) at >> kernel/ipipe/core.c:982 >>> #5 0xc0015c90 in __ipipe_syscall_root (scno=, >> regs=0xde0fffb0) at arch/arm/kernel/ipipe.c:417 >>> >>> Apperently, the assumption of internal.c:cobalt_monitor_wait that a >>> syscall return -EINTR indicates a failure to re-lock the sync object >>> does not hold in this case. There are probably other cases where >>> the same scenario may occur >>> >>> Unfortunately I do not yet know how to resolve this issue... >>> >> >> Actually, you did it. Thanks for the analysis. As you mentioned, the basic issue >> is with relocking the monitor gate upon EINTR, which is wrong: there must be a >> reason why we do this from userland... >> The reason is with any blocking Cobalt syscall which must be aborted upon Linux >> signal receipt, which causes XNBREAK to be present in the thread state flags >> (handle_sigwake_event -> __xnshadow_kick()). And we must not hold the gate >> lock until the signal handler has run. >> >> When a signal hits the sleeping syscall, we must unwind the context all way down >> the regular Linux syscall path, so that a signal frame is built for it. As part >> of this process, prepare_for_signal() switches the signaled context from primary >> to secondary mode. >> >> In short, receiving EINTR in kernel space waiting for a monitor means unwinding >> back to the userland call site first, keeping the monitor gate free while >> running the handler, then grabbing the gate lock anew prior to returning to the >> caller. >> >> Unblocking a thread forcibly can also happen when the latter receives the >> internal/special/not-so-hidden SIGRELS notification (see __cobalt_kill()), in >> which case XNBREAK is raised too. In such a case, we will relock from userland >> the same way. >> >> I need to review the entire machinery for more non-sense of mine, but in the >> meantime, could you try this patch? >> >> TIA, >> >> diff --git a/kernel/cobalt/posix/monitor.c b/kernel/cobalt/posix/monitor.c >> index 0ecaa6a..a61d028 100644 >> --- a/kernel/cobalt/posix/monitor.c >> +++ b/kernel/cobalt/posix/monitor.c >> @@ -283,9 +283,11 @@ int cobalt_monitor_wait(struct cobalt_monitor_shadow __user >> *u_mon, >> if (list_empty(&mon->waiters) && >> !xnsynch_pended_p(&mon->drain)) >> datp->flags &= ~COBALT_MONITOR_PENDED; >> >> - if (info & XNBREAK) >> + if (info & XNBREAK) { >> opret = -EINTR; >> - else if (info & XNTIMEO) >> + goto out; >> + } >> + if (info & XNTIMEO) >> opret = -ETIMEDOUT; >> >> } >> >> -- >> Philippe. >> > > Thanks, current forge/next including the above patch finally passes the test on > my setup. However I seem to be unable to determine which signal actually interrupts > the syscall. debugging all signals with gdb does not show a single occurance... > Because we cheat, we route internal notifications we need to deliver from secondary mode via SIGWINCH, and gdb won't trap it by default. -- Philippe.