From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <4B18DFD6.1030907@domain.hid> Date: Fri, 04 Dec 2009 11:09:26 +0100 From: Gilles Chanteperdrix MIME-Version: 1.0 References: <4B1850B6.4010906@domain.hid> <1259920779.2174.77.camel@domain.hid> In-Reply-To: <1259920779.2174.77.camel@domain.hid> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai-help] Problem with pthread_cond_wait List-Id: Help regarding installation and common use of Xenomai List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Philippe Gerum Cc: "Soboljew, Patrick" , xenomai@xenomai.org Philippe Gerum wrote: > On Fri, 2009-12-04 at 00:58 +0100, Gilles Chanteperdrix wrote: >> Soboljew, Patrick wrote: >>> Hello all, >>> >>> I have a strange problem concerning the posix skin of xenomai (Ver. >>> 2.4.9.1). Whenever two or more threads call 'pthread_cond_wait' and I >>> want to interrupt the program with SIGINT (CTRL-c) only the main thread >>> and the first thread that called 'pthread_cond_wait' get the signal. The >>> remaining threads are not interrupted so I have created some zombies >>> here. I discovered this problem when I tried to debug some code with the >>> ACE/TAO Framework which calls these functions in a similar way. The >>> debugger also has problems to interrupt these threads. >>> >>> The small code example illustrates what I did. >>> >>> Has anyone an idea what exactly causes this problem? >> The problem is the way pthread_cond_wait handles interruption by >> signals: the thread tries to re-acquire the mutex before returning to >> user-space, so as to make the system call restartable, so may be >> suspended if the mutex is not free (which happens for the second thread >> in your test program), so that it does not return to user-space, the >> signal remains pending and unhandled, and you get the disturbing >> behaviour when hitting ctrl-c or when running inside gdb. >> >> We can fix that by making the syscall non restartable, and let the >> user-space handle the mutex re-locking (which it fortunately already >> does). >> >> Note that restarting automatically pthread_cond_wait is not even >> correct, since we could miss a pthread_cond_signal if it was sent >> between the time when the thread was unblocked from the cond wait, and >> the time it starts waiting again. So, in this case, it is better to >> return to the caller, you get a spurious wake-up, which means that the >> caller must run pthread_cond_wait in a loop for the program to run >> correctly. >> >> Anyway, here is a quick fix, could you try it? >> >> Notes for Philippe: the quick fix does not break the ABI, but also >> changes the behaviour of non restartable syscalls, they forcibly return >> -EINTR. This may look like a disrupting change for the 2.4 branch, but >> there is actually currently only one non-restartable syscall in Xenomai >> 2.4: pthread_mutex_unlock, and it is ready to handle -EINTR. However, at >> this chance, we should mark a few more syscalls as non restartable, >> notably nanosleep and select, because they use relative timeouts. I >> think a lot of syscalls in the native skin are using relative timeouts >> too, and should be marked as non-restartable, but this implies >> documenting the return value -EINTR. > > This seems the wrong approach, at the very least for the native skin. > -EINTR is already used and documented there, as a possible return value > for blocking syscalls which have been forcibly unblocked (i.e. via > rt_task_unblocked()). > > Returning -EINTR upon signal interrupt as well would confuse the > application, i.e. what was the actual reason for that syscall to return? > As a corollary, a bunch of applications are currently not handling > -EINTR, precisely because rt_task_unblock() is not used in their > application; so making all timed syscalls non-restartable might break > them badly. > > The best fix would rather to convert relative timeouts to their > XN_REALTIME form internally, the way it is done for a few syscalls in > 2.5 already. Yeah, but that would be an ABI change. In the mean time, I thought more about that: actually, syscalls with relative timeouts are restartable if the timeout is passed by pointer and the syscall updates the timeout upon interruption by a signal, the way nanosleep does (so, in a sense, nanosleep is restartable, contrarily to what I said). Even select could be restartable, but the specification mandates that it returns EINTR upon interruption by a signal and is not restarted automatically. -- Gilles