From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <48EA0743.9040001@domain.hid> Date: Mon, 06 Oct 2008 14:40:35 +0200 From: Jan Kiszka MIME-Version: 1.0 References: <48E76494.9030901@domain.hid> <48E9FECF.1070005@domain.hid> <48EA0517.3030205@domain.hid> In-Reply-To: <48EA0517.3030205@domain.hid> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai-core] gdb lockup on multi-threaded process exit List-Id: "Xenomai life and development \(bug reports, patches, discussions\)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Gilles Chanteperdrix Cc: xenomai-core Gilles Chanteperdrix wrote: > Jan Kiszka wrote: >> Jan Kiszka wrote: >>> Hi, >>> >>> I'm banging my head against this issue for several days now, first >>> trying to sort out an unrelated bug I also came across at this chance, >>> then trying to understand what happens, and finally getting mad about >>> why this may only happen with Xenomai: >>> >>> One process, two threads, running under gdb control (no breakpoints, >>> just the automatically set ones that track thread creation/destruction). >>> All happens already with only one CPU. The first thread decides to issue >>> exit() exactly while the second one is on its way from primary to >>> secondary mode due to running on a breakpoint (int3 -> xnpod_trap_fault >>> -> xnshadow_relax...). The group exit of thread A causes SIGKILL to be >>> set in thread B, but triggers no further actions due to B already being >>> awake and on its way to queue and handle the other signal (SIGTRAP). Now >>> when B comes to dequeue the next signal it finds SIGTRAP and SIGKILL >>> set, but picks up SIGTRAP due to its lower number. Now ptrace causes B >>> to stop, gdb gets confused, sends A, which is already a zombie, a >>> SIGSTOP and waits on it to confirm this stop - which never happens. If >>> someone is interested, I can provide an LTTng dump of this scenario. >>> >>> My problem is now that I still don't understand what prevents this >>> deadlock on vanilla Linux. Does Xenomai create a thread schedule here >>> that is impossible there? Or does it only widens an otherwise very >>> small race window that also exists with mainline? Before making a fool >>> of my self on LKML, I would like to collect some further ideas on the >>> workaround or fix(?) below that cures this deadlock for me. >> After reading this comment >> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3d749b9e676b26584a47e75c235aa6f69d0697ae >> >> I'm now about to escalate the issue to LKML. This really looks like a >> mainline bug, probably just triggered more quickly by the large latency >> between signal queuing and receiver scheduling that the >> primary->secondary mode switch introduces. > > That said, I think gdb is buggy too: the kill function probably returns > some error which says that the thread no longer exists, which gdb > probably ignores since it awaits a signal from that killed thread. According to my traces, there is no error returned. However, gdb /may/ see that the group leader, which issued the sys_exit_group, is now in TASK_DEAD state - before trying to block on it, becoming TASK_TRACED again. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux