[Xenomai-core] gdb lockup on multi-threaded process exit

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Xenomai-core] gdb lockup on multi-threaded process exit
@ 2008-10-04 12:41 Jan Kiszka
  2008-10-04 14:07 ` Gilles Chanteperdrix
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Jan Kiszka @ 2008-10-04 12:41 UTC (permalink / raw)
  To: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 3192 bytes --]

Hi,

I'm banging my head against this issue for several days now, first
trying to sort out an unrelated bug I also came across at this chance,
then trying to understand what happens, and finally getting mad about
why this may only happen with Xenomai:

One process, two threads, running under gdb control (no breakpoints,
just the automatically set ones that track thread creation/destruction).
All happens already with only one CPU. The first thread decides to issue
exit() exactly while the second one is on its way from primary to
secondary mode due to running on a breakpoint (int3 -> xnpod_trap_fault
-> xnshadow_relax...). The group exit of thread A causes SIGKILL to be
set in thread B, but triggers no further actions due to B already being
awake and on its way to queue and handle the other signal (SIGTRAP). Now
when B comes to dequeue the next signal it finds SIGTRAP and SIGKILL
set, but picks up SIGTRAP due to its lower number. Now ptrace causes B
to stop, gdb gets confused, sends A, which is already a zombie, a
SIGSTOP and waits on it to confirm this stop - which never happens. If
someone is interested, I can provide an LTTng dump of this scenario.

My problem is now that I still don't understand what prevents this
deadlock on vanilla Linux. Does Xenomai create a thread schedule here
that is impossible there? Or does it only widens an otherwise very
small race window that also exists with mainline? Before making a fool
of my self on LKML, I would like to collect some further ideas on the
workaround or fix(?) below that cures this deadlock for me.

Thanks,
Jan

---
 kernel/signal.c |   25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

Index: b/kernel/signal.c
===================================================================
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1486,10 +1486,24 @@ static void do_notify_parent_cldstop(str
 	spin_unlock_irqrestore(&sighand->siglock, flags);
 }
 
+/*
+ * Return nonzero if there is a SIGKILL that should be waking us up.
+ * Called with the siglock held.
+ */
+static int sigkill_pending(struct task_struct *tsk)
+{
+	return ((sigismember(&tsk->pending.signal, SIGKILL) ||
+		 sigismember(&tsk->signal->shared_pending.signal, SIGKILL)) &&
+		!unlikely(sigismember(&tsk->blocked, SIGKILL)));
+}
+
 static inline int may_ptrace_stop(void)
 {
 	if (!likely(current->ptrace & PT_PTRACED))
 		return 0;
+
+	if (unlikely(sigkill_pending(current)))
+		return 0;
 	/*
 	 * Are we in the middle of do_coredump?
 	 * If so and our tracer is also part of the coredump stopping
@@ -1507,17 +1521,6 @@ static inline int may_ptrace_stop(void)
 }
 
 /*
- * Return nonzero if there is a SIGKILL that should be waking us up.
- * Called with the siglock held.
- */
-static int sigkill_pending(struct task_struct *tsk)
-{
-	return ((sigismember(&tsk->pending.signal, SIGKILL) ||
-		 sigismember(&tsk->signal->shared_pending.signal, SIGKILL)) &&
-		!unlikely(sigismember(&tsk->blocked, SIGKILL)));
-}
-
-/*
  * This must be called with current->sighand->siglock held.
  *
  * This should be the path for all ptrace stops.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai-core] gdb lockup on multi-threaded process exit
  2008-10-04 12:41 [Xenomai-core] gdb lockup on multi-threaded process exit Jan Kiszka
@ 2008-10-04 14:07 ` Gilles Chanteperdrix
  2008-10-04 14:36   ` Jan Kiszka
  2008-10-04 15:11 ` Gilles Chanteperdrix
  2008-10-06 12:04 ` Jan Kiszka
  2 siblings, 1 reply; 11+ messages in thread
From: Gilles Chanteperdrix @ 2008-10-04 14:07 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Hi,
> 
> I'm banging my head against this issue for several days now, first
> trying to sort out an unrelated bug I also came across at this chance,
> then trying to understand what happens, and finally getting mad about
> why this may only happen with Xenomai:

I have not tried to understand your problem (yet). But do you happen to
work with the latest TASK_ATOMICSWITCH changes? If yes, could you try to
revert them?

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai-core] gdb lockup on multi-threaded process exit
  2008-10-04 14:07 ` Gilles Chanteperdrix
@ 2008-10-04 14:36   ` Jan Kiszka
  2008-10-04 14:43     ` Gilles Chanteperdrix
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kiszka @ 2008-10-04 14:36 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 824 bytes --]

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Hi,
>>
>> I'm banging my head against this issue for several days now, first
>> trying to sort out an unrelated bug I also came across at this chance,
>> then trying to understand what happens, and finally getting mad about
>> why this may only happen with Xenomai:
> 
> I have not tried to understand your problem (yet). But do you happen to
> work with the latest TASK_ATOMICSWITCH changes? If yes, could you try to
> revert them?

Yes, I'm on latest trunk. During my debug endeavour, I haven't seen this
task state being involved. Also, if I got this correctly, it only
concerns the secondary->primary migration, and that one appears to be
out of scope here. Nevertheless, it's easy to verify if you tell me
which revision(s) I should revert.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 258 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai-core] gdb lockup on multi-threaded process exit
  2008-10-04 14:36   ` Jan Kiszka
@ 2008-10-04 14:43     ` Gilles Chanteperdrix
  2008-10-04 17:25       ` Jan Kiszka
  0 siblings, 1 reply; 11+ messages in thread
From: Gilles Chanteperdrix @ 2008-10-04 14:43 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Hi,
>>>
>>> I'm banging my head against this issue for several days now, first
>>> trying to sort out an unrelated bug I also came across at this chance,
>>> then trying to understand what happens, and finally getting mad about
>>> why this may only happen with Xenomai:
>> I have not tried to understand your problem (yet). But do you happen to
>> work with the latest TASK_ATOMICSWITCH changes? If yes, could you try to
>> revert them?
> 
> Yes, I'm on latest trunk. During my debug endeavour, I haven't seen this
> task state being involved. Also, if I got this correctly, it only
> concerns the secondary->primary migration, and that one appears to be
> out of scope here. Nevertheless, it's easy to verify if you tell me
> which revision(s) I should revert.

Should be:
4bc557d998f7cfac0a069d0c47e28510ef270cb2
and:
acaccc82ead113f33f8f717fdafea14dba8b885d

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai-core] gdb lockup on multi-threaded process exit
  2008-10-04 14:43     ` Gilles Chanteperdrix
@ 2008-10-04 17:25       ` Jan Kiszka
  2008-10-04 17:56         ` Gilles Chanteperdrix
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kiszka @ 2008-10-04 17:25 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 1151 bytes --]

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
>>>> Hi,
>>>>
>>>> I'm banging my head against this issue for several days now, first
>>>> trying to sort out an unrelated bug I also came across at this chance,
>>>> then trying to understand what happens, and finally getting mad about
>>>> why this may only happen with Xenomai:
>>> I have not tried to understand your problem (yet). But do you happen to
>>> work with the latest TASK_ATOMICSWITCH changes? If yes, could you try to
>>> revert them?
>> Yes, I'm on latest trunk. During my debug endeavour, I haven't seen this
>> task state being involved. Also, if I got this correctly, it only
>> concerns the secondary->primary migration, and that one appears to be
>> out of scope here. Nevertheless, it's easy to verify if you tell me
>> which revision(s) I should revert.
> 
> Should be:
> 4bc557d998f7cfac0a069d0c47e28510ef270cb2
> and:
> acaccc82ead113f33f8f717fdafea14dba8b885d
> 

OK, I simply went back to 2.0-09 (I think I was there originally when
the bug showed up), but the effect remains the same.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai-core] gdb lockup on multi-threaded process exit
  2008-10-04 17:25       ` Jan Kiszka
@ 2008-10-04 17:56         ` Gilles Chanteperdrix
  0 siblings, 0 replies; 11+ messages in thread
From: Gilles Chanteperdrix @ 2008-10-04 17:56 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Gilles Chanteperdrix wrote:
>>>> Jan Kiszka wrote:
>>>>> Hi,
>>>>>
>>>>> I'm banging my head against this issue for several days now, first
>>>>> trying to sort out an unrelated bug I also came across at this chance,
>>>>> then trying to understand what happens, and finally getting mad about
>>>>> why this may only happen with Xenomai:
>>>> I have not tried to understand your problem (yet). But do you happen to
>>>> work with the latest TASK_ATOMICSWITCH changes? If yes, could you try to
>>>> revert them?
>>> Yes, I'm on latest trunk. During my debug endeavour, I haven't seen this
>>> task state being involved. Also, if I got this correctly, it only
>>> concerns the secondary->primary migration, and that one appears to be
>>> out of scope here. Nevertheless, it's easy to verify if you tell me
>>> which revision(s) I should revert.
>> Should be:
>> 4bc557d998f7cfac0a069d0c47e28510ef270cb2
>> and:
>> acaccc82ead113f33f8f717fdafea14dba8b885d
>>
> 
> OK, I simply went back to 2.0-09 (I think I was there originally when
> the bug showed up), but the effect remains the same.

Just checked in the mean-time, trunk does not use the new version of
TASK_ATOMICSWITCH. So, it can not be the issue.

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai-core] gdb lockup on multi-threaded process exit
  2008-10-04 12:41 [Xenomai-core] gdb lockup on multi-threaded process exit Jan Kiszka
  2008-10-04 14:07 ` Gilles Chanteperdrix
@ 2008-10-04 15:11 ` Gilles Chanteperdrix
  2008-10-04 16:55   ` Gilles Chanteperdrix
  2008-10-06 12:04 ` Jan Kiszka
  2 siblings, 1 reply; 11+ messages in thread
From: Gilles Chanteperdrix @ 2008-10-04 15:11 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> +static int sigkill_pending(struct task_struct *tsk)
> +{
> +	return ((sigismember(&tsk->pending.signal, SIGKILL) ||
> +		 sigismember(&tsk->signal->shared_pending.signal, SIGKILL)) &&
> +		!unlikely(sigismember(&tsk->blocked, SIGKILL)));
> +}
> +

Posix says SIGKILL can not be blocked.

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai-core] gdb lockup on multi-threaded process exit
  2008-10-04 15:11 ` Gilles Chanteperdrix
@ 2008-10-04 16:55   ` Gilles Chanteperdrix
  0 siblings, 0 replies; 11+ messages in thread
From: Gilles Chanteperdrix @ 2008-10-04 16:55 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> +static int sigkill_pending(struct task_struct *tsk)
>> +{
>> +	return ((sigismember(&tsk->pending.signal, SIGKILL) ||
>> +		 sigismember(&tsk->signal->shared_pending.signal, SIGKILL)) &&
>> +		!unlikely(sigismember(&tsk->blocked, SIGKILL)));
>> +}
>> +
> 
> Posix says SIGKILL can not be blocked.

Ok. This fonction already exists in Linux. Sorry for the noise.

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai-core] gdb lockup on multi-threaded process exit
  2008-10-04 12:41 [Xenomai-core] gdb lockup on multi-threaded process exit Jan Kiszka
  2008-10-04 14:07 ` Gilles Chanteperdrix
  2008-10-04 15:11 ` Gilles Chanteperdrix
@ 2008-10-06 12:04 ` Jan Kiszka
  2008-10-06 12:31   ` Gilles Chanteperdrix
  2 siblings, 1 reply; 11+ messages in thread
From: Jan Kiszka @ 2008-10-06 12:04 UTC (permalink / raw)
  To: xenomai-core; +Cc: Gilles Chanteperdrix

Jan Kiszka wrote:
> Hi,
> 
> I'm banging my head against this issue for several days now, first
> trying to sort out an unrelated bug I also came across at this chance,
> then trying to understand what happens, and finally getting mad about
> why this may only happen with Xenomai:
> 
> One process, two threads, running under gdb control (no breakpoints,
> just the automatically set ones that track thread creation/destruction).
> All happens already with only one CPU. The first thread decides to issue
> exit() exactly while the second one is on its way from primary to
> secondary mode due to running on a breakpoint (int3 -> xnpod_trap_fault
> -> xnshadow_relax...). The group exit of thread A causes SIGKILL to be
> set in thread B, but triggers no further actions due to B already being
> awake and on its way to queue and handle the other signal (SIGTRAP). Now
> when B comes to dequeue the next signal it finds SIGTRAP and SIGKILL
> set, but picks up SIGTRAP due to its lower number. Now ptrace causes B
> to stop, gdb gets confused, sends A, which is already a zombie, a
> SIGSTOP and waits on it to confirm this stop - which never happens. If
> someone is interested, I can provide an LTTng dump of this scenario.
> 
> My problem is now that I still don't understand what prevents this
> deadlock on vanilla Linux. Does Xenomai create a thread schedule here
> that is impossible there? Or does it only widens an otherwise very
> small race window that also exists with mainline? Before making a fool
> of my self on LKML, I would like to collect some further ideas on the
> workaround or fix(?) below that cures this deadlock for me.

After reading this comment

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3d749b9e676b26584a47e75c235aa6f69d0697ae

I'm now about to escalate the issue to LKML. This really looks like a
mainline bug, probably just triggered more quickly by the large latency
between signal queuing and receiver scheduling that the
primary->secondary mode switch introduces.

Jan

PS: Gilles, Oleg's patch actually removed the SIGKILL-blocked checked in
2.6.27.

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai-core] gdb lockup on multi-threaded process exit
  2008-10-06 12:04 ` Jan Kiszka
@ 2008-10-06 12:31   ` Gilles Chanteperdrix
  2008-10-06 12:40     ` Jan Kiszka
  0 siblings, 1 reply; 11+ messages in thread
From: Gilles Chanteperdrix @ 2008-10-06 12:31 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Jan Kiszka wrote:
>> Hi,
>>
>> I'm banging my head against this issue for several days now, first
>> trying to sort out an unrelated bug I also came across at this chance,
>> then trying to understand what happens, and finally getting mad about
>> why this may only happen with Xenomai:
>>
>> One process, two threads, running under gdb control (no breakpoints,
>> just the automatically set ones that track thread creation/destruction).
>> All happens already with only one CPU. The first thread decides to issue
>> exit() exactly while the second one is on its way from primary to
>> secondary mode due to running on a breakpoint (int3 -> xnpod_trap_fault
>> -> xnshadow_relax...). The group exit of thread A causes SIGKILL to be
>> set in thread B, but triggers no further actions due to B already being
>> awake and on its way to queue and handle the other signal (SIGTRAP). Now
>> when B comes to dequeue the next signal it finds SIGTRAP and SIGKILL
>> set, but picks up SIGTRAP due to its lower number. Now ptrace causes B
>> to stop, gdb gets confused, sends A, which is already a zombie, a
>> SIGSTOP and waits on it to confirm this stop - which never happens. If
>> someone is interested, I can provide an LTTng dump of this scenario.
>>
>> My problem is now that I still don't understand what prevents this
>> deadlock on vanilla Linux. Does Xenomai create a thread schedule here
>> that is impossible there? Or does it only widens an otherwise very
>> small race window that also exists with mainline? Before making a fool
>> of my self on LKML, I would like to collect some further ideas on the
>> workaround or fix(?) below that cures this deadlock for me.
> 
> After reading this comment
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3d749b9e676b26584a47e75c235aa6f69d0697ae
> 
> I'm now about to escalate the issue to LKML. This really looks like a
> mainline bug, probably just triggered more quickly by the large latency
> between signal queuing and receiver scheduling that the
> primary->secondary mode switch introduces.

That said, I think gdb is buggy too: the kill function probably returns
some error which says that the thread no longer exists, which gdb
probably ignores since it awaits a signal from that killed thread.

-- 
                                                 Gilles.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xenomai-core] gdb lockup on multi-threaded process exit
  2008-10-06 12:31   ` Gilles Chanteperdrix
@ 2008-10-06 12:40     ` Jan Kiszka
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Kiszka @ 2008-10-06 12:40 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Jan Kiszka wrote:
>>> Hi,
>>>
>>> I'm banging my head against this issue for several days now, first
>>> trying to sort out an unrelated bug I also came across at this chance,
>>> then trying to understand what happens, and finally getting mad about
>>> why this may only happen with Xenomai:
>>>
>>> One process, two threads, running under gdb control (no breakpoints,
>>> just the automatically set ones that track thread creation/destruction).
>>> All happens already with only one CPU. The first thread decides to issue
>>> exit() exactly while the second one is on its way from primary to
>>> secondary mode due to running on a breakpoint (int3 -> xnpod_trap_fault
>>> -> xnshadow_relax...). The group exit of thread A causes SIGKILL to be
>>> set in thread B, but triggers no further actions due to B already being
>>> awake and on its way to queue and handle the other signal (SIGTRAP). Now
>>> when B comes to dequeue the next signal it finds SIGTRAP and SIGKILL
>>> set, but picks up SIGTRAP due to its lower number. Now ptrace causes B
>>> to stop, gdb gets confused, sends A, which is already a zombie, a
>>> SIGSTOP and waits on it to confirm this stop - which never happens. If
>>> someone is interested, I can provide an LTTng dump of this scenario.
>>>
>>> My problem is now that I still don't understand what prevents this
>>> deadlock on vanilla Linux. Does Xenomai create a thread schedule here
>>> that is impossible there? Or does it only widens an otherwise very
>>> small race window that also exists with mainline? Before making a fool
>>> of my self on LKML, I would like to collect some further ideas on the
>>> workaround or fix(?) below that cures this deadlock for me.
>> After reading this comment
>>
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3d749b9e676b26584a47e75c235aa6f69d0697ae
>>
>> I'm now about to escalate the issue to LKML. This really looks like a
>> mainline bug, probably just triggered more quickly by the large latency
>> between signal queuing and receiver scheduling that the
>> primary->secondary mode switch introduces.
> 
> That said, I think gdb is buggy too: the kill function probably returns
> some error which says that the thread no longer exists, which gdb
> probably ignores since it awaits a signal from that killed thread.

According to my traces, there is no error returned. However, gdb /may/
see that the group leader, which issued the sys_exit_group, is now in
TASK_DEAD state - before trying to block on it, becoming TASK_TRACED again.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-10-06 12:40 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-04 12:41 [Xenomai-core] gdb lockup on multi-threaded process exit Jan Kiszka
2008-10-04 14:07 ` Gilles Chanteperdrix
2008-10-04 14:36   ` Jan Kiszka
2008-10-04 14:43     ` Gilles Chanteperdrix
2008-10-04 17:25       ` Jan Kiszka
2008-10-04 17:56         ` Gilles Chanteperdrix
2008-10-04 15:11 ` Gilles Chanteperdrix
2008-10-04 16:55   ` Gilles Chanteperdrix
2008-10-06 12:04 ` Jan Kiszka
2008-10-06 12:31   ` Gilles Chanteperdrix
2008-10-06 12:40     ` Jan Kiszka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.