All of lore.kernel.org
 help / color / mirror / Atom feed
* do_syslog/__down_trylock lockup in current BK
@ 2002-09-09 18:53 Ingo Molnar
  2002-09-09 19:01 ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2002-09-09 18:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds


i can reproduce the following lockup in BK-current, SMP:

 >>EIP; c0106c57 <__down_trylock+a7/b4>   <=====
 Trace; c0120316 <do_syslog+16/6a0>
 Trace; c01088c8 <show_registers+198/1c0>
 Trace; c011ef16 <do_fork+436/b20>
 Trace; c011f41a <do_fork+93a/b20>
 Trace; c0125755 <release_resource+15/50>
 Trace; c0126686 <proc_doutsstring+76/c0>
 Trace; c0127c54 <__constant_copy_from_user+24/98>
 Trace; c0107cf3 <handle_signal+113/1a0>
 Trace; c012d572 <sys_setreuid+22/170>
 Trace; c0107fce <do_signal+24e/3c0>

it could be related to the signal changes - but at first sight it looks
like some sort of printk related lockup.

	Ingo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: do_syslog/__down_trylock lockup in current BK
  2002-09-09 18:53 do_syslog/__down_trylock lockup in current BK Ingo Molnar
@ 2002-09-09 19:01 ` Linus Torvalds
  2002-09-09 19:08   ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2002-09-09 19:01 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel


On Mon, 9 Sep 2002, Ingo Molnar wrote:
> 
> i can reproduce the following lockup in BK-current, SMP:

Hmm. The only potential lockup source I could see is that semaphore
waitqueue spinlock. 

>  >>EIP; c0106c57 <__down_trylock+a7/b4>   <=====

Are you sure your system.map is correct? __down_trylock() should _not_ be 
that big - it's just 67 bytes for me (and apparently almost three times 
the size for you). Spinlock debugging or something?

		Linus


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: do_syslog/__down_trylock lockup in current BK
  2002-09-09 19:01 ` Linus Torvalds
@ 2002-09-09 19:08   ` Ingo Molnar
  2002-09-09 19:18     ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2002-09-09 19:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel


On Mon, 9 Sep 2002, Linus Torvalds wrote:

> >  >>EIP; c0106c57 <__down_trylock+a7/b4>   <=====
> 
> Are you sure your system.map is correct? __down_trylock() should _not_ be 
> that big - it's just 67 bytes for me (and apparently almost three times 
> the size for you). Spinlock debugging or something?

i have spinlock debugging enabled - but indeed the trace could be
incorrect, let me re-check it.

doh, wrong script called ... the right backtrace is:

 >>EIP; c0106c57 <__write_lock_failed+7/20>   <=====
 Trace; c0120316 <.text.lock.exit+119/133>
 Trace; c01088c8 <common_interrupt+18/20>
 Trace; c011ef16 <exit_notify+16/230>
 Trace; c011f41a <do_exit+2ea/370>
 Trace; c0125755 <schedule_timeout+b5/c0>
 Trace; c0126686 <sig_exit+36/40>
 Trace; c0127c54 <get_signal_to_deliver+2d4/360>
 Trace; c0107cf3 <do_signal+c3/100>
 Trace; c012d572 <sys_futex+182/1b0>
 Trace; c0107fce <work_notifysig+13/15>

(we definitely need kksymoops in the 2.5 kernel - it's just *so* much
easier to debug various crashes with kksymoops enabled - especially when
debugging over a serial line.)

and this lockup definitely looks related to the signal changes.

	Ingo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: do_syslog/__down_trylock lockup in current BK
  2002-09-09 19:08   ` Ingo Molnar
@ 2002-09-09 19:18     ` Ingo Molnar
  2002-09-09 20:02       ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2002-09-09 19:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel


it's locking up here:

416             struct task_struct *p, *reaper = father;
417             struct list_head *_p;
418
419             write_lock_irq(&tasklist_lock);
420
421             if (father->exit_signal != -1)
422                     reaper = prev_thread(reaper);

(unfortunately i dont know what happened on the other CPU.)

	Ingo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: do_syslog/__down_trylock lockup in current BK
  2002-09-09 19:18     ` Ingo Molnar
@ 2002-09-09 20:02       ` Ingo Molnar
  2002-09-09 20:15         ` Daniel Jacobowitz
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2002-09-09 20:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel


the lockup, on both CPUs:

>>EIP; c01200b1 <zap_thread+21/16d>   <=====
Trace; c0120092 <zap_thread+2/16d>
Trace; c011f00c <exit_notify+10c/230>
Trace; c011f41a <do_exit+2ea/370>
Trace; c0125755 <schedule_timeout+b5/c0>
Trace; c0126686 <sig_exit+36/40>
Trace; c0127c54 <get_signal_to_deliver+2d4/360>
Trace; c0107cf3 <do_signal+c3/100>
Trace; c012d572 <sys_futex+182/1b0>
Trace; c0107fce <work_notifysig+13/15>

>>EIP; c0106c5f <__write_lock_failed+f/20>   <=====
Trace; c0120316 <.text.lock.exit+119/133>
Trace; c011ef16 <exit_notify+16/230>
Trace; c011f41a <do_exit+2ea/370>
Trace; c0125755 <schedule_timeout+b5/c0>
Trace; c0126686 <sig_exit+36/40>
Trace; c0127c54 <get_signal_to_deliver+2d4/360>
Trace; c0107cf3 <do_signal+c3/100>
Trace; c012d572 <sys_futex+182/1b0>
Trace; c0107fce <work_notifysig+13/15>

ie. either the 'zap_again' produces an infinite loop, or one of the lists 
became corrupted. I suspect the former.

	Ingo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: do_syslog/__down_trylock lockup in current BK
  2002-09-09 20:02       ` Ingo Molnar
@ 2002-09-09 20:15         ` Daniel Jacobowitz
  2002-09-09 20:25           ` Ingo Molnar
                             ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Daniel Jacobowitz @ 2002-09-09 20:15 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel

On Mon, Sep 09, 2002 at 10:02:11PM +0200, Ingo Molnar wrote:
> 
> the lockup, on both CPUs:
> 
> >>EIP; c01200b1 <zap_thread+21/16d>   <=====
> Trace; c0120092 <zap_thread+2/16d>
> Trace; c011f00c <exit_notify+10c/230>
> Trace; c011f41a <do_exit+2ea/370>
> Trace; c0125755 <schedule_timeout+b5/c0>
> Trace; c0126686 <sig_exit+36/40>
> Trace; c0127c54 <get_signal_to_deliver+2d4/360>
> Trace; c0107cf3 <do_signal+c3/100>
> Trace; c012d572 <sys_futex+182/1b0>
> Trace; c0107fce <work_notifysig+13/15>
> 
> >>EIP; c0106c5f <__write_lock_failed+f/20>   <=====
> Trace; c0120316 <.text.lock.exit+119/133>
> Trace; c011ef16 <exit_notify+16/230>
> Trace; c011f41a <do_exit+2ea/370>
> Trace; c0125755 <schedule_timeout+b5/c0>
> Trace; c0126686 <sig_exit+36/40>
> Trace; c0127c54 <get_signal_to_deliver+2d4/360>
> Trace; c0107cf3 <do_signal+c3/100>
> Trace; c012d572 <sys_futex+182/1b0>
> Trace; c0107fce <work_notifysig+13/15>
> 
> ie. either the 'zap_again' produces an infinite loop, or one of the lists 
> became corrupted. I suspect the former.

How do you reproduce this?  It's probably my fault, if it's stuck in
zap_thread... but that's a pretty suspicious looking trace to me, if it
goes from schedule_timeout to do_exit.
-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: do_syslog/__down_trylock lockup in current BK
  2002-09-09 20:15         ` Daniel Jacobowitz
@ 2002-09-09 20:25           ` Ingo Molnar
  2002-09-09 20:33           ` Ingo Molnar
  2002-09-09 20:43           ` Ingo Molnar
  2 siblings, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2002-09-09 20:25 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: Linus Torvalds, linux-kernel


On Mon, 9 Sep 2002, Daniel Jacobowitz wrote:

> How do you reproduce this?  It's probably my fault, if it's stuck in
> zap_thread... but that's a pretty suspicious looking trace to me, if it
> goes from schedule_timeout to do_exit.

(i think i found it - patch in a few minutes.)

	Ingo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: do_syslog/__down_trylock lockup in current BK
  2002-09-09 20:15         ` Daniel Jacobowitz
  2002-09-09 20:25           ` Ingo Molnar
@ 2002-09-09 20:33           ` Ingo Molnar
  2002-09-09 20:40             ` Daniel Jacobowitz
  2002-09-09 20:40             ` Linus Torvalds
  2002-09-09 20:43           ` Ingo Molnar
  2 siblings, 2 replies; 15+ messages in thread
From: Ingo Molnar @ 2002-09-09 20:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Jacobowitz, linux-kernel


the attached patch fixes one bug in the way we did zap_thread() - but this
alone does not fix the lockup.

the bug was that list_for_each_safe() is not 'safe enough' - zap_thread()  
drops the tasklist lock at which point anything might happen to the child
list.

the lockup is likely in the while loop - ie. zap_thread() not actually
reparenting a thread and thus causing an infinite loop - is that possible?

	Ingo

--- linux/kernel/exit.c.orig	Mon Sep  9 21:59:24 2002
+++ linux/kernel/exit.c	Mon Sep  9 22:24:41 2002
@@ -493,7 +493,6 @@
 static void exit_notify(void)
 {
 	struct task_struct *t;
-	struct list_head *_p, *_n;
 
 	forget_original_parent(current);
 	/*
@@ -554,17 +553,16 @@
 		do_notify_parent(current, current->exit_signal);
 
 zap_again:
-	list_for_each_safe(_p, _n, &current->children)
-		zap_thread(list_entry(_p,struct task_struct,sibling), current, 0);
-	list_for_each_safe(_p, _n, &current->ptrace_children)
-		zap_thread(list_entry(_p,struct task_struct,ptrace_list), current, 1);
+	while (!list_empty(&current->children))
+		zap_thread(list_entry(current->children.next,struct task_struct,sibling), current, 0);
+	while (!list_empty(&current->ptrace_children))
+		zap_thread(list_entry(current->ptrace_children.next,struct task_struct,sibling), current, 0);
 	/*
 	 * zap_thread might drop the tasklist lock, thus we could
 	 * have new children queued back from the ptrace list into the
 	 * child list:
 	 */
-	if (unlikely(!list_empty(&current->children) ||
-			!list_empty(&current->ptrace_children)))
+	if (unlikely(!list_empty(&current->children)))
 		goto zap_again;
 	/*
 	 * No need to unlock IRQs, we'll schedule() immediately


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: do_syslog/__down_trylock lockup in current BK
  2002-09-09 20:33           ` Ingo Molnar
@ 2002-09-09 20:40             ` Daniel Jacobowitz
  2002-09-09 20:58               ` Ingo Molnar
  2002-09-09 20:40             ` Linus Torvalds
  1 sibling, 1 reply; 15+ messages in thread
From: Daniel Jacobowitz @ 2002-09-09 20:40 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel

On Mon, Sep 09, 2002 at 10:33:17PM +0200, Ingo Molnar wrote:
> 
> the attached patch fixes one bug in the way we did zap_thread() - but this
> alone does not fix the lockup.
> 
> the bug was that list_for_each_safe() is not 'safe enough' - zap_thread()  
> drops the tasklist lock at which point anything might happen to the child
> list.
> 
> the lockup is likely in the while loop - ie. zap_thread() not actually
> reparenting a thread and thus causing an infinite loop - is that possible?

Well, it shouldn't be.  forget_original_parent should update
real_parent for every child on either list, and then zap_thread unlinks
each child from the current parent and links it to the new real_parent.
A couple of printks in there should be able to work out if I'm wrong,
though...

> @@ -554,17 +553,16 @@
>  		do_notify_parent(current, current->exit_signal);
>  
>  zap_again:
> -	list_for_each_safe(_p, _n, &current->children)
> -		zap_thread(list_entry(_p,struct task_struct,sibling), current, 0);
> -	list_for_each_safe(_p, _n, &current->ptrace_children)
> -		zap_thread(list_entry(_p,struct task_struct,ptrace_list), current, 1);
> +	while (!list_empty(&current->children))
> +		zap_thread(list_entry(current->children.next,struct task_struct,sibling), current, 0);
> +	while (!list_empty(&current->ptrace_children))
> +		zap_thread(list_entry(current->ptrace_children.next,struct task_struct,sibling), current, 0);

As Linus points out, typo right there on the last argument.

-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: do_syslog/__down_trylock lockup in current BK
  2002-09-09 20:33           ` Ingo Molnar
  2002-09-09 20:40             ` Daniel Jacobowitz
@ 2002-09-09 20:40             ` Linus Torvalds
  2002-09-09 20:45               ` Ingo Molnar
  1 sibling, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2002-09-09 20:40 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Daniel Jacobowitz, linux-kernel


On Mon, 9 Sep 2002, Ingo Molnar wrote:
>
> the lockup is likely in the while loop - ie. zap_thread() not actually
> reparenting a thread and thus causing an infinite loop - is that possible?

Hmm.. This patch changes the last argument of zap_thread() from a "1" to a
"0" for the ptrace_children list. Was that intentional or a cut-and-paste 
error? If it was intentional, please remove the argument altogether, since 
it's now always 0.

		Linus


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: do_syslog/__down_trylock lockup in current BK
  2002-09-09 20:15         ` Daniel Jacobowitz
  2002-09-09 20:25           ` Ingo Molnar
  2002-09-09 20:33           ` Ingo Molnar
@ 2002-09-09 20:43           ` Ingo Molnar
  2002-09-09 20:50             ` Daniel Jacobowitz
  2 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2002-09-09 20:43 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: Linus Torvalds, linux-kernel


the following assert triggers and catches the lockup:

--- linux/kernel/exit.c.orig	Mon Sep  9 21:59:24 2002
+++ linux/kernel/exit.c	Mon Sep  9 22:38:44 2002
@@ -461,6 +461,8 @@
 		ptrace_unlink (p);
 
 		list_del_init(&p->sibling);
+		if (p->parent == father && p->parent == p->real_parent)
+			BUG();
 		p->parent = p->real_parent;
 		list_add_tail(&p->sibling, &p->parent->children);
 	}

so somehow we can end up having parent == real_parent?

	Ingo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: do_syslog/__down_trylock lockup in current BK
  2002-09-09 20:40             ` Linus Torvalds
@ 2002-09-09 20:45               ` Ingo Molnar
  0 siblings, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2002-09-09 20:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Jacobowitz, linux-kernel


On Mon, 9 Sep 2002, Linus Torvalds wrote:

> > the lockup is likely in the while loop - ie. zap_thread() not actually
> > reparenting a thread and thus causing an infinite loop - is that possible?
> 
> Hmm.. This patch changes the last argument of zap_thread() from a "1" to
> a "0" for the ptrace_children list. Was that intentional or a
> cut-and-paste error? [...]

cut-and-paste error. New patch attached. (with the assert added as well)

	Ingo

--- linux/kernel/exit.c.orig	Mon Sep  9 21:59:24 2002
+++ linux/kernel/exit.c	Mon Sep  9 22:38:44 2002
@@ -461,6 +461,8 @@
 		ptrace_unlink (p);
 
 		list_del_init(&p->sibling);
+		if (p->parent == father && p->parent == p->real_parent)
+			BUG();
 		p->parent = p->real_parent;
 		list_add_tail(&p->sibling, &p->parent->children);
 	}
@@ -493,7 +495,6 @@
 static void exit_notify(void)
 {
 	struct task_struct *t;
-	struct list_head *_p, *_n;
 
 	forget_original_parent(current);
 	/*
@@ -554,17 +555,16 @@
 		do_notify_parent(current, current->exit_signal);
 
 zap_again:
-	list_for_each_safe(_p, _n, &current->children)
-		zap_thread(list_entry(_p,struct task_struct,sibling), current, 0);
-	list_for_each_safe(_p, _n, &current->ptrace_children)
-		zap_thread(list_entry(_p,struct task_struct,ptrace_list), current, 1);
+	while (!list_empty(&current->children))
+		zap_thread(list_entry(current->children.next,struct task_struct,sibling), current, 0);
+	while (!list_empty(&current->ptrace_children))
+		zap_thread(list_entry(current->ptrace_children.next,struct task_struct,sibling), current, 1);
 	/*
 	 * zap_thread might drop the tasklist lock, thus we could
 	 * have new children queued back from the ptrace list into the
 	 * child list:
 	 */
-	if (unlikely(!list_empty(&current->children) ||
-			!list_empty(&current->ptrace_children)))
+	if (unlikely(!list_empty(&current->children)))
 		goto zap_again;
 	/*
 	 * No need to unlock IRQs, we'll schedule() immediately


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: do_syslog/__down_trylock lockup in current BK
  2002-09-09 20:43           ` Ingo Molnar
@ 2002-09-09 20:50             ` Daniel Jacobowitz
  2002-09-09 21:00               ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel Jacobowitz @ 2002-09-09 20:50 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel

On Mon, Sep 09, 2002 at 10:43:51PM +0200, Ingo Molnar wrote:
> 
> the following assert triggers and catches the lockup:
> 
> --- linux/kernel/exit.c.orig	Mon Sep  9 21:59:24 2002
> +++ linux/kernel/exit.c	Mon Sep  9 22:38:44 2002
> @@ -461,6 +461,8 @@
>  		ptrace_unlink (p);
>  
>  		list_del_init(&p->sibling);
> +		if (p->parent == father && p->parent == p->real_parent)
> +			BUG();
>  		p->parent = p->real_parent;
>  		list_add_tail(&p->sibling, &p->parent->children);
>  	}
> 
> so somehow we can end up having parent == real_parent?

When is this happening?  It's not necessarily a bug.  If the process
was traced, then __ptrace_unlink will set p->parent = p->real_parent
when it unlinks.

-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: do_syslog/__down_trylock lockup in current BK
  2002-09-09 20:40             ` Daniel Jacobowitz
@ 2002-09-09 20:58               ` Ingo Molnar
  0 siblings, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2002-09-09 20:58 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: Linus Torvalds, linux-kernel


some more info about the state it was in:

p (ld-linux.so.2, 4364/4357), father: (ld-linux.so.2, 4363/4357)
kernel BUG at exit.c:470!

	Ingo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: do_syslog/__down_trylock lockup in current BK
  2002-09-09 20:50             ` Daniel Jacobowitz
@ 2002-09-09 21:00               ` Ingo Molnar
  0 siblings, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2002-09-09 21:00 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: Linus Torvalds, linux-kernel


On Mon, 9 Sep 2002, Daniel Jacobowitz wrote:

> When is this happening?  It's not necessarily a bug.  If the process was
> traced, then __ptrace_unlink will set p->parent = p->real_parent when it
> unlinks.

it's not traced. And if you look at the patch i've put the assert into the
!traced branch ...

	Ingo


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2002-09-09 20:53 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-09-09 18:53 do_syslog/__down_trylock lockup in current BK Ingo Molnar
2002-09-09 19:01 ` Linus Torvalds
2002-09-09 19:08   ` Ingo Molnar
2002-09-09 19:18     ` Ingo Molnar
2002-09-09 20:02       ` Ingo Molnar
2002-09-09 20:15         ` Daniel Jacobowitz
2002-09-09 20:25           ` Ingo Molnar
2002-09-09 20:33           ` Ingo Molnar
2002-09-09 20:40             ` Daniel Jacobowitz
2002-09-09 20:58               ` Ingo Molnar
2002-09-09 20:40             ` Linus Torvalds
2002-09-09 20:45               ` Ingo Molnar
2002-09-09 20:43           ` Ingo Molnar
2002-09-09 20:50             ` Daniel Jacobowitz
2002-09-09 21:00               ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.