Fix for sparc64 cpu hangs.

All of lore.kernel.org
 help / color / mirror / Atom feed

* Fix for sparc64 cpu hangs.
@ 2007-11-07  4:34 ` David Miller
  0 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-11-07  4:34 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, sparclinux, linux-arch, bernd, joy, fabbione, arnd

[ Bernd, Josip, and Fabio, I think I finally nailed this
  cpu hang bug we were all seeing on sparc64.  ]

[FUTEX]: Fix address computation in compat code.

compat_exit_robust_list() computes a pointer to the
futex entry in userspace as follows:

	(void __user *)entry + futex_offset

'entry' is a 'struct robust_list __user *', and
'futex_offset' is a 'compat_long_t' (typically a 's32').

Things explode if the 32-bit sign bit is set in futex_offset.

Type promotion sign extends futex_offset to a 64-bit value before
adding it to 'entry'.

This triggered a problem on sparc64 running 32-bit applications which
would lock up a cpu looping forever in the fault handling for the
userspace load in handle_futex_death().

Compat userspace runs with address masking (wherein the cpu zeros out
the top 32-bits of every effective address given to a memory operation
instruction) so the sparc64 fault handler accounts for this by
zero'ing out the top 32-bits of the fault address too.

Since the kernel properly uses the compat_uptr interfaces, kernel side
accesses to compat userspace work too since they will only use
addresses with the top 32-bits clear.

Because of this compat futex layer bug we get into the following loop
when executing the get_user() load near the top of handle_futex_death():

1) load from address '0xfffffffff7f16bd8', FAULT
2) fault handler clears upper 32-bits, processes fault
   for address '0xf7f16bd8' which succeeds
3) goto #1

I want to thank Bernd Zeimetz, Josip Rodin, and Fabio Massimo Di Nitto
for their tireless efforts helping me track down this bug.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index 00b5726..8089e7e 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -76,11 +76,16 @@ void compat_exit_robust_list(struct task_struct *curr)
 		 * A pending lock might already be on the list, so
 		 * dont process it twice:
 		 */
-		if (entry != pending)
-			if (handle_futex_death((void __user *)entry + futex_offset,
-						curr, pi))
-				return;
+		if (entry != pending) {
+			void __user *uaddr;
+			compat_uptr_t base;
+
+			base = ptr_to_compat(entry);
+			uaddr = compat_ptr(base + futex_offset);

+			if (handle_futex_death(uaddr, curr, pi))
+				return;
+		}
 		if (rc)
 			return;
 		uentry = next_uentry;

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Fix for sparc64 cpu hangs.
@ 2007-11-07  4:34 ` David Miller
  0 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-11-07  4:34 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, sparclinux, linux-arch, bernd, joy, fabbione, arnd

[ Bernd, Josip, and Fabio, I think I finally nailed this
  cpu hang bug we were all seeing on sparc64.  ]

[FUTEX]: Fix address computation in compat code.

compat_exit_robust_list() computes a pointer to the
futex entry in userspace as follows:

	(void __user *)entry + futex_offset

'entry' is a 'struct robust_list __user *', and
'futex_offset' is a 'compat_long_t' (typically a 's32').

Things explode if the 32-bit sign bit is set in futex_offset.

Type promotion sign extends futex_offset to a 64-bit value before
adding it to 'entry'.

This triggered a problem on sparc64 running 32-bit applications which
would lock up a cpu looping forever in the fault handling for the
userspace load in handle_futex_death().

Compat userspace runs with address masking (wherein the cpu zeros out
the top 32-bits of every effective address given to a memory operation
instruction) so the sparc64 fault handler accounts for this by
zero'ing out the top 32-bits of the fault address too.

Since the kernel properly uses the compat_uptr interfaces, kernel side
accesses to compat userspace work too since they will only use
addresses with the top 32-bits clear.

Because of this compat futex layer bug we get into the following loop
when executing the get_user() load near the top of handle_futex_death():

1) load from address '0xfffffffff7f16bd8', FAULT
2) fault handler clears upper 32-bits, processes fault
   for address '0xf7f16bd8' which succeeds
3) goto #1

I want to thank Bernd Zeimetz, Josip Rodin, and Fabio Massimo Di Nitto
for their tireless efforts helping me track down this bug.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index 00b5726..8089e7e 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -76,11 +76,16 @@ void compat_exit_robust_list(struct task_struct *curr)
 		 * A pending lock might already be on the list, so
 		 * dont process it twice:
 		 */
-		if (entry != pending)
-			if (handle_futex_death((void __user *)entry + futex_offset,
-						curr, pi))
-				return;
+		if (entry != pending) {
+			void __user *uaddr;
+			compat_uptr_t base;
+
+			base = ptr_to_compat(entry);
+			uaddr = compat_ptr(base + futex_offset);

+			if (handle_futex_death(uaddr, curr, pi))
+				return;
+		}
 		if (rc)
 			return;
 		uentry = next_uentry;

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
@ 2007-11-07  5:13   ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-11-07  5:13 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, sparclinux, linux-arch, bernd, joy, fabbione, arnd

From: David Miller <davem@davemloft.net>
Date: Tue, 06 Nov 2007 20:34:33 -0800 (PST)

> [FUTEX]: Fix address computation in compat code.

Sorry, I just noticed there is a second handle_futex_death()
call in compat_exit_robust_list() which has the same
address computation bug.

Here is an updated patch:

[FUTEX]: Fix address computation in compat code.

compat_exit_robust_list() computes a pointer to the
futex entry in userspace as follows:

	(void __user *)entry + futex_offset

'entry' is a 'struct robust_list __user *', and
'futex_offset' is a 'compat_long_t' (typically a 's32').

Things explode if the 32-bit sign bit is set in futex_offset.

Type promotion sign extends futex_offset to a 64-bit value before
adding it to 'entry'.

This triggered a problem on sparc64 running 32-bit applications which
would lock up a cpu looping forever in the fault handling for the
userspace load in handle_futex_death().

Compat userspace runs with address masking (wherein the cpu zeros out
the top 32-bits of every effective address given to a memory operation
instruction) so the sparc64 fault handler accounts for this by
zero'ing out the top 32-bits of the fault address too.

Since the kernel properly uses the compat_uptr interfaces, kernel side
accesses to compat userspace work too since they will only use
addresses with the top 32-bit clear.

Because of this compat futex layer bug we get into the following loop
when executing the get_user() load near the top of handle_futex_death():

1) load from address '0xfffffffff7f16bd8', FAULT
2) fault handler clears upper 32-bits, processes fault
   for address '0xf7f16bd8' which succeeds
3) goto #1

I want to thank Bernd Zeimetz, Josip Rodin, and Fabio Massimo Di Nitto
for their tireless efforts helping me track down this bug.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index 00b5726..1931457 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -30,6 +30,15 @@ fetch_robust_entry(compat_uptr_t *uentry, struct robust_list __user **entry,
 	return 0;
 }

+static void __user *futex_uaddr(struct robust_list *entry,
+				compat_long_t futex_offset)
+{
+	compat_uptr_t base = ptr_to_compat(entry);
+	void __user *uaddr = compat_ptr(base + futex_offset);
+
+	return uaddr;
+}
+
 /*
  * Walk curr->robust_list (very carefully, it's a userspace list!)
  * and mark any locks found there dead, and notify any waiters.
@@ -76,11 +85,13 @@ void compat_exit_robust_list(struct task_struct *curr)
 		 * A pending lock might already be on the list, so
 		 * dont process it twice:
 		 */
-		if (entry != pending)
-			if (handle_futex_death((void __user *)entry + futex_offset,
-						curr, pi))
-				return;
+		if (entry != pending) {
+			void __user *uaddr = futex_uaddr(entry,
+							 futex_offset);

+			if (handle_futex_death(uaddr, curr, pi))
+				return;
+		}
 		if (rc)
 			return;
 		uentry = next_uentry;
@@ -94,9 +105,11 @@ void compat_exit_robust_list(struct task_struct *curr)

 		cond_resched();
 	}
-	if (pending)
-		handle_futex_death((void __user *)pending + futex_offset,
-				   curr, pip);
+	if (pending) {
+		void __user *uaddr = futex_uaddr(pending, futex_offset);
+
+		handle_futex_death(uaddr, curr, pip);
+	}
 }

 asmlinkage long

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
@ 2007-11-07  5:13   ` David Miller
  0 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-11-07  5:13 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, sparclinux, linux-arch, bernd, joy, fabbione, arnd

From: David Miller <davem@davemloft.net>
Date: Tue, 06 Nov 2007 20:34:33 -0800 (PST)

> [FUTEX]: Fix address computation in compat code.

Sorry, I just noticed there is a second handle_futex_death()
call in compat_exit_robust_list() which has the same
address computation bug.

Here is an updated patch:

[FUTEX]: Fix address computation in compat code.

compat_exit_robust_list() computes a pointer to the
futex entry in userspace as follows:

	(void __user *)entry + futex_offset

'entry' is a 'struct robust_list __user *', and
'futex_offset' is a 'compat_long_t' (typically a 's32').

Things explode if the 32-bit sign bit is set in futex_offset.

Type promotion sign extends futex_offset to a 64-bit value before
adding it to 'entry'.

This triggered a problem on sparc64 running 32-bit applications which
would lock up a cpu looping forever in the fault handling for the
userspace load in handle_futex_death().

Compat userspace runs with address masking (wherein the cpu zeros out
the top 32-bits of every effective address given to a memory operation
instruction) so the sparc64 fault handler accounts for this by
zero'ing out the top 32-bits of the fault address too.

Since the kernel properly uses the compat_uptr interfaces, kernel side
accesses to compat userspace work too since they will only use
addresses with the top 32-bit clear.

Because of this compat futex layer bug we get into the following loop
when executing the get_user() load near the top of handle_futex_death():

1) load from address '0xfffffffff7f16bd8', FAULT
2) fault handler clears upper 32-bits, processes fault
   for address '0xf7f16bd8' which succeeds
3) goto #1

I want to thank Bernd Zeimetz, Josip Rodin, and Fabio Massimo Di Nitto
for their tireless efforts helping me track down this bug.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index 00b5726..1931457 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -30,6 +30,15 @@ fetch_robust_entry(compat_uptr_t *uentry, struct robust_list __user **entry,
 	return 0;
 }

+static void __user *futex_uaddr(struct robust_list *entry,
+				compat_long_t futex_offset)
+{
+	compat_uptr_t base = ptr_to_compat(entry);
+	void __user *uaddr = compat_ptr(base + futex_offset);
+
+	return uaddr;
+}
+
 /*
  * Walk curr->robust_list (very carefully, it's a userspace list!)
  * and mark any locks found there dead, and notify any waiters.
@@ -76,11 +85,13 @@ void compat_exit_robust_list(struct task_struct *curr)
 		 * A pending lock might already be on the list, so
 		 * dont process it twice:
 		 */
-		if (entry != pending)
-			if (handle_futex_death((void __user *)entry + futex_offset,
-						curr, pi))
-				return;
+		if (entry != pending) {
+			void __user *uaddr = futex_uaddr(entry,
+							 futex_offset);

+			if (handle_futex_death(uaddr, curr, pi))
+				return;
+		}
 		if (rc)
 			return;
 		uentry = next_uentry;
@@ -94,9 +105,11 @@ void compat_exit_robust_list(struct task_struct *curr)

 		cond_resched();
 	}
-	if (pending)
-		handle_futex_death((void __user *)pending + futex_offset,
-				   curr, pip);
+	if (pending) {
+		void __user *uaddr = futex_uaddr(pending, futex_offset);
+
+		handle_futex_death(uaddr, curr, pip);
+	}
 }

 asmlinkage long

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  5:13   ` David Miller
@ 2007-11-09 20:22     ` Andrew Morton
  -1 siblings, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2007-11-09 20:22 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, linux-kernel, sparclinux, linux-arch, bernd, joy,
	fabbione, arnd, stable

On Tue, 06 Nov 2007 21:13:56 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: David Miller <davem@davemloft.net>
> Date: Tue, 06 Nov 2007 20:34:33 -0800 (PST)
> 
> > [FUTEX]: Fix address computation in compat code.
> 
> Sorry, I just noticed there is a second handle_futex_death()
> call in compat_exit_robust_list() which has the same
> address computation bug.
> 
> Here is an updated patch:
> 
> [FUTEX]: Fix address computation in compat code.
> 
> compat_exit_robust_list() computes a pointer to the
> futex entry in userspace as follows:
> 
> 	(void __user *)entry + futex_offset
> 
> 'entry' is a 'struct robust_list __user *', and
> 'futex_offset' is a 'compat_long_t' (typically a 's32').
> 
> Things explode if the 32-bit sign bit is set in futex_offset.
> 
> Type promotion sign extends futex_offset to a 64-bit value before
> adding it to 'entry'.
> 
> This triggered a problem on sparc64 running 32-bit applications which
> would lock up a cpu looping forever in the fault handling for the
> userspace load in handle_futex_death().
> 
> Compat userspace runs with address masking (wherein the cpu zeros out
> the top 32-bits of every effective address given to a memory operation
> instruction) so the sparc64 fault handler accounts for this by
> zero'ing out the top 32-bits of the fault address too.
> 
> Since the kernel properly uses the compat_uptr interfaces, kernel side
> accesses to compat userspace work too since they will only use
> addresses with the top 32-bit clear.
> 
> Because of this compat futex layer bug we get into the following loop
> when executing the get_user() load near the top of handle_futex_death():
> 
> 1) load from address '0xfffffffff7f16bd8', FAULT
> 2) fault handler clears upper 32-bits, processes fault
>    for address '0xf7f16bd8' which succeeds
> 3) goto #1
> 
> I want to thank Bernd Zeimetz, Josip Rodin, and Fabio Massimo Di Nitto
> for their tireless efforts helping me track down this bug.
> 

I tagged this as needed-in-2.6.23.x.  Please let me know if that is not
appropriate.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
@ 2007-11-09 20:22     ` Andrew Morton
  0 siblings, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2007-11-09 20:22 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, linux-kernel, sparclinux, linux-arch, bernd, joy,
	fabbione, arnd, stable

On Tue, 06 Nov 2007 21:13:56 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: David Miller <davem@davemloft.net>
> Date: Tue, 06 Nov 2007 20:34:33 -0800 (PST)
> 
> > [FUTEX]: Fix address computation in compat code.
> 
> Sorry, I just noticed there is a second handle_futex_death()
> call in compat_exit_robust_list() which has the same
> address computation bug.
> 
> Here is an updated patch:
> 
> [FUTEX]: Fix address computation in compat code.
> 
> compat_exit_robust_list() computes a pointer to the
> futex entry in userspace as follows:
> 
> 	(void __user *)entry + futex_offset
> 
> 'entry' is a 'struct robust_list __user *', and
> 'futex_offset' is a 'compat_long_t' (typically a 's32').
> 
> Things explode if the 32-bit sign bit is set in futex_offset.
> 
> Type promotion sign extends futex_offset to a 64-bit value before
> adding it to 'entry'.
> 
> This triggered a problem on sparc64 running 32-bit applications which
> would lock up a cpu looping forever in the fault handling for the
> userspace load in handle_futex_death().
> 
> Compat userspace runs with address masking (wherein the cpu zeros out
> the top 32-bits of every effective address given to a memory operation
> instruction) so the sparc64 fault handler accounts for this by
> zero'ing out the top 32-bits of the fault address too.
> 
> Since the kernel properly uses the compat_uptr interfaces, kernel side
> accesses to compat userspace work too since they will only use
> addresses with the top 32-bit clear.
> 
> Because of this compat futex layer bug we get into the following loop
> when executing the get_user() load near the top of handle_futex_death():
> 
> 1) load from address '0xfffffffff7f16bd8', FAULT
> 2) fault handler clears upper 32-bits, processes fault
>    for address '0xf7f16bd8' which succeeds
> 3) goto #1
> 
> I want to thank Bernd Zeimetz, Josip Rodin, and Fabio Massimo Di Nitto
> for their tireless efforts helping me track down this bug.
> 

I tagged this as needed-in-2.6.23.x.  Please let me know if that is not
appropriate.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-09 20:22     ` Andrew Morton
@ 2007-11-09 22:14       ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-11-09 22:14 UTC (permalink / raw)
  To: akpm
  Cc: torvalds, linux-kernel, sparclinux, linux-arch, bernd, joy,
	fabbione, arnd, stable

From: Andrew Morton <akpm@linux-foundation.org>
Date: Fri, 9 Nov 2007 12:22:08 -0800

> I tagged this as needed-in-2.6.23.x.  Please let me know if that is not
> appropriate.

It is.  I have it queued up for -stable already.

I'm just waiting for it Linus to get back from wherever he has been
the past few days so he can suck it in and it's upstream before I
submit it to -stable.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
@ 2007-11-09 22:14       ` David Miller
  0 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-11-09 22:14 UTC (permalink / raw)
  To: akpm
  Cc: torvalds, linux-kernel, sparclinux, linux-arch, bernd, joy,
	fabbione, arnd, stable

From: Andrew Morton <akpm@linux-foundation.org>
Date: Fri, 9 Nov 2007 12:22:08 -0800

> I tagged this as needed-in-2.6.23.x.  Please let me know if that is not
> appropriate.

It is.  I have it queued up for -stable already.

I'm just waiting for it Linus to get back from wherever he has been
the past few days so he can suck it in and it's upstream before I
submit it to -stable.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
  (?)
  (?)
@ 2007-11-07 14:25 ` Josip Rodin
  -1 siblings, 0 replies; 38+ messages in thread
From: Josip Rodin @ 2007-11-07 14:25 UTC (permalink / raw)
  To: sparclinux

On Tue, Nov 06, 2007 at 09:13:56PM -0800, David Miller wrote:
> > [FUTEX]: Fix address computation in compat code.
> Here is an updated patch:

I applied the patch, rebooted into the new kernel, and let lebrun run its
buildd, but the apt package fetching method constantly times out trying to
reach incoming.debian.org - that (unrelated) server is having a downtime.
So unfortunately I can't properly test this long-awaited fix :)

But I did the artificial tests, like running dpkg-query --search libc.so.6
in loops, and this seems to work well. Thanks a lot!

-- 
     2. That which causes joy or happiness.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (2 preceding siblings ...)
  (?)
@ 2007-11-07 14:35 ` Bernd Zeimetz
  -1 siblings, 0 replies; 38+ messages in thread
From: Bernd Zeimetz @ 2007-11-07 14:35 UTC (permalink / raw)
  To: sparclinux


> But I did the artificial tests, like running dpkg-query --search libc.so.6
> in loops, and this seems to work well. Thanks a lot!
> 

I was running aptitude -u in a loop for half an hour now, and it didn't
crash, so I assume that fixed the bug. Many thanks for the patch David!

-- 
Bernd Zeimetz
<bernd@bzed.de>                         <http://bzed.de/>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (3 preceding siblings ...)
  (?)
@ 2007-11-08  0:01 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-11-08  0:01 UTC (permalink / raw)
  To: sparclinux

From: Bernd Zeimetz <bernd@bzed.de>
Date: Wed, 07 Nov 2007 15:35:42 +0100

> 
> > But I did the artificial tests, like running dpkg-query --search libc.so.6
> > in loops, and this seems to work well. Thanks a lot!
> > 
> 
> I was running aptitude -u in a loop for half an hour now, and it didn't
> crash, so I assume that fixed the bug. Many thanks for the patch David!

Many thanks for testing :-)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (4 preceding siblings ...)
  (?)
@ 2007-11-11  6:04 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-11-11  6:04 UTC (permalink / raw)
  To: sparclinux

From: Bernd Zeimetz <bernd@bzed.de>
Date: Wed, 07 Nov 2007 15:35:42 +0100

> 
> > But I did the artificial tests, like running dpkg-query --search libc.so.6
> > in loops, and this seems to work well. Thanks a lot!
> > 
> 
> I was running aptitude -u in a loop for half an hour now, and it didn't
> crash, so I assume that fixed the bug. Many thanks for the patch David!

Many thanks for helping me track it down.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (5 preceding siblings ...)
  (?)
@ 2007-11-11  6:13 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-11-11  6:13 UTC (permalink / raw)
  To: sparclinux

From: Josip Rodin <joy@entuzijast.net>
Date: Wed, 7 Nov 2007 15:25:46 +0100

> I applied the patch, rebooted into the new kernel, and let lebrun run its
> buildd, but the apt package fetching method constantly times out trying to
> reach incoming.debian.org - that (unrelated) server is having a downtime.
> So unfortunately I can't properly test this long-awaited fix :)
> 
> But I did the artificial tests, like running dpkg-query --search libc.so.6
> in loops, and this seems to work well. Thanks a lot!

Please let me know if things go smoothly when the
build becomes active again.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (6 preceding siblings ...)
  (?)
@ 2007-11-11  6:27 ` Bernd Zeimetz
  -1 siblings, 0 replies; 38+ messages in thread
From: Bernd Zeimetz @ 2007-11-11  6:27 UTC (permalink / raw)
  To: sparclinux

David Miller wrote:
> From: Bernd Zeimetz <bernd@bzed.de>
> Date: Wed, 07 Nov 2007 15:35:42 +0100
> 
>>> But I did the artificial tests, like running dpkg-query --search libc.so.6
>>> in loops, and this seems to work well. Thanks a lot!
>>>
>> I was running aptitude -u in a loop for half an hour now, and it didn't
>> crash, so I assume that fixed the bug. Many thanks for the patch David!
> 
> Many thanks for helping me track it down.

You're welcome!

The v880 is still running fine, I'll setup the stuff which was supposed
to be running on the machine during the next days, so we'll see how it
behaves under a higher load for a longer time soon.

Thanks again for looking into this annoying bug!


-- 
Bernd Zeimetz
<bernd@bzed.de>                         <http://bzed.de/>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (7 preceding siblings ...)
  (?)
@ 2007-11-12 13:16 ` Josip Rodin
  -1 siblings, 0 replies; 38+ messages in thread
From: Josip Rodin @ 2007-11-12 13:16 UTC (permalink / raw)
  To: sparclinux

On Sat, Nov 10, 2007 at 10:13:28PM -0800, David Miller wrote:
> From: Josip Rodin <joy@entuzijast.net>
> Date: Wed, 7 Nov 2007 15:25:46 +0100
> 
> > I applied the patch, rebooted into the new kernel, and let lebrun run its
> > buildd, but the apt package fetching method constantly times out trying to
> > reach incoming.debian.org - that (unrelated) server is having a downtime.
> > So unfortunately I can't properly test this long-awaited fix :)
> > 
> > But I did the artificial tests, like running dpkg-query --search libc.so.6
> > in loops, and this seems to work well. Thanks a lot!
> 
> Please let me know if things go smoothly when the
> build becomes active again.

It became functional again a couple of hours ago, and for now everything
seems just fine, it's happily churning away, load hovers around 1, memory
usage seems normal, and nothing's getting stuck.

-- 
     2. That which causes joy or happiness.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (8 preceding siblings ...)
  (?)
@ 2007-11-16 21:17 ` Bernd Zeimetz
  -1 siblings, 0 replies; 38+ messages in thread
From: Bernd Zeimetz @ 2007-11-16 21:17 UTC (permalink / raw)
  To: sparclinux

[-- Attachment #1: Type: text/plain, Size: 789 bytes --]

Hi David,

> Please let me know if things go smoothly when the
> build becomes active again.

first the good news:
The U60 here still building and working fine, also I didn't hear any bad
news from lebrun.d.o.

the not so good news:
the v880 (4x US III) here was hit by a stuck process again, after
running fine for some time now. But the machine didn't freeze, one CPU
was running at 100%, but otherwise the machine was responsible.

I think I'll also run a full diag in service mode to make it's not a CPU
bug.
The sysrq-g output is attached, I hope you can make sense out of it.
We'll also add some extra workload to the other machines here to try to
trigger the bug on other CPUs, too.

Best regards,

Bernd

-- 
Bernd Zeimetz
<bernd@bzed.de>                         <http://bzed.de/>

[-- Attachment #2: v880-sysrq-g.txt --]
[-- Type: text/plain, Size: 30540 bytes --]

Nov 16 21:40:57 titan kernel: [12019.840715] SysRq : Show Global CPU Regs
Nov 16 21:40:57 titan kernel: [12019.886698]   CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[NULL:-1]
Nov 16 21:40:57 titan kernel: [12020.003361]              TPC[atomic_sub_ret+0x0/0x30]
Nov 16 21:40:58 titan kernel: [12020.063757]              O7[schedule+0x6dc/0x7a4]
Nov 16 21:40:58 titan kernel: [12020.120007]              I7[do_syslog+0xfc/0x400]
Nov 16 21:40:58 titan kernel: [12020.176249] * CPU[  1]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:40:58 titan kernel: [12020.295006]   CPU[  2]: TSTATE[0000000011009602] TPC[000000000042fc30] TNPC[000000000042fc34] TASK[cat:4365]
Nov 16 21:40:58 titan kernel: [12020.412726]              TPC[udelay+0x0/0x1c]
Nov 16 21:40:58 titan kernel: [12020.464809]              O7[cheetah_xcall_deliver+0x1b8/0x23c]
Nov 16 21:40:58 titan kernel: [12020.534581]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:40:58 titan kernel: [12020.604370]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:40:58 titan kernel: [12020.723128]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:40:58 titan kernel: [12020.778323]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:40:58 titan kernel: [12020.832498]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:41:05 titan ntpd[2766]: adjusting local clock by -20.711568s
Nov 16 21:41:26 titan kernel: [12048.836922] SysRq : Show Global CPU Regs
Nov 16 21:41:26 titan kernel: [12048.882885] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:41:26 titan kernel: [12049.001617]   CPU[  1]: TSTATE[0000009911009602] TPC[0000000000407af0] TNPC[0000000000407af4] TASK[swapper:0]
Nov 16 21:41:27 titan kernel: [12049.120373]              TPC[__tsb_context_switch+0xf0/0x100]
Nov 16 21:41:27 titan kernel: [12049.189109]              O7[schedule+0x514/0x7a4]
Nov 16 21:41:27 titan kernel: [12049.245354]              I7[cpu_idle+0xa8/0xb8]
Nov 16 21:41:27 titan kernel: [12049.299516]   CPU[  2]: TSTATE[0000000011009603] TPC[000000000042faa0] TNPC[000000000042fc18] TASK[cat:4365]
Nov 16 21:41:27 titan kernel: [12049.417244]              TPC[stick_get_tick+0x10/0x14]
Nov 16 21:41:27 titan kernel: [12049.478681]              O7[__delay+0x28/0x48]
Nov 16 21:41:27 titan kernel: [12049.531809]              I7[cheetah_xcall_deliver+0x1b8/0x23c]
Nov 16 21:41:27 titan kernel: [12049.601598]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:41:27 titan kernel: [12049.720351]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:41:27 titan kernel: [12049.775551]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:41:27 titan kernel: [12049.829725]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:41:28 titan kernel: [12050.571422] SysRq : Show Global CPU Regs
Nov 16 21:41:28 titan kernel: [12050.617320] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:41:28 titan kernel: [12050.736074]   CPU[  1]: TSTATE[0000004411009604] TPC[000000000045731c] TNPC[0000000000457320] TASK[swapper:0]
Nov 16 21:41:28 titan kernel: [12050.854834]              TPC[update_stats_wait_end+0x24/0x88]
Nov 16 21:41:28 titan kernel: [12050.923565]              O7[sched_clock+0x10/0x30]
Nov 16 21:41:29 titan kernel: [12050.980856]              I7[pick_next_task_fair+0x24/0x44]
Nov 16 21:41:29 titan kernel: [12051.046480]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:41:29 titan kernel: [12051.164194]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:41:29 titan kernel: [12051.235018]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:41:29 titan kernel: [12051.303771]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:41:29 titan kernel: [12051.373560]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:41:29 titan kernel: [12051.492314]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:41:29 titan kernel: [12051.547514]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:41:29 titan kernel: [12051.601683]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:41:30 titan kernel: [12052.214441] SysRq : Show Global CPU Regs
Nov 16 21:41:30 titan kernel: [12052.260326] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:41:30 titan kernel: [12052.379078]   CPU[  1]: TSTATE[0000004411009604] TPC[000000000045731c] TNPC[0000000000457320] TASK[swapper:0]
Nov 16 21:41:30 titan kernel: [12052.497835]              TPC[update_stats_wait_end+0x24/0x88]
Nov 16 21:41:30 titan kernel: [12052.566571]              O7[sched_clock+0x10/0x30]
Nov 16 21:41:30 titan kernel: [12052.623860]              I7[pick_next_task_fair+0x24/0x44]
Nov 16 21:41:30 titan kernel: [12052.689485]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:41:30 titan kernel: [12052.807200]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:41:30 titan kernel: [12052.878023]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:41:30 titan kernel: [12052.946773]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:41:30 titan kernel: [12053.016564]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:41:31 titan kernel: [12053.135319]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:41:31 titan kernel: [12053.190520]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:41:31 titan kernel: [12053.244688]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:41:32 titan kernel: [12054.286796] SysRq : Show Global CPU Regs
Nov 16 21:41:33 titan kernel: [12054.332702] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:41:33 titan kernel: [12054.451456]   CPU[  1]: TSTATE[0000004411009604] TPC[00000000004572fc] TNPC[0000000000457300] TASK[swapper:0]
Nov 16 21:41:33 titan kernel: [12054.570213]              TPC[update_stats_wait_end+0x4/0x88]
Nov 16 21:41:33 titan kernel: [12054.637912]              O7[sched_clock+0x10/0x30]
Nov 16 21:41:33 titan kernel: [12054.695198]              I7[pick_next_task_fair+0x24/0x44]
Nov 16 21:41:33 titan kernel: [12054.760821]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:41:33 titan kernel: [12054.878536]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:41:33 titan kernel: [12054.949358]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:41:33 titan kernel: [12055.018108]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:41:33 titan kernel: [12055.087900]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:41:33 titan kernel: [12055.206656]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:41:33 titan kernel: [12055.261852]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:41:33 titan kernel: [12055.316017]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:41:34 titan kernel: [12056.187865] SysRq : Show Global CPU Regs
Nov 16 21:41:34 titan kernel: [12056.233830] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:41:34 titan kernel: [12056.352582]   CPU[  1]: TSTATE[0000004411009604] TPC[0000000000457300] TNPC[0000000000457304] TASK[swapper:0]
Nov 16 21:41:34 titan kernel: [12056.471342]              TPC[update_stats_wait_end+0x8/0x88]
Nov 16 21:41:34 titan kernel: [12056.539035]              O7[sched_clock+0x10/0x30]
Nov 16 21:41:34 titan kernel: [12056.596324]              I7[pick_next_task_fair+0x24/0x44]
Nov 16 21:41:34 titan kernel: [12056.661947]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:41:34 titan kernel: [12056.779663]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:41:34 titan kernel: [12056.850487]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:41:34 titan kernel: [12056.919237]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:41:34 titan kernel: [12056.989027]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:41:35 titan kernel: [12057.107784]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:41:35 titan kernel: [12057.162981]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:41:35 titan kernel: [12057.217152]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:41:35 titan kernel: [12057.839154] SysRq : Show Global CPU Regs
Nov 16 21:41:35 titan kernel: [12057.885062] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:41:35 titan kernel: [12058.003816]   CPU[  1]: TSTATE[0000004411009604] TPC[0000000000457358] TNPC[000000000045735c] TASK[swapper:0]
Nov 16 21:41:36 titan kernel: [12058.122573]              TPC[update_stats_wait_end+0x60/0x88]
Nov 16 21:41:36 titan kernel: [12058.191309]              O7[sched_clock+0x10/0x30]
Nov 16 21:41:36 titan kernel: [12058.248599]              I7[pick_next_task_fair+0x24/0x44]
Nov 16 21:41:36 titan kernel: [12058.314225]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:41:36 titan kernel: [12058.431940]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:41:36 titan kernel: [12058.502763]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:41:36 titan kernel: [12058.571513]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:41:36 titan kernel: [12058.641304]   CPU[  3]: TSTATE[0000009980009602] TPC[000000000042888c] TNPC[0000000000428890] TASK[swapper:0]
Nov 16 21:41:36 titan kernel: [12058.760058]              TPC[cpu_idle+0x80/0xb8]
Nov 16 21:41:36 titan kernel: [12058.815257]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:41:36 titan kernel: [12058.869428]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:41:37 titan kernel: [12059.540545] SysRq : Show Global CPU Regs
Nov 16 21:41:38 titan kernel: [12059.586508] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:41:38 titan kernel: [12059.705259]   CPU[  1]: TSTATE[0000004411009604] TPC[000000000045731c] TNPC[0000000000457320] TASK[swapper:0]
Nov 16 21:41:38 titan kernel: [12059.824017]              TPC[update_stats_wait_end+0x24/0x88]
Nov 16 21:41:38 titan kernel: [12059.892758]              O7[sched_clock+0x10/0x30]
Nov 16 21:41:38 titan kernel: [12059.950042]              I7[pick_next_task_fair+0x24/0x44]
Nov 16 21:41:38 titan kernel: [12060.015666]   CPU[  2]: TSTATE[0000000011009602] TPC[000000000044194c] TNPC[0000000000441950] TASK[cat:4365]
Nov 16 21:41:38 titan kernel: [12060.133384]              TPC[cheetah_xcall_deliver+0x48/0x23c]
Nov 16 21:41:38 titan kernel: [12060.203162]              O7[cheetah_xcall_deliver+0x1c0/0x23c]
Nov 16 21:41:38 titan kernel: [12060.272956]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:41:38 titan kernel: [12060.342746]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:41:38 titan kernel: [12060.461501]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:41:38 titan kernel: [12060.516701]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:41:38 titan kernel: [12060.570863]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:41:39 titan kernel: [12061.338017] SysRq : Show Global CPU Regs
Nov 16 21:41:39 titan kernel: [12061.383989] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:41:39 titan kernel: [12061.502742]   CPU[  1]: TSTATE[0000004411009604] TPC[000000000045731c] TNPC[0000000000457320] TASK[swapper:0]
Nov 16 21:41:39 titan kernel: [12061.621500]              TPC[update_stats_wait_end+0x24/0x88]
Nov 16 21:41:39 titan kernel: [12061.690239]              O7[sched_clock+0x10/0x30]
Nov 16 21:41:39 titan kernel: [12061.747525]              I7[pick_next_task_fair+0x24/0x44]
Nov 16 21:41:39 titan kernel: [12061.813150]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:41:39 titan kernel: [12061.930866]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:41:39 titan kernel: [12062.001689]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:41:40 titan kernel: [12062.070439]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:41:40 titan kernel: [12062.140230]   CPU[  3]: TSTATE[0000009980009602] TPC[000000000042888c] TNPC[0000000000428890] TASK[swapper:0]
Nov 16 21:41:40 titan kernel: [12062.258985]              TPC[cpu_idle+0x80/0xb8]
Nov 16 21:41:40 titan kernel: [12062.314182]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:41:40 titan kernel: [12062.368354]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:41:41 titan kernel: [12063.040690] SysRq : Show Global CPU Regs
Nov 16 21:41:41 titan kernel: [12063.086578] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:41:41 titan kernel: [12063.205331]   CPU[  1]: TSTATE[0000004411009604] TPC[000000000045731c] TNPC[0000000000457320] TASK[swapper:0]
Nov 16 21:41:41 titan kernel: [12063.324089]              TPC[update_stats_wait_end+0x24/0x88]
Nov 16 21:41:41 titan kernel: [12063.392824]              O7[sched_clock+0x10/0x30]
Nov 16 21:41:41 titan kernel: [12063.450114]              I7[pick_next_task_fair+0x24/0x44]
Nov 16 21:41:41 titan kernel: [12063.515737]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:41:41 titan kernel: [12063.633453]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:41:41 titan kernel: [12063.704277]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:41:41 titan kernel: [12063.773026]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:41:41 titan kernel: [12063.842816]   CPU[  3]: TSTATE[0000009980009602] TPC[000000000042888c] TNPC[0000000000428890] TASK[swapper:0]
Nov 16 21:41:41 titan kernel: [12063.961573]              TPC[cpu_idle+0x80/0xb8]
Nov 16 21:41:42 titan kernel: [12064.016772]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:41:42 titan kernel: [12064.070939]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:41:43 titan kernel: [12065.460173] SysRq : Show Global CPU Regs
Nov 16 21:41:43 titan kernel: [12065.506145] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:41:43 titan kernel: [12065.624894]   CPU[  1]: TSTATE[0000004411009604] TPC[0000000000457300] TNPC[0000000000457304] TASK[swapper:0]
Nov 16 21:41:43 titan kernel: [12065.743649]              TPC[update_stats_wait_end+0x8/0x88]
Nov 16 21:41:43 titan kernel: [12065.811345]              O7[sched_clock+0x10/0x30]
Nov 16 21:41:43 titan kernel: [12065.868634]              I7[pick_next_task_fair+0x24/0x44]
Nov 16 21:41:43 titan kernel: [12065.934255]   CPU[  2]: TSTATE[0000000011009603] TPC[000000000042fc0c] TNPC[000000000042fc10] TASK[cat:4365]
Nov 16 21:41:44 titan kernel: [12066.051970]              TPC[__delay+0x24/0x48]
Nov 16 21:41:44 titan kernel: [12066.106127]              O7[__delay+0x10/0x48]
Nov 16 21:41:44 titan kernel: [12066.159254]              I7[cheetah_xcall_deliver+0x1b8/0x23c]
Nov 16 21:41:44 titan kernel: [12066.229046]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:41:44 titan kernel: [12066.347801]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:41:44 titan kernel: [12066.402999]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:41:44 titan kernel: [12066.457170]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:41:45 titan kernel: [12067.584733] SysRq : Show Global CPU Regs
Nov 16 21:41:45 titan kernel: [12067.630702] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:41:45 titan kernel: [12067.749456]   CPU[  1]: TSTATE[0000004411009604] TPC[000000000045731c] TNPC[0000000000457320] TASK[swapper:0]
Nov 16 21:41:45 titan kernel: [12067.868214]              TPC[update_stats_wait_end+0x24/0x88]
Nov 16 21:41:45 titan kernel: [12067.936948]              O7[sched_clock+0x10/0x30]
Nov 16 21:41:46 titan kernel: [12067.994238]              I7[pick_next_task_fair+0x24/0x44]
Nov 16 21:41:46 titan kernel: [12068.059861]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:41:46 titan kernel: [12068.177579]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:41:46 titan kernel: [12068.248401]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:41:46 titan kernel: [12068.317151]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:41:46 titan kernel: [12068.386940]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:41:46 titan kernel: [12068.505696]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:41:46 titan kernel: [12068.560896]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:41:46 titan kernel: [12068.615067]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:44:55 titan kernel: [12257.158581] SysRq : Show Global CPU Regs
Nov 16 21:44:55 titan kernel: [12257.204566] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:44:55 titan kernel: [12257.323298]   CPU[  1]: TSTATE[0000008811009602] TPC[000000000067b45c] TNPC[000000000067b460] TASK[swapper:0]
Nov 16 21:44:55 titan kernel: [12257.442053]              TPC[schedule+0x4b8/0x7a4]
Nov 16 21:44:55 titan kernel: [12257.499328]              O7[schedule+0x4a4/0x7a4]
Nov 16 21:44:55 titan kernel: [12257.555582]              I7[cpu_idle+0xa8/0xb8]
Nov 16 21:44:55 titan kernel: [12257.609743]   CPU[  2]: TSTATE[0000009911009603] TPC[000000000053a2a0] TNPC[000000000053a2a4] TASK[cat:4365]
Nov 16 21:44:55 titan kernel: [12257.727474]              TPC[find_next_bit+0x14/0x11c]
Nov 16 21:44:55 titan kernel: [12257.788911]              O7[__first_cpu+0xc/0x28]
Nov 16 21:44:55 titan kernel: [12257.845162]              I7[cheetah_xcall_deliver+0x1c0/0x23c]
Nov 16 21:44:55 titan kernel: [12257.914950]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:44:55 titan kernel: [12258.033705]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:44:55 titan kernel: [12258.088902]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:44:56 titan kernel: [12258.143075]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:44:56 titan kernel: [12258.631365] SysRq : Show Global CPU Regs
Nov 16 21:44:56 titan kernel: [12258.677343] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:44:56 titan kernel: [12258.796095]   CPU[  1]: TSTATE[0000009911009602] TPC[0000000000407af0] TNPC[0000000000407af4] TASK[swapper:0]
Nov 16 21:44:56 titan kernel: [12258.914849]              TPC[__tsb_context_switch+0xf0/0x100]
Nov 16 21:44:56 titan kernel: [12258.983586]              O7[schedule+0x514/0x7a4]
Nov 16 21:44:56 titan kernel: [12259.039832]              I7[cpu_idle+0xa8/0xb8]
Nov 16 21:44:56 titan kernel: [12259.094002]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:44:57 titan kernel: [12259.211716]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:44:57 titan kernel: [12259.282539]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:44:57 titan kernel: [12259.351292]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:44:57 titan kernel: [12259.421080]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:44:57 titan kernel: [12259.539836]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:44:57 titan kernel: [12259.595034]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:44:57 titan kernel: [12259.649205]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:44:58 titan kernel: [12260.261944] SysRq : Show Global CPU Regs
Nov 16 21:44:58 titan kernel: [12260.307849] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:44:58 titan kernel: [12260.426600]   CPU[  1]: TSTATE[0000000011009602] TPC[000000000067b520] TNPC[000000000067b524] TASK[swapper:0]
Nov 16 21:44:58 titan kernel: [12260.545350]              TPC[schedule+0x57c/0x7a4]
Nov 16 21:44:58 titan kernel: [12260.602630]              O7[schedule+0x570/0x7a4]
Nov 16 21:44:58 titan kernel: [12260.658881]              I7[cpu_idle+0xa8/0xb8]
Nov 16 21:44:58 titan kernel: [12260.713048]   CPU[  2]: TSTATE[0000000011009602] TPC[000000000042fc3c] TNPC[000000000042fc40] TASK[cat:4365]
Nov 16 21:44:58 titan kernel: [12260.830769]              TPC[udelay+0xc/0x1c]
Nov 16 21:44:58 titan kernel: [12260.882836]              O7[cheetah_xcall_deliver+0x1b8/0x23c]
Nov 16 21:44:58 titan kernel: [12260.952629]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:44:58 titan kernel: [12261.022419]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:44:59 titan kernel: [12261.141175]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:44:59 titan kernel: [12261.196374]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:44:59 titan kernel: [12261.250543]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:45:00 titan kernel: [12263.019145] SysRq : Show Global CPU Regs
Nov 16 21:45:01 titan kernel: [12263.065114] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:45:01 titan kernel: [12263.183867]   CPU[  1]: TSTATE[0000009911009602] TPC[0000000000407af0] TNPC[0000000000407af4] TASK[swapper:0]
Nov 16 21:45:01 titan kernel: [12263.302617]              TPC[__tsb_context_switch+0xf0/0x100]
Nov 16 21:45:01 titan kernel: [12263.371356]              O7[schedule+0x514/0x7a4]
Nov 16 21:45:01 titan kernel: [12263.427605]              I7[cpu_idle+0xa8/0xb8]
Nov 16 21:45:01 titan kernel: [12263.481771]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:45:01 titan kernel: [12263.599488]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:45:01 titan kernel: [12263.670311]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:45:01 titan kernel: [12263.739061]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:45:01 titan kernel: [12263.808853]   CPU[  3]: TSTATE[0000009980009602] TPC[000000000042888c] TNPC[0000000000428890] TASK[swapper:0]
Nov 16 21:45:01 titan kernel: [12263.927608]              TPC[cpu_idle+0x80/0xb8]
Nov 16 21:45:01 titan kernel: [12263.982805]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:45:02 titan /USR/SBIN/CRON[4397]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Nov 16 21:45:02 titan kernel: [12264.036976]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:45:02 titan kernel: [12264.567544] SysRq : Show Global CPU Regs
Nov 16 21:45:02 titan kernel: [12264.613433] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:45:02 titan kernel: [12264.732184]   CPU[  1]: TSTATE[0000008811009602] TPC[000000000067b448] TNPC[000000000067b44c] TASK[swapper:0]
Nov 16 21:45:02 titan kernel: [12264.850937]              TPC[schedule+0x4a4/0x7a4]
Nov 16 21:45:02 titan kernel: [12264.908214]              O7[schedule+0x3f0/0x7a4]
Nov 16 21:45:02 titan kernel: [12264.964465]              I7[cpu_idle+0xa8/0xb8]
Nov 16 21:45:02 titan kernel: [12265.018632]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:45:03 titan kernel: [12265.136350]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:45:03 titan kernel: [12265.207173]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:45:03 titan kernel: [12265.275923]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:45:03 titan kernel: [12265.345713]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:45:03 titan kernel: [12265.464467]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:45:03 titan kernel: [12265.519665]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:45:03 titan kernel: [12265.573837]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:45:04 titan kernel: [12267.080867] SysRq : Show Global CPU Regs
Nov 16 21:45:05 titan kernel: [12267.126742] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:45:05 titan kernel: [12267.245494]   CPU[  1]: TSTATE[0000008811009602] TPC[000000000067d0ec] TNPC[000000000067d0f0] TASK[swapper:0]
Nov 16 21:45:05 titan kernel: [12267.364246]              TPC[_spin_lock_irqsave+0x10/0x24]
Nov 16 21:45:05 titan kernel: [12267.429858]              O7[schedule+0x4a4/0x7a4]
Nov 16 21:45:05 titan kernel: [12267.486107]              I7[cpu_idle+0xa8/0xb8]
Nov 16 21:45:05 titan kernel: [12267.540275]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:45:05 titan kernel: [12267.657992]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:45:05 titan kernel: [12267.728815]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:45:05 titan kernel: [12267.797566]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:45:05 titan kernel: [12267.867356]   CPU[  3]: TSTATE[0000009980009602] TPC[000000000042888c] TNPC[0000000000428890] TASK[swapper:0]
Nov 16 21:45:05 titan kernel: [12267.986110]              TPC[cpu_idle+0x80/0xb8]
Nov 16 21:45:05 titan kernel: [12268.041309]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:45:06 titan kernel: [12268.095478]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:45:06 titan kernel: [12268.729495] SysRq : Show Global CPU Regs
Nov 16 21:45:06 titan kernel: [12268.775372] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:45:06 titan kernel: [12268.894124]   CPU[  1]: TSTATE[0000008811009602] TPC[000000000067d0ec] TNPC[000000000067d0f0] TASK[swapper:0]
Nov 16 21:45:06 titan kernel: [12269.012875]              TPC[_spin_lock_irqsave+0x10/0x24]
Nov 16 21:45:06 titan kernel: [12269.078489]              O7[schedule+0x4a4/0x7a4]
Nov 16 21:45:07 titan kernel: [12269.134739]              I7[cpu_idle+0xa8/0xb8]
Nov 16 21:45:07 titan kernel: [12269.188905]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:45:07 titan kernel: [12269.306621]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:45:07 titan kernel: [12269.377444]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:45:07 titan kernel: [12269.446196]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:45:07 titan kernel: [12269.515986]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:45:07 titan kernel: [12269.634741]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:45:07 titan kernel: [12269.689942]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:45:07 titan kernel: [12269.744109]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:45:08 titan kernel: [12270.317470] SysRq : Show Global CPU Regs
Nov 16 21:45:08 titan kernel: [12270.363381] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:45:08 titan kernel: [12270.482131]   CPU[  1]: TSTATE[0000008811009602] TPC[000000000067b440] TNPC[000000000067b444] TASK[swapper:0]
Nov 16 21:45:08 titan kernel: [12270.600881]              TPC[schedule+0x49c/0x7a4]
Nov 16 21:45:08 titan kernel: [12270.658160]              O7[schedule+0x3f0/0x7a4]
Nov 16 21:45:08 titan kernel: [12270.714411]              I7[cpu_idle+0xa8/0xb8]
Nov 16 21:45:08 titan kernel: [12270.768579]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:45:08 titan kernel: [12270.886295]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:45:08 titan kernel: [12270.957116]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:45:08 titan kernel: [12271.025867]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:45:08 titan kernel: [12271.095658]   CPU[  3]: TSTATE[0000004480009602] TPC[00000000004288a0] TNPC[00000000004288a4] TASK[swapper:0]
Nov 16 21:45:09 titan kernel: [12271.214414]              TPC[cpu_idle+0x94/0xb8]
Nov 16 21:45:09 titan kernel: [12271.269613]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:45:09 titan kernel: [12271.323782]              I7[start_kernel+0x31c/0x32c]
Nov 16 21:45:09 titan kernel: [12272.052636] SysRq : Show Global CPU Regs
Nov 16 21:45:10 titan kernel: [12272.098570] * CPU[  0]: TSTATE[0000000000000000] TPC[0000000000000000] TNPC[0000000000000000] TASK[bash:3157]
Nov 16 21:45:10 titan kernel: [12272.217324]   CPU[  1]: TSTATE[0000009911009602] TPC[000000000067b408] TNPC[000000000067b40c] TASK[swapper:0]
Nov 16 21:45:10 titan kernel: [12272.336073]              TPC[schedule+0x464/0x7a4]
Nov 16 21:45:10 titan kernel: [12272.393352]              O7[schedule+0x3f0/0x7a4]
Nov 16 21:45:10 titan kernel: [12272.449605]              I7[cpu_idle+0xa8/0xb8]
Nov 16 21:45:10 titan kernel: [12272.503770]   CPU[  2]: TSTATE[0000000011009602] TPC[0000000000441a78] TNPC[0000000000441a7c] TASK[cat:4365]
Nov 16 21:45:10 titan kernel: [12272.621485]              TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:45:10 titan kernel: [12272.692308]              O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:45:10 titan kernel: [12272.761060]              I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:45:10 titan kernel: [12272.830851]   CPU[  3]: TSTATE[0000009980009602] TPC[000000000042888c] TNPC[0000000000428890] TASK[swapper:0]
Nov 16 21:45:10 titan kernel: [12272.949606]              TPC[cpu_idle+0x80/0xb8]
Nov 16 21:45:10 titan kernel: [12273.004804]              O7[cpu_idle+0xa8/0xb8]
Nov 16 21:45:10 titan kernel: [12273.058974]              I7[start_kernel+0x31c/0x32c]


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (9 preceding siblings ...)
  (?)
@ 2007-11-20  6:09 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-11-20  6:09 UTC (permalink / raw)
  To: sparclinux

From: Bernd Zeimetz <bernd@bzed.de>
Date: Fri, 16 Nov 2007 22:17:07 +0100

> first the good news:
> The U60 here still building and working fine, also I didn't hear any bad
> news from lebrun.d.o.
> 
> the not so good news:
> the v880 (4x US III) here was hit by a stuck process again, after
> running fine for some time now. But the machine didn't freeze, one CPU
> was running at 100%, but otherwise the machine was responsible.
> 
> I think I'll also run a full diag in service mode to make it's not a CPU
> bug.
> The sysrq-g output is attached, I hope you can make sense out of it.
> We'll also add some extra workload to the other machines here to try to
> trigger the bug on other CPUs, too.

I'll look into this when I get a chance, thanks for the report.

I'm leaving for a 2 week trip on Wednesday, so I might not be able
to look into this until early December.

Thanks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (10 preceding siblings ...)
  (?)
@ 2007-12-06  8:49 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-12-06  8:49 UTC (permalink / raw)
  To: sparclinux

From: Bernd Zeimetz <bernd@bzed.de>
Date: Fri, 16 Nov 2007 22:17:07 +0100

> The sysrq-g output is attached, I hope you can make sense out of it.
> We'll also add some extra workload to the other machines here to try to
> trigger the bug on other CPUs, too.

I just got back from my vacation and started looking at these
dumps.  I think there might be some bug in cheetah_xcall_deliver(),
I'll try to diagnose this some more.

If you cannot reproduce this bug on non-Ultra-III systems that
would help confirm or deny my theory.  Have you been able to
trigger this on your Ultra-II machine for example?  If so, what
do the sysrq-g traces look like there?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (11 preceding siblings ...)
  (?)
@ 2007-12-06 10:43 ` Bernd Zeimetz
  -1 siblings, 0 replies; 38+ messages in thread
From: Bernd Zeimetz @ 2007-12-06 10:43 UTC (permalink / raw)
  To: sparclinux

David Miller wrote:
> From: Bernd Zeimetz <bernd@bzed.de>
> Date: Fri, 16 Nov 2007 22:17:07 +0100
> 
>> The sysrq-g output is attached, I hope you can make sense out of it.
>> We'll also add some extra workload to the other machines here to try to
>> trigger the bug on other CPUs, too.
> 
> I just got back from my vacation and started looking at these
> dumps.  I think there might be some bug in cheetah_xcall_deliver(),
> I'll try to diagnose this some more.

I'm not sure if it is related, but non-SMP Kernels don't boot at all on
the machine.

> If you cannot reproduce this bug on non-Ultra-III systems that
> would help confirm or deny my theory.  Have you been able to
> trigger this on your Ultra-II machine for example?  If so, what
> do the sysrq-g traces look like there?

Since your Futex bugfix the Ultra-II machine runs pretty stable. I did
not manage to trigger the bug there, but it was hard to trigger the bug
the first time there already - even if I run a Kernel without the Futex
bugfix the machine will just hang itself at some random point, I never
managed to reproduce the bug easily on US II.


Best regards,

Bernd

-- 
Bernd Zeimetz
<bernd@bzed.de>                         <http://bzed.de/>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (12 preceding siblings ...)
  (?)
@ 2007-12-06 11:08 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-12-06 11:08 UTC (permalink / raw)
  To: sparclinux

From: Bernd Zeimetz <bernd@bzed.de>
Date: Thu, 06 Dec 2007 11:43:45 +0100

> David Miller wrote:
> > From: Bernd Zeimetz <bernd@bzed.de>
> > Date: Fri, 16 Nov 2007 22:17:07 +0100
> > 
> >> The sysrq-g output is attached, I hope you can make sense out of it.
> >> We'll also add some extra workload to the other machines here to try to
> >> trigger the bug on other CPUs, too.
> > 
> > I just got back from my vacation and started looking at these
> > dumps.  I think there might be some bug in cheetah_xcall_deliver(),
> > I'll try to diagnose this some more.
> 
> I'm not sure if it is related, but non-SMP Kernels don't boot at all on
> the machine.

I doubt it's related as non-SMP kernels won't even have that
code compiled in :-)

What does a failed non-SMP boot say?  If it doesn't even bring up the
console, give it "-p" on the kernel command line.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (13 preceding siblings ...)
  (?)
@ 2007-12-06 12:09 ` Bernd Zeimetz
  -1 siblings, 0 replies; 38+ messages in thread
From: Bernd Zeimetz @ 2007-12-06 12:09 UTC (permalink / raw)
  To: sparclinux

David Miller wrote:
> From: Bernd Zeimetz <bernd@bzed.de>
> Date: Thu, 06 Dec 2007 11:43:45 +0100
> 
>> David Miller wrote:
>>> From: Bernd Zeimetz <bernd@bzed.de>
>>> Date: Fri, 16 Nov 2007 22:17:07 +0100
>>>
>>>> The sysrq-g output is attached, I hope you can make sense out of it.
>>>> We'll also add some extra workload to the other machines here to try to
>>>> trigger the bug on other CPUs, too.
>>> I just got back from my vacation and started looking at these
>>> dumps.  I think there might be some bug in cheetah_xcall_deliver(),
>>> I'll try to diagnose this some more.
>> I'm not sure if it is related, but non-SMP Kernels don't boot at all on
>> the machine.
> 
> I doubt it's related as non-SMP kernels won't even have that
> code compiled in :-)
> What does a failed non-SMP boot say?  If it doesn't even bring up the
> console, give it "-p" on the kernel command line.


That's from a 2.6.21-2-sparc64, had the output lying around here. I can
build and install a 2.6.23 and try it again if you want. It would be
good to know if non-SMP kernels work at all on the v880 and larger
machines, same for more recent CPU models - at the moment the Sparc
installer is non-SMP only, which resulted in some extra fun to install
the v880.


Rebooting with command: boot net:dhcp -p
Boot device: /pci@9,700000/network@1,1:dhcp  File and args: -p
Timed out waiting for BOOTP/DHCP reply
\
PROMLIB: Sun IEEE Boot Prom 'OBP 4.22.34 2007/07/23 13:01'
PROMLIB: Root node compatible:
Linux version 2.6.21-2-sparc64 (Debian 2.6.21-6) (waldi@debian.org) (gcc
version 4.1.3 20070629 (prerelease) (Debian 4.1.2
-13)) #1 Thu Jul 12 12:33:00 UTC 2007
ARCH: SUN4U
Ethernet address: 00:03:ba:0b:07:89
Remapping the kernel... done.
PROM: Built device tree with 125090 bytes of memory.
Booting Linux...
CPU[0]: Caches D[sz(65536):line_sz(32)] I[sz(32768):line_sz(32)]
E[sz(8388608):line_sz(512)]
Built 1 zonelists.  Total pages: 412546
Kernel command line: -p
PID hash table entries: 4096 (order: 12, 32768 bytes)
Console: colour dummy device 80x25
Dentry cache hash table entries: 524288 (order: 9, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 8, 2097152 bytes)
Memory: 8311800k available (2360k kernel code, 824k data, 144k init)
[fffff80000000000,000000b0ffb16000]
Calibrating delay using timer specific routine.. 20.00 BogoMIPS
(lpj@009)
Security Framework v1.0.0 initialized
SELinux:  Disabled at boot.
Capability LSM initialized
Mount-cache hash table entries: 512
NET: Registered protocol family 16
PCI: Probing for controllers.
/pci@8,700000: SCHIZO PCI Bus Module ver[4:0]
/pci@8,700000: PCI CFG[7ffee000000] IO[7ffef000000] MEM[7fe00000000]
/pci@8,600000: SCHIZO PCI Bus Module ver[4:0]
/pci@8,600000: PCI CFG[7ffec000000] IO[7ffed000000] MEM[7fd00000000]
/pci@9,700000: SCHIZO PCI Bus Module ver[4:0]
/pci@9,700000: PCI CFG[7ffea000000] IO[7ffeb000000] MEM[7fc00000000]
/pci@9,600000: SCHIZO PCI Bus Module ver[4:0]
/pci@9,600000: PCI CFG[7ffe8000000] IO[7ffe9000000] MEM[7fb00000000]
PCI1(PBMB): Bus running at 33MHz
PCI1(PBMA): Bus running at 66MHz
PCI0(PBMB): Bus running at 33MHz
PCI0(PBMA): Bus running at 66MHz
ebus0: [flashprom] [bbc] [power] [i2c -> (fru) (fru) (fru) (fru) (fru)
(fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru)
(fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru)
(fru) (fru) (fru) (fru) (fru) (fru) (fru) (temperature) (temperature)
(temperature) (temperature) (temperature) (temperature) (temperature)]
[i2c -> (controller) (smbus-ara) (controller) (temperature)
(temperature) (temperature) (ioexp) (temperature) (controller) (adio)
(adio) (ioexp) (ioexp) (ioexp) (ioexp) (ioexp) (ioexp) (ioexp) (adio)
(adio) (adio) (adio) (temperature-sensor) (fru) (fru) (fru) (fru) (fru)
(fru) (rscrtc) (hotplug-controller) (hotplug-controller)
(hotplug-controller) (hotplug-controller)] [bbc] [i2c -> (temperature)
(temperature) (temperature)] [i2c -> (nvram) (idprom)] [rtc] [gpio]
[pmc] [rsc-control] [rsc-console] [serial]
power: Control reg at 7fc7e30002e ... not using powerd.
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
/pci@9,700000/ebus@1/rtc@1,300070: Clock regs at 000007fc7e300070
NET: Registered protocol family 2
IP route cache hash table entries: 131072 (order: 7, 1048576 bytes)
TCP established hash table entries: 524288 (order: 10, 8388608 bytes)
TCP bind hash table entries: 65536 (order: 6, 524288 bytes)
TCP: Hash tables configured (established 524288 bind 65536)
TCP reno registered
checking if image is initramfs... it is
Freeing initrd memory: 3238k freed
/memory-controller@0,400000: US3 memory controller at 0000040000400000
[ACTIVE]
/memory-controller@1,400000: US3 memory controller at 0000040000c00000
[ACTIVE]
/memory-controller@2,400000: US3 memory controller at 0000040001400000
[ACTIVE]
ERROR(0): Cheetah error trap taken afsr[0000100000000000]
afar[0000040001c00000] TL1(0)
ERROR(0): TPC[4351dc] TNPC[4351e0] O7[4353b4] TSTATE[80001606]
ERROR(0): TPC<interpret_one_decode_reg+0x0/0xfc>
ERROR(0): M_SYND(0),  E_SYND(0)
ERROR(0): Highest priority error (0000100000000000) "Unmapped error from
system bus"
ERROR(0): D-cache idx[0] tag[0000000000000000] utag[0000000000000000]
stag[0000000000000000]
ERROR(0): D-cache data0[0000000000000000] data1[0000000000000000]
data2[0000000000000000] data3[0000000000000000]
ERROR(0): I-cache idx[0] tag[0000000000000000] utag[0000000000000000]
stag[0000000000000000] u[0000000000000000] l[0000000000000000]
ERROR(0): I-cache INSN0[0000000000000000] INSN1[0000000000000000]
INSN2[0000000000000000] INSN3[0000000000000000]
ERROR(0): I-cache INSN4[0000000000000000] INSN5[0000000000000000]
INSN6[0000000000000000] INSN7[0000000000000000]
ERROR(0): E-cache idx[0] tag[0000000000000000]
ERROR(0): E-cache data0[0000000000000000] data1[0000000000000000]
data2[0000000000000000] data3[0000000000000000]
Kernel panic - not syncing: Irrecoverable deferred error trap.


-- 
Bernd Zeimetz
<bernd@bzed.de>                         <http://bzed.de/>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (14 preceding siblings ...)
  (?)
@ 2007-12-06 13:52 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-12-06 13:52 UTC (permalink / raw)
  To: sparclinux

From: Bernd Zeimetz <bernd@bzed.de>
Date: Thu, 06 Dec 2007 13:09:18 +0100

> ERROR(0): Cheetah error trap taken afsr[0000100000000000]
> afar[0000040001c00000] TL1(0)
> ERROR(0): TPC[4351dc] TNPC[4351e0] O7[4353b4] TSTATE[80001606]
> ERROR(0): TPC<interpret_one_decode_reg+0x0/0xfc>

I'm pretty sure I know what is causing this, thanks for
the log.

I'll work on a fix after I get some sleep.

Thanks again.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (15 preceding siblings ...)
  (?)
@ 2007-12-07  8:59 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-12-07  8:59 UTC (permalink / raw)
  To: sparclinux

From: Bernd Zeimetz <bernd@bzed.de>
Date: Thu, 06 Dec 2007 13:09:18 +0100

> ERROR(0): Cheetah error trap taken afsr[0000100000000000]
> afar[0000040001c00000] TL1(0)
> ERROR(0): TPC[4351dc] TNPC[4351e0] O7[4353b4] TSTATE[80001606]
> ERROR(0): TPC<interpret_one_decode_reg+0x0/0xfc>
> ERROR(0): M_SYND(0),  E_SYND(0)

Please try this patch:

commit 980a9fd582ee9ac6729d6f0ac19ce21ca55aa401
Author: David S. Miller <davem@sunset.davemloft.net>
Date:   Fri Dec 7 00:58:55 2007 -0800

    [SPARC64]: Fix memory controller register access when non-SMP.
    
    get_cpu() always returns zero on non-SMP builds, but we
    really want the physical cpu number in this code in order
    to do the right thing.
    
    Based upon a non-SMP kernel boot failure report from Bernd Zeimetz.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/arch/sparc64/kernel/chmc.c b/arch/sparc64/kernel/chmc.c
index 777d345..6d4f02e 100644
--- a/arch/sparc64/kernel/chmc.c
+++ b/arch/sparc64/kernel/chmc.c
@@ -1,7 +1,6 @@
-/* $Id: chmc.c,v 1.4 2002/01/08 16:00:14 davem Exp $
- * memctrlr.c: Driver for UltraSPARC-III memory controller.
+/* memctrlr.c: Driver for UltraSPARC-III memory controller.
  *
- * Copyright (C) 2001 David S. Miller (davem@redhat.com)
+ * Copyright (C) 2001, 2007 David S. Miller (davem@davemloft.net)
  */
 
 #include <linux/module.h>
@@ -16,6 +15,7 @@
 #include <linux/init.h>
 #include <asm/spitfire.h>
 #include <asm/chmctrl.h>
+#include <asm/cpudata.h>
 #include <asm/oplib.h>
 #include <asm/prom.h>
 #include <asm/io.h>
@@ -242,8 +242,11 @@ int chmc_getunumber(int syndrome_code,
  */
 static u64 read_mcreg(struct mctrl_info *mp, unsigned long offset)
 {
-	unsigned long ret;
-	int this_cpu = get_cpu();
+	unsigned long ret, this_cpu;
+
+	preempt_disable();
+
+	this_cpu = real_hard_smp_processor_id();
 
 	if (mp->portid = this_cpu) {
 		__asm__ __volatile__("ldxa	[%1] %2, %0"
@@ -255,7 +258,8 @@ static u64 read_mcreg(struct mctrl_info *mp, unsigned long offset)
 				     : "r" (mp->regs + offset),
 				       "i" (ASI_PHYS_BYPASS_EC_E));
 	}
-	put_cpu();
+
+	preempt_enable();
 
 	return ret;
 }

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (16 preceding siblings ...)
  (?)
@ 2007-12-08  0:14 ` Bernd Zeimetz
  -1 siblings, 0 replies; 38+ messages in thread
From: Bernd Zeimetz @ 2007-12-08  0:14 UTC (permalink / raw)
  To: sparclinux


David Miller wrote:
> From: Bernd Zeimetz <bernd@bzed.de>
> Date: Thu, 06 Dec 2007 13:09:18 +0100
> 
>> ERROR(0): Cheetah error trap taken afsr[0000100000000000]
>> afar[0000040001c00000] TL1(0)
>> ERROR(0): TPC[4351dc] TNPC[4351e0] O7[4353b4] TSTATE[80001606]
>> ERROR(0): TPC<interpret_one_decode_reg+0x0/0xfc>
>> ERROR(0): M_SYND(0),  E_SYND(0)
> 
> Please try this patch:
[...]

titan:~# uname -a
Linux titan 2.6.23.9+davem-nonsmp #1 Fri Dec 7 10:02:01 UTC 2007 sparc64
GNU/Linux
titan:~# cat /proc/cpuinfo
cpu             : TI UltraSparc III (Cheetah)
fpu             : UltraSparc III integrated FPU
prom            : OBP 4.22.34 2007/07/23 13:01
type            : sun4u
ncpus probed    : 4
ncpus active    : 1
D$ parity tl1   : 0
I$ parity tl1   : 0
Cpu0ClkTck      : 000000002cb41780
MMU Type        : Cheetah
titan:~#

works well, thanks for fixing!

-- 
Bernd Zeimetz
<bernd@bzed.de>                         <http://bzed.de/>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (17 preceding siblings ...)
  (?)
@ 2007-12-09  8:38 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-12-09  8:38 UTC (permalink / raw)
  To: sparclinux

From: Bernd Zeimetz <bernd@bzed.de>
Date: Sat, 08 Dec 2007 01:14:46 +0100

> works well, thanks for fixing!

Thanks a lot for testing.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (18 preceding siblings ...)
  (?)
@ 2007-12-10  9:16 ` Bernd Zeimetz
  -1 siblings, 0 replies; 38+ messages in thread
From: Bernd Zeimetz @ 2007-12-10  9:16 UTC (permalink / raw)
  To: sparclinux

David Miller wrote:
> From: Bernd Zeimetz <bernd@bzed.de>
> Date: Sat, 08 Dec 2007 01:14:46 +0100
> 
>> works well, thanks for fixing!
> 
> Thanks a lot for testing.


You're welcome.
Are you going to send the patch for 2.6.23, too?

Also I've tried to crash the machine while running the non-SMP kernel -
but it is still running fine.


-- 
Bernd Zeimetz
<bernd@bzed.de>                         <http://bzed.de/>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (19 preceding siblings ...)
  (?)
@ 2007-12-10  9:18 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-12-10  9:18 UTC (permalink / raw)
  To: sparclinux

From: Bernd Zeimetz <bernd@bzed.de>
Date: Mon, 10 Dec 2007 10:16:25 +0100

> Are you going to send the patch for 2.6.23, too?

I'll push it soon to the -stable folks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (20 preceding siblings ...)
  (?)
@ 2007-12-11 14:19 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-12-11 14:19 UTC (permalink / raw)
  To: sparclinux

From: Bernd Zeimetz <bernd@bzed.de>
Date: Fri, 16 Nov 2007 22:17:07 +0100

> the v880 (4x US III) here was hit by a stuck process again, after
> running fine for some time now. But the machine didn't freeze, one CPU
> was running at 100%, but otherwise the machine was responsible.

In your dump all of the cpus seem to be alive and healthy, and
thus able to receive cpu messages, yet we are stuck on one cpu
in cheetah_xcall_deliver().

I suspect that the code in cheetah_xcall_deliver() can get into an
endless loop because of the way it interprets the dispatch status
register.

For example, if there are stray BUSY or NACK bits set in that register
for processors we are not trying to send a message to (for example,
from a previous message send) the logic can clear all the bits in the
cpu mask and then endlessly cycle in that function because no new
messages are sent and therefore no forward progress is made.  The
cycle repeats forever.

I shoule be easily fixed using the patch below.  It records which bits
we should actually be concerned about, and only tests those specific
bits in the dispatch status register.

Could you please give this patch a test?

Thanks.

diff --git a/arch/sparc64/kernel/smp.c b/arch/sparc64/kernel/smp.c
index 894b506..c399449 100644
--- a/arch/sparc64/kernel/smp.c
+++ b/arch/sparc64/kernel/smp.c
@@ -476,7 +476,7 @@ static inline void spitfire_xcall_deliver(u64 data0, u64 data1, u64 data2, cpuma
  */
 static void cheetah_xcall_deliver(u64 data0, u64 data1, u64 data2, cpumask_t mask)
 {
-	u64 pstate, ver;
+	u64 pstate, ver, busy_mask;
 	int nack_busy_id, is_jbus, need_more;
 
 	if (cpus_empty(mask))
@@ -508,14 +508,20 @@ retry:
 			       "i" (ASI_INTR_W));
 
 	nack_busy_id = 0;
+	busy_mask = 0;
 	{
 		int i;
 
 		for_each_cpu_mask(i, mask) {
 			u64 target = (i << 14) | 0x70;
 
-			if (!is_jbus)
+			if (is_jbus) {
+				busy_mask |= (0x1UL << (i * 2));
+			} else {
 				target |= (nack_busy_id << 24);
+				busy_mask |= (0x1UL <<
+					      (nack_busy_id * 2));
+			}
 			__asm__ __volatile__(
 				"stxa	%%g0, [%0] %1\n\t"
 				"membar	#Sync\n\t"
@@ -531,15 +537,16 @@ retry:
 
 	/* Now, poll for completion. */
 	{
-		u64 dispatch_stat;
+		u64 dispatch_stat, nack_mask;
 		long stuck;
 
 		stuck = 100000 * nack_busy_id;
+		nack_mask = busy_mask << 1;
 		do {
 			__asm__ __volatile__("ldxa	[%%g0] %1, %0"
 					     : "=r" (dispatch_stat)
 					     : "i" (ASI_INTR_DISPATCH_STAT));
-			if (dispatch_stat = 0UL) {
+			if (!(dispatch_stat & (busy_mask | nack_mask))) {
 				__asm__ __volatile__("wrpr %0, 0x0, %%pstate"
 						     : : "r" (pstate));
 				if (unlikely(need_more)) {
@@ -556,12 +563,12 @@ retry:
 			}
 			if (!--stuck)
 				break;
-		} while (dispatch_stat & 0x5555555555555555UL);
+		} while (dispatch_stat & busy_mask);
 
 		__asm__ __volatile__("wrpr %0, 0x0, %%pstate"
 				     : : "r" (pstate));
 
-		if ((dispatch_stat & ~(0x5555555555555555UL)) = 0) {
+		if (dispatch_stat & busy_mask) {
 			/* Busy bits will not clear, continue instead
 			 * of freezing up on this cpu.
 			 */

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (21 preceding siblings ...)
  (?)
@ 2007-12-12 16:05 ` Bernd Zeimetz
  -1 siblings, 0 replies; 38+ messages in thread
From: Bernd Zeimetz @ 2007-12-12 16:05 UTC (permalink / raw)
  To: sparclinux


> I shoule be easily fixed using the patch below.  It records which bits
> we should actually be concerned about, and only tests those specific
> bits in the dispatch status register.
> 
> Could you please give this patch a test?

Tested - the patch seems to fix the problem as the machine is still
alive and working well after several hours of running the buggy aptitude
-u in a loop.

I'll leave the kernel running and make sure the machine gets some more
users and load during the next days.


Thanks for the fix,

Bernd

-- 
Bernd Zeimetz
<bernd@bzed.de>                         <http://bzed.de/>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (22 preceding siblings ...)
  (?)
@ 2007-12-12 16:23 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-12-12 16:23 UTC (permalink / raw)
  To: sparclinux

From: Bernd Zeimetz <bernd@bzed.de>
Date: Wed, 12 Dec 2007 17:05:55 +0100

> Tested - the patch seems to fix the problem as the machine is still
> alive and working well after several hours of running the buggy aptitude
> -u in a loop.
> 
> I'll leave the kernel running and make sure the machine gets some more
> users and load during the next days.

Thanks for testing, let me know if any more issues trigger.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (23 preceding siblings ...)
  (?)
@ 2007-12-13 20:54 ` Bernd Zeimetz
  -1 siblings, 0 replies; 38+ messages in thread
From: Bernd Zeimetz @ 2007-12-13 20:54 UTC (permalink / raw)
  To: sparclinux

> Thanks for testing, let me know if any more issues trigger.

The machine had some random processes (ssh, ping and aptitude) being
stuck today, but they went away after hitting them with kill -9. They
also didn't eat CPU time - they were just doing nothing.
Unfortunately I didn't have the time for a closer look, I'll try to
gather some more informations the next time it happens.

-- 
Bernd Zeimetz
<bernd@bzed.de>                         <http://bzed.de/>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (24 preceding siblings ...)
  (?)
@ 2007-12-16 22:41 ` Bernd Zeimetz
  -1 siblings, 0 replies; 38+ messages in thread
From: Bernd Zeimetz @ 2007-12-16 22:41 UTC (permalink / raw)
  To: sparclinux


>> I'll leave the kernel running and make sure the machine gets some more
>> users and load during the next days.
> 
> Thanks for testing, let me know if any more issues trigger.

One problem I was pointed to was the build failure of erlang. Here the
created erlc binary segfaults with a bus error.

- this only happens on US III machines, works fine on US II.

- on lebrun it doesn't happen on the first call of erlc, but after
several successful runs of it - see
http://buildd.debian.org/fetch.cgi?&pkg=erlang&ver=1%3A11.b.5dfsg-11&arch=sparc&stamp\x1197012623&file=log

- on our v880 here (which is still running the kernel with your test
patch) erlc segfaults instantly. A strace shows that it is stuck at a
well known place - pretty similar to the segfault in aptitude which
successfully shot the machine to death before your patch(es) was(were)
applied:

[pid  1224] clone(Process 1228 attached
child_stack=0xf7951480,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0xf7951bd8, tls=0xf7951b90, child_tidptr=0xf7951bd8) = 1228
[pid  1224] SYS_300(0xf7951be0, 0xc, 0, 0, 0xf7951df4) = 0
[pid  1224] futex(0xff993338, 0x80 /* FUTEX_??? */, 2

... there it hangs.


I guess you should be able to reproduce this on your US III machine.
dget -x \
ftp://debian.netcologne.de/debian/pool/main/e/erlang/erlang_11.b.5dfsg-11.dsc
cd erlang-11.b.5dfsg
dpkg-buildpackage -rfakeroot
(you'll probably have to install some build-deps...)
when erlc segfaults, change into the directory and set

ERL_TOP=/home/bzed/erlang-11.b.5dfsg
PATH=/home/bzed/erlang-11.b.5dfsg/bootstrap/bin:${PATH}

before retrying to run erlc.


Let me know if you need more informations or want me to test something.


-- 
Bernd Zeimetz
<bernd@bzed.de>                         <http://bzed.de/>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (25 preceding siblings ...)
  (?)
@ 2007-12-16 22:48 ` Josip Rodin
  -1 siblings, 0 replies; 38+ messages in thread
From: Josip Rodin @ 2007-12-16 22:48 UTC (permalink / raw)
  To: sparclinux

On Sun, Dec 16, 2007 at 11:41:19PM +0100, Bernd Zeimetz wrote:
> >> I'll leave the kernel running and make sure the machine gets some more
> >> users and load during the next days.
> > 
> > Thanks for testing, let me know if any more issues trigger.
> 
> One problem I was pointed to was the build failure of erlang. Here the
> created erlc binary segfaults with a bus error.
> 
> - this only happens on US III machines, works fine on US II.
> 
> - on lebrun it doesn't happen on the first call of erlc, but after
> several successful runs of it - see
> http://buildd.debian.org/fetch.cgi?&pkg=erlang&ver=1%3A11.b.5dfsg-11&arch=sparc&stamp\x1197012623&file=log

BTW, we've had lebrun crash three times since November 12th, the last time
a few hours ago. The first time I just cycled it via RSC, the second time
someone looked at it and the console was blank (!) after which we cycled it.
I'm going over there tomorrow morning to see what's on the console now.

-- 
     2. That which causes joy or happiness.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (26 preceding siblings ...)
  (?)
@ 2007-12-16 23:30 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-12-16 23:30 UTC (permalink / raw)
  To: sparclinux

From: Bernd Zeimetz <bernd@bzed.de>
Date: Sun, 16 Dec 2007 23:41:19 +0100

> One problem I was pointed to was the build failure of erlang. Here the
> created erlc binary segfaults with a bus error.

Find out where the bus error is occuring.

The futex() line from the strace isn't very interesting,
it just shows that one thread is stuck because the other
one died with a lock held.

I really don't have time ot do all the monkey work to track down this
critical debugging information and you have the setup to trigger it
already, so if you could please do this I'd really appreciate it.

I've already dumped a ton of time into fixing every single
problem you've reported, it's frustrating to be dumped on
with even more work given that :-(

I hope you understand.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (27 preceding siblings ...)
  (?)
@ 2007-12-17  9:40 ` Josip Rodin
  -1 siblings, 0 replies; 38+ messages in thread
From: Josip Rodin @ 2007-12-17  9:40 UTC (permalink / raw)
  To: sparclinux

On Sun, Dec 16, 2007 at 11:48:31PM +0100, Josip Rodin wrote:
> > One problem I was pointed to was the build failure of erlang. Here the
> > created erlc binary segfaults with a bus error.
> > 
> > - this only happens on US III machines, works fine on US II.
> > 
> > - on lebrun it doesn't happen on the first call of erlc, but after
> > several successful runs of it - see
> > http://buildd.debian.org/fetch.cgi?&pkg=erlang&ver=1%3A11.b.5dfsg-11&arch=sparc&stamp\x1197012623&file=log
> 
> BTW, we've had lebrun crash three times since November 12th, the last time
> a few hours ago. The first time I just cycled it via RSC, the second time
> someone looked at it and the console was blank (!) after which we cycled it.
> I'm going over there tomorrow morning to see what's on the console now.

The machine was stuck with a white blank screen on the monitor, and a
dysfunctional keyboard (USB). It's a standard keyboard that came with the
same machine, and otherwise works fine. It's completely dead - not only
normal key combinations (including Alt+SysRq), but pressing Caps Lock
doesn't turn on its LED, and Stop+A doesn't work either.

The machine doesn't have any ports for the old-style Sun keyboards, so I'm
out of options...

-- 
     2. That which causes joy or happiness.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (28 preceding siblings ...)
  (?)
@ 2007-12-17  9:57 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-12-17  9:57 UTC (permalink / raw)
  To: sparclinux

From: Josip Rodin <joy@entuzijast.net>
Date: Mon, 17 Dec 2007 10:40:05 +0100

> The machine was stuck with a white blank screen on the monitor, and a
> dysfunctional keyboard (USB). It's a standard keyboard that came with the
> same machine, and otherwise works fine. It's completely dead - not only
> normal key combinations (including Alt+SysRq), but pressing Caps Lock
> doesn't turn on its LED, and Stop+A doesn't work either.
> 
> The machine doesn't have any ports for the old-style Sun keyboards, so I'm
> out of options...

You're more likely to get a good crash dump on the serial
console, if that is something you can use.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (29 preceding siblings ...)
  (?)
@ 2007-12-17 10:10 ` Josip Rodin
  -1 siblings, 0 replies; 38+ messages in thread
From: Josip Rodin @ 2007-12-17 10:10 UTC (permalink / raw)
  To: sparclinux

On Mon, Dec 17, 2007 at 01:57:55AM -0800, David Miller wrote:
> From: Josip Rodin <joy@entuzijast.net>
> Date: Mon, 17 Dec 2007 10:40:05 +0100
> 
> > The machine was stuck with a white blank screen on the monitor, and a
> > dysfunctional keyboard (USB). It's a standard keyboard that came with the
> > same machine, and otherwise works fine. It's completely dead - not only
> > normal key combinations (including Alt+SysRq), but pressing Caps Lock
> > doesn't turn on its LED, and Stop+A doesn't work either.
> > 
> > The machine doesn't have any ports for the old-style Sun keyboards, so I'm
> > out of options...
> 
> You're more likely to get a good crash dump on the serial
> console, if that is something you can use.

I tried setting the console devices to rsc-console in PROM, but the RSC
never sees anything other than the initial kernel output, and doesn't
relay any input to the machine. Have you tried that on your 280R?

-- 
     2. That which causes joy or happiness.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix for sparc64 cpu hangs.
  2007-11-07  4:34 ` David Miller
                   ` (30 preceding siblings ...)
  (?)
@ 2007-12-17 10:21 ` David Miller
  -1 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2007-12-17 10:21 UTC (permalink / raw)
  To: sparclinux

From: Josip Rodin <joy@entuzijast.net>
Date: Mon, 17 Dec 2007 11:10:59 +0100

> I tried setting the console devices to rsc-console in PROM, but the RSC
> never sees anything other than the initial kernel output, and doesn't
> relay any input to the machine. Have you tried that on your 280R?

I simply use the main serial lines, not the RSC.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2007-12-17 10:21 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-07  4:34 Fix for sparc64 cpu hangs David Miller
2007-11-07  4:34 ` David Miller
2007-11-07  5:13 ` David Miller
2007-11-07  5:13   ` David Miller
2007-11-09 20:22   ` Andrew Morton
2007-11-09 20:22     ` Andrew Morton
2007-11-09 22:14     ` David Miller
2007-11-09 22:14       ` David Miller
2007-11-07 14:25 ` Josip Rodin
2007-11-07 14:35 ` Bernd Zeimetz
2007-11-08  0:01 ` David Miller
2007-11-11  6:04 ` David Miller
2007-11-11  6:13 ` David Miller
2007-11-11  6:27 ` Bernd Zeimetz
2007-11-12 13:16 ` Josip Rodin
2007-11-16 21:17 ` Bernd Zeimetz
2007-11-20  6:09 ` David Miller
2007-12-06  8:49 ` David Miller
2007-12-06 10:43 ` Bernd Zeimetz
2007-12-06 11:08 ` David Miller
2007-12-06 12:09 ` Bernd Zeimetz
2007-12-06 13:52 ` David Miller
2007-12-07  8:59 ` David Miller
2007-12-08  0:14 ` Bernd Zeimetz
2007-12-09  8:38 ` David Miller
2007-12-10  9:16 ` Bernd Zeimetz
2007-12-10  9:18 ` David Miller
2007-12-11 14:19 ` David Miller
2007-12-12 16:05 ` Bernd Zeimetz
2007-12-12 16:23 ` David Miller
2007-12-13 20:54 ` Bernd Zeimetz
2007-12-16 22:41 ` Bernd Zeimetz
2007-12-16 22:48 ` Josip Rodin
2007-12-16 23:30 ` David Miller
2007-12-17  9:40 ` Josip Rodin
2007-12-17  9:57 ` David Miller
2007-12-17 10:10 ` Josip Rodin
2007-12-17 10:21 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.