* Re: RED State Exception on Ultra 1, 2.6.9-rc2
2004-09-21 4:59 RED State Exception on Ultra 1, 2.6.9-rc2 David Dillow
@ 2004-09-21 6:31 ` David S. Miller
2004-09-21 13:47 ` David Dillow
` (7 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: David S. Miller @ 2004-09-21 6:31 UTC (permalink / raw)
To: sparclinux
On 21 Sep 2004 00:59:51 -0400
David Dillow <dave@thedillows.org> wrote:
> What's a RED State exception?
The CPU has storage internally for up to 5 levels of
trap information, this is called the trap stack. When
you take a trap at the 4th level, the cpu enters RED
state and traps to a special vector in a trap table
at a fixed pre-determined address, with TLBs and caches
disabled, to process the trap.
When that happens, what usually occurs on Sun systems
is that the cpu jumps into Sun firmware code which dumps
out the trap stack like you see here.
> TL\000.0000.0000.0005 TT\000.0000.0000.0080
> TPC\000.0000.0040.ec98 TnPC\000.0000.0040.ec9c TSTATE\000.0000.8000.9504
> TL\000.0000.0000.0004 TT\000.0000.0000.0010
> TPC\000.0000.0040.d000 TnPC\000.0000.0040.d004 TSTATE\000.0000.8000.9504
> TL\000.0000.0000.0003 TT\000.0000.0000.0080
> TPC\000.0000.0040.ec98 TnPC\000.0000.0040.ec9c TSTATE\000.0000.8000.9502
> TL\000.0000.0000.0002 TT\000.0000.0000.0010
> TPC\000.0000.0040.8c00 TnPC\000.0000.0040.8c04 TSTATE\000.0000.8008.9402
> TL\000.0000.0000.0001 TT\000.0000.0000.0060
> TPC\000.0000.0042.1b68 TnPC\000.0000.0042.1b6c TSTATE\000.0000.8000.9602
Can you match up these "TPC" and "TnPC" values to symbols in
the kernel running at the time of this crash?
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: RED State Exception on Ultra 1, 2.6.9-rc2
2004-09-21 4:59 RED State Exception on Ultra 1, 2.6.9-rc2 David Dillow
2004-09-21 6:31 ` David S. Miller
@ 2004-09-21 13:47 ` David Dillow
2004-09-21 14:50 ` Tim Walberg
` (6 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: David Dillow @ 2004-09-21 13:47 UTC (permalink / raw)
To: sparclinux
On Tue, 2004-09-21 at 02:31, David S. Miller wrote:
> On 21 Sep 2004 00:59:51 -0400
> David Dillow <dave@thedillows.org> wrote:
>
> > What's a RED State exception?
[explanation snipped; thanks! ]
> > TL\000.0000.0000.0005 TT\000.0000.0000.0080
> > TPC\000.0000.0040.ec98 TnPC\000.0000.0040.ec9c TSTATE\000.0000.8000.9504
TPC: etrap_irq
TnPC: etrap_irq
> > TL\000.0000.0000.0004 TT\000.0000.0000.0010
> > TPC\000.0000.0040.d000 TnPC\000.0000.0040.d004 TSTATE\000.0000.8000.9504
TPC: tl1_s0n
TnPC: tl1_s0n
> > TL\000.0000.0000.0003 TT\000.0000.0000.0080
> > TPC\000.0000.0040.ec98 TnPC\000.0000.0040.ec9c TSTATE\000.0000.8000.9502
TPC: etrap_irq
TnPC: etrap_irq
> > TL\000.0000.0000.0002 TT\000.0000.0000.0010
> > TPC\000.0000.0040.8c00 TnPC\000.0000.0040.8c04 TSTATE\000.0000.8008.9402
TPC: tl0_ivec
TnPC: tl0_ivec
> > TL\000.0000.0000.0001 TT\000.0000.0000.0060
> > TPC\000.0000.0042.1b68 TnPC\000.0000.0042.1b6c TSTATE\000.0000.8000.9602
TPC: free_streaming_cluster
TnPC: free_streaming_cluster
Doh! Did I mess something up in my prior patch allowing larger than 1MB
sbus_map_sg()'s?
If that sounds like a reasonable explanation to you, could you describe
the TT and TSTATE parameters, or point me to some documentation?
Thanks,
Dave
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: RED State Exception on Ultra 1, 2.6.9-rc2
2004-09-21 4:59 RED State Exception on Ultra 1, 2.6.9-rc2 David Dillow
2004-09-21 6:31 ` David S. Miller
2004-09-21 13:47 ` David Dillow
@ 2004-09-21 14:50 ` Tim Walberg
2004-09-21 19:48 ` David S. Miller
` (5 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Tim Walberg @ 2004-09-21 14:50 UTC (permalink / raw)
To: sparclinux
[-- Attachment #1: Type: text/plain, Size: 2358 bytes --]
Not the simplest answer, but the TT and TSTATE
stuff (along with probably much more than you
would ever want to know) are pretty well documented
in the appropriate SPARC architecture manuals
freely available at
http://www.sparc.org/resource.htm
and/or
http://www.sun.com/processors/documentation.html
On 09/21/2004 09:47 -0400, David Dillow wrote:
>> On Tue, 2004-09-21 at 02:31, David S. Miller wrote:
>> > On 21 Sep 2004 00:59:51 -0400
>> > David Dillow <dave@thedillows.org> wrote:
>> >
>> > > What's a RED State exception?
>>
>> [explanation snipped; thanks! ]
>>
>> > > TL=0000.0000.0000.0005 TT=0000.0000.0000.0080
>> > > TPC=0000.0000.0040.ec98 TnPC=0000.0000.0040.ec9c TSTATE=0000.0000.8000.9504
>> TPC: etrap_irq
>> TnPC: etrap_irq
>> > > TL=0000.0000.0000.0004 TT=0000.0000.0000.0010
>> > > TPC=0000.0000.0040.d000 TnPC=0000.0000.0040.d004 TSTATE=0000.0000.8000.9504
>> TPC: tl1_s0n
>> TnPC: tl1_s0n
>> > > TL=0000.0000.0000.0003 TT=0000.0000.0000.0080
>> > > TPC=0000.0000.0040.ec98 TnPC=0000.0000.0040.ec9c TSTATE=0000.0000.8000.9502
>> TPC: etrap_irq
>> TnPC: etrap_irq
>> > > TL=0000.0000.0000.0002 TT=0000.0000.0000.0010
>> > > TPC=0000.0000.0040.8c00 TnPC=0000.0000.0040.8c04 TSTATE=0000.0000.8008.9402
>> TPC: tl0_ivec
>> TnPC: tl0_ivec
>> > > TL=0000.0000.0000.0001 TT=0000.0000.0000.0060
>> > > TPC=0000.0000.0042.1b68 TnPC=0000.0000.0042.1b6c TSTATE=0000.0000.8000.9602
>> TPC: free_streaming_cluster
>> TnPC: free_streaming_cluster
>>
>> Doh! Did I mess something up in my prior patch allowing larger than 1MB
>> sbus_map_sg()'s?
>>
>> If that sounds like a reasonable explanation to you, could you describe
>> the TT and TSTATE parameters, or point me to some documentation?
>>
>> Thanks,
>> Dave
>> -
>> To unsubscribe from this list: send the line "unsubscribe sparclinux" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
End of included message
--
+--------------------------+------------------------------+
| Tim Walberg | twalberg@mindspring.com |
| 830 Carriage Dr. | www.mindspring.com/~twalberg |
| Algonquin, IL 60102 | |
+--------------------------+------------------------------+
[-- Attachment #2: Type: application/pgp-signature, Size: 174 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: RED State Exception on Ultra 1, 2.6.9-rc2
2004-09-21 4:59 RED State Exception on Ultra 1, 2.6.9-rc2 David Dillow
` (2 preceding siblings ...)
2004-09-21 14:50 ` Tim Walberg
@ 2004-09-21 19:48 ` David S. Miller
2004-09-22 3:25 ` David Dillow
` (4 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: David S. Miller @ 2004-09-21 19:48 UTC (permalink / raw)
To: sparclinux
On 21 Sep 2004 09:47:02 -0400
David Dillow <dave@thedillows.org> wrote:
> > > TL\000.0000.0000.0005 TT\000.0000.0000.0080
> > > TPC\000.0000.0040.ec98 TnPC\000.0000.0040.ec9c TSTATE\000.0000.8000.9504
> TPC: etrap_irq
> TnPC: etrap_irq
> > > TL\000.0000.0000.0004 TT\000.0000.0000.0010
> > > TPC\000.0000.0040.d000 TnPC\000.0000.0040.d004 TSTATE\000.0000.8000.9504
> TPC: tl1_s0n
> TnPC: tl1_s0n
> > > TL\000.0000.0000.0003 TT\000.0000.0000.0080
> > > TPC\000.0000.0040.ec98 TnPC\000.0000.0040.ec9c TSTATE\000.0000.8000.9502
> TPC: etrap_irq
> TnPC: etrap_irq
> > > TL\000.0000.0000.0002 TT\000.0000.0000.0010
> > > TPC\000.0000.0040.8c00 TnPC\000.0000.0040.8c04 TSTATE\000.0000.8008.9402
> TPC: tl0_ivec
> TnPC: tl0_ivec
> > > TL\000.0000.0000.0001 TT\000.0000.0000.0060
> > > TPC\000.0000.0042.1b68 TnPC\000.0000.0042.1b6c TSTATE\000.0000.8000.9602
> TPC: free_streaming_cluster
> TnPC: free_streaming_cluster
>
> Doh! Did I mess something up in my prior patch allowing larger than 1MB
> sbus_map_sg()'s?
I don't think so, based upon this trace. TT means "Trap Type", that's the
numbered trap the cpu took at each trap level and what the trap type numbers
mean is described in the UltraSPARC programmer's manual.
TSTATE's layout is defined by macros in include/asm-sparc64/pstate.h
Anyways, in free_streaming_cluster() we took a vectored interrupt (trap
type 0x60). In tl0_ivec we took an illegal instruction trap, which is
very odd because the instruction at 0x408c00:tl0_ivec is a branch.
It looks like something clobbered the instruction there, that is my best
guess. If you get one of these again you can, at the OBP prompt, say:
ok 0x408c00 dis
and see if the instruction has been corrupted there.
It may be your DMA mapping changes, so doing the following wouldn't hurt:
1) try running with those DMA mapping patches reverted
2) audit those patches for possible errors
Thanks.
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: RED State Exception on Ultra 1, 2.6.9-rc2
2004-09-21 4:59 RED State Exception on Ultra 1, 2.6.9-rc2 David Dillow
` (3 preceding siblings ...)
2004-09-21 19:48 ` David S. Miller
@ 2004-09-22 3:25 ` David Dillow
2004-09-22 3:35 ` David S. Miller
` (3 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: David Dillow @ 2004-09-22 3:25 UTC (permalink / raw)
To: sparclinux
On Tue, 2004-09-21 at 15:48, David S. Miller wrote:
> TSTATE's layout is defined by macros in include/asm-sparc64/pstate.h
Thanks, but my quick decode of that info didn't turn up anything very
interesting: current window is always 2, address space always 0x80. The
other bits are Privilege, Intr enable, and some globals -- nothing that
means much to me, or looks promising.
> Anyways, in free_streaming_cluster() we took a vectored interrupt (trap
> type 0x60). In tl0_ivec we took an illegal instruction trap, which is
> very odd because the instruction at 0x408c00:tl0_ivec is a branch.
Should I track the vectored interrupt? What's it used for -- it looks
like its used for cross-CPU calls, but this is a single CPU. Are there
other uses?
I ask, because we took the first trap on "stx %g0, [%o2]", which
corresponds with this line in free_streaming_cluster():
iopte_val(*iopte) = 0UL;
Could iopte point somewhere nasty?
Does this seem like a possible pointer to a bug, or just a red herring?
> It looks like something clobbered the instruction there, that is my best
> guess. If you get one of these again you can, at the OBP prompt, say:
>
> ok 0x408c00 dis
>
> and see if the instruction has been corrupted there.
If I can reproduce it, I will check this.
Any suggestions as to reproducing it? I was doing an "apt-get upgrade"
when it occurred, somewhere around restarting /sbin/init. I'm guessing
just load the CPU and disk, since that's what it was doing mostly.
> It may be your DMA mapping changes, so doing the following wouldn't hurt:
>
> 1) try running with those DMA mapping patches reverted
> 2) audit those patches for possible errors
And 3) Audit my lpvi driver for memory problems -- it had been in use
earlier in this boot.
I've seen some slab corruption, but I'm pretty sure that was from some
error recovery code in the lpvi driver that I've since fixed.
If I can cause it to reappear with the driver installed (and used a
bit), then I will also try without it.
Thanks for your help,
Dave
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: RED State Exception on Ultra 1, 2.6.9-rc2
2004-09-21 4:59 RED State Exception on Ultra 1, 2.6.9-rc2 David Dillow
` (4 preceding siblings ...)
2004-09-22 3:25 ` David Dillow
@ 2004-09-22 3:35 ` David S. Miller
2004-10-08 16:51 ` David Dillow
` (2 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: David S. Miller @ 2004-09-22 3:35 UTC (permalink / raw)
To: sparclinux
On 21 Sep 2004 23:25:48 -0400
David Dillow <dave@thedillows.org> wrote:
> > Anyways, in free_streaming_cluster() we took a vectored interrupt (trap
> > type 0x60). In tl0_ivec we took an illegal instruction trap, which is
> > very odd because the instruction at 0x408c00:tl0_ivec is a branch.
>
> Should I track the vectored interrupt? What's it used for -- it looks
> like its used for cross-CPU calls, but this is a single CPU. Are there
> other uses?
>
> I ask, because we took the first trap on "stx %g0, [%o2]", which
> corresponds with this line in free_streaming_cluster():
> iopte_val(*iopte) = 0UL;
>
> Could iopte point somewhere nasty?
> Does this seem like a possible pointer to a bug, or just a red herring?
Just a red herring. And the ivector is not going to be anything
special.
The interesting bit is the illegal instruction trap at the first
instruction of tl0_ivec, which should be a branch. That's where
things start to go downhill.
Something is writing garbage to that instruction, that is my current
guess.
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: RED State Exception on Ultra 1, 2.6.9-rc2
2004-09-21 4:59 RED State Exception on Ultra 1, 2.6.9-rc2 David Dillow
` (5 preceding siblings ...)
2004-09-22 3:35 ` David S. Miller
@ 2004-10-08 16:51 ` David Dillow
2004-10-11 4:09 ` David S. Miller
2004-10-29 6:14 ` David Dillow
8 siblings, 0 replies; 10+ messages in thread
From: David Dillow @ 2004-10-08 16:51 UTC (permalink / raw)
To: sparclinux
On Tue, 2004-09-21 at 23:35, David S. Miller wrote:
> On 21 Sep 2004 23:25:48 -0400
> David Dillow <dave@thedillows.org> wrote:
>
> > > Anyways, in free_streaming_cluster() we took a vectored interrupt (trap
> > > type 0x60). In tl0_ivec we took an illegal instruction trap, which is
> > > very odd because the instruction at 0x408c00:tl0_ivec is a branch.
[snip]
> Something is writing garbage to that instruction, that is my current
> guess.
Ok, I've reproduced one more RED exception, and several silent lockups.
I got a RED exception with arch/sparck64/kernel/sbus.c v1.17, but
without loading my lpvi driver. It's been a while, but I believe I've
seen a silent lockup on this kernel as well.
I backed that patch out, and using sbus.c 1.16 I've gotten 3 or 4 silent
lockups.
So, I think there is a problem somewhere, and its probably not the sbus
changes.
I'm using 2.4.9-rc2 built by gcc version 3.2.3 (Debian).
Any suggestions on things I can do to rule out flaky hardware, or other
useful steps before I start a binary search on where it broke? It can
take from 15 minutes to 5 or more hours to reproduce using a network
load and back-to-back "dbench 24" runs.
Thanks,
Dave
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: RED State Exception on Ultra 1, 2.6.9-rc2
2004-09-21 4:59 RED State Exception on Ultra 1, 2.6.9-rc2 David Dillow
` (6 preceding siblings ...)
2004-10-08 16:51 ` David Dillow
@ 2004-10-11 4:09 ` David S. Miller
2004-10-29 6:14 ` David Dillow
8 siblings, 0 replies; 10+ messages in thread
From: David S. Miller @ 2004-10-11 4:09 UTC (permalink / raw)
To: sparclinux
On 08 Oct 2004 12:51:07 -0400
David Dillow <dave@thedillows.org> wrote:
> Any suggestions on things I can do to rule out flaky hardware, or other
> useful steps before I start a binary search on where it broke? It can
> take from 15 minutes to 5 or more hours to reproduce using a network
> load and back-to-back "dbench 24" runs.
David, I'll try to give you a hand with this when I get back
from Japan on the 18th.
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: RED State Exception on Ultra 1, 2.6.9-rc2
2004-09-21 4:59 RED State Exception on Ultra 1, 2.6.9-rc2 David Dillow
` (7 preceding siblings ...)
2004-10-11 4:09 ` David S. Miller
@ 2004-10-29 6:14 ` David Dillow
8 siblings, 0 replies; 10+ messages in thread
From: David Dillow @ 2004-10-29 6:14 UTC (permalink / raw)
To: sparclinux
On Sun, 2004-10-10 at 21:09 -0700, David S. Miller wrote:
> On 08 Oct 2004 12:51:07 -0400
> David Dillow <dave@thedillows.org> wrote:
>
> > Any suggestions on things I can do to rule out flaky hardware, or other
> > useful steps before I start a binary search on where it broke? It can
> > take from 15 minutes to 5 or more hours to reproduce using a network
> > load and back-to-back "dbench 24" runs.
>
> David, I'll try to give you a hand with this when I get back
> from Japan on the 18th.
While you were gone, and while the latest addition to my family has kept
me up nights, I've run some more testing...
A simple "while true; do dbench 48; done" will crash the system.
Repeatably. With the exact same RED exception dump -- well, it was
different one time.
I tried from 2.6.1 to 2.6.8.1, all compiled by the same compiler I'd
been using with 2.6.9-rc2, and they all produced RED exceptions, or
silent hangs.
I also tried the Debian "testing" kernel package of 2.6.8.1 (gcc 3.3.2,
I think -- would have to reboot to check). Same result.
I'm open to suggestions on what to try next.
Thanks!
--
David Dillow <dave@thedillows.org>
^ permalink raw reply [flat|nested] 10+ messages in thread