a really really weird crash on swarm

Linux MIPS Architecture development
 help / color / mirror / Atom feed

* a really really weird crash on swarm
@ 2002-08-09 23:22 Jun Sun
  2002-08-10  0:07 ` Justin Carlson
  2002-08-11 16:51 ` Ralf Baechle
  0 siblings, 2 replies; 5+ messages in thread
From: Jun Sun @ 2002-08-09 23:22 UTC (permalink / raw)
  To: linux-mips


Call me crazy - I have seen crash like this.  As you can see, the register is 
loaded with one value and on next instruction it shows another value.  What 
the hell is it possibly going on?

This is with today's OSS tree 2.4 branch.

Jun

Unable to handle kernel paging request at virtual address ffffcbf4, epc == 80120
9d4, ra == 8011bc30
Oops in fault.c::do_page_fault, line 206:
$0 : 00000000 10001f00 80120998 00000000 00000001 00000000 00000000 802c71e0
$8 : 10001f00 00001000 8029bca4 8028e060 00000000 00000000 00000012 00000000
$16: ffffcbf4 00000000 00000000 00000001 00000002 8028e880 802c71e0 8feabe48
$24: 00000000 2ad78fd0                   8027c000 8027ddd8 ffff6e08 8011bc30
Hi : fffd4abb
Lo : 0000e717
epc  : 801209d4    Not tainted
Status: 10001f02
Cause : 00800008
Process swapper (pid: 0, stackpage=8027c000)
Stack:    813da000 ffffffff 8fb40e60 0000008a 813da26c 813da160 00000000
  8fb40e60 802c75e0 00000000 00000000 8011bc30 00000000 00000024 813da160
  00000000 8011ba58 00000000 00012cf7 0000012b 00000000 8028ec90 00000000
  8028ec80 fffffffe 00000000 10001f00 8fe8f1a0 8ffb8cc0 8011b300 00000000
  00000000 8010c5d0 8010c7e0 00000000 00000000 8ff9229c 43464531 8ff90bd8
  8010c7e0 ...
Call Trace:   [<8011bc30>] [<8011ba58>] [<8011b300>] [<8010c5d0>] [<8010c7e0>]
  [<8010c7e0>] [<80255870>] [<80255508>] [<80103240>] [<8010324c>] [<80100450>]
  [<80258b20>] [<80258b60>]

Code: 00000040  0010802a  2610cbf4 <c2030000> 1460fffe  3c038000  e2030000  1060
fffb  0000000f
Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing
  <0>Rebooting in 5 seconds..swarm_linux_exit called...passing control back to CF
E


ffffffff80120998 <timer_bh>:
ffffffff80120998:       27bdffd0        addiu   $sp,$sp,-48
ffffffff8012099c:       afbf002c        sw      $ra,44($sp)
ffffffff801209a0:       afb20028        sw      $s2,40($sp)
ffffffff801209a4:       afb10024        sw      $s1,36($sp)
ffffffff801209a8:       afb00020        sw      $s0,32($sp)
ffffffff801209ac:       40016000        mfc0    $at,$12
ffffffff801209b0:       00000000        nop
ffffffff801209b4:       34210001        ori     $at,$at,0x1
ffffffff801209b8:       38210001        xori    $at,$at,0x1
ffffffff801209bc:       40816000        mtc0    $at,$12
ffffffff801209c0:       00000040        ssnop
ffffffff801209c4:       00000040        ssnop
ffffffff801209c8:       00000040        ssnop
ffffffff801209cc:       3c10802a        lui     $s0,0x802a
ffffffff801209d0:       2610cbf4        addiu   $s0,$s0,-13324
ffffffff801209d4:       c2030000        ll      $v1,0($s0)
ffffffff801209d8:       1460fffe        bnez    $v1,ffffffff801209d4 <timer_bh+0
x3c>
ffffffff801209dc:       3c038000        lui     $v1,0x8000
ffffffff801209e0:       e2030000        sc      $v1,0($s0)
ffffffff801209e4:       1060fffb        beqz    $v1,ffffffff801209d4 <timer_bh+0
x3c>
ffffffff801209e8:       0000000f        sync
ffffffff801209ec:       3c02802c        lui     $v0,0x802c
ffffffff801209f0:       8c427984        lw      $v0,31108($v0)
ffffffff801209f4:       3c03802d        lui     $v1,0x802d
ffffffff801209f8:       8c6389c8        lw      $v1,-30264($v1)
ffffffff801209fc:       00438823        subu    $s1,$v0,$v1
ffffffff80120a00:       12200005        beqz    $s1,ffffffff80120a18 <timer_bh+0
x80>
ffffffff80120a04:       00711021        addu    $v0,$v1,$s1
ffffffff80120a08:       3c01802d        lui     $at,0x802d
ffffffff80120a0c:       ac2289c8        sw      $v0,-30264($at)
ffffffff80120a10:       0c048180        jal     ffffffff80120600 <update_wall_ti
me>
ffffffff80120a14:       02202021        move    $a0,$s1
ffffffff80120a18:       0000000f        sync

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: a really really weird crash on swarm
  2002-08-09 23:22 a really really weird crash on swarm Jun Sun
@ 2002-08-10  0:07 ` Justin Carlson
  2002-08-11 16:51 ` Ralf Baechle
  1 sibling, 0 replies; 5+ messages in thread
From: Justin Carlson @ 2002-08-10  0:07 UTC (permalink / raw)
  To: Jun Sun; +Cc: linux-mips

On Fri, 2002-08-09 at 16:22, Jun Sun wrote:
> 
> Call me crazy - I have seen crash like this.  As you can see, the register is 
> loaded with one value and on next instruction it shows another value.  What 
> the hell is it possibly going on?
> 

As presented, this seems exceedingly unlikely; even processor bugs don't
look quite like this.  You just don't lose state in such a blatant
manner.

I'd be more inclined to believe that we took an interrupt and somehow
the saved processor state was corrupted.  The positioning of the code
looks suspicious to me;  we're shortly into a bottom half, which means
if we just, say, unmasked an interrupt in the SCD, we could quite
possibly take the interrupt around then.

Unfortunately, unless you can reliably reproduce this crash, there's not
enough info there to really do more than speculate wildly.

-Justin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: a really really weird crash on swarm
  2002-08-09 23:22 a really really weird crash on swarm Jun Sun
  2002-08-10  0:07 ` Justin Carlson
@ 2002-08-11 16:51 ` Ralf Baechle
  2002-08-19 12:57   ` Maciej W. Rozycki
  1 sibling, 1 reply; 5+ messages in thread
From: Ralf Baechle @ 2002-08-11 16:51 UTC (permalink / raw)
  To: Jun Sun; +Cc: linux-mips

On Fri, Aug 09, 2002 at 04:22:03PM -0700, Jun Sun wrote:

> Call me crazy - I have seen crash like this.  As you can see, the register is 
> loaded with one value and on next instruction it shows another value.  What 
> the hell is it possibly going on?
> 
> This is with today's OSS tree 2.4 branch.

Really odd because the register only lost the upper 16 bits; the lower 16
bits still have their expected value.

  Ralf

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: a really really weird crash on swarm
  2002-08-11 16:51 ` Ralf Baechle
@ 2002-08-19 12:57   ` Maciej W. Rozycki
  2002-08-19 13:28     ` Ralf Baechle
  0 siblings, 1 reply; 5+ messages in thread
From: Maciej W. Rozycki @ 2002-08-19 12:57 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Jun Sun, linux-mips

On Sun, 11 Aug 2002, Ralf Baechle wrote:

> > Call me crazy - I have seen crash like this.  As you can see, the register is 
> > loaded with one value and on next instruction it shows another value.  What 
> > the hell is it possibly going on?
> > 
> > This is with today's OSS tree 2.4 branch.
> 
> Really odd because the register only lost the upper 16 bits; the lower 16
> bits still have their expected value.

 It is a typical symptom of a register being corrupted between a "lui" and
an "addiu"  -- an exception must have done it in the immediately preceding
code.  You might be able to track a reason down by carefully studying
possible exception paths at the place of the problem.  Unfortunately you
don't have much of the state preserved at this stage -- you only know
which register was corrupted. 

 Another possible approach is to add some code that compares the values of
the register upon an exception entry and exit and wait for it to trigger
-- for a single register it shouldn't be too tough and you have still much
of the state available before an "rfe" or "eret".

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--------------------------------------------------------------+
+        e-mail: macro@ds2.pg.gda.pl, PGP key available        +

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: a really really weird crash on swarm
  2002-08-19 12:57   ` Maciej W. Rozycki
@ 2002-08-19 13:28     ` Ralf Baechle
  0 siblings, 0 replies; 5+ messages in thread
From: Ralf Baechle @ 2002-08-19 13:28 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Jun Sun, linux-mips

On Mon, Aug 19, 2002 at 02:57:14PM +0200, Maciej W. Rozycki wrote:

> > Really odd because the register only lost the upper 16 bits; the lower 16
> > bits still have their expected value.
> 
>  It is a typical symptom of a register being corrupted between a "lui" and
> an "addiu"  -- an exception must have done it in the immediately preceding
> code.  You might be able to track a reason down by carefully studying
> possible exception paths at the place of the problem.  Unfortunately you
> don't have much of the state preserved at this stage -- you only know
> which register was corrupted. 

Little exception potencial in this case as the interrupts got disabled and
the addresses used were rsp. should all be in KSEG0.

>  Another possible approach is to add some code that compares the values of
> the register upon an exception entry and exit and wait for it to trigger
> -- for a single register it shouldn't be too tough and you have still much
> of the state available before an "rfe" or "eret".

Don't try to think too deterministic - Jun was working on first silicon, so
not necessarily on a deterministic platform as we'd like.  Fortunately
as you may have seen in the kernel code there's already newer silicon so
I'd simply file this one to /dev/null for now.

  Ralf

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2002-08-20 12:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-08-09 23:22 a really really weird crash on swarm Jun Sun
2002-08-10  0:07 ` Justin Carlson
2002-08-11 16:51 ` Ralf Baechle
2002-08-19 12:57   ` Maciej W. Rozycki
2002-08-19 13:28     ` Ralf Baechle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox