From mboxrd@z Thu Jan  1 00:00:00 1970
Date: Mon, 18 Sep 2000 17:13:29 +0300
From: Alex Shnitman <alexsh@hectic.net>
To: "Mark A. Greer" <mgreer@mvista.com>
Cc: linuxppc-embedded@lists.linuxppc.org
Subject: Re: Sandpoint & random crashes?
Message-ID: <20000918171328.A4328@hectic.net>
References: <2F67A63DFFB1D31185D90090278CBB2D014ECF08@apmail6.chn.agilent.com> <39BD4215.6FD9B56F@mvista.com> <20000912001214.B17705@hectic.net> <39BD66B9.9A1A362F@mvista.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <39BD66B9.9A1A362F@mvista.com>; from mgreer@mvista.com on Mon, Sep 11, 2000 at 04:11:53PM -0700
Sender: owner-linuxppc-embedded@lists.linuxppc.org
List-Id: <linuxppc-embedded@lists.linuxppc.org>


Hi, Mark!

On Mon, Sep 11, 2000 at 04:11:53PM -0700, you wrote the following:

> > Today I think I noticed a very interesting consistency that might be
> > helpful. I haven't had the time to test it completely; I'll do it
> > tomorrow and post again. The thing is, there's a little green led on
> > the board saying "backup power" or something like that. If you turn
> > off the computer and the power supply, and leave it off for half a
> > minute or so, the led turns off. If you turn the computer on
> > afterwards and load the kernel, it loads init and you can work (until
> > it crashes). If you just reset the computer and load the kernel (after
> > uploading it via dink of course), init won't load.
> >
>
> This almost sounds like a hardware problem.  How old is your
> processor module?  Remember this is a test platform for MOT SPS
> where they test out new processors, etc.  They may have given you an
> early rev board or processor or host bridge or...  If you have an
> old one, you may want to ask for a newer one.

We've just bought those boards now from Motorola. I checked on their
site and we have the latest revisions...

As to hardware problems, I took the other box we have here (an
identical configuration) and tested there, with the same results. :-(
So if it's a hardware problem, it's in all those boards. I also ran
the memory test that dink has on all the memory that I can (from 90000
to the end; before that resides dink itself) and it didn't find any
errors. (I ran all the six or seven tests, 19 times in a row -- about
19-20 hours of testing.) So unless I'm extremely unlucky (and there
are problems in the low range), the memory isn't a problem either.

I downloaded the compilers from CDK 1.2 and compiled a kernel with
them. Made no difference.

Here are some more crash dumps, FWIW. Something is definitely fishy in
regard to memory management.

This one is weird -- I don't have any swap.

> mount-t^H ^H^H^H

sh: mount-: coBad swap file entry 00000085
kernel BUG at swap_state.c:71!

NIP: C002F0D4 XER: 20000000 LR: C002F0D4 REGS: c0283ce0 TRAP: 0700
MSR: 00089032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
TASK = c0282000[7] 'sh' Last syscall: 1
last math c0282000 last altivec 00000000
GPR00: C002F0D4 C0283D90 C0282000 0000001F 00001032 C0105734 C0140000 00000000
GPR08: C0140000 C0100000 C0110000 C0283CD0 44262824 100A09F8 00000000 100A6190
GPR16: 00000000 00000000 00000000 00000000 C01310E0 C0284100 1024D000 0FF2E000
GPR24: 00000000 00000000 0032E000 00000041 C03CDB4C C0284100 00000085 C01AF6B8
Call backtrace:
C002F0D4 C002F1EC C002F328 C00208B8 C002393C C0013F84 C0016FB0
C00171F4 C0004CC0 0FE79968 1002190C 10020DC8 1001D75C 1004D09C
10010B08 1000FBC4 0FE6F75C 00000000
Kernel panic: Exception in kernel pc c002f0d4 signal 4

backtrace:
0xc002f0d4 -- 0xc002f080 + 0x0054   __delete_from_swap_cache
0xc002f1ec -- 0xc002f14c + 0x00a0   delete_from_swap_cache_nolock
0xc002f328 -- 0xc002f280 + 0x00a8   free_page_and_swap_cache
0xc00208b8 -- 0xc0020730 + 0x0188   zap_page_range
0xc002393c -- 0xc0023838 + 0x0104   exit_mmap
0xc0013f84 -- 0xc0013f4c + 0x0038   mmput
0xc0016fb0 -- 0xc0016ed0 + 0x00e0   do_exit
0xc00171f4 -- 0xc00171f4 + 0x0000   sys_wait4
0xc0004cc0 -- 0xc0004cc0 + 0x0000   ret_from_syscall_1


And this one is crazy -- 14,500,000 worked fine, 15,000,000 gave me
"Out of memory", and the middle between them gave me this:

bash-2.03# perl -e '$a="A"x14750000'
kmem_free: Bad obj addr (objp=c0177500, name=size-64)
kernel BUG at slab.c:1695!
NIP: C002CDD4 XER: 20000000 LR: C002CDD4 REGS: c0104cd0 TRAP: 0700
MSR: 00089032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
TASK = c0103000[0] 'swapper' Last syscall: 36
last math 00000000 last altivec 00000000
GPR00: C002CDD4 C0104D80 C0103000 0000001B 00001032 C0105734 C0140000 00000000
GPR08: C0140000 C0100000 C0110000 C0104CC0 24462024 100A09F8 00000000 00000000
GPR16: 00000000 00000000 00000000 00000000 00000000 00104EB0 C02F7042 C02FFA40
GPR24: C0177500 00000014 0000001C C01023E0 C017755C C0177FE0 C0177500 C01A0160
Call backtrace:
C002CDD4 C009E1E4 C009EA68 C009DA98 C009DF70 C0093E4C C00189EC
C0004F60 00000000 C0006130 C0006144 C011678C 00003C60
Kernel panic: Exception in kernel pc c002cdd4 signal 4
In interrupt handler - not syncing
Rebooting in 180 seconds..

backtrace:
0xc002cdd4 -- 0xc002ca04 + 0x03d0   kfree
0xc009e1e4 -- 0xc009e0e4 + 0x0100   ip_free
0xc009ea68 -- 0xc009e740 + 0x0328   ip_defrag
0xc009da98 -- 0xc009da70 + 0x0028   ip_local_deliver
0xc009df70 -- 0xc009dc44 + 0x032c   ip_rcv
0xc0093e4c -- 0xc0093c48 + 0x0204   net_rx_action
0xc00189ec -- 0xc0018934 + 0x00b8   do_softirq
0xc0004f60 -- 0xc0004f60 + 0x0000   do_bottom_half_ret
0x00000000 -- unknown address
0xc0006130 -- 0xc00060c0 + 0x0070   idled
0xc0006144 -- 0xc0006134 + 0x0010   cpu_idle
0xc011678c -- 0xc0116644 + 0x0148   start_kernel
0x00003c60 -- unknown address


--
Alex Shnitman                            | http://www.debian.org
alexsh@hectic.net, alexsh@linux.org.il   +-----------------------
http://alexsh.hectic.net    UIN 188956    PGP key on web page
       E1 F2 7B 6C A0 31 80 28  63 B8 02 BA 65 C7 8B BA

/real/ kernel hackers
    dd if=/dev/urandom of=/vmlinuz
and influence the Universal Randomosity Field.
	-- Gaal Yahas

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/