Re: kernel access of bad area, sig: 11 ( mpc852t)

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Re: kernel access of bad area, sig: 11  ( mpc852t)
  2006-04-11 14:06 Oops: " gautam borad
@ 2006-04-11 14:39 ` Mark Chambers
  0 siblings, 0 replies; 6+ messages in thread
From: Mark Chambers @ 2006-04-11 14:39 UTC (permalink / raw)
  To: gautam borad, linuxppc-embedded

> Hi,
>    Im having problem porting linux kernel 2.4.21 to our mpc852T custom
> board.The kernel
> panics randomly with sig 11.
> The board boots up fine and we also get to the prompt.When we open 3-4
> telnet sessions
> and try to run some command the kernel panics.This is completely
> random.Sometimes it
> even panics before opening the telnet session.
>
> One of the oops dump is:
>
> -----------------------------------------------------------------------------------------------------------------------------
> Oops: kernel access of bad area, sig: 11
> NIP: C0019FD0 XER: 00000000 LR: C001A06C SP: C1C33AD0 REGS: c1c33a20
> TRAP: 0300    Not tainted
> MSR: 00009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
> DAR: 725F6578, DSISR: 0000000B
> TASK = c1c32000[48] 'insmod' Last syscall: 5
> last math 00000000 last altivec 00000000
> GPR00: 7361726D C1C33AD0 C1C32000 00000000 C0113678 C0150000 C0150000
> C014B210
> GPR08: C014B210 C012D060 00000000 725F6578 04000024 00000000 00000000
> 00000000
> GPR16: 00000000 00000000 00000000 00000000 00001032 01C33BA0 00000000
> C0000000
> GPR24: C014BE38 C0150000 C0110000 C0110000 C0140000 00000001 00000000
> 00000001
> Call backtrace:
> C001A0C8 C0016174 C0015FEC C0015CC0 C0005E38 C0004668 C1C33D10
> C0004670 C004A380 C003FFD4 C00404D8 C0040AD4 C0040FF8 C00330D8
> C00334AC C000443C 1006EF48 1001FCF0 1002023C 10003A18 100036A0
> 300591AC 00000000
> -----------------------------------------------------------------------------------------------------------------------------
>
> The call trace back is of not much help because it is different on all 
> oops.
> We are using u-boot 1-1-3.
>
> Thanks in advance.
>

You almost certainly have SDRAM problems.  If you have thoroughly checked 
out the
complete address range statically, remember that burst accesses will not 
occur until the
cache is turned on, so your problem may be with bursting.  But you can also 
have severe
problems like a missing address line and linux still run for a few seconds.

Mark Chambers 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel access of bad area, sig: 11 ( mpc852t)
@ 2006-04-17 11:35 Akshay Mishra
  0 siblings, 0 replies; 6+ messages in thread
From: Akshay Mishra @ 2006-04-17 11:35 UTC (permalink / raw)
  To: linuxppc-embedded

>> Hi,
>> Im having problem porting linux kernel 2.4.21 to our mpc852T custom
>> board.The kernel
>> panics randomly with sig 11.
>> The board boots up fine and we also get to the prompt.When we open 3-4
>> telnet sessions
>> and try to run some command the kernel panics.This is completely
>> random.Sometimes it
>> even panics before opening the telnet session.
>>
>> One of the oops dump is:
>>
>> ------------------------------------------------------------------------=
------------
\
>>                 -----------------------------------------
>> Oops: kernel access of bad area, sig: 11
>> NIP: C0019FD0 XER: 00000000 LR: C001A06C SP: C1C33AD0 REGS: c1c33a20
>> TRAP: 0300    Not tainted
>> MSR: 00009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
>> DAR: 725F6578, DSISR: 0000000B
>> TASK =3D c1c32000[48] 'insmod' Last syscall: 5
>> last math 00000000 last altivec 00000000
>> GPR00: 7361726D C1C33AD0 C1C32000 00000000 C0113678 C0150000 C0150000
>> C014B210
>> GPR08: C014B210 C012D060 00000000 725F6578 04000024 00000000 00000000
>> 00000000
>> GPR16: 00000000 00000000 00000000 00000000 00001032 01C33BA0 00000000
>> C0000000
>> GPR24: C014BE38 C0150000 C0110000 C0110000 C0140000 00000001 00000000
>> 00000001
>> Call backtrace:
>> C001A0C8 C0016174 C0015FEC C0015CC0 C0005E38 C0004668 C1C33D10
>> C0004670 C004A380 C003FFD4 C00404D8 C0040AD4 C0040FF8 C00330D8
>> C00334AC C000443C 1006EF48 1001FCF0 1002023C 10003A18 100036A0
>> 300591AC 00000000
>> ------------------------------------------------------------------------=
------------
\
>> -----------------------------------------
>> The call trace back is of not much help because it is different on all
>> oops.
>> We are using u-boot 1-1-3.
>>
>> Thanks in advance.
>>

>You almost certainly have SDRAM problems.  If you have thoroughly checked
>out the
>complete address range statically, remember that burst accesses will not
>occur until the
>cache is turned on, so your problem may be with bursting.  But you can als=
o
>have severe
>problems like a missing address line and linux still run for a few seconds=
.
>
>Mark Chambers

We've checked the SDRAM. The timings (UPM) look fine. The problem
however is that linux does not hang until after a few processes are
started.
If we boot to linux and leave it as it is, everything is fine and the
board remains working. However each time a few processes (4-5 telnet
sessions for eg.) are started the system either panics or hangs (goes
dead).

Thanks in advance,
Akshay

^ permalink raw reply	[flat|nested] 6+ messages in thread

* kernel access of bad area, sig: 11 ( mpc852t)
@ 2006-04-19 12:58 Kenneth Poole
  2006-04-19 13:45 ` Mark Chambers
  0 siblings, 1 reply; 6+ messages in thread
From: Kenneth Poole @ 2006-04-19 12:58 UTC (permalink / raw)
  To: linuxppc-embedded

[-- Attachment #1: Type: text/plain, Size: 2573 bytes --]

>>> Hi,
>>> Im having problem porting linux kernel 2.4.21 to our mpc852T custom
>>> board.The kernel
>>> panics randomly with sig 11.
>>> The board boots up fine and we also get to the prompt.When we open
3-4
>>> telnet sessions
>>> and try to run some command the kernel panics.This is completely
>>> random.Sometimes it
>>> even panics before opening the telnet session.
>>>

>>> <oops dump snipped>
>>>
>>You almost certainly have SDRAM problems.  If you have thoroughly
checked
>>out the
>>complete address range statically, remember that burst accesses will
not
>>occur until the
>>cache is turned on, so your problem may be with bursting.  But you can
also
>>have severe
>>problems like a missing address line and linux still run for a few
seconds.
>>
>>Mark Chambers

>We've checked the SDRAM. The timings (UPM) look fine. The problem
>however is that linux does not hang until after a few processes are
>started.
>If we boot to linux and leave it as it is, everything is fine and the
>board remains working. However each time a few processes (4-5 telnet
>sessions for eg.) are started the system either panics or hangs (goes
>dead).

>Thanks in advance,
>Akshay

We have been experiencing this same issue with random boards in
production. The exact same version of software will run for months on
other instances of the exact same board design, but a few percent get
'random' trap 300s. When they do occur, it's only after Linux has booted
and address translation and caching are turned on. Examining the oops-es
and memory shows that some location in SDRAM has a bogus value, but I
don't have the tools to trace back how it got that way.

I have ported a rigorous moving-inversions memory test into our
firmware, and have run it extensively across the entire SDRAM address
space (the test code executes from flash). I have let this test run
continuously for hours and hours, but never found a memory problem.
Unfortunately, I do not have test software that enables the MMU address
translation or caching, so as Mark said, I can't test memory using
bursting. Our hardware engineers have reviewed the designs very
carefully and are quite confident that there is plenty of margin in the
memory timing. Signal quality has also been carefully checked.

Our manufacturing people have replaced the CPU on some of these boards,
and the problem went away.

If anyone else on the mailing list has experienced this issue, or has
developed a virtual address memory test, please let us know.

Ken Poole

[-- Attachment #2: Type: text/html, Size: 7533 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel access of bad area, sig: 11 ( mpc852t)
  2006-04-19 12:58 kernel access of bad area, sig: 11 ( mpc852t) Kenneth Poole
@ 2006-04-19 13:45 ` Mark Chambers
  0 siblings, 0 replies; 6+ messages in thread
From: Mark Chambers @ 2006-04-19 13:45 UTC (permalink / raw)
  To: Kenneth Poole, linuxppc-embedded

kernel access of bad area, sig: 11 ( mpc852t)>>> board.The kernel
>>> panics randomly with sig 11.

>We have been experiencing this same issue with random boards in production. 
>The exact same version of software will run for months on other >instances 
>of the exact same board design, but a few percent get 'random' trap 300s. 
>When they do occur, it's only after Linux has booted and >address 
>translation and caching are turned on. Examining the oops-es and memory 
>shows that some location in SDRAM has a bogus value, >but I don't have the 
>tools to trace back how it got that way.
>I have ported a rigorous moving-inversions memory test into our firmware, 
>and have run it extensively across the entire SDRAM address >space (the 
>test code executes from flash). I have let this test run continuously for 
>hours and hours, but never found a memory problem. >Unfortunately, I do not 
>have test software that enables the MMU address translation or caching, so 
>as Mark said, I can't test memory using >bursting. Our hardware engineers 
>have reviewed the designs very carefully and are quite confident that there 
>is plenty of margin in the memory >timing. Signal quality has also been 
>carefully checked.

Ouch!  Yeah, these are the tough ones, the intermittent ones.  You can, btw, 
force a burst cycle using the RUN
command in the MCR, similar to what you do to generate a few refreshes when 
configuring the DRAM.  And
you can easily enable the cache for testing and then you'll get bursts (I 
don't think MMU will have any effect).
A burst is not so much different from other cycles, so I don't think 
bursting per se is what causes problems when
the kernel starts.  I think it has more to do with the increased randomness 
of accesses with multitasking and
cacheing and all that.

>Our manufacturing people have replaced the CPU on some of these boards, and 
>the problem went away.

It also seems to me that the cache is the most delicate bit of logic in the 
852.  So if you have ground noise or
problems on the 1.8V rail it will likely show up in the cache - I had 
hardware problems where I could
track it down to a mismatch between the cache line and memory (and the scope 
showed the read burst to
be fine).  Also look closely at the PLL circuit - it can work both ways, the 
PLL can inject noise back into
the unfiltered supply (I use a ferrite instead of the inductor that 
Freescale recommends).

That's my $.02 :-)

Mark C.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: kernel access of bad area, sig: 11 ( mpc852t)
@ 2006-04-19 14:33 Rune Torgersen
  0 siblings, 0 replies; 6+ messages in thread
From: Rune Torgersen @ 2006-04-19 14:33 UTC (permalink / raw)
  To: Kenneth Poole, linuxppc-embedded

When I was tracking down a timing problem on our SDRAM I found that
doing a native compile of glibc over NFS seems to be a very good memory
test.


> -----Original Message-----
> From: linuxppc-embedded-bounces+runet=3Dinnovsys.com@ozlabs.org=20
> [mailto:linuxppc-embedded-bounces+runet=3Dinnovsys.com@ozlabs.or
> g] On Behalf Of Kenneth Poole
> Sent: Wednesday, April 19, 2006 07:59
> To: linuxppc-embedded@ozlabs.org
> Subject: kernel access of bad area, sig: 11 ( mpc852t)
>=20
>=20
> >>> Hi,
>=20
> >>> Im having problem porting linux kernel 2.4.21 to our=20
> mpc852T custom
>=20
> >>> board.The kernel
>=20
> >>> panics randomly with sig 11.
>=20
> >>> The board boots up fine and we also get to the=20
> prompt.When we open 3-4
>=20
> >>> telnet sessions
>=20
> >>> and try to run some command the kernel panics.This is completely
>=20
> >>> random.Sometimes it
>=20
> >>> even panics before opening the telnet session.
>=20
> >>>
>=20
>=20
> >>> <oops dump snipped>
>=20
> >>>
>=20
> >>You almost certainly have SDRAM problems.  If you have=20
> thoroughly checked
>=20
> >>out the
>=20
> >>complete address range statically, remember that burst=20
> accesses will not
>=20
> >>occur until the
>=20
> >>cache is turned on, so your problem may be with bursting. =20
> But you can also
>=20
> >>have severe
>=20
> >>problems like a missing address line and linux still run=20
> for a few seconds.
>=20
> >>
>=20
> >>Mark Chambers
>=20
> >We've checked the SDRAM. The timings (UPM) look fine. The problem
>=20
> >however is that linux does not hang until after a few processes are
>=20
> >started.
>=20
> >If we boot to linux and leave it as it is, everything is fine and the
>=20
> >board remains working. However each time a few processes (4-5 telnet
>=20
> >sessions for eg.) are started the system either panics or hangs (goes
>=20
> >dead).
>=20
> >Thanks in advance,
>=20
> >Akshay
>=20
> We have been experiencing this same issue with random boards=20
> in production. The exact same version of software will run=20
> for months on other instances of the exact same board design,=20
> but a few percent get 'random' trap 300s. When they do occur,=20
> it's only after Linux has booted and address translation and=20
> caching are turned on. Examining the oops-es and memory shows=20
> that some location in SDRAM has a bogus value, but I don't=20
> have the tools to trace back how it got that way.
>=20
> I have ported a rigorous moving-inversions memory test into=20
> our firmware, and have run it extensively across the entire=20
> SDRAM address space (the test code executes from flash). I=20
> have let this test run continuously for hours and hours, but=20
> never found a memory problem. Unfortunately, I do not have=20
> test software that enables the MMU address translation or=20
> caching, so as Mark said, I can't test memory using bursting.=20
> Our hardware engineers have reviewed the designs very=20
> carefully and are quite confident that there is plenty of=20
> margin in the memory timing. Signal quality has also been=20
> carefully checked.
>=20
> Our manufacturing people have replaced the CPU on some of=20
> these boards, and the problem went away.
>=20
> If anyone else on the mailing list has experienced this=20
> issue, or has developed a virtual address memory test, please=20
> let us know.
>=20
> Ken Poole
>=20
> =20
>=20
> =20
>=20
>=20

^ permalink raw reply	[flat|nested] 6+ messages in thread

* kernel access of bad area, sig: 11 ( mpc852t)
@ 2006-04-21 15:10 Akshay Mishra
  0 siblings, 0 replies; 6+ messages in thread
From: Akshay Mishra @ 2006-04-21 15:10 UTC (permalink / raw)
  To: linuxppc-embedded, kpoole

What processor frequency do you use ? The EP board  for 852T uses 10
MHz OSCM and CLKIN. We were trying 66 MHz earlier and 25Mhz after
that. But never got any results. The hardware is clean afaik. and the
memory timing is more critical on the MPC8280 on the same board and it
works very well.

We tried changing the clock source to oscillator/crystal and slowing
the clockout to the SDRAM to verify if memory access were a problem.
But no result there too.

The kernel we have is 2.4.21. Will migrating to 2.6 make lives any better ?

Best,
Akshay

----------Quoting Kenneth
We have been experiencing this same issue with random boards in
production. The exact same version of software will run for months on
other instances of the exact same board design, but a few percent get
'random' trap 300s. When they do occur, it's only after Linux has booted
and address translation and caching are turned on. Examining the oops-es
and memory shows that some location in SDRAM has a bogus value, but I
don't have the tools to trace back how it got that way.

I have ported a rigorous moving-inversions memory test into our
firmware, and have run it extensively across the entire SDRAM address
space (the test code executes from flash). I have let this test run
continuously for hours and hours, but never found a memory problem.
Unfortunately, I do not have test software that enables the MMU address
translation or caching, so as Mark said, I can't test memory using
bursting. Our hardware engineers have reviewed the designs very
carefully and are quite confident that there is plenty of margin in the
memory timing. Signal quality has also been carefully checked.

Our manufacturing people have replaced the CPU on some of these boards,
and the problem went away.

If anyone else on the mailing list has experienced this issue, or has
developed a virtual address memory test, please let us know.

Ken Poole

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-04-21 15:10 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-19 12:58 kernel access of bad area, sig: 11 ( mpc852t) Kenneth Poole
2006-04-19 13:45 ` Mark Chambers
  -- strict thread matches above, loose matches on Subject: below --
2006-04-21 15:10 Akshay Mishra
2006-04-19 14:33 Rune Torgersen
2006-04-17 11:35 Akshay Mishra
2006-04-11 14:06 Oops: " gautam borad
2006-04-11 14:39 ` Mark Chambers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).