Kernel Mode Software Emulation NIP: 00001FFC - cache coherency problem on m8xx processors

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Kernel Mode Software Emulation NIP: 00001FFC - cache coherency problem on m8xx processors
@ 2004-03-25 23:13 Marcelo Tosatti
  2004-03-26  6:51 ` Dan Malek
  2004-03-26  7:23 ` LC Geldenhuys
  0 siblings, 2 replies; 6+ messages in thread
From: Marcelo Tosatti @ 2004-03-25 23:13 UTC (permalink / raw)
  To: linuxppc-embedded

Hi,

We encountered a problem with our MPC855T based appliances under heavy
load. The crashes looked like this:

Oops: Kernel Mode Software FPU Emulation, sig: 8
NIP: 00001FFC XER: 20000000 LR: 00000590 SP: C0D99DC0 REGS: c0d99d10 TRAP:1000    Not tainted
MSR: 00001000 EE: 0 PR: 0 FP: 0 ME: 1 IR/DR: 00
TASK = c0d98000[115] 'webs' Last syscall: 102
last math 00000000 last altivec 00000000
GPR00: 00000001 C0D99DC0 C0D98000 C0D99DD0 00000000 00000001 000005A8
00000000
GPR08: C59ED4F0 00000000 00000000 00000002 00000000 1007335C 00000000
00000000
GPR16: 00000000 00000000 00000000 00000000 00001032 00D99DC0 C00CFFFC
00009032
GPR24: C00027C0 10049C90 00000000 00000002 000005A8 C653D000 C59ED630
C653D0D8
Call backtrace:
C00E2470 C00CFFBC C00C4610 C00E25A4 C00A2A00 C00A3024 C000281C
00000001 100330A4 10033B1C 10041F40 10029418 1002E530 1003FA7C
1003F240 1003F140 1003485C 100346A4 1000234C 0FBE7FDC 00000000

The kernel crashed trying to execute address "00001FFC". I have seen similar
reports on linux PPC lists archives. The problem is that "bl transfer_to_handler"
(transfer_to_handler is at "2000") was jumping to "1FFC" instead, in some rare ocasions
(only under heavy network/memory activity).

After thinking for a while and talking to Dan Malek, it seems "isync" instructions before
"bl transfer_to_handler" are required to avoid cache coherency problems.

I'm not exactly sure why we were jumping to "1FFC" instead of "2000",
but adding "isync" before "bl transfer_to_handler" in both DecrementTimer
and HardwareInterrupt fixed the problem for us.

On the following patch against 2.4.25 I also add "isync" to FINISH_EXCEPTION define, for safety.

The performance impact of this is hardly noticeable.

It seems Dan is OK including this into linuxppc-2.4 repository. We also want
to add this to 2.6 when the m8xx support gets fixed.

Regards,

--- head_8xx.S.orig     2004-03-25 18:30:49.323575664 -0300
+++ head_8xx.S  2004-03-25 18:32:00.464760560 -0300
@@ -172,6 +172,7 @@
  */

 #define FINISH_EXCEPTION(func)                 \
+       isync;                                  \
        bl      transfer_to_handler;            \
        .long   func;                           \
        .long   ret_from_except
@@ -228,6 +229,7 @@
        addi    r3,r1,STACK_FRAME_OVERHEAD
        li      r20,MSR_KERNEL
        li      r4,0
+       isync
        bl      transfer_to_handler
        .globl  do_IRQ_intercept
 do_IRQ_intercept:
@@ -265,6 +267,7 @@
        EXCEPTION_PROLOG
        addi    r3,r1,STACK_FRAME_OVERHEAD
        li      r20,MSR_KERNEL
+       isync
        bl      transfer_to_handler
        .globl  timer_interrupt_intercept
 timer_interrupt_intercept:

----- End forwarded message -----

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Kernel Mode Software Emulation NIP: 00001FFC - cache coherency problem on m8xx processors
  2004-03-25 23:13 Kernel Mode Software Emulation NIP: 00001FFC - cache coherency problem on m8xx processors Marcelo Tosatti
@ 2004-03-26  6:51 ` Dan Malek
  2004-03-26  9:04   ` Wolfgang Denk
  2004-03-26  7:23 ` LC Geldenhuys
  1 sibling, 1 reply; 6+ messages in thread
From: Dan Malek @ 2004-03-26  6:51 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linuxppc-embedded

Marcelo Tosatti wrote:

> We encountered a problem with our MPC855T based appliances under heavy
> load. The crashes looked like this:

> The kernel crashed trying to execute address "00001FFC". I have seen similar
> reports on linux PPC lists archives. The problem is that "bl transfer_to_handler"
> (transfer_to_handler is at "2000") was jumping to "1FFC" instead, in some rare ocasions
> (only under heavy network/memory activity).

Here is my standard answer to bad things happening under heavy network
activity.  Something is likely wrong with the SDRAM UPM Burst Mode programming.
The only way you can get back to back burst mode bus operations is with the
core very busy and the CPM or FEC performing DMA.  Neither one on their own
can generate this special case bus cycle.  I've seen this myself, and the
cause was always the same.  It's a PITA to debug, but I still suspect that is
the problem.

I don't remember the details of our IRC discusson, but one thing I would suggest
to test this is setting the Burst Inhibit (BI) flag in the memory controller
for the SDRAM chip select.

> After thinking for a while and talking to Dan Malek, it seems "isync" instructions before
> "bl transfer_to_handler" are required to avoid cache coherency problems.

I was actually thinking of a different interrupt controller problem.  I am
surprised this works.  This isn't a cache coherency problem.

> I'm not exactly sure why we were jumping to "1FFC" instead of "2000",
> but adding "isync" before "bl transfer_to_handler" in both DecrementTimer
> and HardwareInterrupt fixed the problem for us.

That's just too weird.  We need to understand why this happens.  Here is another
test.  At about line 652, change the:

	. = 0x2000

to:

	. = 0x1ffc
	nop

Let's see if it happens to jump to any other location or if this one is
special.

> On the following patch against 2.4.25 I also add "isync" .....

Let's put a big comment around this.  Indicate it was a problem for
one person with an 855T.  I don't have any 855T parts, if anyone else has
some and can do some heavy network testing, I'd appreciate knowing the
results.  Like I keep saying, I've seen similar problems on the 860T parts,
but it was clearly my fault programming the UPM.  Once that was fixed,
problem solved.

Thanks.

	-- Dan

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Kernel Mode Software Emulation NIP: 00001FFC - cache coherency problem on m8xx processors
  2004-03-26  6:51 ` Dan Malek
@ 2004-03-26  9:04   ` Wolfgang Denk
  2004-03-26 13:44     ` Dan Malek
  0 siblings, 1 reply; 6+ messages in thread
From: Wolfgang Denk @ 2004-03-26  9:04 UTC (permalink / raw)
  To: Dan Malek; +Cc: Marcelo Tosatti, linuxppc-embedded


In message <4063D300.2060006@embeddededge.com> Dan Malek wrote:
>
> Let's put a big comment around this. Indicate it was a problem for one
> person with an 855T. I don't have any 855T parts, if anyone else has
> some and can do some heavy network testing, I'd appreciate knowing the

Can you define which sort of network load is suitable to  sfficiently
stress  the  system? We ave many kinds of systems of all types in the
lab, and I can easily run such a test.

But I actually doubt that we would see any  problem,  as  we  perform
such  testing  on  a regular base, and never had any such problems on
8xx systems.

> results. Like I keep saying, I've seen similar problems on the 860T
> parts, but it was clearly my fault programming the UPM. Once that was
> fixed, problem solved.

I agree 100%.

Best regards,

Wolfgang Denk

--
Software Engineering:  Embedded and Realtime Systems,  Embedded Linux
Phone: (+49)-8142-4596-87  Fax: (+49)-8142-4596-88  Email: wd@denx.de
Quantum Mechanics is God's version of "Trust me."

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Kernel Mode Software Emulation NIP: 00001FFC - cache coherency problem on m8xx processors
  2004-03-26  9:04   ` Wolfgang Denk
@ 2004-03-26 13:44     ` Dan Malek
  0 siblings, 0 replies; 6+ messages in thread
From: Dan Malek @ 2004-03-26 13:44 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Marcelo Tosatti, linuxppc-embedded

Wolfgang Denk wrote:

> Can you define which sort of network load is suitable to  sfficiently
> stress  the  system? We ave many kinds of systems of all types in the
> lab, and I can easily run such a test.

Copying files on a root NFS usually works for me. :-)

> But I actually doubt that we would see any  problem,

I just don't have an 855T handy, although I doubt this problem would
be unique to that part.  I just wanted to cover that base.

Thanks.

	-- Dan

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Kernel Mode Software Emulation NIP: 00001FFC - cache coherency problem on m8xx processors
  2004-03-25 23:13 Kernel Mode Software Emulation NIP: 00001FFC - cache coherency problem on m8xx processors Marcelo Tosatti
  2004-03-26  6:51 ` Dan Malek
@ 2004-03-26  7:23 ` LC Geldenhuys
  2004-03-26  8:07   ` Dan Malek
  1 sibling, 1 reply; 6+ messages in thread
From: LC Geldenhuys @ 2004-03-26  7:23 UTC (permalink / raw)
  To: linuxppc-embedded


> -----Original Message-----
> From: owner-linuxppc-embedded@lists.linuxppc.org
> [mailto:owner-linuxppc-embedded@lists.linuxppc.org] On Behalf
> Of Marcelo Tosatti
> Sent: 26 March 2004 01:14 AM
> To: linuxppc-embedded@lists.linuxppc.org
> Subject: Kernel Mode Software Emulation NIP: 00001FFC - cache
> coherency problem on m8xx processors

<snip>

> I'm not exactly sure why we were jumping to "1FFC" instead of "2000",
> but adding "isync" before "bl transfer_to_handler" in both
> DecrementTimer
> and HardwareInterrupt fixed the problem for us.

This sounds distinctly familiar to the CPU13 Errata. Do you have
Instruction fetch show cycles enabled in ICTRL[IST_SER]? See:
http://e-www.motorola.com/files/32bit/doc/errata/MPC860CE.pdf

Cheers,
  Lourens


** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Kernel Mode Software Emulation NIP: 00001FFC - cache coherency problem on m8xx processors
  2004-03-26  7:23 ` LC Geldenhuys
@ 2004-03-26  8:07   ` Dan Malek
  0 siblings, 0 replies; 6+ messages in thread
From: Dan Malek @ 2004-03-26  8:07 UTC (permalink / raw)
  To: lourens; +Cc: linuxppc-embedded

LC Geldenhuys wrote:

> This sounds distinctly familiar to the CPU13 Errata.

Not exactly.  This isn't the first instruction of the exception
handler.  It's the first instruction of the general handler
that all of them call after excuting about 30 instructions of
context saving.

There are, however, several Errata associated with the setting
of the ICTRL[IST_SER] and page boundary instruction execution.

Thanks for pointing out ensuring the ICTRL is set for normal,
"no show" operation.  It's worth checking that, too.  The
default reset is not what you want here, the boot rom should
set this register to something for normal operation.

	-- Dan

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-03-26 13:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-25 23:13 Kernel Mode Software Emulation NIP: 00001FFC - cache coherency problem on m8xx processors Marcelo Tosatti
2004-03-26  6:51 ` Dan Malek
2004-03-26  9:04   ` Wolfgang Denk
2004-03-26 13:44     ` Dan Malek
2004-03-26  7:23 ` LC Geldenhuys
2004-03-26  8:07   ` Dan Malek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).