Lockup problem with 8260

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Lockup problem with 8260
@ 2001-12-29 18:14 David Ashley
  0 siblings, 0 replies; 3+ messages in thread
From: David Ashley @ 2001-12-29 18:14 UTC (permalink / raw)
  To: linuxppc-embedded

I'm getting a lockup problem. Our kernel is based on Montevista's HHL 2.4.2.
The system has cache problems where the cache doesn't work as desired. I
don't know if that is related. There is a pci bridge in the system, that
I've managed to initialize properly, and the pci system seems to be working
fine.

There are two ethernet ports on the board, and we're using the fcc_enet.c
driver. There is a usb controller on the board, which is a pci device. It
uses the usb-ohci.c driver. The cache problems we're having relate to
external devices acting as a bus master writing into the 8260's dram. The
8260 doesn't detect the changes, instead it uses stale data from its cache.
This problem is endemic to our board and is very irritating. As a hack to
workaround the problem, in the case of the usb driver, I had to move all the
data io buffers into a 4 megabyte local bus ram which is non-cacheable.
So if the usb system tells the usb-ohci driver to send a packet out to the
usb controller, I've first got to copy it to the local bus ram, initiate
the transfer from there, wait until the interrupt signals the data is done,
then copy the buffer from the local bus ram back to the dram buffer. This
hacking into the driver is really grotesque and we shouldn't have to do that.
The 8260 is supposed to have a coherent cache, but for whatever reason our
hardware isn't working properly. No one knows enough here to know how to
fix the problem, or if it is fixable. Is it a configuration issue? Is it
something with the pci bridge we're using (Which is a tundra powerspan).
Also Wolfgang in another thread said he thinks that could be a problem with
the A.1 mask of the 8260 we're using, but I suspect it's not that simple,
this cache trouble is so pervasive it has to be something misconfigured.

ANYWAY, that cache stuff is just background. The lockup bug happens when 3
things happen:
1) Load the usb module with no usb devices plugged in
2) Have one of the ethernet ports plugged into our LAN
3) Do heavy file IO, by repeatedly doing cat /sbin/* > /dev/null

If the usb isn't initialized, the lockup doesn't happen. If I unplug the
ethernet plug, the lockup doesn't happen. If I don't do the heavy file io,
I think it locks up eventually but not as quickly.

The file io makes heavy use of the function __copy_tofrom_user which just
is an equivalent to memcpy but it flushes the cache as it does the copy, I
think. With the BDI2000 connected I can examine the state of the machine
after the lockup. A lot of the time the cpu is frozen on the same instruction
in the function _copy_tofrom_user, the instruction is
 lwz r9,12(r4)
That instruction loads the 32 bit word at location (r4+12) and puts it into
r9. When it is frozen there, r4 is always on an address with the low 5 bits
containing 0x14. The 8260's cache line size is 32 bytes. Before the
fetching of the word 12(r4) it has done 4(r4) and 8(r4). So at this
instruction the address just crossed a cache line. I don't know if that
is relevant, but I think it is. The 8260 I believe will prefetch from memory
an entire cache line once it is accessed. That means 8 memory fetches.

When the usb is initialized, the usb driver tells the usb controller the
address of a 256 byte block of memory in the 8260's memory space that the
usb can use for communication. Once the usb is initialized, about 2000
times per second it is writing a counter into offset 0x80 of that buffer.
That is the usb frame number. It is incrementing at about 2000/second, so
I assume it is being written on every increment but I don't know for sure.
The usb is not generating any hardware interrupts at all until something
is plugged into a usb port. There is a software timer invoked by the
usb driver which happens a few times a second, where it polls the usb
controller to see if anything has been plugged in. I have tried disabling
this software timer, so I am certain that no code in the usb driver is
ever being executed. The lockup still occurs. I believe the usb's
contribution to the problem is just the writing of the word to that address
2000 times a second.

When the ethernet cable is plugged in, even with no noticeable net traffic,
there are packets coming across. Mostly ARP requests. Those are passed to
all hosts on a network. The form is like "what is the IP address associated
with this MAC address" and vice-versa. Those happen pretty frequently,
perhaps a few times a second. When a packet comes it it generates an
interrupt which is serviced by the ethernet driver. I believe those
interrupts are the ethernet's contribution to the problem.

When the machine locks I can't single step anymore, the BDI2000 reports
COP freeze. The machine is in some dead state where nothing can get it out
except a reset. The BDI2000 tech support guys think it is in the checkstop
state. In the docs on the 8260 the cpu goes into the checkstop state if
certain events happen and the machine check exception is disabled. The
events are address parity error, data parity error, or the machine check
external line being driven low by an external device. There is some other
cause which I forget. Now, we're not running with the machine check exception
disabled, it is enabled. The hardware maker says there is nothing driving that
machine check line low on the 8260. A fellow I spoke to thinks the cpu can go
into the checkstop state if you get a machine check exception, then while
servicing that you get another one. He didn't say it that way, he just
said you get one, then another one.

Anyway I'm at my wit's end on this one. Maybe someone more experienced can
look at these facts and the answer will jump out. That's what I'm hoping
at least :^). Thanks for any advice.

-Dave

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: Lockup problem with 8260
@ 2002-01-07  9:10 Babic Stefano
  0 siblings, 0 replies; 3+ messages in thread
From: Babic Stefano @ 2002-01-07  9:10 UTC (permalink / raw)
  To: linuxppc-embedded

> Also Wolfgang in another thread said he thinks that could be
> a problem with
> the A.1 mask of the 8260 we're using, but I suspect it's not
> that simple,
> this cache trouble is so pervasive it has to be something
> misconfigured.

I can confirm the problems with the A.1 mask. According to Motorola, bus snooping is not working. The fcc_enet driver turns on bus snooping, but in our extensive tests we found a lot of problems (system locked up, wrong frame was sent,...).
For this reason we changed the driver disabling bus snooping and using not cached buffers for DMA (sigh !). The A.1 mask has other big problems, too: the MCC controller simply does not work (confirmed by Motorola). We upgraded to B.3 mask, but we are still using local buffers for DMA.

-Stefano

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: Lockup problem with 8260
@ 2002-01-09 15:47 David Ashley
  0 siblings, 0 replies; 3+ messages in thread
From: David Ashley @ 2002-01-09 15:47 UTC (permalink / raw)
  To: linuxppc-embedded

I've found out a lot more about the problems we're having, and gotten
some workarounds in place.

Three things have to be true in order for there to be the lockups we were
seeing:

1) 8260 is accessing cacheable region of 60x bus
2) CPM is accessing cacheable region of 60x bus
3) External device (in this case a pci bridge) is accessing
   cacheable region of 60x bus

I believe #1 and #2 have to both be accessing the same area of
memory. #3 can be accessing a completely separate area. I don't
know if all the attempts to access the bus have to be close together
in time.

What happens when it fails is bogus addresses start coming out of the
CPM (we think the CPM is the source, not the 8260). Then a series of bus
faults and eventually a check stop state are entered. Frequently the
system might crash where the CP (of the CPM) appears to be dead but the
8260 itself is alive and well--until inside the linux kernel it has to
busy wait for the CPM, say for outputting a character to the serial console.

The cache problem we were having was because the ESE bit in the SIUMCR
register was always off. Set that bit to 1, and suddenly the L1 cache
becomes coherent. The lockups occured whether that bit is 0 or 1.

The CPM's parameter blocks have bits telling whether the BD's and
buffers themselves are on the 60x bus or the local bus. There is a
bit GBL which is supposed to inform snooping devices to snoop this
address. I believe in the case of the CPM and 8260 accessing the bus,
the 8260 will always snoop CPM's accesses even if GBL isn't asserted
and even if ESE is 0 (disabling snooping). I think those bits only
effect devices outside the 8260, such as the pci bridge, mastering the
bus.

The workaround that is effective (rock solid operation) is to use
the local bus for all CPM's operations, meaning BD's and buffers. The dual
port ram is taboo also, it is equivalent to the 60x bus. Then in the
FCRx field descriptions the DTB and BDB bits have to be set to 1, to
tell the CPM the buffers and BD's are on the local bus. This keeps the
CPM off of the 60x bus and prevents the lockup from occuring.

If the local bus memory is used but the DTB/BDB bits aren't set the system
still operates, but the lockups still occur. GBL has always been irrelevant.
ESE in the SIUMCR has to be set to 1 for a coherent cache between the
8260 and the outside world, say a pci bus master accessing the 60x bus.
I'm really shocked that no one on this newsgroup ever mentioned the ESE
bit, that seems to be an obvious first thing to look at for the cache
incoherency problems we were having.

Our chip is using the A.1 mask. This seems to be working perfectly well
with the dcache enabled. We have only the L1 cache, no L2 cache.
We have a small amount of dram hung off the local bus. This local bus
ram is not cacheable.

Other solutions have been to reserve a region of the 60x bus's dram as
non-cacheable, and use that for CPM operations. We're not going that route.
The pci bus masters are accessing cacheable memory of the 60x bus and it
appears to be working perfectly.

Short answer: Keep the CPM off the 60x bus. And the dual port ram counts
as the 60x bus. I haven't tried using dual port ram for BD's and buffers yet
keeping DTB and BDB's to 1, I would think that might not work.

-Dave

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2002-01-09 15:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-12-29 18:14 Lockup problem with 8260 David Ashley
  -- strict thread matches above, loose matches on Subject: below --
2002-01-07  9:10 Babic Stefano
2002-01-09 15:47 David Ashley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).