From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Sat, 29 Dec 2001 10:14:13 -0800 From: David Ashley Message-Id: <200112291814.fBTIED100179@dave.home> To: linuxppc-embedded@lists.linuxppc.org Subject: Lockup problem with 8260 Sender: owner-linuxppc-embedded@lists.linuxppc.org List-Id: I'm getting a lockup problem. Our kernel is based on Montevista's HHL 2.4.2. The system has cache problems where the cache doesn't work as desired. I don't know if that is related. There is a pci bridge in the system, that I've managed to initialize properly, and the pci system seems to be working fine. There are two ethernet ports on the board, and we're using the fcc_enet.c driver. There is a usb controller on the board, which is a pci device. It uses the usb-ohci.c driver. The cache problems we're having relate to external devices acting as a bus master writing into the 8260's dram. The 8260 doesn't detect the changes, instead it uses stale data from its cache. This problem is endemic to our board and is very irritating. As a hack to workaround the problem, in the case of the usb driver, I had to move all the data io buffers into a 4 megabyte local bus ram which is non-cacheable. So if the usb system tells the usb-ohci driver to send a packet out to the usb controller, I've first got to copy it to the local bus ram, initiate the transfer from there, wait until the interrupt signals the data is done, then copy the buffer from the local bus ram back to the dram buffer. This hacking into the driver is really grotesque and we shouldn't have to do that. The 8260 is supposed to have a coherent cache, but for whatever reason our hardware isn't working properly. No one knows enough here to know how to fix the problem, or if it is fixable. Is it a configuration issue? Is it something with the pci bridge we're using (Which is a tundra powerspan). Also Wolfgang in another thread said he thinks that could be a problem with the A.1 mask of the 8260 we're using, but I suspect it's not that simple, this cache trouble is so pervasive it has to be something misconfigured. ANYWAY, that cache stuff is just background. The lockup bug happens when 3 things happen: 1) Load the usb module with no usb devices plugged in 2) Have one of the ethernet ports plugged into our LAN 3) Do heavy file IO, by repeatedly doing cat /sbin/* > /dev/null If the usb isn't initialized, the lockup doesn't happen. If I unplug the ethernet plug, the lockup doesn't happen. If I don't do the heavy file io, I think it locks up eventually but not as quickly. The file io makes heavy use of the function __copy_tofrom_user which just is an equivalent to memcpy but it flushes the cache as it does the copy, I think. With the BDI2000 connected I can examine the state of the machine after the lockup. A lot of the time the cpu is frozen on the same instruction in the function _copy_tofrom_user, the instruction is lwz r9,12(r4) That instruction loads the 32 bit word at location (r4+12) and puts it into r9. When it is frozen there, r4 is always on an address with the low 5 bits containing 0x14. The 8260's cache line size is 32 bytes. Before the fetching of the word 12(r4) it has done 4(r4) and 8(r4). So at this instruction the address just crossed a cache line. I don't know if that is relevant, but I think it is. The 8260 I believe will prefetch from memory an entire cache line once it is accessed. That means 8 memory fetches. When the usb is initialized, the usb driver tells the usb controller the address of a 256 byte block of memory in the 8260's memory space that the usb can use for communication. Once the usb is initialized, about 2000 times per second it is writing a counter into offset 0x80 of that buffer. That is the usb frame number. It is incrementing at about 2000/second, so I assume it is being written on every increment but I don't know for sure. The usb is not generating any hardware interrupts at all until something is plugged into a usb port. There is a software timer invoked by the usb driver which happens a few times a second, where it polls the usb controller to see if anything has been plugged in. I have tried disabling this software timer, so I am certain that no code in the usb driver is ever being executed. The lockup still occurs. I believe the usb's contribution to the problem is just the writing of the word to that address 2000 times a second. When the ethernet cable is plugged in, even with no noticeable net traffic, there are packets coming across. Mostly ARP requests. Those are passed to all hosts on a network. The form is like "what is the IP address associated with this MAC address" and vice-versa. Those happen pretty frequently, perhaps a few times a second. When a packet comes it it generates an interrupt which is serviced by the ethernet driver. I believe those interrupts are the ethernet's contribution to the problem. When the machine locks I can't single step anymore, the BDI2000 reports COP freeze. The machine is in some dead state where nothing can get it out except a reset. The BDI2000 tech support guys think it is in the checkstop state. In the docs on the 8260 the cpu goes into the checkstop state if certain events happen and the machine check exception is disabled. The events are address parity error, data parity error, or the machine check external line being driven low by an external device. There is some other cause which I forget. Now, we're not running with the machine check exception disabled, it is enabled. The hardware maker says there is nothing driving that machine check line low on the 8260. A fellow I spoke to thinks the cpu can go into the checkstop state if you get a machine check exception, then while servicing that you get another one. He didn't say it that way, he just said you get one, then another one. Anyway I'm at my wit's end on this one. Maybe someone more experienced can look at these facts and the answer will jump out. That's what I'm hoping at least :^). Thanks for any advice. -Dave ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/