From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chase Venters Subject: Questions about ATA Command Timeouts and kernel lockups Date: Sat, 12 Nov 2005 18:10:34 -0600 Message-ID: <200511121810.57220.chase.venters@clientec.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from relay00.pair.com ([209.68.5.9]:28430 "HELO relay00.pair.com") by vger.kernel.org with SMTP id S964890AbVKMALG (ORCPT ); Sat, 12 Nov 2005 19:11:06 -0500 Content-Disposition: inline Sender: linux-ide-owner@vger.kernel.org List-Id: linux-ide@vger.kernel.org To: linux-ide@vger.kernel.org Greetings, I'm using an ICH6. I have four WD 7200RPM SATA 320GB hard drives hanging off of it, on top of which I'm running raid10 and reiserfs. Over the last month and a half, I've started to have an excessive amount of hardware problems. The first sign of trouble was when I got one or two lockups that seemed to be caused by overheating. Then I had an expensive GeForce die on me. I started a process of replacing hardware, while attempting to diagnose the problem. I switched from a quiet fan (on a Prescott, which run way hot) to a bigger Thermaltake cooler. I swapped in a brand new 550 Watt PSU with 2 independent 12V rails. I replaced the GeForce, and replaced the motherboard with (another) ASUS P5GDC-V. In this process, I've had to deal with data corruption severe enough to totally bork my system on two occasions. One one occasion, I kept having random reboots / crashes that. After a series of many of these crashes, tree nodes on ReiserFS got so badly damaged that attempting to access certain parts of the disk resulted in kernel panics. After swapping in the PSU, I rebuilt the system on a fresh ReiserFS partition. It seemed to be working quite well until I started getting lockups again within the last week. This time, I was equipped with a serial console and SysRQ to see what the hell was going on. Long story short, I'm experiencing random lockups under IO load. This is all the kernel has to say before totally locking up: ATA: abnormal status 0x80 on port 0x9807 ATA: abnormal status 0x80 on port 0x9807 ATA: abnormal status 0x80 on port 0x9807 ata2: command 0x25 timeout, stat 0x51 host_stat 0x61 ata2: status=0x51 { DriveReady SeekComplete Error } ata2: error=0x04 { DriveStatusError } After getting this message, the kernel refuses to answer even SysRQ. I posted about this once before and had a private conversation with another member about the possibility of heat or power issues. I no longer believe this to be the case - the 550 Watt supply is brand new, and I'm now in the process of rebuilding Gentoo on *another* fresh ReiserFS partition (the lockup pertaining to the messages above occurred last night and left 2 drives out of sync, while a third was already sitting out with a backup of my data from a month ago). The drives are cooled by a 120mm fan, and even under the load of an md resync and a compile, I can rest my hands on them for long periods of time without getting burned. They're a little warm, but nothing they shouldn't be able to handle. The reason I'm sending this in to the list is because the last option I feel that I have is to buy a new motherboard other than the P5GDC-V. This process is getting expensive and I'd like to stop having crashes. I can accept that I am dealing with a hardware issue, but my big question is this: Is it considered normal for the kernel to lock up, not responding to SysRQ over RS232, upon experiencing a command timeout as shown above? I'll admit to not being really up to speed on the interaction between the kernel and the bus over DMA, but it seems to me that since this diagnostic information *does* get printed, followed by a total lockup, that the kernel may be falling into a tight loop with interrupts disabled. To this end, I'm building the kernel that will run on my system once it's rebuilt to include sleep while holding spinlock checking. Is there anything else I can do, in case this crash hits me again? Thanks, Chase Venters