Questions about ATA Command Timeouts and kernel lockups

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Questions about ATA Command Timeouts and kernel lockups
@ 2005-11-13  0:10 Chase Venters
  0 siblings, 0 replies; 2+ messages in thread
From: Chase Venters @ 2005-11-13  0:10 UTC (permalink / raw)
  To: linux-ide

Greetings,
	I'm using an ICH6. I have four WD 7200RPM SATA 320GB hard drives hanging off 
of it, on top of which I'm running raid10 and reiserfs. Over the last month 
and a half, I've started to have an excessive amount of hardware problems.
	The first sign of trouble was when I got one or two lockups that seemed to be 
caused by overheating. Then I had an expensive GeForce die on me. I started a 
process of replacing hardware, while attempting to diagnose the problem. I 
switched from a quiet fan (on a Prescott, which run way hot) to a bigger 
Thermaltake cooler. I swapped in a brand new 550 Watt PSU with 2 independent 
12V rails. I replaced the GeForce, and replaced the motherboard with 
(another) ASUS P5GDC-V. 
	In this process, I've had to deal with data corruption severe enough to 
totally bork my system on two occasions. One one occasion, I kept having 
random reboots / crashes that. After a series of many of these crashes, tree 
nodes on ReiserFS got so badly damaged that attempting to access certain 
parts of the disk resulted in kernel panics.
	After swapping in the PSU, I rebuilt the system on a fresh ReiserFS 
partition. It seemed to be working quite well until I started getting lockups 
again within the last week. 
	This time, I was equipped with a serial console and SysRQ to see what the 
hell was going on. Long story short, I'm experiencing random lockups under IO 
load. This is all the kernel has to say before totally locking up:

ATA: abnormal status 0x80 on port 0x9807
ATA: abnormal status 0x80 on port 0x9807
ATA: abnormal status 0x80 on port 0x9807
ata2: command 0x25 timeout, stat 0x51 host_stat 0x61
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: error=0x04 { DriveStatusError }

	After getting this message, the kernel refuses to answer even SysRQ. 
	I posted about this once before and had a private conversation with another 
member about the possibility of heat or power issues. I no longer believe 
this to be the case - the 550 Watt supply is brand new, and I'm now in the 
process of rebuilding Gentoo on *another* fresh ReiserFS partition (the 
lockup pertaining to the messages above occurred last night and left 2 drives 
out of sync, while a third was already sitting out with a backup of my data 
from a month ago). The drives are cooled by a 120mm fan, and even under the 
load of an md resync and a compile, I can rest my hands on them for long 
periods of time without getting burned. They're a little warm, but nothing 
they shouldn't be able to handle.
	The reason I'm sending this in to the list is because the last option I feel 
that I have is to buy a new motherboard other than the P5GDC-V. This process 
is getting expensive and I'd like to stop having crashes. I can accept that I 
am dealing with a hardware issue, but my big question is this:

	Is it considered normal for the kernel to lock up, not responding to SysRQ 
over RS232, upon experiencing a command timeout as shown above? 

	I'll admit to not being really up to speed on the interaction between the 
kernel and the bus over DMA, but it seems to me that since this diagnostic 
information *does* get printed, followed by a total lockup, that the kernel 
may be falling into a tight loop with interrupts disabled. To this end, I'm 
building the kernel that will run on my system once it's rebuilt to include 
sleep while holding spinlock checking.
	Is there anything else I can do, in case this crash hits me again?

Thanks,
Chase Venters

^ permalink raw reply	[flat|nested] 2+ messages in thread

* RE: Questions about ATA Command Timeouts and kernel lockups
@ 2005-11-15 10:00 SMALL, Timothy
  0 siblings, 0 replies; 2+ messages in thread
From: SMALL, Timothy @ 2005-11-15 10:00 UTC (permalink / raw)
  To: 'Chase Venters', linux-ide

-----Original Message-----
> From: Chase Venters [mailto:chase.venters@clientec.com]
> kernel and the bus over DMA, but it seems to me that since this diagnostic

> information *does* get printed, followed by a total lockup, that the
kernel 
> may be falling into a tight loop with interrupts disabled. To this end,
I'm 
> building the kernel that will run on my system once it's rebuilt to
include 
> sleep while holding spinlock checking.
> 	Is there anything else I can do, in case this crash hits me again?

Enable the NMI watchdog?

This email is for the intended addressee only.
If you have received it in error then you must not use, retain, disseminate or otherwise deal with it.
Please notify the sender by return email.
The views of the author may not necessarily constitute the views of EADS Astrium Limited.
Nothing in this email shall bind EADS Astrium Limited in any contract or obligation.

EADS Astrium Limited, Registered in England and Wales No. 2449259
Registered Office: Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2005-11-15 10:01 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-13  0:10 Questions about ATA Command Timeouts and kernel lockups Chase Venters
  -- strict thread matches above, loose matches on Subject: below --
2005-11-15 10:00 SMALL, Timothy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).