Promise SATA oops

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Promise SATA oops
@ 2005-12-02  4:58 Aaron Lehmann
  2005-12-02  5:29 ` Jeff Garzik
  0 siblings, 1 reply; 7+ messages in thread
From: Aaron Lehmann @ 2005-12-02  4:58 UTC (permalink / raw)
  To: linux-ide, linux-kernel

I'm running 2.6.14.2 on an x86_64 (Athlon X2, i.e. SMP) with a Promise
TX4 SATAII 150 controller. The night I set up the machine, I got a
Promise-related oops (null pointer dereference IIRC), but was foolish
enough not to write it down.  Since then, the machine has been
unstable, and I've suspected the same thing is recurring, but since I
use X it's very difficult to actually get at the oops. I ended up
setting up a ramdisk with a static busybox that I could use to poke
around if anything interesting happened. Just now everything using the
filesystem went into D-state, so I checked dmesg and saw uncorrectable
errors being reported on /dev/sdd. The system froze completely within
a minute. When I rebooted, I got the oops at the end of this message.
I was only able to copy the portion that fit on the screen. A second
reboot was sucessful. My RAID5 arrays are resyncing now, and I expect
that to complete normally because I've had to go through a lot of
resyncs since I set this system up and they were all sucessful. Once
that's done, I guess I'll run badblocks on sdd and see if anything
turns up. It would be a shame if that drive is bad, considering that
my 4 hard drives are brand new ones to replace a failed array I had
lots of problems with.

Process scsi_eh_3 (pid: 25, threadinfo fff81001fbc0000, task ffff81001fbbcf40)
Stack: ffffffff80274291 ffff81001fc0f800 ffff81001fb2a340 ffff81001fe78000
       ffffffff8026d524 ffff81001fe78948 ffff81001fb2a340 ffff81001fe78428
       ffffffff80280006 ffff81001fe78428
Call Trace:<ffffffff80274291>{scsi_device_unbusy+33} <ffffffff8026d524>{scsi_fin
ish_command+36}
       <ffffffff80280006>{ata_scsi_qc_complete+54} <ffffffff8027c32e>{ata_qc_com
plete+366}
       <ffffffff80281764>{pdc_eng_timeout+212} <ffffffff802716f0>{scsi_error_han
dler+0}
       <ffffffff8027fae5>{ata_scsi_error+21} <ffffffff80271790>{scsi_error_handl
er+160}
       <ffffffff80149430>{keventd_create_kthread+0} <ffffffff802716f0>{scsi_erro
r_handler+0}
       <ffffffff80149430>{keventd_create_kthread+0} <ffffffff801496a9>{kthread+2
17}
       <ffffffff80130260>{schedule_tail+64} <ffffffff8010e746>{child_rip+8}
       <ffffffff80149430>{kevent_create_kthread+0} <ffffffff801495d0>{kthread+0}
       <ffffffff8010e73e>{child_rip+0}

Code: 80 3f 00 7e f9 e9 2e fe ff ff f3 90 80 3f 00 7e f9 e9 30 fe
console shuts up ...
 <0>Kernel panic - not syncing: Aiee, killing interrupt handler!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Promise SATA oops
  2005-12-02  4:58 Promise SATA oops Aaron Lehmann
@ 2005-12-02  5:29 ` Jeff Garzik
  2005-12-02 19:51   ` Aaron Lehmann
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff Garzik @ 2005-12-02  5:29 UTC (permalink / raw)
  To: Aaron Lehmann; +Cc: linux-ide, linux-kernel

Aaron Lehmann wrote:
> I'm running 2.6.14.2 on an x86_64 (Athlon X2, i.e. SMP) with a Promise
> TX4 SATAII 150 controller. The night I set up the machine, I got a
> Promise-related oops (null pointer dereference IIRC), but was foolish
> enough not to write it down.  Since then, the machine has been
> unstable, and I've suspected the same thing is recurring, but since I
> use X it's very difficult to actually get at the oops. I ended up
> setting up a ramdisk with a static busybox that I could use to poke
> around if anything interesting happened. Just now everything using the
> filesystem went into D-state, so I checked dmesg and saw uncorrectable
> errors being reported on /dev/sdd. The system froze completely within
> a minute. When I rebooted, I got the oops at the end of this message.
> I was only able to copy the portion that fit on the screen. A second
> reboot was sucessful. My RAID5 arrays are resyncing now, and I expect
> that to complete normally because I've had to go through a lot of
> resyncs since I set this system up and they were all sucessful. Once
> that's done, I guess I'll run badblocks on sdd and see if anything
> turns up. It would be a shame if that drive is bad, considering that
> my 4 hard drives are brand new ones to replace a failed array I had
> lots of problems with.

This should be fixed in 2.6.15-rcX...

	Jeff




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Promise SATA oops
  2005-12-02  5:29 ` Jeff Garzik
@ 2005-12-02 19:51   ` Aaron Lehmann
  2005-12-03 10:09     ` Erik Slagter
  2005-12-20 20:17     ` Aaron Lehmann
  0 siblings, 2 replies; 7+ messages in thread
From: Aaron Lehmann @ 2005-12-02 19:51 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-ide, linux-kernel

On Fri, Dec 02, 2005 at 12:29:01AM -0500, Jeff Garzik wrote:
> This should be fixed in 2.6.15-rcX...

Still isn't stable. It froze within hours after announcing in all
terminals that it was disabling a certain IRQ. Now the RAID is so
degraded that root can't even be mounted. Was the Promise controller a
bad choice for a reliable setup?

I may not have time to look at this further until late next week, but
I'll follow up with whatever I learn.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Promise SATA oops
  2005-12-02 19:51   ` Aaron Lehmann
@ 2005-12-03 10:09     ` Erik Slagter
  2005-12-20 20:17     ` Aaron Lehmann
  1 sibling, 0 replies; 7+ messages in thread
From: Erik Slagter @ 2005-12-03 10:09 UTC (permalink / raw)
  To: Aaron Lehmann; +Cc: Jeff Garzik, linux-ide, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 660 bytes --]

On Fri, 2005-12-02 at 11:51 -0800, Aaron Lehmann wrote:
> On Fri, Dec 02, 2005 at 12:29:01AM -0500, Jeff Garzik wrote:
> > This should be fixed in 2.6.15-rcX...
> 
> Still isn't stable. It froze within hours after announcing in all
> terminals that it was disabling a certain IRQ. Now the RAID is so
> degraded that root can't even be mounted. Was the Promise controller a
> bad choice for a reliable setup?
> 
> I may not have time to look at this further until late next week, but
> I'll follow up with whatever I learn.

This look very similar to the problem I had. It vanished completely when
I exchanged the power supply for a high-end one.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 2771 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Promise SATA oops
  2005-12-02 19:51   ` Aaron Lehmann
  2005-12-03 10:09     ` Erik Slagter
@ 2005-12-20 20:17     ` Aaron Lehmann
  2005-12-27 23:51       ` Peter Smith
  2006-02-21  4:21       ` Aaron Lehmann
  1 sibling, 2 replies; 7+ messages in thread
From: Aaron Lehmann @ 2005-12-20 20:17 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-ide, linux-kernel

On Fri, Dec 02, 2005 at 11:51:09AM -0800, Aaron Lehmann wrote:
> Still isn't stable. It froze within hours after announcing in all
> terminals that it was disabling a certain IRQ. Now the RAID is so
> degraded that root can't even be mounted. Was the Promise controller a
> bad choice for a reliable setup?
> 
> I may not have time to look at this further until late next week, but
> I'll follow up with whatever I learn.

Argh, died again!! It had been stable for over 12 days. Same error
message, and the root md is degraded and dirty just like last time.
This is a very severe state with high risk of data loss. When things
went sour, terminals and most applications still kept working, but
anything that touched the filesystem froze up. I had a shell open in a
chroot on a ramdisk, but dmesg just hung for a few minutes and then
exited with a "Bus error". I had no other way of examining the kernel
log since the machine runs X.

This was running 2.6.15-rc4. Crashes seem to happen less frequently
with it than with 2.6.14.x, but when they happen they leave the RAID
in a severe state. I also don't think 2.6.14.2 said anything about
disabling the IRQ.

I'm very desperate now. About every week I experience a crash that
damages my RAID array to the point where it can't boot, as if the
instability wasn't bad enough. Do I need to buy a hardware RAID card?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Promise SATA oops
  2005-12-20 20:17     ` Aaron Lehmann
@ 2005-12-27 23:51       ` Peter Smith
  2006-02-21  4:21       ` Aaron Lehmann
  1 sibling, 0 replies; 7+ messages in thread
From: Peter Smith @ 2005-12-27 23:51 UTC (permalink / raw)
  To: linux-ide

Aaron Lehmann wrote:

>Argh, died again!! It had been stable for over 12 days. Same error
>message, and the root md is degraded and dirty just like last time.
>This is a very severe state with high risk of data loss. When things
>went sour, terminals and most applications still kept working, but
>anything that touched the filesystem froze up. I had a shell open in a
>chroot on a ramdisk, but dmesg just hung for a few minutes and then
>exited with a "Bus error". I had no other way of examining the kernel
>log since the machine runs X.
>
>This was running 2.6.15-rc4. Crashes seem to happen less frequently
>with it than with 2.6.14.x, but when they happen they leave the RAID
>in a severe state. I also don't think 2.6.14.2 said anything about
>disabling the IRQ.
>
>I'm very desperate now. About every week I experience a crash that
>damages my RAID array to the point where it can't boot, as if the
>instability wasn't bad enough. Do I need to buy a hardware RAID card?
>
>  
>
I personally wouldn't recommend a hardware RAID card.. Are you still 
experiencing difficulties? Have you *tried* an i386 build? Can you work 
on getting more data out of the Ooops'es--this would involve setting up 
a serial console connection to another box and receiving dumps that way, 
there are How-tos out there on this setup.. Another possibility is the 
Kernel Crash Dump project [1]... Btw, I have (fairly simple) setup here 
at my office, using Fedora Core 3, 2.6.12-1.1381_FC3smp kernel, dual 
P3-500mz, 512MB ram, two Promise Sata2-150 TX4s, and five Seagate 200GB 
drives.. I haven't had any problems with it since it was installed. 
Granted the hardware and software are a bit behind the curve, it has 
been sailling along quietly and steadily, so it is possible.

Peter

[1] http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.11google.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Promise SATA oops
  2005-12-20 20:17     ` Aaron Lehmann
  2005-12-27 23:51       ` Peter Smith
@ 2006-02-21  4:21       ` Aaron Lehmann
  1 sibling, 0 replies; 7+ messages in thread
From: Aaron Lehmann @ 2006-02-21  4:21 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-ide, linux-kernel

This crash kept happening for months, across all versions of the
kernel I tried (up through early 2.6.16 git snapshots). I ended up
buying a different SATA card, and this seems to have fixed the
problem. At around the same frequency as I experienced the nasty
hanging, I'm seeing this error message:

ata1: command 0xea timeout, stat 0x51 host_stat 0x0
ata1: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x04 { DriveStatusError }

...but the system continues running fine. This leads me to believe
that there's a bug in the Promise SATA driver that prevents it from
gracefully handling this error condition, whatever it is. The hard
drives are model WDC WD3200JD-60K, and I couldn't find any bad blocks
on them.


On Tue, Dec 20, 2005 at 12:17:19PM -0800, Aaron Lehmann wrote:
> On Fri, Dec 02, 2005 at 11:51:09AM -0800, Aaron Lehmann wrote:
> > Still isn't stable. It froze within hours after announcing in all
> > terminals that it was disabling a certain IRQ. Now the RAID is so
> > degraded that root can't even be mounted. Was the Promise controller a
> > bad choice for a reliable setup?
> > 
> > I may not have time to look at this further until late next week, but
> > I'll follow up with whatever I learn.
> 
> Argh, died again!! It had been stable for over 12 days. Same error
> message, and the root md is degraded and dirty just like last time.
> This is a very severe state with high risk of data loss. When things
> went sour, terminals and most applications still kept working, but
> anything that touched the filesystem froze up. I had a shell open in a
> chroot on a ramdisk, but dmesg just hung for a few minutes and then
> exited with a "Bus error". I had no other way of examining the kernel
> log since the machine runs X.
> 
> This was running 2.6.15-rc4. Crashes seem to happen less frequently
> with it than with 2.6.14.x, but when they happen they leave the RAID
> in a severe state. I also don't think 2.6.14.2 said anything about
> disabling the IRQ.
> 
> I'm very desperate now. About every week I experience a crash that
> damages my RAID array to the point where it can't boot, as if the
> instability wasn't bad enough. Do I need to buy a hardware RAID card?

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-02-21  4:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-02  4:58 Promise SATA oops Aaron Lehmann
2005-12-02  5:29 ` Jeff Garzik
2005-12-02 19:51   ` Aaron Lehmann
2005-12-03 10:09     ` Erik Slagter
2005-12-20 20:17     ` Aaron Lehmann
2005-12-27 23:51       ` Peter Smith
2006-02-21  4:21       ` Aaron Lehmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).