* Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
@ 2008-08-13 22:27 Linda Walsh
2008-08-14 10:50 ` Alan Cox
0 siblings, 1 reply; 12+ messages in thread
From: Linda Walsh @ 2008-08-13 22:27 UTC (permalink / raw)
To: linux-ide
I can format and initialize it, but attempting to do a massive copy
from one of the PATA drives->the SATA drive, _eventually_ causes the
machine to lock up.
I don't know if this is the cause, but with closer monitoring
I was trying (again) to move files from PATA disks to a my SATA
controller w/SATA disks a few at a time -- and caught the copy
process going into a 'blocked' state -- which eventually had about
8 processes blocked before the disk was 'disabled' by the kernel
and the processes were all released from the blocked state -- I
then was able to get the log files appended below:
The disk is a Western Digital-1TB (new disk).
The controller is a Promise SATA 300TX4 (4-port Serial 3G with
NCQ/TCQ Support).
Is this a _\lemon\_ controller or brand? It looked like it
was supported under linux
The boot messages from the controller:
...(saw this: does it apply to the SATA driver? - came out with
first sd drive(SCSI)
Driver 'sd' needs updating - please use bus_type methods
...
sata_promise 0000:00:0d.0: version 2.11
ACPI: PCI Interrupt 0000:00:0d.0[A] -> GSI 16 (level, low) -> IRQ 17
scsi1 : sata_promise
scsi2 : sata_promise
scsi3 : sata_promise
scsi4 : sata_promise
ata1: SATA max UDMA/133 mmio m4096@0xfe0e4000 port 0xfe0e4380 irq 17
ata2: SATA max UDMA/133 mmio m4096@0xfe0e4000 port 0xfe0e4280 irq 17
ata3: SATA max UDMA/133 mmio m4096@0xfe0e4000 port 0xfe0e4200 irq 17
ata4: SATA max UDMA/133 mmio m4096@0xfe0e4000 port 0xfe0e4300 irq 17
ata1: SATA link down (SStatus 0 SControl 300)
ata2: SATA link down (SStatus 0 SControl 300)
ata3: SATA link down (SStatus 0 SControl 300)
ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata4.00: ATA-8: WDC WD1001FALS-00J7B0, 05.00K05, max UDMA/133
ata4.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 0/32)
ata4.00: configured for UDMA/133
scsi 4:0:0:0: Direct-Access ATA WDC WD1001FALS-0 05.0 PQ: 0 ANSI: 5
sd 4:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB)
sd 4:0:0:0: [sdb] Write Protect is off
sd 4:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 4:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
sd 4:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB)
sd 4:0:0:0: [sdb] Write Protect is off
sd 4:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 4:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
sdb: sdb1
sd 4:0:0:0: [sdb] Attached SCSI disk
sd 4:0:0:0: Attached scsi generic sg1 type 0
-------------------------------------------------------------------
warning and error logs follow ---
(trimmed some of the prefix material
(month, host, kernel->kern...))---------------------
13 10:12:20 kern: res ff/ff:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x12
(ATA bus error)
13 10:12:20 kern: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x19 0002
action 0xa frozen
13 10:12:20 kern: ata4.00: hotplug_status 0x4
13 10:12:20 kern: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar }
13 10:12:20 kern: ata4.00: cmd 35/00:00:bf:3c:af/00:04:2b:00:00/e0 tag 0
dma 524288 out
13 10:12:20 kern: ata4.00: status: { Busy }
13 10:12:20 kern: ata4.00: error: { ICRC UNC IDNF ABRT }
13 10:12:26 kern: ata4: port is slow to respond, please be patient
(Status 0xff)
13 10:13:25 last message repeated 2 times
13 10:13:30 kern: ata4: COMRESET failed (errno=-16)
13 10:12:30 kern: ata4: device not ready (errno=-16), forcing hard reset
13 10:13:30 kern: ata4: reset failed, giving up
13 10:13:30 kern: ata4: exception Emask 0x10 SAct 0x0 SErr 0x41900 02
action 0xa frozen t4
13 10:13:30 kern: ata4: hotplug_status 0x40
13 10:13:30 kern: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar DevExch }
13 10:12:37 kern: ata4: port is slow to respond, please be patient
(Status 0xff)
13 10:12:40 kern: ata4: COMRESET failed (errno=-16)
13 10:12:57 last message repeated 2 times
13 10:13:25 kern: ata4: limiting SATA link speed to 1.5 Gbps
13 10:13:30 kern: ata4.00: disabled
13 10:13:37 kern: ata4: port is slow to respond, please be patient
(Status 0xff)
13 10:13:57 last message repeated 2 times
13 10:14:25 kern: ata4: limiting SATA link speed to 1.5 Gbps
13 10:14:30 kern: ata4: COMRESET failed (errno=-16)
13 10:14:30 kern: ata4: reset failed, giving up
13 10:14:30 kern: ata4: exception Emask 0x10 SAct 0x0 SErr 0x41900 02
action 0xa frozen t3
13 10:14:30 kern: ata4: hotplug_status 0x40
13 10:14:30 kern: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar DevExch }
13 10:14:37 kern: ata4: port is slow to respond, please be patient
(Status 0xff)
13 10:14:40 kern: ata4: COMRESET failed (errno=-16)
13 10:14:57 last message repeated 2 times
13 10:15:25 kern: ata4: limiting SATA link speed to 1.5 Gbps
13 10:15:25 last message repeated 2 times
13 10:15:30 kern: ata4: COMRESET failed (errno=-16)
13 10:15:30 kern: ata4: reset failed, giving up
13 10:15:30 kern: ata4: exception Emask 0x10 SAct 0x0 SErr 0x4190002
action 0xa frozen t2
13 10:15:30 kern: ata4: hotplug_status 0x40
13 10:15:30 kern: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar
DevExch }
13 10:15:37 kern: ata4: port is slow to respond, please be patient
(Status 0xff)
13 10:15:40 kern: ata4: COMRESET failed (errno=-16)
13 10:15:50 kern: ata4: COMRESET failed (errno=-16)
13 10:16:25 kern: ata4: limiting SATA link speed to 1.5 Gbps
13 10:16:30 last message repeated 2 times
13 10:16:30 kern: ata4: reset failed, giving up
13 10:16:30 kern: ata4: exception Emask 0x10 SAct 0x0 SErr 0x41900
02 action 0xa frozen t1
13 10:16:30 kern: ata4: hotplug_status 0x40
13 10:16:30 kern: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar
DevExch }
13 10:16:37 kern: ata4: port is slow to respond, please be patient
(Status 0xff)
13 10:16:40 kern: ata4: COMRESET failed (errno=-16)
13 10:16:50 kern: ata4: COMRESET failed (errno=-16)
13 10:16:57 last message repeated 2 times
13 10:17:25 kern: ata4: limiting SATA link speed to 1.5 Gbps
13 10:17:30 kern: Descriptor sense data with sense descriptors (in hex):
13 10:17:30 kern: end_request: I/O error, dev sdb, sector 732904639
13 10:17:30 kern: end_request: I/O error, dev sdb, sector 732905663
13 10:17:30 kern: lost page write due to I/O error on sdb1
13 10:17:30 last message repeated 2 times
13 10:17:30 kern: ata4: reset failed, giving up
13 10:17:30 kern: ata4: EH pending after 5 tries, giving up
13 10:17:30 kern: sd 4:0:0:0: rejecting I/O to offline device
13 10:17:30 last message repeated 152 times
13 10:17:30 kern: Buffer I/O error on device sdb1, logical block 91613461
13 10:17:30 kern: Buffer I/O error on device sdb1, logical block 91613462
13 10:17:30 kern: Buffer I/O error on device sdb1, logical block 91613463
13 10:17:30 kern: Buffer I/O error on device sdb1, logical block 91613464
13 10:17:31 last message repeated 9 times
13 10:17:31 kern: sd 4:0:0:0: [sdb] START_STOP FAILED
13 10:17:31 kern: Buffer I/O error on device sdb1, logical block 91613465
13 10:17:31 kern: Buffer I/O error on device sdb1, logical block 91613466
13 10:17:31 kern: Buffer I/O error on device sdb1, logical block 91613467
13 10:17:31 kern: Buffer I/O error on device sdb1, logical block 91613468
13 10:17:31 kern: Buffer I/O error on device sdb1, logical block 91613469
13 10:17:31 kern: Buffer I/O error on device sdb1, logical block 91613470
13 12:35:13 kern: printk: 17932 messages suppressed.
----
please 'CC' me, as my subscription to linux-ide has "lapsed" again
(kernel.org seems to drop subscriptions randomly...; no I don't
think is bounced email...too specific to k.o)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
2008-08-13 22:27 Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7 Linda Walsh
@ 2008-08-14 10:50 ` Alan Cox
2008-08-20 7:39 ` Tejun Heo
0 siblings, 1 reply; 12+ messages in thread
From: Alan Cox @ 2008-08-14 10:50 UTC (permalink / raw)
To: Linda Walsh; +Cc: linux-ide
> The disk is a Western Digital-1TB (new disk).
> The controller is a Promise SATA 300TX4 (4-port Serial 3G with
> NCQ/TCQ Support).
Your drive appears to fall off the bus.
> 13 10:12:20 kern: res ff/ff:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x12
> (ATA bus error)
Splat
> 13 10:12:20 kern: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x19 0002
> action 0xa frozen
> 13 10:12:20 kern: ata4.00: hotplug_status 0x4
> 13 10:12:20 kern: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar }
SATA link dies
> 13 10:13:25 kern: ata4: limiting SATA link speed to 1.5 Gbps
We try 1.5GBit
> 13 10:14:30 kern: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar DevExch }
> 13 10:14:37 kern: ata4: port is slow to respond, please be patient
> (Status 0xff)
First guess would be a dud drive but it could be power or cabling or
firmware or ...
In all the sane cases I would have expected it to recover, particularly
if it was cabling.
Alan
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
2008-08-14 10:50 ` Alan Cox
@ 2008-08-20 7:39 ` Tejun Heo
2008-08-28 1:46 ` Linda Walsh
0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2008-08-20 7:39 UTC (permalink / raw)
To: Linda Walsh; +Cc: Alan Cox, linux-ide
Alan Cox wrote:
>> 13 10:12:20 kern: res ff/ff:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x12
>> (ATA bus error)
>
> Splat
>
>> 13 10:12:20 kern: ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x19 0002
>> action 0xa frozen
>> 13 10:12:20 kern: ata4.00: hotplug_status 0x4
>> 13 10:12:20 kern: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar }
>
> SATA link dies
>
>> 13 10:13:25 kern: ata4: limiting SATA link speed to 1.5 Gbps
>
> We try 1.5GBit
>
>> 13 10:14:30 kern: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar DevExch }
>> 13 10:14:37 kern: ata4: port is slow to respond, please be patient
>> (Status 0xff)
>
> First guess would be a dud drive but it could be power or cabling or
> firmware or ...
>
> In all the sane cases I would have expected it to recover, particularly
> if it was cabling.
Hmm... this could be either the drive or the controller. After that
happens, can you please plug the power line off the drive, wait a few
tens of secs and plug in again and see whether the drive comes back?
Even if the drive was serving root fs, you should still be able to see
whether libata can converse with the device. Just don't unplug and
replug while libata is still trying to recover the device. Unwritten
data in the disk buffer will be lost when you unplug the power and
libata would think it was just transmission glitch and the fs will just
continue as if nothing happened which could result in massive filesystem
corruption.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
2008-08-20 7:39 ` Tejun Heo
@ 2008-08-28 1:46 ` Linda Walsh
2008-08-28 7:03 ` Tejun Heo
0 siblings, 1 reply; 12+ messages in thread
From: Linda Walsh @ 2008-08-28 1:46 UTC (permalink / raw)
To: linux-ide; +Cc: Tejun Heo, Alan Cox
Tejun Heo wrote:
> Alan Cox wrote:
>>> 13 10:12:20 kern: res ff/ff:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x12
>>> (ATA bus error)
>>> rn: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar DevExch }
>>> 13 10:14:37 kern: ata4: port is slow to respond, please be patient
>>> (Status 0xff)
>> First guess would be a dud drive but it could be power or cabling or
>> firmware or ...
> Hmm... this could be either the drive or the controller.
----
Just to confirm -- this particular problem was due to a faulty
brand-new SATA Western_Digital drive that died. It hung the system
several times under load, but shortly after the above errors,
the system would not boot with that drive attached.
Secondary error: My ACPI impementation is, /apparently/, flakey.
I used to not be able to use acpi back in the 2.2 timeframe. But
sometime in the 2.4 timeframe, ACPI started working with this system
(a 440BX based motherboard). I thought ACPI support had improved.
Symptom of ACPI based boot vs. non: random hang (a few hours up to maybe
48 hours max). But after I thought ACPI was 'fixed', booting with ACPI
(or not) resulted in stable system.
But -- two different error types. Starting with the 2.6.25 series,
I started observing hangs again (same in the 2.6.26 series). My last
stable was 2.6.24.1. BUT -- I also occasionally noticed some rare
sporadic disk error messages (while looking for the cause of the hang) --
they weren't there in the "pre-hang" 2.6.24.1 kernel...(I couldn't
even get a 2.6.24.7 kernel to stay up for more than 2 days).
My upgrade strategy for disks has been to move to SATA disks as
I needed to replace older PATA's. Had alot of problems last Feb when
I tried to use SATA; after a few weeks of making no progress discovering
the source of he hangs, I went back to a PATA drive and took out the SATA
controller -- and system went back to stable. Ok...I'm tired of
debugging this...lets stay with PATA for now.
Six months later...need another disk. Back to trying SATA...
more hangs (and a bad disk drive). It seems that in addition to
ACPI no longer working above my 2.6.24.1 kernel, adding in the SATA
board also would cause an ACPI based boot to eventually hang (max
runtime ~30 hours). Using the kernel load option "acpi=noirq", seems to be
the key to stability now.
So I don't know exactly what changed -- but ACPI, which was working
(pre-SATA) seemed to stop being reliable after 2.6.24.1.
Anyway I cut it, acpi=noirq now seems to be a requirement for
system stability. My ACPI version string shows it as "1.0"...so I'm
guessing there might have been some kinks in the implementation.
So had 4 different problems all converge at roughly the same time:
1) new SATA Western_Digital-1TB disk failure,
2) ACPI-induced instability in 2.6.25 and above
3) ACPI induced instability with addition of new SATA controller
(including a rebuilt-for-sata-support 2.6.24.1).
4) Auxiliary cooling fan failed and system would get 'warm' (don't know
exact temps, but some disks were nearing 50C (normal is mid 30's,
except for the 15K system SCSI. It has its own attached fan, so
it's usually a few degrees cooler when the case-fans are operating
correctly.
However, the disk temps are not indicative of the CPU temps -- they
are only an indirect sign that case-airflow is sub-optimal. The
CPU's (2 1GHz P-III's) in this baby don't give reliable thermal
warnings (have only ever seen 1). Usually the system will
just 'hang' (not the most helpful indicator in any event).
Thanks much for feedback that led me to figuring out (*crossing
fingers*) the problems and fixes...
Linda Walsh
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
2008-08-28 1:46 ` Linda Walsh
@ 2008-08-28 7:03 ` Tejun Heo
2008-08-28 12:36 ` Thomas Renninger
0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2008-08-28 7:03 UTC (permalink / raw)
To: Linda Walsh; +Cc: linux-ide, Alan Cox, Thomas Renninger, linux acpi
(cc'ing Thomas and linux-acpi for ACPI reference)
Linda Walsh wrote:
> Tejun Heo wrote:
>> Alan Cox wrote:
>>>> 13 10:12:20 kern: res ff/ff:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x12
>>>> (ATA bus error)
>>>> rn: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar DevExch }
>>>> 13 10:14:37 kern: ata4: port is slow to respond, please be patient
>>>> (Status 0xff)
>>> First guess would be a dud drive but it could be power or cabling or
>>> firmware or ...
>> Hmm... this could be either the drive or the controller.
> ----
>
> Just to confirm -- this particular problem was due to a faulty
> brand-new SATA Western_Digital drive that died. It hung the system
> several times under load, but shortly after the above errors,
> the system would not boot with that drive attached.
>
> Secondary error: My ACPI impementation is, /apparently/, flakey.
> I used to not be able to use acpi back in the 2.2 timeframe. But
> sometime in the 2.4 timeframe, ACPI started working with this system
> (a 440BX based motherboard). I thought ACPI support had improved.
> Symptom of ACPI based boot vs. non: random hang (a few hours up to maybe
> 48 hours max). But after I thought ACPI was 'fixed', booting with ACPI
> (or not) resulted in stable system.
>
> But -- two different error types. Starting with the 2.6.25 series,
> I started observing hangs again (same in the 2.6.26 series). My last
> stable was 2.6.24.1. BUT -- I also occasionally noticed some rare
> sporadic disk error messages (while looking for the cause of the hang) --
> they weren't there in the "pre-hang" 2.6.24.1 kernel...(I couldn't
> even get a 2.6.24.7 kernel to stay up for more than 2 days).
>
> My upgrade strategy for disks has been to move to SATA disks as
> I needed to replace older PATA's. Had alot of problems last Feb when
> I tried to use SATA; after a few weeks of making no progress discovering
> the source of he hangs, I went back to a PATA drive and took out the SATA
> controller -- and system went back to stable. Ok...I'm tired of
> debugging this...lets stay with PATA for now.
>
> Six months later...need another disk. Back to trying SATA...
> more hangs (and a bad disk drive). It seems that in addition to
> ACPI no longer working above my 2.6.24.1 kernel, adding in the SATA
> board also would cause an ACPI based boot to eventually hang (max
> runtime ~30 hours). Using the kernel load option "acpi=noirq", seems to be
> the key to stability now.
>
> So I don't know exactly what changed -- but ACPI, which was working
> (pre-SATA) seemed to stop being reliable after 2.6.24.1.
> Anyway I cut it, acpi=noirq now seems to be a requirement for
> system stability. My ACPI version string shows it as "1.0"...so I'm
> guessing there might have been some kinks in the implementation.
>
> So had 4 different problems all converge at roughly the same time:
> 1) new SATA Western_Digital-1TB disk failure,
> 2) ACPI-induced instability in 2.6.25 and above
> 3) ACPI induced instability with addition of new SATA controller
> (including a rebuilt-for-sata-support 2.6.24.1).
> 4) Auxiliary cooling fan failed and system would get 'warm' (don't know
> exact temps, but some disks were nearing 50C (normal is mid 30's,
> except for the 15K system SCSI. It has its own attached fan, so
> it's usually a few degrees cooler when the case-fans are operating
> correctly.
> However, the disk temps are not indicative of the CPU temps -- they
> are only an indirect sign that case-airflow is sub-optimal. The
> CPU's (2 1GHz P-III's) in this baby don't give reliable thermal
> warnings (have only ever seen 1). Usually the system will
> just 'hang' (not the most helpful indicator in any event).
>
> Thanks much for feedback that led me to figuring out (*crossing
> fingers*) the problems and fixes...
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
2008-08-28 7:03 ` Tejun Heo
@ 2008-08-28 12:36 ` Thomas Renninger
2008-08-29 10:20 ` Tejun Heo
0 siblings, 1 reply; 12+ messages in thread
From: Thomas Renninger @ 2008-08-28 12:36 UTC (permalink / raw)
To: Tejun Heo; +Cc: Linda Walsh, linux-ide, Alan Cox, linux acpi
On Thursday 28 August 2008 09:03:15 Tejun Heo wrote:
> (cc'ing Thomas and linux-acpi for ACPI reference)
>
> Linda Walsh wrote:
> > Tejun Heo wrote:
> >> Alan Cox wrote:
> >>>> 13 10:12:20 kern: res ff/ff:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x12
> >>>> (ATA bus error)
> >>>> rn: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar DevExch }
> >>>> 13 10:14:37 kern: ata4: port is slow to respond, please be patient
> >>>> (Status 0xff)
> >>>
> >>> First guess would be a dud drive but it could be power or cabling or
> >>> firmware or ...
> >>
> >> Hmm... this could be either the drive or the controller.
> >
> > ----
> >
> > Just to confirm -- this particular problem was due to a faulty
> > brand-new SATA Western_Digital drive that died. It hung the system
> > several times under load, but shortly after the above errors,
> > the system would not boot with that drive attached.
> >
> > Secondary error: My ACPI impementation is, /apparently/, flakey.
> > I used to not be able to use acpi back in the 2.2 timeframe. But
> > sometime in the 2.4 timeframe, ACPI started working with this system
> > (a 440BX based motherboard). I thought ACPI support had improved.
> > Symptom of ACPI based boot vs. non: random hang (a few hours up to maybe
> > 48 hours max). But after I thought ACPI was 'fixed', booting with ACPI
> > (or not) resulted in stable system.
> >
> > But -- two different error types. Starting with the 2.6.25 series,
> > I started observing hangs again (same in the 2.6.26 series). My last
> > stable was 2.6.24.1. BUT -- I also occasionally noticed some rare
> > sporadic disk error messages (while looking for the cause of the hang) --
> > they weren't there in the "pre-hang" 2.6.24.1 kernel...(I couldn't
> > even get a 2.6.24.7 kernel to stay up for more than 2 days).
> >
> > My upgrade strategy for disks has been to move to SATA disks as
> > I needed to replace older PATA's. Had alot of problems last Feb when
> > I tried to use SATA; after a few weeks of making no progress discovering
> > the source of he hangs, I went back to a PATA drive and took out the SATA
> > controller -- and system went back to stable. Ok...I'm tired of
> > debugging this...lets stay with PATA for now.
> >
> > Six months later...need another disk. Back to trying SATA...
> > more hangs (and a bad disk drive). It seems that in addition to
> > ACPI no longer working above my 2.6.24.1 kernel, adding in the SATA
> > board also would cause an ACPI based boot to eventually hang (max
> > runtime ~30 hours). Using the kernel load option "acpi=noirq", seems to
> > be the key to stability now.
> >
> > So I don't know exactly what changed -- but ACPI, which was working
> > (pre-SATA) seemed to stop being reliable after 2.6.24.1.
> > Anyway I cut it, acpi=noirq now seems to be a requirement for
> > system stability. My ACPI version string shows it as "1.0"...so I'm
> > guessing there might have been some kinks in the implementation.
There is bug:
http://bugzilla.kernel.org/show_bug.cgi?id=11044
There it is exactly the other way around:
PATA is not, but SATA is working. But:
pci=noacpi (which should have the same effect as acpi=irq)
Hmm, the machine are rather different? Could be totally unrelated.
Hmm, are there dmesg from working and non-working kernels?
Also the system is really old. Why don't you stick to pci=noacpi or
even acpi=off?
What advantage do you want to get with ACPI (SATA works?)?
Thomas
> >
> > So had 4 different problems all converge at roughly the same time:
> > 1) new SATA Western_Digital-1TB disk failure,
> > 2) ACPI-induced instability in 2.6.25 and above
> > 3) ACPI induced instability with addition of new SATA controller
> > (including a rebuilt-for-sata-support 2.6.24.1).
> > 4) Auxiliary cooling fan failed and system would get 'warm' (don't know
> > exact temps, but some disks were nearing 50C (normal is mid 30's,
> > except for the 15K system SCSI. It has its own attached fan, so
> > it's usually a few degrees cooler when the case-fans are operating
> > correctly.
> > However, the disk temps are not indicative of the CPU temps -- they
> > are only an indirect sign that case-airflow is sub-optimal. The
> > CPU's (2 1GHz P-III's) in this baby don't give reliable thermal
> > warnings (have only ever seen 1). Usually the system will
> > just 'hang' (not the most helpful indicator in any event).
> >
> > Thanks much for feedback that led me to figuring out (*crossing
> > fingers*) the problems and fixes...
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
2008-08-28 12:36 ` Thomas Renninger
@ 2008-08-29 10:20 ` Tejun Heo
2008-08-29 11:39 ` Thomas Renninger
0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2008-08-29 10:20 UTC (permalink / raw)
To: Thomas Renninger; +Cc: Linda Walsh, linux-ide, Alan Cox, linux acpi
Thomas Renninger wrote:
> Also the system is really old. Why don't you stick to pci=noacpi or
> even acpi=off?
> What advantage do you want to get with ACPI (SATA works?)?
I think this is the second time I see ACPI IRQ routing doesn't work on
old ACPI. Is it possible to detect this and turn off automatically? Or
does that risk breaking even more machines?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
2008-08-29 10:20 ` Tejun Heo
@ 2008-08-29 11:39 ` Thomas Renninger
2008-08-29 12:02 ` Tejun Heo
0 siblings, 1 reply; 12+ messages in thread
From: Thomas Renninger @ 2008-08-29 11:39 UTC (permalink / raw)
To: Tejun Heo; +Cc: Linda Walsh, linux-ide, Alan Cox, linux acpi
On Friday 29 August 2008 12:20:14 Tejun Heo wrote:
> Thomas Renninger wrote:
> > Also the system is really old. Why don't you stick to pci=noacpi or
> > even acpi=off?
> > What advantage do you want to get with ACPI (SATA works?)?
>
> I think this is the second time I see ACPI IRQ routing doesn't work on
> old ACPI.
Hey, that's great. I expected you have seen much more (you inflicted me
on more than two? :) ).
There is a cut (date) when it makes sense to use ACPI and when to not
use it. Ideally one would like to choose for what the machine got
tested and certified, but you can only guess.
Thus a lot old machines need acpi=force or acpi=off.
> Is it possible to detect this and turn off automatically? Or
> does that risk breaking even more machines?
I do not know the very IRQ and PCI details, but I expect the
problem is you cannot detect whether an interrupt is wrongly set up.
While apic vs pic is a real HW switch and once done there is no way back,
acpi vs legacy IRQ setup, should just be about different ways of parsing and
getting info how to set up the irq.
If you set up the IRQ using ACPI information and you detect that something
went wrong, it should be no problem (hmm, maybe a solvable design problem in
the pci layer) to use PCI config or whatever legacy info to re-set up the
IRQ.
But as said, I expect it's not easy/possible to detect when the IRQ is
wrongly set up. Maybe you can add in the devices:
test_irq_activity(..)
If this fails you can try to set it up again..., no I do not think you want
to do that.
Also beside old machines which might need the noacpi or acpi=off, pci=noacpi
and related boot params we do rather good IMO.
I remember:
- legacy IDE problems, one boiled down to a BIOS Bug
- No PCI domain support, that broke one HP machine which seemed to be the
only one using it. Maybe it's already supported, rather old.
- yeah and some older machines I do not really remember
where pci=noacpi helped. IMO not worth an automated detection.
Especially for those old machines..., people know which param to use, you will
produce more grief than any good.
There were several acpipnp problems recently, but this is another topic and
that needs fixing anyway, Bjorn is doing a real good job here.
Thomas
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
2008-08-29 11:39 ` Thomas Renninger
@ 2008-08-29 12:02 ` Tejun Heo
2008-08-29 13:11 ` Thomas Renninger
0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2008-08-29 12:02 UTC (permalink / raw)
To: Thomas Renninger; +Cc: Linda Walsh, linux-ide, Alan Cox, linux acpi
Thomas Renninger wrote:
> On Friday 29 August 2008 12:20:14 Tejun Heo wrote:
>> Thomas Renninger wrote:
>>> Also the system is really old. Why don't you stick to pci=noacpi or
>>> even acpi=off?
>>> What advantage do you want to get with ACPI (SATA works?)?
>> I think this is the second time I see ACPI IRQ routing doesn't work on
>> old ACPI.
> Hey, that's great. I expected you have seen much more (you inflicted me
> on more than two? :) ).
Yeah, I tend to redirect all IRQ routing related problems to you. :-P
...
> where pci=noacpi helped. IMO not worth an automated detection.
> Especially for those old machines..., people know which param to use, you will
> produce more grief than any good.
>
> There were several acpipnp problems recently, but this is another topic and
> that needs fixing anyway, Bjorn is doing a real good job here.
Hmm... Maybe what's necessary it to detect IRQ misrouting and turn on
irqpoll on the specific IRQ (or IRQ handler), which would help 'nobody
cared' cases too. Misrouted or killed IRQs cause a lot of griefs and
sporadic ones are kinda wide spread. For example, I got hit by one
during resume for an IRQ shared by USB, 1394 and sound controller after
probably hundreds of successful suspend/resume cycles (not in one go) on
the same machine and that somehow caused the sound driver to get stuck
leaving no way out than rebooting.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
2008-08-29 12:02 ` Tejun Heo
@ 2008-08-29 13:11 ` Thomas Renninger
2008-08-29 13:18 ` Tejun Heo
0 siblings, 1 reply; 12+ messages in thread
From: Thomas Renninger @ 2008-08-29 13:11 UTC (permalink / raw)
To: Tejun Heo; +Cc: Linda Walsh, linux-ide, Alan Cox, linux acpi
On Friday 29 August 2008 14:02:59 Tejun Heo wrote:
> Thomas Renninger wrote:
> > On Friday 29 August 2008 12:20:14 Tejun Heo wrote:
> >> Thomas Renninger wrote:
...
> > where pci=noacpi helped. IMO not worth an automated detection.
> > Especially for those old machines..., people know which param to use, you
> > will produce more grief than any good.
> >
> > There were several acpipnp problems recently, but this is another topic
> > and that needs fixing anyway, Bjorn is doing a real good job here.
>
> Hmm... Maybe what's necessary it to detect IRQ misrouting and turn on
> irqpoll on the specific IRQ (or IRQ handler), which would help 'nobody
> cared' cases too.
But you risk that things never get fixed correctly.
At least yell loudly at the user that things must get fixed.
IMO the a message (you already see appearing?) at the right place:
try irqpoll, try xyz param is enough.
The current behavior is not that bad and not that much machines
(at least new machines) are affected, but as said I am not
deeply involved in PCI/IRQ things.
Thomas
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
2008-08-29 13:11 ` Thomas Renninger
@ 2008-08-29 13:18 ` Tejun Heo
2008-08-29 13:31 ` Thomas Renninger
0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2008-08-29 13:18 UTC (permalink / raw)
To: Thomas Renninger; +Cc: Linda Walsh, linux-ide, Alan Cox, linux acpi
Thomas Renninger wrote:
>>> There were several acpipnp problems recently, but this is another topic
>>> and that needs fixing anyway, Bjorn is doing a real good job here.
>> Hmm... Maybe what's necessary it to detect IRQ misrouting and turn on
>> irqpoll on the specific IRQ (or IRQ handler), which would help 'nobody
>> cared' cases too.
> But you risk that things never get fixed correctly.
> At least yell loudly at the user that things must get fixed.
> IMO the a message (you already see appearing?) at the right place:
> try irqpoll, try xyz param is enough.
Yeah, the kernel should scream like hell but keep working after screaming.
> The current behavior is not that bad and not that much machines
> (at least new machines) are affected, but as said I am not
> deeply involved in PCI/IRQ things.
The thing is IRQ storms occassionally happen on otherwise working
machines taking down the IRQ and all the devices running off the IRQ, so
it's not as cut and dry as boot or no boot.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
2008-08-29 13:18 ` Tejun Heo
@ 2008-08-29 13:31 ` Thomas Renninger
0 siblings, 0 replies; 12+ messages in thread
From: Thomas Renninger @ 2008-08-29 13:31 UTC (permalink / raw)
To: Tejun Heo; +Cc: Linda Walsh, linux-ide, Alan Cox, linux acpi
On Friday 29 August 2008 15:18:08 Tejun Heo wrote:
> Thomas Renninger wrote:
> >>> There were several acpipnp problems recently, but this is another topic
> >>> and that needs fixing anyway, Bjorn is doing a real good job here.
> >>
> >> Hmm... Maybe what's necessary it to detect IRQ misrouting and turn on
> >> irqpoll on the specific IRQ (or IRQ handler), which would help 'nobody
> >> cared' cases too.
> >
> > But you risk that things never get fixed correctly.
> > At least yell loudly at the user that things must get fixed.
> > IMO the a message (you already see appearing?) at the right place:
> > try irqpoll, try xyz param is enough.
>
> Yeah, the kernel should scream like hell but keep working after screaming.
>
> > The current behavior is not that bad and not that much machines
> > (at least new machines) are affected, but as said I am not
> > deeply involved in PCI/IRQ things.
>
> The thing is IRQ storms occassionally happen on otherwise working
> machines taking down the IRQ and all the devices running off the IRQ, so
> it's not as cut and dry as boot or no boot.
AFAIK legacy IRQs can be routed somewhere else (below 16 while the APIC IRQ is
somewhere above), thus the IRQ may happen twice. Sounds a bit like what you
explained above.
At least I remember such a very specific problem from the
Real Time people.
(could eventually be switched off by very chipset specific quirks)
Anyway, this starts to get off topic and I am really the wrong one
to answer such questions, others probably know much more about this
than I do.
Thomas
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-08-29 13:31 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-13 22:27 Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7 Linda Walsh
2008-08-14 10:50 ` Alan Cox
2008-08-20 7:39 ` Tejun Heo
2008-08-28 1:46 ` Linda Walsh
2008-08-28 7:03 ` Tejun Heo
2008-08-28 12:36 ` Thomas Renninger
2008-08-29 10:20 ` Tejun Heo
2008-08-29 11:39 ` Thomas Renninger
2008-08-29 12:02 ` Tejun Heo
2008-08-29 13:11 ` Thomas Renninger
2008-08-29 13:18 ` Tejun Heo
2008-08-29 13:31 ` Thomas Renninger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).