Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7

public inbox for linux-acpi@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
       [not found]     ` <48B60373.2050501@tlinx.org>
@ 2008-08-28  7:03       ` Tejun Heo
  2008-08-28 12:36         ` Thomas Renninger
  0 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2008-08-28  7:03 UTC (permalink / raw)
  To: Linda Walsh; +Cc: linux-ide, Alan Cox, Thomas Renninger, linux acpi

(cc'ing Thomas and linux-acpi for ACPI reference)

Linda Walsh wrote:
> Tejun Heo wrote:
>> Alan Cox wrote:
>>>> 13 10:12:20 kern: res ff/ff:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x12
>>>> (ATA bus error)
>>>> rn: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar DevExch }
>>>> 13 10:14:37 kern: ata4: port is slow to respond, please be patient
>>>> (Status 0xff)
>>> First guess would be a dud drive but it could be power or cabling or
>>> firmware or ...
>> Hmm... this could be either the drive or the controller. 
> ----
> 
>    Just to confirm -- this particular problem was due to a faulty
> brand-new SATA Western_Digital drive that died.  It hung the system
> several times under load, but shortly after the above errors,
> the system would not boot with that drive attached.
> 
>    Secondary error:     My ACPI impementation is, /apparently/, flakey.
> I used to not be able to use acpi back in the 2.2 timeframe.  But
> sometime in the 2.4 timeframe, ACPI started working with this system
> (a 440BX based motherboard).  I thought ACPI support had improved.
> Symptom of ACPI based boot vs. non: random hang (a few hours up to maybe
> 48 hours max).  But after I thought ACPI was 'fixed', booting with ACPI
> (or not) resulted in stable system.
> 
>    But -- two different error types.  Starting with the 2.6.25 series,
> I started observing hangs again (same in the 2.6.26 series).  My last
> stable was 2.6.24.1.  BUT -- I also occasionally noticed some rare
> sporadic disk error messages (while looking for the cause of the hang) --
> they weren't there in the "pre-hang" 2.6.24.1 kernel...(I couldn't
> even get a 2.6.24.7 kernel to stay up for more than 2 days).
> 
>    My upgrade strategy for disks has been to move to SATA disks as
> I needed to replace older PATA's.  Had alot of problems last Feb when
> I tried to use SATA; after a few weeks of making no progress discovering
> the source of he hangs, I went back to a PATA drive and took out the SATA
> controller -- and system went back to stable.  Ok...I'm tired of
> debugging this...lets stay with PATA for now.
> 
>    Six months later...need another disk.  Back to trying SATA...
> more hangs (and a bad disk drive).  It seems that in addition to
> ACPI no longer working above my 2.6.24.1 kernel, adding in the SATA
> board also would cause an ACPI based boot to eventually hang (max
> runtime ~30 hours).  Using the kernel load option "acpi=noirq", seems to be
> the key to stability now.
> 
>    So I don't know exactly what changed -- but ACPI, which was working
> (pre-SATA) seemed to stop being reliable after 2.6.24.1.
>    Anyway I cut it,  acpi=noirq   now seems to be a requirement for
> system stability.  My ACPI version string shows it as "1.0"...so I'm
> guessing there might have been some kinks in the implementation.
> 
>    So had 4 different problems all converge at roughly the same time:
> 1)  new SATA Western_Digital-1TB disk failure,
> 2)  ACPI-induced instability in 2.6.25 and above
> 3)  ACPI induced instability with addition of new SATA controller
>    (including a rebuilt-for-sata-support 2.6.24.1).
> 4)  Auxiliary cooling fan failed and system would get 'warm' (don't know
>    exact temps, but some disks were nearing 50C (normal is mid 30's,
>    except for the 15K system SCSI.  It has its own attached fan, so
>    it's usually a few degrees cooler when the case-fans are operating
>    correctly.
>    However, the disk temps are not indicative of the CPU temps -- they
>    are only an indirect sign that case-airflow is sub-optimal.  The
>    CPU's (2 1GHz P-III's) in this baby don't give reliable thermal
>    warnings (have only ever seen 1).  Usually the system will
>    just 'hang' (not the most helpful indicator in any event).
> 
> Thanks much for feedback that led me to figuring out (*crossing
> fingers*) the problems and fixes...

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
  2008-08-28  7:03       ` Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7 Tejun Heo
@ 2008-08-28 12:36         ` Thomas Renninger
  2008-08-29 10:20           ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Renninger @ 2008-08-28 12:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Linda Walsh, linux-ide, Alan Cox, linux acpi

On Thursday 28 August 2008 09:03:15 Tejun Heo wrote:
> (cc'ing Thomas and linux-acpi for ACPI reference)
>
> Linda Walsh wrote:
> > Tejun Heo wrote:
> >> Alan Cox wrote:
> >>>> 13 10:12:20 kern: res ff/ff:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x12
> >>>> (ATA bus error)
> >>>> rn: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar DevExch }
> >>>> 13 10:14:37 kern: ata4: port is slow to respond, please be patient
> >>>> (Status 0xff)
> >>>
> >>> First guess would be a dud drive but it could be power or cabling or
> >>> firmware or ...
> >>
> >> Hmm... this could be either the drive or the controller.
> >
> > ----
> >
> >    Just to confirm -- this particular problem was due to a faulty
> > brand-new SATA Western_Digital drive that died.  It hung the system
> > several times under load, but shortly after the above errors,
> > the system would not boot with that drive attached.
> >
> >    Secondary error:     My ACPI impementation is, /apparently/, flakey.
> > I used to not be able to use acpi back in the 2.2 timeframe.  But
> > sometime in the 2.4 timeframe, ACPI started working with this system
> > (a 440BX based motherboard).  I thought ACPI support had improved.
> > Symptom of ACPI based boot vs. non: random hang (a few hours up to maybe
> > 48 hours max).  But after I thought ACPI was 'fixed', booting with ACPI
> > (or not) resulted in stable system.
> >
> >    But -- two different error types.  Starting with the 2.6.25 series,
> > I started observing hangs again (same in the 2.6.26 series).  My last
> > stable was 2.6.24.1.  BUT -- I also occasionally noticed some rare
> > sporadic disk error messages (while looking for the cause of the hang) --
> > they weren't there in the "pre-hang" 2.6.24.1 kernel...(I couldn't
> > even get a 2.6.24.7 kernel to stay up for more than 2 days).
> >
> >    My upgrade strategy for disks has been to move to SATA disks as
> > I needed to replace older PATA's.  Had alot of problems last Feb when
> > I tried to use SATA; after a few weeks of making no progress discovering
> > the source of he hangs, I went back to a PATA drive and took out the SATA
> > controller -- and system went back to stable.  Ok...I'm tired of
> > debugging this...lets stay with PATA for now.
> >
> >    Six months later...need another disk.  Back to trying SATA...
> > more hangs (and a bad disk drive).  It seems that in addition to
> > ACPI no longer working above my 2.6.24.1 kernel, adding in the SATA
> > board also would cause an ACPI based boot to eventually hang (max
> > runtime ~30 hours).  Using the kernel load option "acpi=noirq", seems to
> > be the key to stability now.
> >
> >    So I don't know exactly what changed -- but ACPI, which was working
> > (pre-SATA) seemed to stop being reliable after 2.6.24.1.
> >    Anyway I cut it,  acpi=noirq   now seems to be a requirement for
> > system stability.  My ACPI version string shows it as "1.0"...so I'm
> > guessing there might have been some kinks in the implementation.
There is bug:
http://bugzilla.kernel.org/show_bug.cgi?id=11044
There it is exactly the other way around:
PATA is not, but SATA is working. But:
pci=noacpi (which should have the same effect as acpi=irq)
Hmm, the machine are rather different? Could be totally unrelated.

Hmm, are there dmesg from working and non-working kernels?

Also the system is really old. Why don't you stick to pci=noacpi or
even acpi=off?
What advantage do you want to get with ACPI (SATA works?)?

   Thomas

> >
> >    So had 4 different problems all converge at roughly the same time:
> > 1)  new SATA Western_Digital-1TB disk failure,
> > 2)  ACPI-induced instability in 2.6.25 and above
> > 3)  ACPI induced instability with addition of new SATA controller
> >    (including a rebuilt-for-sata-support 2.6.24.1).
> > 4)  Auxiliary cooling fan failed and system would get 'warm' (don't know
> >    exact temps, but some disks were nearing 50C (normal is mid 30's,
> >    except for the 15K system SCSI.  It has its own attached fan, so
> >    it's usually a few degrees cooler when the case-fans are operating
> >    correctly.
> >    However, the disk temps are not indicative of the CPU temps -- they
> >    are only an indirect sign that case-airflow is sub-optimal.  The
> >    CPU's (2 1GHz P-III's) in this baby don't give reliable thermal
> >    warnings (have only ever seen 1).  Usually the system will
> >    just 'hang' (not the most helpful indicator in any event).
> >
> > Thanks much for feedback that led me to figuring out (*crossing
> > fingers*) the problems and fixes...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
  2008-08-28 12:36         ` Thomas Renninger
@ 2008-08-29 10:20           ` Tejun Heo
  2008-08-29 11:39             ` Thomas Renninger
  0 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2008-08-29 10:20 UTC (permalink / raw)
  To: Thomas Renninger; +Cc: Linda Walsh, linux-ide, Alan Cox, linux acpi

Thomas Renninger wrote:
> Also the system is really old. Why don't you stick to pci=noacpi or
> even acpi=off?
> What advantage do you want to get with ACPI (SATA works?)?

I think this is the second time I see ACPI IRQ routing doesn't work on
old ACPI.  Is it possible to detect this and turn off automatically?  Or
does that risk breaking even more machines?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
  2008-08-29 10:20           ` Tejun Heo
@ 2008-08-29 11:39             ` Thomas Renninger
  2008-08-29 12:02               ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Renninger @ 2008-08-29 11:39 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Linda Walsh, linux-ide, Alan Cox, linux acpi

On Friday 29 August 2008 12:20:14 Tejun Heo wrote:
> Thomas Renninger wrote:
> > Also the system is really old. Why don't you stick to pci=noacpi or
> > even acpi=off?
> > What advantage do you want to get with ACPI (SATA works?)?
>
> I think this is the second time I see ACPI IRQ routing doesn't work on
> old ACPI.
Hey, that's great. I expected you have seen much more (you inflicted me
on more than two? :) ).
There is a cut (date) when it makes sense to use ACPI and when to not
use it. Ideally one would like to choose for what the machine got
tested and certified, but you can only guess.
Thus a lot old machines need acpi=force or acpi=off.
> Is it possible to detect this and turn off automatically?  Or 
> does that risk breaking even more machines?
I do not know the very IRQ and PCI details, but I expect the
problem is you cannot detect whether an interrupt is wrongly set up.

While apic vs pic is a real HW switch and once done there is no way back,
acpi vs legacy IRQ setup, should just be about different ways of parsing and 
getting info how to set up the irq.
If you set up the IRQ using ACPI information and you detect that something
went wrong, it should be no problem (hmm, maybe a solvable design problem in 
the pci layer) to use PCI config or whatever legacy info to re-set up the
IRQ.
But as said, I expect it's not easy/possible to detect when the IRQ is
wrongly set up. Maybe you can add in the devices:
test_irq_activity(..)
If this fails you can try to set it up again..., no I do not think you want
to do that.

Also beside old machines which might need the noacpi or acpi=off, pci=noacpi 
and related boot params we do rather good IMO.
I remember:
    - legacy IDE problems, one boiled down to a BIOS Bug
    - No PCI domain support, that broke one HP machine which seemed to be the
      only one using it. Maybe it's already supported, rather old.
    - yeah and some older machines I do not really remember
where pci=noacpi helped. IMO not worth an automated detection.
Especially for those old machines..., people know which param to use, you will
produce more grief than any good.

There were several acpipnp problems recently, but this is another topic and 
that needs fixing anyway, Bjorn is doing a real good job here.

    Thomas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
  2008-08-29 11:39             ` Thomas Renninger
@ 2008-08-29 12:02               ` Tejun Heo
  2008-08-29 13:11                 ` Thomas Renninger
  0 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2008-08-29 12:02 UTC (permalink / raw)
  To: Thomas Renninger; +Cc: Linda Walsh, linux-ide, Alan Cox, linux acpi

Thomas Renninger wrote:
> On Friday 29 August 2008 12:20:14 Tejun Heo wrote:
>> Thomas Renninger wrote:
>>> Also the system is really old. Why don't you stick to pci=noacpi or
>>> even acpi=off?
>>> What advantage do you want to get with ACPI (SATA works?)?
>> I think this is the second time I see ACPI IRQ routing doesn't work on
>> old ACPI.
> Hey, that's great. I expected you have seen much more (you inflicted me
> on more than two? :) ).

Yeah, I tend to redirect all IRQ routing related problems to you.  :-P

...
> where pci=noacpi helped. IMO not worth an automated detection.
> Especially for those old machines..., people know which param to use, you will
> produce more grief than any good.
> 
> There were several acpipnp problems recently, but this is another topic and 
> that needs fixing anyway, Bjorn is doing a real good job here.

Hmm... Maybe what's necessary it to detect IRQ misrouting and turn on
irqpoll on the specific IRQ (or IRQ handler), which would help 'nobody
cared' cases too.  Misrouted or killed IRQs cause a lot of griefs and
sporadic ones are kinda wide spread.  For example, I got hit by one
during resume for an IRQ shared by USB, 1394 and sound controller after
probably hundreds of successful suspend/resume cycles (not in one go) on
the same machine and that somehow caused the sound driver to get stuck
leaving no way out than rebooting.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
  2008-08-29 12:02               ` Tejun Heo
@ 2008-08-29 13:11                 ` Thomas Renninger
  2008-08-29 13:18                   ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Renninger @ 2008-08-29 13:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Linda Walsh, linux-ide, Alan Cox, linux acpi

On Friday 29 August 2008 14:02:59 Tejun Heo wrote:
> Thomas Renninger wrote:
> > On Friday 29 August 2008 12:20:14 Tejun Heo wrote:
> >> Thomas Renninger wrote:
...
> > where pci=noacpi helped. IMO not worth an automated detection.
> > Especially for those old machines..., people know which param to use, you
> > will produce more grief than any good.
> >
> > There were several acpipnp problems recently, but this is another topic
> > and that needs fixing anyway, Bjorn is doing a real good job here.
>
> Hmm... Maybe what's necessary it to detect IRQ misrouting and turn on
> irqpoll on the specific IRQ (or IRQ handler), which would help 'nobody
> cared' cases too.
But you risk that things never get fixed correctly.
At least yell loudly at the user that things must get fixed.
IMO the a message (you already see appearing?) at the right place:
try irqpoll, try xyz param is enough.

The current behavior is not that bad and not that much machines
(at least new machines) are affected, but as said I am not
deeply involved in PCI/IRQ things.

      Thomas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
  2008-08-29 13:11                 ` Thomas Renninger
@ 2008-08-29 13:18                   ` Tejun Heo
  2008-08-29 13:31                     ` Thomas Renninger
  0 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2008-08-29 13:18 UTC (permalink / raw)
  To: Thomas Renninger; +Cc: Linda Walsh, linux-ide, Alan Cox, linux acpi

Thomas Renninger wrote:
>>> There were several acpipnp problems recently, but this is another topic
>>> and that needs fixing anyway, Bjorn is doing a real good job here.
>> Hmm... Maybe what's necessary it to detect IRQ misrouting and turn on
>> irqpoll on the specific IRQ (or IRQ handler), which would help 'nobody
>> cared' cases too.
> But you risk that things never get fixed correctly.
> At least yell loudly at the user that things must get fixed.
> IMO the a message (you already see appearing?) at the right place:
> try irqpoll, try xyz param is enough.

Yeah, the kernel should scream like hell but keep working after screaming.

> The current behavior is not that bad and not that much machines
> (at least new machines) are affected, but as said I am not
> deeply involved in PCI/IRQ things.

The thing is IRQ storms occassionally happen on otherwise working
machines taking down the IRQ and all the devices running off the IRQ, so
it's not as cut and dry as boot or no boot.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
  2008-08-29 13:18                   ` Tejun Heo
@ 2008-08-29 13:31                     ` Thomas Renninger
  0 siblings, 0 replies; 8+ messages in thread
From: Thomas Renninger @ 2008-08-29 13:31 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Linda Walsh, linux-ide, Alan Cox, linux acpi

On Friday 29 August 2008 15:18:08 Tejun Heo wrote:
> Thomas Renninger wrote:
> >>> There were several acpipnp problems recently, but this is another topic
> >>> and that needs fixing anyway, Bjorn is doing a real good job here.
> >>
> >> Hmm... Maybe what's necessary it to detect IRQ misrouting and turn on
> >> irqpoll on the specific IRQ (or IRQ handler), which would help 'nobody
> >> cared' cases too.
> >
> > But you risk that things never get fixed correctly.
> > At least yell loudly at the user that things must get fixed.
> > IMO the a message (you already see appearing?) at the right place:
> > try irqpoll, try xyz param is enough.
>
> Yeah, the kernel should scream like hell but keep working after screaming.
>
> > The current behavior is not that bad and not that much machines
> > (at least new machines) are affected, but as said I am not
> > deeply involved in PCI/IRQ things.
>
> The thing is IRQ storms occassionally happen on otherwise working
> machines taking down the IRQ and all the devices running off the IRQ, so
> it's not as cut and dry as boot or no boot.
AFAIK legacy IRQs can be routed somewhere else (below 16 while the APIC IRQ is 
somewhere above), thus the IRQ may happen twice. Sounds a bit like what you 
explained above.
At least I remember such a very specific problem from the
Real Time people.
(could eventually be switched off by very chipset specific quirks)

Anyway, this starts to get off topic and I am really the wrong one
to answer such questions, others probably know much more about this
than I do.

   Thomas

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-08-29 13:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <48A35FE6.1080903@tlinx.org>
     [not found] ` <20080814115005.1495a0b1@lxorguk.ukuu.org.uk>
     [not found]   ` <48ABCA18.6060800@kernel.org>
     [not found]     ` <48B60373.2050501@tlinx.org>
2008-08-28  7:03       ` Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7 Tejun Heo
2008-08-28 12:36         ` Thomas Renninger
2008-08-29 10:20           ` Tejun Heo
2008-08-29 11:39             ` Thomas Renninger
2008-08-29 12:02               ` Tejun Heo
2008-08-29 13:11                 ` Thomas Renninger
2008-08-29 13:18                   ` Tejun Heo
2008-08-29 13:31                     ` Thomas Renninger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox