linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linda Walsh <lkml@tlinx.org>
To: linux-ide@vger.kernel.org
Cc: Tejun Heo <tj@kernel.org>, Alan Cox <alan@lxorguk.ukuu.org.uk>
Subject: Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
Date: Wed, 27 Aug 2008 18:46:27 -0700	[thread overview]
Message-ID: <48B60373.2050501@tlinx.org> (raw)
In-Reply-To: <48ABCA18.6060800@kernel.org>

Tejun Heo wrote:
> Alan Cox wrote:
>>> 13 10:12:20 kern: res ff/ff:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x12 
>>> (ATA bus error)
>>> rn: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar DevExch }
>>> 13 10:14:37 kern: ata4: port is slow to respond, please be patient 
>>> (Status 0xff)
>> First guess would be a dud drive but it could be power or cabling or
>> firmware or ...
> Hmm... this could be either the drive or the controller. 
----

    Just to confirm -- this particular problem was due to a faulty
brand-new SATA Western_Digital drive that died.  It hung the system
several times under load, but shortly after the above errors,
the system would not boot with that drive attached.

    Secondary error:     My ACPI impementation is, /apparently/, flakey.
I used to not be able to use acpi back in the 2.2 timeframe.  But
sometime in the 2.4 timeframe, ACPI started working with this system
(a 440BX based motherboard).  I thought ACPI support had improved.
Symptom of ACPI based boot vs. non: random hang (a few hours up to maybe
48 hours max).  But after I thought ACPI was 'fixed', booting with ACPI
(or not) resulted in stable system.

    But -- two different error types.  Starting with the 2.6.25 series,
I started observing hangs again (same in the 2.6.26 series).  My last
stable was 2.6.24.1.  BUT -- I also occasionally noticed some rare
sporadic disk error messages (while looking for the cause of the hang) --
they weren't there in the "pre-hang" 2.6.24.1 kernel...(I couldn't
even get a 2.6.24.7 kernel to stay up for more than 2 days).

    My upgrade strategy for disks has been to move to SATA disks as
I needed to replace older PATA's.  Had alot of problems last Feb when
I tried to use SATA; after a few weeks of making no progress discovering
the source of he hangs, I went back to a PATA drive and took out the SATA
controller -- and system went back to stable.  Ok...I'm tired of
debugging this...lets stay with PATA for now.

    Six months later...need another disk.  Back to trying SATA...
more hangs (and a bad disk drive).  It seems that in addition to
ACPI no longer working above my 2.6.24.1 kernel, adding in the SATA
board also would cause an ACPI based boot to eventually hang (max 
runtime ~30 hours).  Using the kernel load option "acpi=noirq", seems to be
the key to stability now.

    So I don't know exactly what changed -- but ACPI, which was working
(pre-SATA) seemed to stop being reliable after 2.6.24.1. 

    Anyway I cut it,  acpi=noirq   now seems to be a requirement for
system stability.  My ACPI version string shows it as "1.0"...so I'm
guessing there might have been some kinks in the implementation.

    So had 4 different problems all converge at roughly the same time:
1)  new SATA Western_Digital-1TB disk failure,
2)  ACPI-induced instability in 2.6.25 and above
3)  ACPI induced instability with addition of new SATA controller
    (including a rebuilt-for-sata-support 2.6.24.1).
4)  Auxiliary cooling fan failed and system would get 'warm' (don't know
    exact temps, but some disks were nearing 50C (normal is mid 30's,
    except for the 15K system SCSI.  It has its own attached fan, so
    it's usually a few degrees cooler when the case-fans are operating
    correctly.
    However, the disk temps are not indicative of the CPU temps -- they
    are only an indirect sign that case-airflow is sub-optimal.  The
    CPU's (2 1GHz P-III's) in this baby don't give reliable thermal
    warnings (have only ever seen 1).  Usually the system will
    just 'hang' (not the most helpful indicator in any event).

Thanks much for feedback that led me to figuring out (*crossing
fingers*) the problems and fixes...

Linda Walsh






  reply	other threads:[~2008-08-28  1:46 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-13 22:27 Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7 Linda Walsh
2008-08-14 10:50 ` Alan Cox
2008-08-20  7:39   ` Tejun Heo
2008-08-28  1:46     ` Linda Walsh [this message]
2008-08-28  7:03       ` Tejun Heo
2008-08-28 12:36         ` Thomas Renninger
2008-08-29 10:20           ` Tejun Heo
2008-08-29 11:39             ` Thomas Renninger
2008-08-29 12:02               ` Tejun Heo
2008-08-29 13:11                 ` Thomas Renninger
2008-08-29 13:18                   ` Tejun Heo
2008-08-29 13:31                     ` Thomas Renninger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48B60373.2050501@tlinx.org \
    --to=lkml@tlinx.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=linux-ide@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).