linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Linda Walsh <lkml@tlinx.org>
Cc: linux-ide@vger.kernel.org, Alan Cox <alan@lxorguk.ukuu.org.uk>,
	Thomas Renninger <trenn@suse.de>,
	linux acpi <linux-acpi@vger.kernel.org>
Subject: Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
Date: Thu, 28 Aug 2008 09:03:15 +0200	[thread overview]
Message-ID: <48B64DB3.6060906@kernel.org> (raw)
In-Reply-To: <48B60373.2050501@tlinx.org>

(cc'ing Thomas and linux-acpi for ACPI reference)

Linda Walsh wrote:
> Tejun Heo wrote:
>> Alan Cox wrote:
>>>> 13 10:12:20 kern: res ff/ff:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x12
>>>> (ATA bus error)
>>>> rn: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar DevExch }
>>>> 13 10:14:37 kern: ata4: port is slow to respond, please be patient
>>>> (Status 0xff)
>>> First guess would be a dud drive but it could be power or cabling or
>>> firmware or ...
>> Hmm... this could be either the drive or the controller. 
> ----
> 
>    Just to confirm -- this particular problem was due to a faulty
> brand-new SATA Western_Digital drive that died.  It hung the system
> several times under load, but shortly after the above errors,
> the system would not boot with that drive attached.
> 
>    Secondary error:     My ACPI impementation is, /apparently/, flakey.
> I used to not be able to use acpi back in the 2.2 timeframe.  But
> sometime in the 2.4 timeframe, ACPI started working with this system
> (a 440BX based motherboard).  I thought ACPI support had improved.
> Symptom of ACPI based boot vs. non: random hang (a few hours up to maybe
> 48 hours max).  But after I thought ACPI was 'fixed', booting with ACPI
> (or not) resulted in stable system.
> 
>    But -- two different error types.  Starting with the 2.6.25 series,
> I started observing hangs again (same in the 2.6.26 series).  My last
> stable was 2.6.24.1.  BUT -- I also occasionally noticed some rare
> sporadic disk error messages (while looking for the cause of the hang) --
> they weren't there in the "pre-hang" 2.6.24.1 kernel...(I couldn't
> even get a 2.6.24.7 kernel to stay up for more than 2 days).
> 
>    My upgrade strategy for disks has been to move to SATA disks as
> I needed to replace older PATA's.  Had alot of problems last Feb when
> I tried to use SATA; after a few weeks of making no progress discovering
> the source of he hangs, I went back to a PATA drive and took out the SATA
> controller -- and system went back to stable.  Ok...I'm tired of
> debugging this...lets stay with PATA for now.
> 
>    Six months later...need another disk.  Back to trying SATA...
> more hangs (and a bad disk drive).  It seems that in addition to
> ACPI no longer working above my 2.6.24.1 kernel, adding in the SATA
> board also would cause an ACPI based boot to eventually hang (max
> runtime ~30 hours).  Using the kernel load option "acpi=noirq", seems to be
> the key to stability now.
> 
>    So I don't know exactly what changed -- but ACPI, which was working
> (pre-SATA) seemed to stop being reliable after 2.6.24.1.
>    Anyway I cut it,  acpi=noirq   now seems to be a requirement for
> system stability.  My ACPI version string shows it as "1.0"...so I'm
> guessing there might have been some kinks in the implementation.
> 
>    So had 4 different problems all converge at roughly the same time:
> 1)  new SATA Western_Digital-1TB disk failure,
> 2)  ACPI-induced instability in 2.6.25 and above
> 3)  ACPI induced instability with addition of new SATA controller
>    (including a rebuilt-for-sata-support 2.6.24.1).
> 4)  Auxiliary cooling fan failed and system would get 'warm' (don't know
>    exact temps, but some disks were nearing 50C (normal is mid 30's,
>    except for the 15K system SCSI.  It has its own attached fan, so
>    it's usually a few degrees cooler when the case-fans are operating
>    correctly.
>    However, the disk temps are not indicative of the CPU temps -- they
>    are only an indirect sign that case-airflow is sub-optimal.  The
>    CPU's (2 1GHz P-III's) in this baby don't give reliable thermal
>    warnings (have only ever seen 1).  Usually the system will
>    just 'hang' (not the most helpful indicator in any event).
> 
> Thanks much for feedback that led me to figuring out (*crossing
> fingers*) the problems and fixes...

-- 
tejun

  reply	other threads:[~2008-08-28  7:05 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-13 22:27 Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7 Linda Walsh
2008-08-14 10:50 ` Alan Cox
2008-08-20  7:39   ` Tejun Heo
2008-08-28  1:46     ` Linda Walsh
2008-08-28  7:03       ` Tejun Heo [this message]
2008-08-28 12:36         ` Thomas Renninger
2008-08-29 10:20           ` Tejun Heo
2008-08-29 11:39             ` Thomas Renninger
2008-08-29 12:02               ` Tejun Heo
2008-08-29 13:11                 ` Thomas Renninger
2008-08-29 13:18                   ` Tejun Heo
2008-08-29 13:31                     ` Thomas Renninger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48B64DB3.6060906@kernel.org \
    --to=tj@kernel.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-ide@vger.kernel.org \
    --cc=lkml@tlinx.org \
    --cc=trenn@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).