All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Linda Walsh <lkml@tlinx.org>
Cc: linux-ide@vger.kernel.org, Alan Cox <alan@lxorguk.ukuu.org.uk>,
	Thomas Renninger <trenn@suse.de>,
	linux acpi <linux-acpi@vger.kernel.org>
Subject: Re: Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7
Date: Thu, 28 Aug 2008 09:03:15 +0200	[thread overview]
Message-ID: <48B64DB3.6060906@kernel.org> (raw)
In-Reply-To: <48B60373.2050501@tlinx.org>

(cc'ing Thomas and linux-acpi for ACPI reference)

Linda Walsh wrote:
> Tejun Heo wrote:
>> Alan Cox wrote:
>>>> 13 10:12:20 kern: res ff/ff:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x12
>>>> (ATA bus error)
>>>> rn: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar DevExch }
>>>> 13 10:14:37 kern: ata4: port is slow to respond, please be patient
>>>> (Status 0xff)
>>> First guess would be a dud drive but it could be power or cabling or
>>> firmware or ...
>> Hmm... this could be either the drive or the controller. 
> ----
> 
>    Just to confirm -- this particular problem was due to a faulty
> brand-new SATA Western_Digital drive that died.  It hung the system
> several times under load, but shortly after the above errors,
> the system would not boot with that drive attached.
> 
>    Secondary error:     My ACPI impementation is, /apparently/, flakey.
> I used to not be able to use acpi back in the 2.2 timeframe.  But
> sometime in the 2.4 timeframe, ACPI started working with this system
> (a 440BX based motherboard).  I thought ACPI support had improved.
> Symptom of ACPI based boot vs. non: random hang (a few hours up to maybe
> 48 hours max).  But after I thought ACPI was 'fixed', booting with ACPI
> (or not) resulted in stable system.
> 
>    But -- two different error types.  Starting with the 2.6.25 series,
> I started observing hangs again (same in the 2.6.26 series).  My last
> stable was 2.6.24.1.  BUT -- I also occasionally noticed some rare
> sporadic disk error messages (while looking for the cause of the hang) --
> they weren't there in the "pre-hang" 2.6.24.1 kernel...(I couldn't
> even get a 2.6.24.7 kernel to stay up for more than 2 days).
> 
>    My upgrade strategy for disks has been to move to SATA disks as
> I needed to replace older PATA's.  Had alot of problems last Feb when
> I tried to use SATA; after a few weeks of making no progress discovering
> the source of he hangs, I went back to a PATA drive and took out the SATA
> controller -- and system went back to stable.  Ok...I'm tired of
> debugging this...lets stay with PATA for now.
> 
>    Six months later...need another disk.  Back to trying SATA...
> more hangs (and a bad disk drive).  It seems that in addition to
> ACPI no longer working above my 2.6.24.1 kernel, adding in the SATA
> board also would cause an ACPI based boot to eventually hang (max
> runtime ~30 hours).  Using the kernel load option "acpi=noirq", seems to be
> the key to stability now.
> 
>    So I don't know exactly what changed -- but ACPI, which was working
> (pre-SATA) seemed to stop being reliable after 2.6.24.1.
>    Anyway I cut it,  acpi=noirq   now seems to be a requirement for
> system stability.  My ACPI version string shows it as "1.0"...so I'm
> guessing there might have been some kinks in the implementation.
> 
>    So had 4 different problems all converge at roughly the same time:
> 1)  new SATA Western_Digital-1TB disk failure,
> 2)  ACPI-induced instability in 2.6.25 and above
> 3)  ACPI induced instability with addition of new SATA controller
>    (including a rebuilt-for-sata-support 2.6.24.1).
> 4)  Auxiliary cooling fan failed and system would get 'warm' (don't know
>    exact temps, but some disks were nearing 50C (normal is mid 30's,
>    except for the 15K system SCSI.  It has its own attached fan, so
>    it's usually a few degrees cooler when the case-fans are operating
>    correctly.
>    However, the disk temps are not indicative of the CPU temps -- they
>    are only an indirect sign that case-airflow is sub-optimal.  The
>    CPU's (2 1GHz P-III's) in this baby don't give reliable thermal
>    warnings (have only ever seen 1).  Usually the system will
>    just 'hang' (not the most helpful indicator in any event).
> 
> Thanks much for feedback that led me to figuring out (*crossing
> fingers*) the problems and fixes...

-- 
tejun

  reply	other threads:[~2008-08-28  7:03 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-13 22:27 Promise 300-TX 4-channel SATA disk going dead under load 2.6.24-7 Linda Walsh
2008-08-14 10:50 ` Alan Cox
2008-08-20  7:39   ` Tejun Heo
2008-08-28  1:46     ` Linda Walsh
2008-08-28  7:03       ` Tejun Heo [this message]
2008-08-28 12:36         ` Thomas Renninger
2008-08-29 10:20           ` Tejun Heo
2008-08-29 11:39             ` Thomas Renninger
2008-08-29 12:02               ` Tejun Heo
2008-08-29 13:11                 ` Thomas Renninger
2008-08-29 13:18                   ` Tejun Heo
2008-08-29 13:31                     ` Thomas Renninger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48B64DB3.6060906@kernel.org \
    --to=tj@kernel.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-ide@vger.kernel.org \
    --cc=lkml@tlinx.org \
    --cc=trenn@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.