linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Håkon Løvdal" <hlovdal@gmail.com>
To: Robert Hancock <hancockrwd@gmail.com>
Cc: linux-ide@vger.kernel.org
Subject: Re: "raid5:md0: read error not correctable (sector 795463080 on sdf1)" error on controller with SIL 3114
Date: Wed, 17 Feb 2010 03:42:45 +0100	[thread overview]
Message-ID: <a01a16b51002161842h3c2da6dbnbeb52ff1b6802bf@mail.gmail.com> (raw)
In-Reply-To: <4B70EEFD.1040603@gmail.com>

On 9 February 2010 06:13, Robert Hancock <hancockrwd@gmail.com> wrote:
> On 02/08/2010 05:11 AM, Håkon Løvdal wrote:
>> ---BEGIN log-4---
>> Feb  6 07:09:57 localhost kernel: ata8.00: exception Emask 0x0 SAct
>> 0x0 SErr 0x0 action 0x0
>> Feb  6 07:09:57 localhost kernel: ata8.00: BMDMA2 stat 0x6c0009
>> Feb  6 07:09:57 localhost kernel: ata8.00: cmd
>> 25/00:80:cf:cd:69/00:00:2f:00:00/e0 tag 0 dma 65536 in
>> Feb  6 07:09:57 localhost kernel:         res
>> 51/40:00:e4:cd:69/00:00:2f:00:00/e0 Emask 0x9 (media error)
>> Feb  6 07:09:57 localhost kernel: ata8.00: status: { DRDY ERR }
>> Feb  6 07:09:57 localhost kernel: ata8.00: error: { UNC }
>
> That's fairly definitive, uncorrected read error reported by the drive. You
> might want to check its SMART status. Could be a bad drive, or potentially
> other causes like excessive vibration, high temperature, power issues..

For all of sdb, sdc, sdd, sde, sdf and sdg they all have had a
normalized value of 100 for the whole lifetime of the disk (I have
a cron job to capture output from smartctl nightly for reference
and have now checked those files) for all the critical attributes listed at
http://en.wikipedia.org/wiki/S.M.A.R.T.#ATA_S.M.A.R.T._attributes
  1 Raw_Read_Error_Rate
  5 Reallocated_Sector_Ct
 10 Spin_Retry_Count
184 Unknown_Attribute
188 Unknown_Attribute
196 Reallocated_Event_Count
197 Current_Pending_Sector
198 Offline_Uncorrectable
201 Soft_Read_Error_Rate
except for Soft_Read_Error_Rate which switches between either 100 or 253.

The disks are now placed in a Image Shapetek EYE-981SC tower[1] with good space,
and the disks are placed in 5.25" bays with rubber hard disk stabilizers[1] to
reduce vibration. There is therefore good airflow around all the
disks, and I keep
one side of the tower case open, so temperature should not be a
problem (any longer).

In the previous case space could be more tight. I see that last summer
hde and hdf had temperatures of around 45-55°C in June/July which does not
sound too good[3]. They are still part of the raid, whereas hdc which has
an excellent temperature profile of 35-45°C and hdd (28-38) are the two
disks being currently kicked out of the rad.

There might be some issues with the PSU[4] (I am waiting for a new one). I doubt
there are any problem with line electricity because the quality is
generally quite
good here in Norway and besides the machine is behind an UPS.

smartctl -l selftest /dev/sde gives
    Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
    # 1  Extended offline    Completed: read failure       90%
795         1465145815
    # 2  Conveyance offline  Completed: read failure       90%
794         1465145815
    # 3  Offline             Completed: read failure       00%
790         1465145815
    # 4  Short offline       Completed: read failure       20%
787         1465145815
None of the other disks report any selftest failures.

So sde and sdf show some sign of trouble (temperature, selftest and ata8.00
exception above), but they are not kicked out of the raid. On the other hand
sdc and sdd are both kicked out and I cannot see any obvious signs of hardware
trouble here. Any suggestions?


BR Håkon Løvdal

[1]
http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fwww.hardware.no%2Fartikler%2Fi_s_981_servertower%2F46558%2Futskrift&sl=no&tl=en

[2]
http://www.scythe-eu.com/en/products/pc-accessory/hard-disk-stabilizer-2.html

[3]
http://en.wikibooks.org/wiki/Minimizing_hard_disk_drive_failure_and_data_loss#Temperature_control

[4]
350W, Point of view, VP-3504

  reply	other threads:[~2010-02-17  2:42 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-08 11:11 "raid5:md0: read error not correctable (sector 795463080 on sdf1)" error on controller with SIL 3114 Håkon Løvdal
2010-02-09  5:13 ` Robert Hancock
2010-02-17  2:42   ` Håkon Løvdal [this message]
2010-02-20 13:05 ` Håkon Løvdal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a01a16b51002161842h3c2da6dbnbeb52ff1b6802bf@mail.gmail.com \
    --to=hlovdal@gmail.com \
    --cc=hancockrwd@gmail.com \
    --cc=linux-ide@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).