Re: raid/device failure

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Phil Turmel <philip@turmel.org>
To: thomas@fjellstrom.ca
Cc: linux-raid@vger.kernel.org
Subject: Re: raid/device failure
Date: Sun, 10 Feb 2013 21:09:51 -0500	[thread overview]
Message-ID: <511852EF.9070201@turmel.org> (raw)
In-Reply-To: <201302101827.36116.thomas@fjellstrom.ca>

On 02/10/2013 08:27 PM, Thomas Fjellstrom wrote:
> I've re-configured my NAS box (still haven't put it into "production") to be a 
> raid5 over 7 2TB consumer seagate barracuda drives, and with some tweaking, 
> performance was looking stellar.
> 
> Unfortunately I started seeing some messages in dmesg that worried me:

[trim /]

The MD subsystem keeps a count of read errors on each device, corrected
or not, and kicks the drive out when the count reaches twenty (20).
Every hour, the accumulated count is cut in half to allow for general
URE "maintenenance" in regular scrubs.  This behavior and the count are
hardcoded in the kernel source.

> I've run full S.M.A.R.T. tests (except the conveyance test, probably run that 
> tonight and see what happens) on all drives in the array, and there are no 
> obvious warnings or errors in the S.M.A.R.T. restults at all. Including 
> reallocated (pending or not) sectors.

MD fixed most of these errors, so I wouldn't expect to see them in SMART
unless the fix triggered a relocation.  But some weren't corrected--so I
would be concerned that MD and SMART don't agree.

Have these drives ever been scrubbed?  (I vaguely recall you mentioning
new drives...)  If they are new and already had a URE, I'd be concerned
about mishandling during shipping.  If they aren't new, I'd
destructively exercise them and retest.

> I've seen references while searching for possible causes, where people had 
> this error occur with faulty cables, or SAS backplanes. Is this a likely 
> senario? The cables are brand new, but anything is possible.
> 
> The card is a IBM M1015 8 port HBA flashed with the LSI 9211-8i IT firmware, 
> and no BIOS.

It might not hurt to recheck your power supply rating vs. load.  If you
can't find anything else, a data-logging voltmeter with min/max capture
would be my tool of choice.

http://www.fluke.com/fluke/usen/digital-multimeters/fluke-287.htm?PID=56058

Phil

next prev parent reply	other threads:[~2013-02-11  2:09 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-11  1:27 raid/device failure Thomas Fjellstrom
2013-02-11  2:09 ` Phil Turmel [this message]
2013-02-11  2:52   ` EJ Vincent
2013-02-11  3:44     ` Phil Turmel
2013-02-11 20:28       ` EJ Vincent
2013-02-11  2:55   ` Thomas Fjellstrom
2013-02-11  3:22 ` Brad Campbell
2013-02-11  7:55   ` Thomas Fjellstrom
2013-02-11  8:29 ` Roy Sigurd Karlsbakk
2013-02-11  9:13   ` Thomas Fjellstrom
2013-02-12 22:31 ` Thomas Fjellstrom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=511852EF.9070201@turmel.org \
    --to=philip@turmel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=thomas@fjellstrom.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).