Re: raid/device failure - Thomas Fjellstrom

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Thomas Fjellstrom <thomas@fjellstrom.ca>
To: Phil Turmel <philip@turmel.org>
Cc: linux-raid@vger.kernel.org
Subject: Re: raid/device failure
Date: Sun, 10 Feb 2013 19:55:18 -0700	[thread overview]
Message-ID: <201302101955.19121.thomas@fjellstrom.ca> (raw)
In-Reply-To: <511852EF.9070201@turmel.org>

On February 10, 2013, Phil Turmel wrote:
> On 02/10/2013 08:27 PM, Thomas Fjellstrom wrote:
> > I've re-configured my NAS box (still haven't put it into "production") to
> > be a raid5 over 7 2TB consumer seagate barracuda drives, and with some
> > tweaking, performance was looking stellar.
> 
> > Unfortunately I started seeing some messages in dmesg that worried me:
> [trim /]
> 
> The MD subsystem keeps a count of read errors on each device, corrected
> or not, and kicks the drive out when the count reaches twenty (20).
> Every hour, the accumulated count is cut in half to allow for general
> URE "maintenenance" in regular scrubs.  This behavior and the count are
> hardcoded in the kernel source.
> 

Interesting. Thats good to know.

> > I've run full S.M.A.R.T. tests (except the conveyance test, probably run
> > that tonight and see what happens) on all drives in the array, and there
> > are no obvious warnings or errors in the S.M.A.R.T. restults at all.
> > Including reallocated (pending or not) sectors.
> 
> MD fixed most of these errors, so I wouldn't expect to see them in SMART
> unless the fix triggered a relocation.  But some weren't corrected--so I
> would be concerned that MD and SMART don't agree.

That is what I was wondering. I tought an uncorrected read error meant it 
wrote the data back out, and then a read of that data again was wrong.

> Have these drives ever been scrubbed?  (I vaguely recall you mentioning
> new drives...)  If they are new and already had a URE, I'd be concerned
> about mishandling during shipping.  If they aren't new, I'd
> destructively exercise them and retest.

They are new in that they haven't been used very much at all yet, and I 
haven't done a full scrub over every sector. I have run some lenghy tests 
using iozone over 32GB or more space (individually, and as part of a raid6), 
but as a bunch of parameters have changed from my last setup (raid5 vs raid6, 
xfs inode32 vs inode64), and xfs/md may or may not have alloated the test 
files from different areas of the device, so I can't be sure that the same 
general area of the disks were being accessed.

I did think that a full destructive write test may be in order, just to make 
sure. I've seen a drive throw errors at me, refuse to reallocate a sector 
untill it was written over manually, and then work fine afterwards.

> > I've seen references while searching for possible causes, where people
> > had this error occur with faulty cables, or SAS backplanes. Is this a
> > likely senario? The cables are brand new, but anything is possible.
> > 
> > The card is a IBM M1015 8 port HBA flashed with the LSI 9211-8i IT
> > firmware, and no BIOS.
> 
> It might not hurt to recheck your power supply rating vs. load.  If you
> can't find anything else, a data-logging voltmeter with min/max capture
> would be my tool of choice.
> 
> http://www.fluke.com/fluke/usen/digital-multimeters/fluke-287.htm?PID=56058

The PSU is overspeced if anything. But that doesn't mean it's not faulty in 
some way. It's a Seasonic G series 450W 80+ gold PSU. The system at full load 
should come in at just over half of that (core i3 2120, intel s1200kp m-itx 
board, hba, 7 hdds, 2 ssds, 2 x 8GB ddr3 1333mhz ECC ram).

I have an Agilent U1253B ( http://goo.gl/kl1aC ) which should be adequate to 
test with.

The NAS is on a 1000VA (600W?) UPS, so incomming power should be decently 
clean and even (assuming the UPS isn't bad).

> Phil
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

next prev parent reply	other threads:[~2013-02-11  2:55 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-11  1:27 raid/device failure Thomas Fjellstrom
2013-02-11  2:09 ` Phil Turmel
2013-02-11  2:52   ` EJ Vincent
2013-02-11  3:44     ` Phil Turmel
2013-02-11 20:28       ` EJ Vincent
2013-02-11  2:55   ` Thomas Fjellstrom [this message]
2013-02-11  3:22 ` Brad Campbell
2013-02-11  7:55   ` Thomas Fjellstrom
2013-02-11  8:29 ` Roy Sigurd Karlsbakk
2013-02-11  9:13   ` Thomas Fjellstrom
2013-02-12 22:31 ` Thomas Fjellstrom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201302101955.19121.thomas@fjellstrom.ca \
    --to=thomas@fjellstrom.ca \
    --cc=linux-raid@vger.kernel.org \
    --cc=philip@turmel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.