Re: Recent drive errors

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Phil Turmel <philip@turmel.org>
To: thomas@fjellstrom.ca
Cc: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: Recent drive errors
Date: Tue, 19 May 2015 09:23:20 -0400	[thread overview]
Message-ID: <555B3948.1030602@turmel.org> (raw)
In-Reply-To: <84264713.v03zHsT0Cj@balsa>

On 05/19/2015 08:50 AM, Thomas Fjellstrom wrote:
> On Tue 19 May 2015 08:34:55 AM Phil Turmel wrote:

>> Based on the smart report, this drive is perfectly healthy.  A small
>> number of uncorrectable read errors is normal in the life of any drive.
> 
> Is it perfectly normal for the same sector to be reported uncorrectable 5 
> times in a row like it did?

Yes, if you keep trying to read it.  Unreadable sectors stay unreadable,
generally, until they are re-written.  That's the first opportunity the
drive has to decide if a relocation is necessary.

> How many UREs are considered "ok"? Tens, hundreds, thousands, tens of 
> thousands?

Depends.  In a properly functioning array that gets scrubbed
occasionally, or sufficiently heavy use to read the entire contents
occasionally, the UREs get rewritten by MD right away.  Any UREs then
only show up once.

In a desktop environment, or non-raid, or improperly configured raid,
the UREs will build up, and get reported on every read attempt.

Most consumer-grade drives claim a URE average below 1 per 1E14 bits
read.  So by the end of their warranty period, getting one every 12TB
read wouldn't be unusual.  This sort of thing follows a Poisson
distribution:

http://marc.info/?l=linux-raid&m=135863964624202&w=2

> These drives have been barely used. Most of their life, they were either off, 
> or not actually being used. (it took a while to collect enough 3TB drives, and 
> then find time to build the array, and set it up as a regular backup of my 
> 11TB nas).

While being off may lengthen their life somewhat, the magnetic domains
on these things are so small that some degradation will happen just
sitting there.  Diffusion in the p- and n-doped regions of the
semiconductors is also happening while sitting unused, degrading the
electronics.

>>  It has no relocations, and no pending sectors.  The latency spikes are
>> likely due to slow degradation of some sectors that the drive is having
>> to internally retry to read successfully.  Again, normal.
> 
> The latency spikes are /very/ regular and theres quite a lot of them.
> See: http://i.imgur.com/QjTl6o3.png

Interesting.  I suspect that if you wipe that disk with noise, read it
all back, and wipe it again, you'll have a handful of relocations.

Your latency test will show different numbers then, as the head will
have to seek to the spare sector and back whenever you read through one
of those spots.

Or the rewrites will fix them all, and you'll have no further problems.
 Hard to tell.  Bottom line is that drives can't fix any problems they
have unless they are *written* in previously identified problem areas.

>> I own some "DM001" drives -- they are unsuited to raid duty as they
>> don't support ERC.  So, out of the box, they are time bombs for any
>> array you put them in.  That's almost certainly why they were ejected
>> from your array.
>>
>> If you absolutely must use them, you *must* set the *driver* timeout to
>> 120 seconds or more.
> 
> I've been planning on looking into the ERC stuff. I now actually have some 
> drives that do support ERC, so it'll be interesting to make sure everything is 
> set up properly.

You have it backwards.  If you have WD Reds, they are correct out of the
box.  It's when you *don't* have ERC support, or you only have desktop
ERC, that you need to take special action.

If you have consumer grade drives in a raid array, and you don't have
boot scripts or udev rules to deal with timeout mismatch, your *ss is
hanging in the wind.  The links in my last msg should help you out.

Also, I noticed that you used "smartctl -a" to post a complete report of
your drive's status.  It's not complete.  You should get in the habit of
using "smartctl -x" instead, so you see the ERC status, too.

Phil

next prev parent reply	other threads:[~2015-05-19 13:23 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-19 11:08 Recent drive errors Thomas Fjellstrom
2015-05-19 12:34 ` Phil Turmel
2015-05-19 12:50   ` Thomas Fjellstrom
2015-05-19 13:23     ` Phil Turmel [this message]
2015-05-19 14:32       ` Thomas Fjellstrom
2015-05-19 14:51         ` Phil Turmel
2015-05-19 16:07           ` Thomas Fjellstrom
2015-05-20  5:38             ` Thomas Fjellstrom
2015-05-21  7:58     ` Mikael Abrahamsson
2015-05-21 12:45       ` Thomas Fjellstrom
2015-05-22 13:38         ` Mikael Abrahamsson
2015-05-22 14:19           ` Thomas Fjellstrom
2015-05-22  7:07       ` Weedy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=555B3948.1030602@turmel.org \
    --to=philip@turmel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=thomas@fjellstrom.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.