Re: Recent drive errors - Thomas Fjellstrom

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Thomas Fjellstrom <thomas@fjellstrom.ca>
To: Phil Turmel <philip@turmel.org>
Cc: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: Recent drive errors
Date: Tue, 19 May 2015 08:32:21 -0600	[thread overview]
Message-ID: <2278721.7NtVspC26F@balsa> (raw)
In-Reply-To: <555B3948.1030602@turmel.org>

On Tue 19 May 2015 09:23:20 AM Phil Turmel wrote:
> On 05/19/2015 08:50 AM, Thomas Fjellstrom wrote:
> > On Tue 19 May 2015 08:34:55 AM Phil Turmel wrote:
> >> Based on the smart report, this drive is perfectly healthy.  A small
> >> number of uncorrectable read errors is normal in the life of any drive.
> > 
> > Is it perfectly normal for the same sector to be reported uncorrectable 5
> > times in a row like it did?
> 
> Yes, if you keep trying to read it.  Unreadable sectors stay unreadable,
> generally, until they are re-written.  That's the first opportunity the
> drive has to decide if a relocation is necessary.
> 
> > How many UREs are considered "ok"? Tens, hundreds, thousands, tens of
> > thousands?
> 
> Depends.  In a properly functioning array that gets scrubbed
> occasionally, or sufficiently heavy use to read the entire contents
> occasionally, the UREs get rewritten by MD right away.  Any UREs then
> only show up once.

I have made sure that it's doing regular scrubs, and regular SMART scans. This 
time...

> In a desktop environment, or non-raid, or improperly configured raid,
> the UREs will build up, and get reported on every read attempt.
> 
> Most consumer-grade drives claim a URE average below 1 per 1E14 bits
> read.  So by the end of their warranty period, getting one every 12TB
> read wouldn't be unusual.  This sort of thing follows a Poisson
> distribution:
> 
> http://marc.info/?l=linux-raid&m=135863964624202&w=2
> 
> > These drives have been barely used. Most of their life, they were either
> > off, or not actually being used. (it took a while to collect enough 3TB
> > drives, and then find time to build the array, and set it up as a regular
> > backup of my 11TB nas).
> 
> While being off may lengthen their life somewhat, the magnetic domains
> on these things are so small that some degradation will happen just
> sitting there.  Diffusion in the p- and n-doped regions of the
> semiconductors is also happening while sitting unused, degrading the
> electronics.
> 
> >>  It has no relocations, and no pending sectors.  The latency spikes are
> >> 
> >> likely due to slow degradation of some sectors that the drive is having
> >> to internally retry to read successfully.  Again, normal.
> > 
> > The latency spikes are /very/ regular and theres quite a lot of them.
> > See: http://i.imgur.com/QjTl6o3.png
> 
> Interesting.  I suspect that if you wipe that disk with noise, read it
> all back, and wipe it again, you'll have a handful of relocations.

It looks like each one of the blocks in that display is 128KiB. Which i think 
means those red blocks aren't very far apart. Maybe 80MiB apart? Would it 
reallocate all of those? That'd be a lot of reallocated sectors.

> Your latency test will show different numbers then, as the head will
> have to seek to the spare sector and back whenever you read through one
> of those spots.
> 
> Or the rewrites will fix them all, and you'll have no further problems.
>  Hard to tell.  Bottom line is that drives can't fix any problems they
> have unless they are *written* in previously identified problem areas.
> 
> >> I own some "DM001" drives -- they are unsuited to raid duty as they
> >> don't support ERC.  So, out of the box, they are time bombs for any
> >> array you put them in.  That's almost certainly why they were ejected
> >> from your array.
> >> 
> >> If you absolutely must use them, you *must* set the *driver* timeout to
> >> 120 seconds or more.
> > 
> > I've been planning on looking into the ERC stuff. I now actually have some
> > drives that do support ERC, so it'll be interesting to make sure
> > everything is set up properly.
> 
> You have it backwards.  If you have WD Reds, they are correct out of the
> box.  It's when you *don't* have ERC support, or you only have desktop
> ERC, that you need to take special action.

I was under the impression you still had to enable ERC on boot. And I 
/thought/ I read that you still want to adjust the timeouts, though not the 
same as for consumer drives.

> If you have consumer grade drives in a raid array, and you don't have
> boot scripts or udev rules to deal with timeout mismatch, your *ss is
> hanging in the wind.  The links in my last msg should help you out.

There was some talk of ERC/TLER and md. I'll still have to find or write a 
script to properly set up timeouts and enable TLER on drives capable of it 
(that don't come with it enabled by default).

> Also, I noticed that you used "smartctl -a" to post a complete report of
> your drive's status.  It's not complete.  You should get in the habit of
> using "smartctl -x" instead, so you see the ERC status, too.

Good to know. Thanks.

> Phil
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

next prev parent reply	other threads:[~2015-05-19 14:32 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-19 11:08 Recent drive errors Thomas Fjellstrom
2015-05-19 12:34 ` Phil Turmel
2015-05-19 12:50   ` Thomas Fjellstrom
2015-05-19 13:23     ` Phil Turmel
2015-05-19 14:32       ` Thomas Fjellstrom [this message]
2015-05-19 14:51         ` Phil Turmel
2015-05-19 16:07           ` Thomas Fjellstrom
2015-05-20  5:38             ` Thomas Fjellstrom
2015-05-21  7:58     ` Mikael Abrahamsson
2015-05-21 12:45       ` Thomas Fjellstrom
2015-05-22 13:38         ` Mikael Abrahamsson
2015-05-22 14:19           ` Thomas Fjellstrom
2015-05-22  7:07       ` Weedy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2278721.7NtVspC26F@balsa \
    --to=thomas@fjellstrom.ca \
    --cc=linux-raid@vger.kernel.org \
    --cc=philip@turmel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).