Re: Recent drive errors - Thomas Fjellstrom

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Thomas Fjellstrom <thomas@fjellstrom.ca>
To: Phil Turmel <philip@turmel.org>
Cc: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: Recent drive errors
Date: Tue, 19 May 2015 08:32:21 -0600	[thread overview]
Message-ID: <2278721.7NtVspC26F@balsa> (raw)
In-Reply-To: <555B3948.1030602@turmel.org>

On Tue 19 May 2015 09:23:20 AM Phil Turmel wrote:
> On 05/19/2015 08:50 AM, Thomas Fjellstrom wrote:
> > On Tue 19 May 2015 08:34:55 AM Phil Turmel wrote:
> >> Based on the smart report, this drive is perfectly healthy.  A small
> >> number of uncorrectable read errors is normal in the life of any drive.
> > 
> > Is it perfectly normal for the same sector to be reported uncorrectable 5
> > times in a row like it did?
> 
> Yes, if you keep trying to read it.  Unreadable sectors stay unreadable,
> generally, until they are re-written.  That's the first opportunity the
> drive has to decide if a relocation is necessary.
> 
> > How many UREs are considered "ok"? Tens, hundreds, thousands, tens of
> > thousands?
> 
> Depends.  In a properly functioning array that gets scrubbed
> occasionally, or sufficiently heavy use to read the entire contents
> occasionally, the UREs get rewritten by MD right away.  Any UREs then
> only show up once.

I have made sure that it's doing regular scrubs, and regular SMART scans. This 
time...

> In a desktop environment, or non-raid, or improperly configured raid,
> the UREs will build up, and get reported on every read attempt.
> 
> Most consumer-grade drives claim a URE average below 1 per 1E14 bits
> read.  So by the end of their warranty period, getting one every 12TB
> read wouldn't be unusual.  This sort of thing follows a Poisson
> distribution:
> 
> http://marc.info/?l=linux-raid&m=135863964624202&w=2
> 
> > These drives have been barely used. Most of their life, they were either
> > off, or not actually being used. (it took a while to collect enough 3TB
> > drives, and then find time to build the array, and set it up as a regular
> > backup of my 11TB nas).
> 
> While being off may lengthen their life somewhat, the magnetic domains
> on these things are so small that some degradation will happen just
> sitting there.  Diffusion in the p- and n-doped regions of the
> semiconductors is also happening while sitting unused, degrading the
> electronics.
> 
> >>  It has no relocations, and no pending sectors.  The latency spikes are
> >> 
> >> likely due to slow degradation of some sectors that the drive is having
> >> to internally retry to read successfully.  Again, normal.
> > 
> > The latency spikes are /very/ regular and theres quite a lot of them.
> > See: http://i.imgur.com/QjTl6o3.png
> 
> Interesting.  I suspect that if you wipe that disk with noise, read it
> all back, and wipe it again, you'll have a handful of relocations.

It looks like each one of the blocks in that display is 128KiB. Which i think 
means those red blocks aren't very far apart. Maybe 80MiB apart? Would it 
reallocate all of those? That'd be a lot of reallocated sectors.

> Your latency test will show different numbers then, as the head will
> have to seek to the spare sector and back whenever you read through one
> of those spots.
> 
> Or the rewrites will fix them all, and you'll have no further problems.
>  Hard to tell.  Bottom line is that drives can't fix any problems they
> have unless they are *written* in previously identified problem areas.
> 
> >> I own some "DM001" drives -- they are unsuited to raid duty as they
> >> don't support ERC.  So, out of the box, they are time bombs for any
> >> array you put them in.  That's almost certainly why they were ejected
> >> from your array.
> >> 
> >> If you absolutely must use them, you *must* set the *driver* timeout to
> >> 120 seconds or more.
> > 
> > I've been planning on looking into the ERC stuff. I now actually have some
> > drives that do support ERC, so it'll be interesting to make sure
> > everything is set up properly.
> 
> You have it backwards.  If you have WD Reds, they are correct out of the
> box.  It's when you *don't* have ERC support, or you only have desktop
> ERC, that you need to take special action.

I was under the impression you still had to enable ERC on boot. And I 
/thought/ I read that you still want to adjust the timeouts, though not the 
same as for consumer drives.

> If you have consumer grade drives in a raid array, and you don't have
> boot scripts or udev rules to deal with timeout mismatch, your *ss is
> hanging in the wind.  The links in my last msg should help you out.

There was some talk of ERC/TLER and md. I'll still have to find or write a 
script to properly set up timeouts and enable TLER on drives capable of it 
(that don't come with it enabled by default).

> Also, I noticed that you used "smartctl -a" to post a complete report of
> your drive's status.  It's not complete.  You should get in the habit of
> using "smartctl -x" instead, so you see the ERC status, too.

Good to know. Thanks.

> Phil
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

next prev parent reply	other threads:[~2015-05-19 14:32 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-19 11:08 Recent drive errors Thomas Fjellstrom
2015-05-19 12:34 ` Phil Turmel
2015-05-19 12:50   ` Thomas Fjellstrom
2015-05-19 13:23     ` Phil Turmel
2015-05-19 14:32       ` Thomas Fjellstrom [this message]
2015-05-19 14:51         ` Phil Turmel
2015-05-19 16:07           ` Thomas Fjellstrom
2015-05-20  5:38             ` Thomas Fjellstrom
2015-05-21  7:58     ` Mikael Abrahamsson
2015-05-21 12:45       ` Thomas Fjellstrom
2015-05-22 13:38         ` Mikael Abrahamsson
2015-05-22 14:19           ` Thomas Fjellstrom
2015-05-22  7:07       ` Weedy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2278721.7NtVspC26F@balsa \
    --to=thomas@fjellstrom.ca \
    --cc=linux-raid@vger.kernel.org \
    --cc=philip@turmel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.