Re: proactive disk replacement

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wols Lists <antlists@youngman.org.uk>
To: David Brown <david.brown@hesbynett.no>,
	Reindl Harald <h.reindl@thelounge.net>,
	Adam Goryachev <mailinglists@websitemanagers.com.au>,
	Jeff Allison <jeff.allison@allygray.2y.net>
Cc: linux-raid@vger.kernel.org
Subject: Re: proactive disk replacement
Date: Tue, 21 Mar 2017 15:25:45 +0000	[thread overview]
Message-ID: <58D145F9.1080405@youngman.org.uk> (raw)
In-Reply-To: <58D13598.50403@hesbynett.no>

On 21/03/17 14:15, David Brown wrote:
>> for most arrays the disks have a similar age and usage pattern, so when
>> > the first one fails it becomes likely that it don't take too long for
>> > another one and so load and recovery time matters

> False.  There is no reason to suspect that - certainly not to within the
> hours or day it takes to rebuild your array.  Disk failure pattern shows
> a peak within the first month or so (failures due to manufacturing or
> handling), then a very low error rate for a few years, then a gradually
> increasing rate after that.  There is not a very significant correlation
> between drive failures within the same system, nor is there a very
> significant correlation between usage and failures.

Except your argument and the claim don't match. You're right - disk
failures follow the pattern you describe. BUT.

If the array was created from completely new disks, then the usage
patterns will be very similar, therefore there will be a statistical
correlation between failures as compared to the population as a whole.
(Bit like a false DNA match is much higher in an inbred town, than in a
cosmopolitan city of immigrants.)

EVEN WORSE. The probability of all the drives coming off the same batch,
and sharing the same systematic defects, is much much higher. One only
has to look at the Seagate 3TB Barracuda mess to see a perfect example.

In other words, IFF your array is built of a bunch of identical drives
all bought at the same time, the risk of multiple failure is
significantly higher. How significant that is I don't know, but it is a
very valid reason for replacing your drives at semi-random intervals.

(Completely off topic :-) but a real-world demonstrable example is
couples' initials. "Like chooses like" and if you compare a couple's
first initials against what you would expect from a random sample, there
is a VERY significant spike in couples that share the same initial.)

To put it bluntly, if your array consists of disks with near-identical
characteristics (including manufacturing batch), then your chances of
random multiple failure are noticeably increased. Is it worth worrying
about? If you can do something about it, of course!

Cheers,
Wol

next prev parent reply	other threads:[~2017-03-21 15:25 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-20 12:47 proactive disk replacement Jeff Allison
2017-03-20 13:25 ` Reindl Harald
2017-03-20 14:59 ` Adam Goryachev
2017-03-20 15:04   ` Reindl Harald
2017-03-20 15:23     ` Adam Goryachev
2017-03-20 16:19       ` Wols Lists
2017-03-21  2:33   ` Jeff Allison
2017-03-21  9:54     ` Reindl Harald
2017-03-21 10:54       ` Adam Goryachev
2017-03-21 11:03         ` Reindl Harald
2017-03-21 11:34           ` Andreas Klauer
2017-03-21 12:03             ` Reindl Harald
2017-03-21 12:41               ` Andreas Klauer
2017-03-22  4:16                 ` NeilBrown
2017-03-21 11:56           ` Adam Goryachev
2017-03-21 12:10             ` Reindl Harald
2017-03-21 13:13           ` David Brown
2017-03-21 13:24             ` Reindl Harald
2017-03-21 14:15               ` David Brown
2017-03-21 15:25                 ` Wols Lists [this message]
2017-03-21 15:41                   ` David Brown
2017-03-21 16:49                     ` Phil Turmel
2017-03-22 13:53                       ` Gandalf Corvotempesta
2017-03-22 14:12                         ` David Brown
2017-03-22 14:32                         ` Phil Turmel
2017-03-21 11:55         ` Gandalf Corvotempesta
2017-03-21 13:02       ` David Brown
2017-03-21 13:26         ` Gandalf Corvotempesta
2017-03-21 14:26           ` David Brown
2017-03-21 15:31             ` Wols Lists
2017-03-21 17:00               ` Phil Turmel
2017-03-21 15:29         ` Wols Lists
2017-03-21 16:55         ` Phil Turmel
2017-03-22 14:51 ` John Stoffel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=58D145F9.1080405@youngman.org.uk \
    --to=antlists@youngman.org.uk \
    --cc=david.brown@hesbynett.no \
    --cc=h.reindl@thelounge.net \
    --cc=jeff.allison@allygray.2y.net \
    --cc=linux-raid@vger.kernel.org \
    --cc=mailinglists@websitemanagers.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.