md dropping disks too early (was: Use RAID-6!)

Linux RAID subsystem development
 help / color / mirror / Atom feed

From: Ben Bucksch <linux.news@bucksch.org>
To: Linux RAID <linux-raid@vger.kernel.org>
Subject: md dropping disks too early (was: Use RAID-6!)
Date: Wed, 17 Apr 2013 01:42:09 +0200	[thread overview]
Message-ID: <516DE1D1.1050704@bucksch.org> (raw)
In-Reply-To: <15345091.8.1366130671716.JavaMail.root@zimbra>

The purpose of my RAID system is 1) to protect against hardware disk 
failures, both that a harddrive is entirely broken and won't read at all 
anymore. I know that this *will* happen at some point, but it's still a 
fairly rare event. The chance that 2 out of 8 drives go bad *in the same 
week* (!) is very small.

I am also concerned about 2) bit errors and silently broken sectors, and 
want my RAID to detect and fix those. I am not sure that Linux md does that.

There is a good chance that a controller or some wiring is bad, and many 
disks fail at the same time. Neither RAID5 nor RAID6 will protect 
against that, but a re-cabling should fix it without data loss, as the 
data on the disks is not affected.

Given that this RAID array is for my personal use, and the amount of 
disk slots in a machine is limited, and drives need 24/7 power, too, a 
RAID5 is the right choice for me, given the above situation.

---

BUT - and this is the main purpose of my post - Linux md causes problems 
by itself:

In my case, and from what I read in other posts in forums and on this 
mailing lists, many people have the problem that Linux md simply drops a 
disk from the RAID5, even though there was NOT an unrecoverable hardware 
failure. There are many situations where this happens:

 1. Upgrade (my case)
 2. Disk temporarily not accessible
 3. Disk has bad sectors (but the other content can still be read)

None of these should be fatal. But it seems that md marks the disk as 
faulty and requires a resync. There does not seem to be any way to get a 
disk that was once marked spare or faulty back into the array, unless I 
do a resync. (If somebody knows a way, please show me, see thread 'Disk 
wrongly marked "spare", need to force re-add it'.) Now, the resync needs 
to read all data from all disks and can be the event that uncovers a 
problem with one of the other disks. That disk is then dropped as well, 
again with no way to re-add, and the array is entirely lost. However, 
that is completely unnecessary, given that there are often only a few 
bad sectors, and these - while bad - are no reason to say goodbye to 
several TB of data.

Essentially, by being overly cautious with the data and dropping disks 
too early and being too instant about it, md actually achieves the 
opposite of what it was made for. It was intended to protect my data 
against disk problems, but md actually makes minor or even temporary 
problems resulting in a total dataloss.

I'm not overstating, because that's the exact situation I am in right 
now. I have only 1 disk that's actually failing, and a RAID5, so in 
theory I am fine. But I see no way to safely get at my data anymore. My 
array is offline and I have no idea how to get it online again without 
risking to lose all data.

And worst: the whole situation was triggered by md dropping a disk from 
the array that is wasn't even failing, but just because I upgraded. :-(

Ben

next prev parent reply	other threads:[~2013-04-16 23:42 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-16 16:44 Use RAID-6! Roy Sigurd Karlsbakk
2013-04-16 17:09 ` Mikael Abrahamsson
2013-04-16 17:25   ` Roy Sigurd Karlsbakk
2013-04-16 20:01   ` David Brown
2013-04-17  7:56     ` Mikael Abrahamsson
2013-04-17  9:26       ` David Brown
2013-04-16 19:52 ` Robert L Mathews
2013-04-16 20:05   ` Carsten Aulbert
2013-04-16 20:19     ` Roman Mamedov
2013-04-16 22:44     ` Robert L Mathews
2013-04-17  0:20       ` Ben Bucksch
2013-04-17  1:35         ` Adam Goryachev
2013-04-17  4:27           ` Robert L Mathews
2013-04-17  4:45             ` Adam Goryachev
2013-04-17  6:06             ` Stan Hoeppner
2013-04-17 11:13           ` Ben Bucksch
2013-04-17 11:32             ` Adam Goryachev
2013-04-17 11:51               ` Ben Bucksch
2013-04-17 17:50                 ` Roy Sigurd Karlsbakk
2013-04-17  3:32         ` Robert L Mathews
2013-04-17  4:20       ` Roman Mamedov
2013-04-17  5:22         ` Robert L Mathews
2013-04-17 17:27   ` Roy Sigurd Karlsbakk
2013-04-16 23:42 ` Ben Bucksch [this message]
2013-04-17  8:00   ` md dropping disks too early (was: Use RAID-6!) Mikael Abrahamsson
2013-04-17 10:57     ` md dropping disks too early Ben Bucksch
2013-04-17 15:03       ` Keith Keller
2013-04-17 18:09       ` Roy Sigurd Karlsbakk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=516DE1D1.1050704@bucksch.org \
    --to=linux.news@bucksch.org \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox