Re: Raid6 array crashed-- 4-disk failure...(?)

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Maarten <maarten@ultratux.net>
To: Peter Grandi <pg_lxra@lxra.for.sabi.co.UK>
Cc: Linux RAID <linux-raid@vger.kernel.org>
Subject: Re: Raid6 array crashed-- 4-disk failure...(?)
Date: Mon, 15 Sep 2008 18:57:53 +0200	[thread overview]
Message-ID: <48CE9411.4060201@ultratux.net> (raw)
In-Reply-To: <18638.16613.435533.269946@tree.ty.sabi.co.uk>

Peter Grandi wrote:
>> This weekend I promoted my new 6-disk raid6 array to
>> production use and was busy copying data to it overnight. The
>> next morning the machine had crashed, and the array is down
>> with an (apparent?) 4-disk failure, [ ... ]
> 
> Multiple drive failures are far more common than people expect,
> and the problem lies in people's expectations, because they don't
> do common mode analysis (what's what? many will think).

It IS more common indeed. I'm on my seventh or eight raid-5 array now, 
the first was a 4-disk raid5 40(120) GB array. I've had 4 or 5 two-disk 
failures happen to me over the years, invariably during rebuild, indeed.
This is why I'm switching over to raid-6, by the way.

I did not, at any point, lose the array with the two-disk failures 
though. I intelligently cloned bad drives with dd_rescue and reassembled 
those degraded arrays using the new disks and thus got my data back.
But still, such events tend to keep me busy for a whole weekend, which 
is not too pleasant.

> They typically happen all at once at power up, or in short
> succession (e.g. 2nd drive fails while syncing to recover from
> 1st failure).
> 
> The typical RAID has N drives from the same manufacturer, of the
> same model, with nearly contiguous serial numbers, from the same
> shipping carton, in an enclosure where they all are started and
> stopped at the same time, run on the same power circuit, at the
> same temperature, on much the same load, attached to the same
> host adapter or N of the same type. Expecting as many do to have
> uncorrelated failures is rather comical.

This is true. However, since I know this fact I tend to take care to not 
make it too vulnerable; the system is incredibly well cooled, it has 8 
80mm fans that cool the 16(!) disks, I buy disks in batches of 2, from 
different brands and vendors. It indeed has just one PSU, but I chose a 
good one, I think it's a Tagan 550 Watt unit.

In fact -this is my home system- since I cannot afford a DLT drive for 
this much data I practically have no backup, so I really spend a lot of 
effort making sure the array stays ok. Yes, I know, this not a good 
idea, but how do I economically backup 3 TB ?
In practice I have older disks and/or decommisioned arrays with 
"backups" but this is of course not up to date at all.

> 1) Is my analysis correct so far ?
> 
> Not so sure :-). Consider this interesting discrepancy:
> 
>   /dev/sda1:
>   [ ... ]
>       Raid Devices : 7
>      Total Devices : 6
>   [ ... ]
>     Active Devices : 5
>   Working Devices : 5
> 
>   /dev/sdb1:
>   [ ... ]
>       Raid Devices : 7
>      Total Devices : 6
>   [ ... ]
>     Active Devices : 6
>   Working Devices : 6
> 
> Also note that member 0, 'sdk1' is listed as "removed", but not
> faulty, in some member statuses. However you have been able to
> actually get the status out of all members, including 'sdk1',
> which reports itself as 'active', like all other drives as of
> 5:16. Then only 2 drives report themselves as 'active' as of
> 5:17, and those think that the array has 5 'active'/'working'
> devices at that time. What happened between 5:16 and 5:17?

Don't know, I was asleep ;-)
Seriously, the system experienced a hard crash. Not even the keyboard 
responded to the capslock key/led anymore. Logs are empty.

> You should look at your system log to figure out what really
> happened to your drives and then assess what the cause of the
> failure was and its impact.

Syslogs are empty. Not one line nor even a hint at that time.

> 3) Should I say farewell to my ~2400 GB of data ? :-(
> 
> Surely not -- you have a backup of those 2400GB, as obvious from
> "busy copying data to it". RAID is not backup anyhow :-).

Yes I have most of the data. What I'd lose is ~20 GB, which is less than 
one percent ;-).  But still, it's a lot of bytes...

> 4) If it was only a one-drive failure, why did it kill the array ?
> 
> The MD subsystem marked as bad more than one drive. Anyhow doing
> a 5+2 RAID6 and then loading it with data with a checksum drive
> missing and at the same time as it syncing seems a bit too clever
> to me. Right now the array is running in effect in RAID0 mode, so
> I would not trust it even if you are able to restart it.

Just bought a seventh/replacement disk...  But if the array is lost that 
is of little use.  I'll try to reassemble later tonight...

Thanks,
Maarten

> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2008-09-15 16:57 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-09-15  9:04 Raid6 array crashed-- 4-disk failure...(?) Maarten
2008-09-15 10:16 ` Neil Brown
2008-09-15 16:32   ` Maarten
2008-09-15 20:57     ` Maarten
2008-09-16 13:12       ` Andre Noll
2008-09-15 11:03 ` Peter Grandi
2008-09-15 16:57   ` Maarten [this message]
2008-09-16 19:06     ` Bill Davidsen
2008-09-15 12:59 ` Andre Noll
2008-09-15 17:14   ` Maarten
2008-09-16  8:25     ` Andre Noll
2008-09-16 17:50       ` Maarten
2008-09-16 18:12         ` Maarten
2008-09-17  8:25         ` Andre Noll
2008-09-19 14:55         ` John Stoffel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48CE9411.4060201@ultratux.net \
    --to=maarten@ultratux.net \
    --cc=linux-raid@vger.kernel.org \
    --cc=pg_lxra@lxra.for.sabi.co.UK \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).