Re: making raid5 more robust after a crash?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Martin Cracauer <cracauer@cons.org>
To: Neil Brown <neilb@suse.de>
Cc: Chris Allen <chris@cjx.com>, linux-raid@vger.kernel.org
Subject: Re: making raid5 more robust after a crash?
Date: Mon, 20 Mar 2006 12:41:59 -0500	[thread overview]
Message-ID: <20060320124159.A50675@cons.org> (raw)
In-Reply-To: <17435.9868.249413.113639@cse.unsw.edu.au>; from neilb@suse.de on Sat, Mar 18, 2006 at 08:13:48AM +1100

Neil Brown wrote on Sat, Mar 18, 2006 at 08:13:48AM +1100: 
> On Friday March 17, chris@cjx.com wrote:
> > Dear All,
> > 
> > We have a number of machines running 4TB raid5 arrays.
> > Occasionally one of these machines will lock up solid and
> > will need power cycling. Often when this happens, the
> > array will refuse to restart with 'cannot start dirty
> > degraded array'. Usually  mdadm --assemble --force will
> > get the thing going again - although it will then do
> > a complete resync.

First of all you need to make sure you can see the kernel messages
from this.  If /var/log/messages lives on the array affected you won't
see messages explaining what happens even if the kernel printed them.

What you see here is probably similar to a problem I just had: by
using software RAID you are subject to errors below the RAID level
that are not disk errors.  In my case a BIOS problem on my board made
the SATA driver run out of space, on requests for two of the disks on
my RAID-5, simultaneously.  The driver had to report an error upstream
and the RAID software on top of it cannot tell such a non-disk error
from a disk error.  It treats everything as a disk error and drops the
disk out of the array because it has seen errors on requests for two
disks.

I have more info on my accident here:
http://forums.2cpu.com/showthread.php?t=73705

As I said, you need to have a logfile on a disk not in the array, or
(better) you need to be able to watch kernel messages on the console
when this happens.

It sounds to me you have a similar problem to what I had: a software
error above the disks but below the raid level.

> > 
> > 
> > My question is: Is there any way I can make the array
> > more robust? I don't mind it losing a single drive and
> > having to resync when we get a lockup - but having to
> > do a forced assemble always makes me nervous, and means
> > that this sort of crash has to be escalated to a senior
> > engineer.

The re-sync is actually a big problem because actually losing a drive
physically during the re-sync will kill your array (unless it is the
re-syncing disk).

Martin
-- 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Martin Cracauer <cracauer@cons.org>   http://www.cons.org/cracauer/
FreeBSD - where you want to go, today.      http://www.freebsd.org/

next prev parent reply	other threads:[~2006-03-20 17:41 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-03-17 13:02 making raid5 more robust after a crash? Chris Allen
2006-03-17 21:13 ` Neil Brown
2006-03-20 17:41   ` Martin Cracauer [this message]
2006-03-29 13:19   ` Chris Allen
2006-03-29 22:17     ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060320124159.A50675@cons.org \
    --to=cracauer@cons.org \
    --cc=chris@cjx.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.