Re: 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Neil Brown <neilb@suse.de>
To: CoolCold <coolthecold@gmail.com>
Cc: Norman White <nwhite@stern.nyu.edu>, linux-raid@vger.kernel.org
Subject: Re: 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover?
Date: Sat, 11 Sep 2010 07:24:46 +1000	[thread overview]
Message-ID: <20100911072446.50833b79@notabene> (raw)
In-Reply-To: <AANLkTiknvajTtJuCBO927tGcNLB+4R+pXev68snrU4kr@mail.gmail.com>

On Fri, 10 Sep 2010 23:39:08 +0400
CoolCold <coolthecold@gmail.com> wrote:

> Neil can you share that decision making algorithm ?

Nothing special really.  I just confirmed that the metadata contents agreed
with what Norman said had happened.  Sometime people do other things to
arrays first and don't realise the consequences and don't report them.  So if
I based my "just use --force, that should work" on what they tell me I am
sometimes wrong - because what they tell me is flawed.  So I am not more
cautious and try to only base it on what mdadm tells me.

And I didn't really need to see the mdadm output.  As I said before
mdadm -Af should either do the right thing or nothing.  But I know people can
be very protective of there multi-gigabyte data sets so if I can be a bit
more definite, it helps them.

Also, I like to see want mdadm -E output from odd failures as it might show
me something that needs fixing.  Largely as a consequence of watching how
people interact with their failed arrays, the next release of mdadm will be a
lot more cautious about letting you add a device to an array - particularly a
failed array.  People often seem to do this thinking it means 'add this back
in', but really it means 'make this a spare and attach it to the array'.

In this particular case, I checked that what each device thought of its own
role in the array was compatible with with what the newest device thought,
where 'compatible' means either they agreed, or the newer reported a clean
failure.  If any device thought it was a spare and shouldn't have done, or
two devices both thought they filled the same roles, that would have been a
concern.

NeilBrown

> We have servers with "lucky" aic9410 & LSI 1068E controllers which
> hang system sometimes, then drive(s) are dropped.
> In simple cases, when one drive has different Events count it's enough
> to force assemble, but in other cases when, say 8 drives are dropped
> like in Kyler's case (
> http://marc.info/?l=linux-raid&m=127534131202696&w=2 ) and examine
> shows different info for drives from the same array, ie
> http://lairds.us/temp/ucmeng_md/20100526/examine_sdj1 ,
> http://lairds.us/temp/ucmeng_md/20100526/examine_sda1 . Keeping in
> mind that drives can be exported in different order on every boot,
> it's not so straightforward to detect "right" options.
> 
> I promise to put that knowledge on wiki.
> 
> On Thu, Sep 9, 2010 at 1:35 AM, Neil Brown <neilb@suse.de> wrote:
> > On Wed, 08 Sep 2010 13:22:30 -0400
> > Norman White <nwhite@stern.nyu.edu> wrote:
> >
> >> We have a 15 drive addonics array with 3 5 port sata multiplexors, one
> >> of the sas cables was knocked out to one of the port multiplexors and now
> >> mdadm sees 9 drives , a spare, and 5 failed, removed drives (after
> >> fixing the cabling problem).
> >>
> >> A mdadm -E on each of the drives, see 5 drives (the ones that were
> >> uncabled) as seeing the original  configuration with 14 drives and a
> >> spare, while the other 10 drives report
> >> 9 drives, a spare and 5 failed , removed drives.
> >>
> >> We are very confident that there was no io going on at the time, but are
> >> not sure how to proceed.
> >>
> >> One obvious thing to do is to just do a:
> >>
> >> mdadm --assemble --force --assume-clean /dev/md0 sd[b,c, ... , p]
> >> but we are getting different advice about what force will do in this
> >> situation. The last thing we want to do is wipe the array.
> >
> > What sort of different advice?  From whom?
> >
> > This should either do exactly what you want, or nothing at all.  I suspect
> > the former.  To be more confident I would need to see the output of
> >   mdadm -E /dev/sd[b-p]
> >
> > NeilBrown
> >
> >
> >>
> >> Another option would be to fiddle with the super blocks with mddump, so
> >> that they all see the same 15 drives in the same configuration, and then
> >> assemble it.
> >>
> >> Yet another suggestion was to recreate the array configuration and hope
> >> that the data wouldn't be touched.
> >>
> >> And even another suggestion is to create the array with one drive
> >> missing (so it is degraded and won't rebuild)
> >>
> >> Any pointers on how to proceed would be helpful. Restoring 30TB takes
> >> along time.
> >>
> >> Best,
> >> Norman White
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

     prev parent reply	other threads:[~2010-09-10 21:24 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-08 17:22 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover? Norman White
2010-09-08 18:47 ` Stan Hoeppner
2010-09-08 20:22 ` Mikael Abrahamsson
2010-09-08 21:35 ` Neil Brown
2010-09-10 15:18   ` Norman White
2010-09-10 17:47     ` Mr. James W. Laferriere
2010-09-10 18:51       ` Norman White
2010-09-10 19:39   ` CoolCold
2010-09-10 21:24     ` Neil Brown [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100911072446.50833b79@notabene \
    --to=neilb@suse.de \
    --cc=coolthecold@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=nwhite@stern.nyu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).