Re: RAID6 dead on the water after Controller failure

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Phil Turmel <philip@turmel.org>
To: Florian Lampel <florian.lampel@gmail.com>, linux-raid@vger.kernel.org
Subject: Re: RAID6 dead on the water after Controller failure
Date: Fri, 14 Feb 2014 15:35:57 -0500	[thread overview]
Message-ID: <52FE7E2D.8020308@turmel.org> (raw)
In-Reply-To: <7A417EAE-106E-4541-941F-1002696F8735@gmail.com>

Hi Florian,

On 02/14/2014 11:19 AM, Florian Lampel wrote:
> Greetings,
> 
> The title says it all: 2 days before my RAID6 lost a HDD (sdh). Not  a problem, I thought, just let it reassemble and be done with it.
> 
> Unfortunately, my Mainboard-Controller didn't seem to like that, and after about 2 hours into the rebuilding process it showed me that the array was missing 5 drives ( 4 from the MB-Controller and the one that went south before).
> Being a Admin for quite a while, I did not panic and have not issued a single command that writes to the RAID in any form as of yet.
> 
> Having read the wiki page about broken RAID arrays reading some messages on the list it became obvious that I should check with you guys before I do anything. The Server is still running, but I intend to restart it after unplugging an SATA cable that I assume to be faulty.
> 
> Here are the relevant logs and outputs of mdadm as requested on the Wiki:
> 
> h__p://pastebin.com/1xweaLYG

Good report.  It even includes the mapping of serial numbers to devices!

To consolidate some critical parts:

sda1: WD-WMC300595645 probably device 8
sdb1: WD-WMC300314217 probably device 9
sdc1: WD-WMC300595957 probably device 10
sdd1: WD-WMC300313432 probably device 11
sde1: WD-WMC300595440 Active device 4
sdf1: WD-WMC300595880 Active device 5
sdg1: WD-WMC1T1521826 Active device 6
sdh1: WD-WMC300314126 spare, incomplete device 7
sdj1: WD-WMC300312702 Active device 0
sdk1: WD-WMC300248734 Active device 1
sdl1: WD-WMC300314248 Active device 2
sdm1: WD-WMC300585843 Active device 3

> sda, sdb, sdc and sdd can't be reached anymore by any means. I believe a restart might fix this, but I am not sure.
> 
> 2) I assume that I should do the following, in this order: 
> 
> 2.1) restart the machine and check all the cables etc.
> ---> and hope that /dev/sda, sdb, sdc and sdd will talk to me again.

Keep replacing controllers, cables, power supplies (anything except the
drives) until you can communicate with all of them.

Except /dev/sdh.  It wasn't finished syncing, so is no help.

Figure out what went wrong with the hardware.  After you get them all
talking, show us the missing mdadm --examine data and an exhaustive
smartctl report:

mdadm -E /dev/sd[abcd]1 >pastebin.txt

for x in /dev/sd[a-z] ; do echo $x : ; smartctl -x $x ; done >>pastebin.txt

> 2.2) mdadm --assemble --scan 
> ---> and hope for the best. I don't think it will work.

Don't bother.  It certainly won't work now that four drives will have
different event counts.  "--scan" is less than useful in these cases, too.

> 2.3 madm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 (since the Event count is the same) /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1
> --> I don't believe this one will work, too. When using --force, is the sequence of the HDDs in the command important?

This is the right tool.  Order doesn't matter, as the metadata carries
the member ID.  Leave out /dev/sdh1 (or wherever WD-WMC300314126 ends up).

mdadm -Afv /dev/md0 /dev/sd[abcdefgjklm]1

If it fails, show us the output.

> 2.4) mdadm --create --assume-clean --chunk=512 --metadata=1.0 --level 6 --raid-devices=12 --size=1953512960 /dev/md0 /dev/sdj1 /dev/sdk1 /dev/sdl1 etc. (using the sequence numbers of the /proc/mdstat pasted above)

Do *not* do this!  You have metadata.  You have enough drives to run the
array.  Re-creating the array is *madness*.

HTH,

Phil

next prev parent reply	other threads:[~2014-02-14 20:35 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-14 16:19 RAID6 dead on the water after Controller failure Florian Lampel
2014-02-14 20:35 ` Phil Turmel [this message]
2014-02-15 12:31   ` Florian Lampel
2014-02-15 15:12     ` Phil Turmel
2014-02-15 18:52       ` Florian Lampel
2014-02-15 19:00         ` Phil Turmel
2014-02-15 19:01           ` Phil Turmel
2014-02-15 19:09           ` Bakk. Florian Lampel
2014-02-15 22:04       ` Jon Nelson
2014-02-15 23:04         ` Mikael Abrahamsson
2014-02-15 23:23           ` Jon Nelson
2014-02-16  3:49             ` Phil Turmel
     [not found]     ` <CADNH=7EiY18TJDBDQsT6LDtw+Ft_2XCFaP30uK7uJb_e7xKhsQ@mail.gmail.com>
2014-02-15 18:56       ` Florian Lampel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52FE7E2D.8020308@turmel.org \
    --to=philip@turmel.org \
    --cc=florian.lampel@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.