linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* recovering from a controller failure
@ 2010-05-29 19:07 Kyler Laird
  2010-05-29 19:46 ` Berkey B Walker
  2010-05-29 21:18 ` Richard
  0 siblings, 2 replies; 23+ messages in thread
From: Kyler Laird @ 2010-05-29 19:07 UTC (permalink / raw)
  To: linux-raid

Recently a drive failed on one of our file servers.  The machine has
three RAID6 arrays (15 1TB each plus spares).  I let the spare rebuild
and then started the process of replacing the drive.

Unfortunately I'd misplaced the list of drive IDs so I generated a new
list in order to identify the failed drive.  I used "smartctl" and made
a quick script to scan all 48 drives and generate pretty output.  That
was a mistake.  After running it a couple times one of the controllers
failed and several disks in the first array were failed.

I worked on the machine for awhile.  (It has an NFS root.)  I got some
information from it before it rebooted (via watchdog).  I've dumped all
of the information here.
	http://lairds.us/temp/ucmeng_md/

In mdstat_0 you can see the status of the arrays right after the
controller failure.  mdstat_1 shows the status after reboot.

sys_block shows a listing of the block devices so you can see that the
problem drives are on controller 1.

The examine_sd?1 files show -E output from each drive in md0.  Note that
the Events count is different for the drives on the problem controller.

I'd like to know if this is something I can recover.  I do have backups
but it's a huge pain to recover this much data.

Thank you.

--kyler

^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: recovering from a controller failure
@ 2010-05-31 18:27 Kyler Laird
  2010-06-01 15:49 ` Kyler Laird
  0 siblings, 1 reply; 23+ messages in thread
From: Kyler Laird @ 2010-05-31 18:27 UTC (permalink / raw)
  To: linux-raid

I appreciate the help that everyone here has been providing with this
frustrating problem.  It looks like there's agreement that I need to
use "--force" to assemble the array with the disk devices specified. 
Here's my first cut at a command to try:
	mdadm --force --assemble /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1
	http://lairds.us/temp/ucmeng_md/suggested_recovery
I'm sure I'm missing something.  Corrections are welcome.

(It would be comforting if mdadm had a "--dry-run" option.)

Thank you, all!

--kyler

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2010-06-01 19:15 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-29 19:07 recovering from a controller failure Kyler Laird
2010-05-29 19:46 ` Berkey B Walker
2010-05-29 20:44   ` Kyler Laird
2010-05-29 21:18 ` Richard
2010-05-29 21:36   ` Kyler Laird
2010-05-29 21:38     ` Richard
2010-05-29 21:45       ` Kyler Laird
2010-05-29 21:50         ` Richard
2010-05-30  0:15           ` Kyler Laird
2010-05-30  0:28             ` Richard
2010-05-30  0:54               ` Richard
2010-05-30  3:33             ` Leslie Rhorer
2010-05-30 13:17               ` CoolCold
2010-05-30 22:38                 ` Leslie Rhorer
2010-05-31  8:33                   ` CoolCold
2010-05-31  8:50                     ` Leslie Rhorer
2010-05-30 18:55               ` Richard Scobie
2010-05-30 22:23                 ` Leslie Rhorer
2010-05-29 21:59         ` Richard
2010-05-29 21:43   ` Berkey B Walker
  -- strict thread matches above, loose matches on Subject: below --
2010-05-31 18:27 Kyler Laird
2010-06-01 15:49 ` Kyler Laird
2010-06-01 19:15   ` Richard Scobie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).