linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* the dreaded double disk failure
@ 2005-01-13  7:14 Mike Hardy
  2005-01-13  8:37 ` Guy
  0 siblings, 1 reply; 5+ messages in thread
From: Mike Hardy @ 2005-01-13  7:14 UTC (permalink / raw)
  To: linux-raid


Alas, I've been bitten.

Worse, it was after attempting to use raidreconf and having it trash the 
array with my backup on it instead of extending it. I know raidreconf is 
a use-at-your-own-risk tool, but it was the backup, so I didn't mind.

Until I got this (partial mdadm -E output):

       Number   Major   Minor   RaidDevice State
this     7      91        1        7      active sync   /dev/hds1
    0     0      33        1        0      active sync   /dev/hde1
    1     1      34        1        1      active sync   /dev/hdg1
    2     2      56        1        2      active sync   /dev/hdi1
    3     3      57        1        3      faulty   /dev/hdk1
    4     4      88        1        4      active sync   /dev/hdm1
    5     5      89        1        5      faulty   /dev/hdo1
    6     6      90        1        6      active sync   /dev/hdq1
    7     7      91        1        7      active sync   /dev/hds1

/dev/hdk1 has at least one unreadable block around LBA 3,600,000 or so, 
and /dev/hdo1 has at least one unreadable blok around LBA 8,000,000 or so.

Further, the array was resyncing (power failure due to construction, yes 
its been one of those days - but it was actually in sync) when the first 
bad block hit, but I know that all the data I care about was static at 
the time, so barring some fsck cleanup, all the important blocks should 
have correct parity.

Which is to say that I think my data exists, its just a bit far away at 
the moment.

The first question is, would you agree?

Assuming its there, my general plan is to do this to get my data out:

1) resurrect the backup array
2) add one faulty drive to the array, with bad blocks there
    (an mdadm assemble with 7 of the 8, forced?)
3) start the backup, fully anticipating the read error and disk ejection
4) add the other faulty drive in, with bad blocks there
    (mdadm assemble with 7 of the 8, forced again?)
5) finish the backup

The second question is, does that sound sane? Or is there a better way?

Finally, to get the main array healthy, I'm going to take note of which 
files kicked out which drives, and clobber them with the backed up version.

Alternately, how hard would it be to write a utility that inspected the 
array, took the LBA(s) of the bad block on one component, and 
reconstructed it for rewrite via parity. A very smart dd, in a way. Is 
that possible?

Finally, I heard mention (fromo Peter Breuer I think) of a raid5 patch 
that tolerates sector read errors and re-writes automagically. Any info 
on that would be interesting.

Thanks for your time
-Mike

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-01-15 22:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-13  7:14 the dreaded double disk failure Mike Hardy
2005-01-13  8:37 ` Guy
2005-01-13  8:47   ` Mike Hardy
2005-01-15  3:59   ` Mike Hardy
2005-01-15 22:52     ` raid5 test array creator (was Re: the dreaded double disk failure) Mike Hardy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).