Lost mdadm RAID6 array and tried everything, hoping for any other suggestions you may have

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Lost mdadm RAID6 array and tried everything, hoping for any other suggestions you may have
@ 2011-09-22 22:07 Eduard Rozenberg
  2011-09-23  4:30 ` NeilBrown
  0 siblings, 1 reply; 3+ messages in thread
From: Eduard Rozenberg @ 2011-09-22 22:07 UTC (permalink / raw)
  To: linux-raid

Hello,

I've had very good luck with mdadm RAID1 over the years and it's really 
helped out.
More recently I got a bit more adventurous and tried out RAID6, but 
after my recent
experience with it I'm considering writing my take on "RAID6 considered 
dangerous" :)

Quick summary:

* Slackware 64 13.37, Linux 2.6.37.6, on Shuttle XPC 4gb ram

* RAID6 array, 8 active drives in a chassis connected via 2 esata cables 
to a shuttle pc,
    12TB total, worked fine for several months. Esata controller w/port 
multiplier support.

* Couple of days back, noticed array was down, with second half of the 
drives shown as
    down. Assumption - 1 cable or esata controller port hickuped, taking 
4 drives out of the array,
    or something happened due to the hot temps that day

* /proc/mdstat showed (S) next to some (or all, can't remember) of the 
drives in the array -
    I think that means spare, but I had no spares defined for the array 
so it seemed weird

* Rebooted machine and checked smartctl status, all 8 drives in chassis 
showed OK status,
    and they were all accessible using gdisk and fd00 partitions 
appeared fine.

* Tried to reassemble normally, then with force, nothing happened - no 
errors, array just
    didn't come up. Did not try --assume-clean (to my regret). Maybe 
would have worked,
    will never know.

* Took some internet advice and tried --create to recreate the array, 
however I forgot which
    chunk size I used so I tried several times with different chunk 
sizes (some resync took place
    each time). Could not find any info on Internet about whether the 
resyncs blew away my data.

* After each mdadm array recreate, tried to mount the array but failed 
with missing superblock

* dd'd a few gb's from the array and tried to grep text in a failed 
attempt to determine chunk size

* Tried testdisk utility to attempt to locate file system structures 
after recreating array with various
    chunk sizes, didn't let utility finish but it didn't seem to be 
doing anything useful

* R-Studio - tried using it, didn't seem it would do anything useful for me

At this point the key questions I'm aware of:

* Did recreating array with various chunk sizes blow away my data/file 
system structures
   (I did not use --assume-clean when recreating array)

* If the data is still ok, is there a way to determine the chunk size 
that was used? I'm hoping
    the metadata version and bitmap options used would not affect being 
able to recover
    the array, because I don't remember which metadata and bitmap 
options I used if any.

* Given the correct chunk size, if I recreate the array, is there some 
way to convince mount to
    mount the array, or some way of fixing the ext4 structure, or any 
other way to get the data of
    the array other than the file carving utilities that dump everything 
in a bunch of random
    directories.

I'm well aware RAID != backups and I had a backup but it was a few 
months old unfortunately.
I didn't expect at all this failure mode of having half the disks 
disappear and having the array
be so hard to recover. Most of the Internet information is focused on 
array creation and mgmt,
and I found precious little information on recovery, some of which was 
wrong and dangerous.
At this point I do consider RAID6 to be dangerous and will avoid it  
where possible. It just makes
recovery so much harder when the file system and data is broken up into 
little pieces.

Thanks in advance for any tips.

Regards,
--Ed

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Lost mdadm RAID6 array and tried everything, hoping for any other suggestions you may have
  2011-09-22 22:07 Lost mdadm RAID6 array and tried everything, hoping for any other suggestions you may have Eduard Rozenberg
@ 2011-09-23  4:30 ` NeilBrown
  2011-10-28  0:08   ` Ryan Castellucci
  0 siblings, 1 reply; 3+ messages in thread
From: NeilBrown @ 2011-09-23  4:30 UTC (permalink / raw)
  To: Eduard Rozenberg; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4922 bytes --]


Did you keep the "mdadm -E" output from before you started recreating the
array?
If not, you are probably out of luck.

When you create a RAID5 array it will write to all the superblocks, and will
write over everything on the last device listed - basically treating it as a
spare while the rest is a degraded array, and rebuilding the spare.

So the first n-1 drives are probably largely untouched ... unless you changed
the order of devices when you created.

The superblock will be somewhere on the first drive.  Maybe you can find it.
That will at least tell you the data offset...

Best advice:  As soon as something happens that you don't understand - ask
for help.  Once you don't understand what you are doing, anything you do will
probably make it worse.
Also : keep records of everything.

NeilBrown



On Thu, 22 Sep 2011 15:07:02 -0700 Eduard Rozenberg <eduardr@pobox.com> wrote:

> Hello,
> 
> I've had very good luck with mdadm RAID1 over the years and it's really 
> helped out.
> More recently I got a bit more adventurous and tried out RAID6, but 
> after my recent
> experience with it I'm considering writing my take on "RAID6 considered 
> dangerous" :)
> 
> Quick summary:
> 
> * Slackware 64 13.37, Linux 2.6.37.6, on Shuttle XPC 4gb ram
> 
> * RAID6 array, 8 active drives in a chassis connected via 2 esata cables 
> to a shuttle pc,
>     12TB total, worked fine for several months. Esata controller w/port 
> multiplier support.
> 
> * Couple of days back, noticed array was down, with second half of the 
> drives shown as
>     down. Assumption - 1 cable or esata controller port hickuped, taking 
> 4 drives out of the array,
>     or something happened due to the hot temps that day
> 
> * /proc/mdstat showed (S) next to some (or all, can't remember) of the 
> drives in the array -
>     I think that means spare, but I had no spares defined for the array 
> so it seemed weird
> 
> * Rebooted machine and checked smartctl status, all 8 drives in chassis 
> showed OK status,
>     and they were all accessible using gdisk and fd00 partitions 
> appeared fine.
> 
> * Tried to reassemble normally, then with force, nothing happened - no 
> errors, array just
>     didn't come up. Did not try --assume-clean (to my regret). Maybe 
> would have worked,
>     will never know.
> 
> * Took some internet advice and tried --create to recreate the array, 
> however I forgot which
>     chunk size I used so I tried several times with different chunk 
> sizes (some resync took place
>     each time). Could not find any info on Internet about whether the 
> resyncs blew away my data.
> 
> * After each mdadm array recreate, tried to mount the array but failed 
> with missing superblock
> 
> * dd'd a few gb's from the array and tried to grep text in a failed 
> attempt to determine chunk size
> 
> * Tried testdisk utility to attempt to locate file system structures 
> after recreating array with various
>     chunk sizes, didn't let utility finish but it didn't seem to be 
> doing anything useful
> 
> * R-Studio - tried using it, didn't seem it would do anything useful for me
> 
> At this point the key questions I'm aware of:
> 
> * Did recreating array with various chunk sizes blow away my data/file 
> system structures
>    (I did not use --assume-clean when recreating array)
> 
> * If the data is still ok, is there a way to determine the chunk size 
> that was used? I'm hoping
>     the metadata version and bitmap options used would not affect being 
> able to recover
>     the array, because I don't remember which metadata and bitmap 
> options I used if any.
> 
> * Given the correct chunk size, if I recreate the array, is there some 
> way to convince mount to
>     mount the array, or some way of fixing the ext4 structure, or any 
> other way to get the data of
>     the array other than the file carving utilities that dump everything 
> in a bunch of random
>     directories.
> 
> I'm well aware RAID != backups and I had a backup but it was a few 
> months old unfortunately.
> I didn't expect at all this failure mode of having half the disks 
> disappear and having the array
> be so hard to recover. Most of the Internet information is focused on 
> array creation and mgmt,
> and I found precious little information on recovery, some of which was 
> wrong and dangerous.
> At this point I do consider RAID6 to be dangerous and will avoid it  
> where possible. It just makes
> recovery so much harder when the file system and data is broken up into 
> little pieces.
> 
> Thanks in advance for any tips.
> 
> Regards,
> --Ed
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Lost mdadm RAID6 array and tried everything, hoping for any other suggestions you may have
  2011-09-23  4:30 ` NeilBrown
@ 2011-10-28  0:08   ` Ryan Castellucci
  0 siblings, 0 replies; 3+ messages in thread
From: Ryan Castellucci @ 2011-10-28  0:08 UTC (permalink / raw)
  Cc: Eduard Rozenberg, linux-raid

FWIW, I've had the exact issue - four of eight drives going offline
due to a loose cable.  If you force an assemble using --force --assume
clean, then force an array re-sync, you should be able to mount the
filesystem.

Re-creating the array has probably done horrible things to your data
due to the partial resync, but I would expect partial recovery (of the
data not overwritten) to still be possible by brute forcing the disk
layout with repeated creates with the --assume-clean flag.  If
possible, buy another eight hard drives and clone the originals to
them, then try recovering from the clones.

I'm not sure if you'll be able to do anything other than use a carving
tool to get your files back.

-Ryan

On Thu, Sep 22, 2011 at 9:30 PM, NeilBrown <neilb@suse.de> wrote:
>
> Did you keep the "mdadm -E" output from before you started recreating the
> array?
> If not, you are probably out of luck.
>
> When you create a RAID5 array it will write to all the superblocks, and will
> write over everything on the last device listed - basically treating it as a
> spare while the rest is a degraded array, and rebuilding the spare.
>
> So the first n-1 drives are probably largely untouched ... unless you changed
> the order of devices when you created.
>
> The superblock will be somewhere on the first drive.  Maybe you can find it.
> That will at least tell you the data offset...
>
> Best advice:  As soon as something happens that you don't understand - ask
> for help.  Once you don't understand what you are doing, anything you do will
> probably make it worse.
> Also : keep records of everything.
>
> NeilBrown
>
>
>
> On Thu, 22 Sep 2011 15:07:02 -0700 Eduard Rozenberg <eduardr@pobox.com> wrote:
>
>> Hello,
>>
>> I've had very good luck with mdadm RAID1 over the years and it's really
>> helped out.
>> More recently I got a bit more adventurous and tried out RAID6, but
>> after my recent
>> experience with it I'm considering writing my take on "RAID6 considered
>> dangerous" :)
>>
>> Quick summary:
>>
>> * Slackware 64 13.37, Linux 2.6.37.6, on Shuttle XPC 4gb ram
>>
>> * RAID6 array, 8 active drives in a chassis connected via 2 esata cables
>> to a shuttle pc,
>>     12TB total, worked fine for several months. Esata controller w/port
>> multiplier support.
>>
>> * Couple of days back, noticed array was down, with second half of the
>> drives shown as
>>     down. Assumption - 1 cable or esata controller port hickuped, taking
>> 4 drives out of the array,
>>     or something happened due to the hot temps that day
>>
>> * /proc/mdstat showed (S) next to some (or all, can't remember) of the
>> drives in the array -
>>     I think that means spare, but I had no spares defined for the array
>> so it seemed weird
>>
>> * Rebooted machine and checked smartctl status, all 8 drives in chassis
>> showed OK status,
>>     and they were all accessible using gdisk and fd00 partitions
>> appeared fine.
>>
>> * Tried to reassemble normally, then with force, nothing happened - no
>> errors, array just
>>     didn't come up. Did not try --assume-clean (to my regret). Maybe
>> would have worked,
>>     will never know.
>>
>> * Took some internet advice and tried --create to recreate the array,
>> however I forgot which
>>     chunk size I used so I tried several times with different chunk
>> sizes (some resync took place
>>     each time). Could not find any info on Internet about whether the
>> resyncs blew away my data.
>>
>> * After each mdadm array recreate, tried to mount the array but failed
>> with missing superblock
>>
>> * dd'd a few gb's from the array and tried to grep text in a failed
>> attempt to determine chunk size
>>
>> * Tried testdisk utility to attempt to locate file system structures
>> after recreating array with various
>>     chunk sizes, didn't let utility finish but it didn't seem to be
>> doing anything useful
>>
>> * R-Studio - tried using it, didn't seem it would do anything useful for me
>>
>> At this point the key questions I'm aware of:
>>
>> * Did recreating array with various chunk sizes blow away my data/file
>> system structures
>>    (I did not use --assume-clean when recreating array)
>>
>> * If the data is still ok, is there a way to determine the chunk size
>> that was used? I'm hoping
>>     the metadata version and bitmap options used would not affect being
>> able to recover
>>     the array, because I don't remember which metadata and bitmap
>> options I used if any.
>>
>> * Given the correct chunk size, if I recreate the array, is there some
>> way to convince mount to
>>     mount the array, or some way of fixing the ext4 structure, or any
>> other way to get the data of
>>     the array other than the file carving utilities that dump everything
>> in a bunch of random
>>     directories.
>>
>> I'm well aware RAID != backups and I had a backup but it was a few
>> months old unfortunately.
>> I didn't expect at all this failure mode of having half the disks
>> disappear and having the array
>> be so hard to recover. Most of the Internet information is focused on
>> array creation and mgmt,
>> and I found precious little information on recovery, some of which was
>> wrong and dangerous.
>> At this point I do consider RAID6 to be dangerous and will avoid it
>> where possible. It just makes
>> recovery so much harder when the file system and data is broken up into
>> little pieces.
>>
>> Thanks in advance for any tips.
>>
>> Regards,
>> --Ed
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



-- 
Ryan Castellucci http://ryanc.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-10-28  0:08 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-22 22:07 Lost mdadm RAID6 array and tried everything, hoping for any other suggestions you may have Eduard Rozenberg
2011-09-23  4:30 ` NeilBrown
2011-10-28  0:08   ` Ryan Castellucci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).