Problem recovering a failed RIAD5 array with 4-drives.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Problem recovering a failed RIAD5 array with 4-drives.
@ 2007-07-12 13:49 James
  2007-07-12 16:44 ` Lennart Sorensen
  2007-07-12 22:48 ` Neil Brown
  0 siblings, 2 replies; 9+ messages in thread
From: James @ 2007-07-12 13:49 UTC (permalink / raw)
  To: linux-kernel

My apologies if this is not the correct forum. If there is a better place to 
post this please advise.


Linux localhost.localdomain 2.6.17-1.2187_FC5 #1 Mon Sep 11 01:17:06 EDT 2006 
i686 i686 i386 GNU/Linux

(I was planning to upgrade to FC7 this weekend, but that is currently on hold 
because-)

I've got a problem with a software RIAD5 using mdadm.
Drive sdc failed causing sda to appear failed. Both drives where marked 
as 'spare'.

What follows is a record of the steps I've taken and the results. I'm looking 
for some direction/advice to get the data back. 


I've tried a few cautions things to bring the array back up with the three 
good drives with no luck. 

The last thing attempted had some limited success. I was able to get all 
drives powered up. I checked the Event count on the three good drives and 
they were all equal. So I assumed it would be safe to do the following. I 
hope I was not wrong. I issued the following commands to try to bring the 
array into a usable state.




[]# 
mdadm --create --verbose /dev/md0 --assume-clean --level=raid5 --raid-devices=4 --spare-devices=0  /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

[]# /sbin/mdadm --misc --test --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Wed Jul 11 08:03:20 2007
     Raid Level : raid5
     Array Size : 1465175808 (1397.30 GiB 1500.34 GB)
    Device Size : 488391936 (465.77 GiB 500.11 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Jul 11 08:03:47 2007
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : e46beb22:37d329db:dd16ea76:29c07a23
         Events : 0.2

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8        1        2      active sync   /dev/sda1
       3       8       49        3      active sync   /dev/sdd1
[]# mdadm --fail /dev/md0 /dev/sdc1
mdadm: set /dev/sdc1 faulty in /dev/md0

[]# /sbin/mdadm --misc --test --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Wed Jul 11 08:03:20 2007
     Raid Level : raid5
     Array Size : 1465175808 (1397.30 GiB 1500.34 GB)
    Device Size : 488391936 (465.77 GiB 500.11 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Jul 11 14:37:56 2007
          State : clean, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : e46beb22:37d329db:dd16ea76:29c07a23
         Events : 0.3

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
      10       0        0        0      removed
       2       8        1        2      active sync   /dev/sda1
       3       8       49        3      active sync   /dev/sdd1

       4       8       33        -      faulty spare   /dev/sdc1



[]# mount /dev/md0 /opt
mount: wrong fs type, bad option, bad superblock on /dev/md0,
       missing codepage or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

In /var/log/messages
Jul 11 14:32:44 localhost kernel: EXT3-fs: md0: couldn't mount because of 
unsupported optional features (4000000).

[]# /sbin/fsck /dev/md0
fsck 1.38 (30-Jun-2005)
e2fsck 1.38 (30-Jun-2005)
fsck.ext3: Filesystem revision too high while trying to open /dev/md0
The filesystem revision is apparently too high for this version of e2fsck.
(Or the filesystem superblock is corrupt)


The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

[]# mke2fs -n /dev/md0
mke2fs 1.38 (30-Jun-2005)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
183156736 inodes, 366293952 blocks
18314697 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=369098752
11179 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000, 214990848


I tried the following for all Superblock backups with the same result.

[]# e2fsck -b 214990848 /dev/md0
e2fsck 1.38 (30-Jun-2005)
/sbin/e2fsck: Invalid argument while trying to open /dev/md0

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>


Any advice/direction would be appreciated. 
Thanks much.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problem recovering a failed RIAD5 array with 4-drives.
  2007-07-12 13:49 Problem recovering a failed RIAD5 array with 4-drives James
@ 2007-07-12 16:44 ` Lennart Sorensen
  2007-07-12 20:21   ` James
  2007-07-12 22:48 ` Neil Brown
  1 sibling, 1 reply; 9+ messages in thread
From: Lennart Sorensen @ 2007-07-12 16:44 UTC (permalink / raw)
  To: James; +Cc: linux-kernel

On Thu, Jul 12, 2007 at 08:49:15AM -0500, James wrote:
> My apologies if this is not the correct forum. If there is a better place to 
> post this please advise.
> 
> 
> Linux localhost.localdomain 2.6.17-1.2187_FC5 #1 Mon Sep 11 01:17:06 EDT 2006 
> i686 i686 i386 GNU/Linux
> 
> (I was planning to upgrade to FC7 this weekend, but that is currently on hold 
> because-)
> 
> I've got a problem with a software RIAD5 using mdadm.
> Drive sdc failed causing sda to appear failed. Both drives where marked 
> as 'spare'.
> 
> What follows is a record of the steps I've taken and the results. I'm looking 
> for some direction/advice to get the data back. 
> 
> 
> I've tried a few cautions things to bring the array back up with the three 
> good drives with no luck. 
> 
> The last thing attempted had some limited success. I was able to get all 
> drives powered up. I checked the Event count on the three good drives and 
> they were all equal. So I assumed it would be safe to do the following. I 
> hope I was not wrong. I issued the following commands to try to bring the 
> array into a usable state.
> 
> 
> 
> 
> []# 
> mdadm --create --verbose /dev/md0 --assume-clean --level=raid5 --raid-devices=4 --spare-devices=0  /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

Don't you want assemble rather than create if it already exists?

How did two drives fail at the same time?  Are you running PATA drives
with two drives on a single cable?  That is a no no for raid.  PATA
drive failures often take out the bus and you never want two drives in a
single raid to share an IDE bus.

You probably want to try and assemble the non failed drives, and then
add in the new replacement drive afterwards, since after all it is NOT
clean.  Hopefully the raid will accept back sda even though it appeared
failed.  Then you can add the new sdc to resync the raid.

--
Len Sorensen

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problem recovering a failed RIAD5 array with 4-drives.
  2007-07-12 16:44 ` Lennart Sorensen
@ 2007-07-12 20:21   ` James
  2007-07-12 21:41     ` Phil Turmel
  0 siblings, 1 reply; 9+ messages in thread
From: James @ 2007-07-12 20:21 UTC (permalink / raw)
  To: linux-kernel

> On Thu, Jul 12, 2007 at 08:49:15AM -0500, James wrote:
> > My apologies if this is not the correct forum. If there is a better place 
to 
> > post this please advise.
> > 
> > 
> > Linux localhost.localdomain 2.6.17-1.2187_FC5 #1 Mon Sep 11 01:17:06 EDT 
2006 
> > i686 i686 i386 GNU/Linux
> > 
> > (I was planning to upgrade to FC7 this weekend, but that is currently on 
hold 
> > because-)
> > 
> > I've got a problem with a software RIAD5 using mdadm.
> > Drive sdc failed causing sda to appear failed. Both drives where marked 
> > as 'spare'.
> > 
> > What follows is a record of the steps I've taken and the results. I'm 
looking 
> > for some direction/advice to get the data back. 
> > 
> > 
> > I've tried a few cautions things to bring the array back up with the three 
> > good drives with no luck. 
> > 
> > The last thing attempted had some limited success. I was able to get all 
> > drives powered up. I checked the Event count on the three good drives and 
> > they were all equal. So I assumed it would be safe to do the following. I 
> > hope I was not wrong. I issued the following commands to try to bring the 
> > array into a usable state.
> > 
> > 
> > 
> > 
> > []# 
> > 
mdadm --create --verbose /dev/md0 --assume-clean --level=raid5 --raid-devices=4 --spare-devices=0  /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
> 
> Don't you want assemble rather than create if it already exists?
> 
> How did two drives fail at the same time?  Are you running PATA drives
> with two drives on a single cable?  That is a no no for raid.  PATA
> drive failures often take out the bus and you never want two drives in a
> single raid to share an IDE bus.
> 
> You probably want to try and assemble the non failed drives, and then
> add in the new replacement drive afterwards, since after all it is NOT
> clean.  Hopefully the raid will accept back sda even though it appeared
> failed.  Then you can add the new sdc to resync the raid.
> 
> --
> Len Sorensen
> 

I should have included more information. When I attempted to --assemble the 
array I received the following:

[]# mdadm --assemble [--force --run] /dev/md0 /dev/sda1 /dev/sdb1 
[/dev/sdc1]  /dev/sdd1
mdadm: failed to RUN_ARRAY /dev/md0: Input/output error


>From what I read I assumed I could use the --assume-clean option with --create 
to bring the array back at least in some semblance of working order. 

I'd like to recover as much as possible from the RAID array. I actually have a 
nice new SATA configuration sitting here waiting to receive the data. This 
thing failed a day too early. I'm gnashing my teeth over this one. 

I'd truly appreciate any help/advice.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problem recovering a failed RIAD5 array with 4-drives.
  2007-07-12 20:21   ` James
@ 2007-07-12 21:41     ` Phil Turmel
  0 siblings, 0 replies; 9+ messages in thread
From: Phil Turmel @ 2007-07-12 21:41 UTC (permalink / raw)
  To: LinuxKernel; +Cc: linux-kernel

James wrote:
[snip /]
>>On Thu, Jul 12, 2007 at 08:49:15AM -0500, James wrote:
>>>I've tried a few cautions things to bring the array back up with the three 
>>>good drives with no luck. 
>>>
[snip /]
> 
> mdadm --create --verbose /dev/md0 --assume-clean --level=raid5 --raid-devices=4 --spare-devices=0  /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
> 
[snip /]
> 
> I should have included more information. When I attempted to --assemble the 
> array I received the following:
> 
> []# mdadm --assemble [--force --run] /dev/md0 /dev/sda1 /dev/sdb1 
> [/dev/sdc1]  /dev/sdd1
> mdadm: failed to RUN_ARRAY /dev/md0: Input/output error
> 
> 
> From what I read I assumed I could use the --assume-clean option with --create 
> to bring the array back at least in some semblance of working order. 
> 
> I'd like to recover as much as possible from the RAID array. I actually have a 
> nice new SATA configuration sitting here waiting to receive the data. This 
> thing failed a day too early. I'm gnashing my teeth over this one. 
> 
> I'd truly appreciate any help/advice.
> 
Hi James,

mdadm allows you to specify "missing" in place of a failed device 
when assembling or creating arrays, like so:

mdadm --assemble /dev/md0 --run \
	/dev/sda1 /dev/sdb1 missing /dev/sdd1

I don't know if using --create has already trashed your array, 
but this is worth a try.  You may also want to try --force with 
the above.

HTH,

Phil


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problem recovering a failed RIAD5 array with 4-drives.
  2007-07-12 13:49 Problem recovering a failed RIAD5 array with 4-drives James
  2007-07-12 16:44 ` Lennart Sorensen
@ 2007-07-12 22:48 ` Neil Brown
  2007-07-12 23:10   ` James
  1 sibling, 1 reply; 9+ messages in thread
From: Neil Brown @ 2007-07-12 22:48 UTC (permalink / raw)
  To: LinuxKernel; +Cc: linux-kernel

On Thursday July 12, LinuxKernel@jamesplace.net wrote:
> 
> []# 
> mdadm --create --verbose /dev/md0 --assume-clean --level=raid5 --raid-devices=4 --spare-devices=0  /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
> 
snip
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       17        0      active sync   /dev/sdb1
>        1       8       33        1      active sync   /dev/sdc1
>        2       8        1        2      active sync   /dev/sda1
>        3       8       49        3      active sync   /dev/sdd1

Something looks very wrong here.  You listed the devices to --create
in one order:
   a b c d
but that appear in the array in a different order
   b c a d

Did you cut/paste the command line into the mail, or did you retype
it?  If you retyped it, could you have got it wrong?

You need the order that --detail shows to match the order of the
original array....

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problem recovering a failed RIAD5 array with 4-drives.
  2007-07-12 22:48 ` Neil Brown
@ 2007-07-12 23:10   ` James
  2007-07-12 23:21     ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: James @ 2007-07-12 23:10 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-kernel

On Thu July 12 2007 5:48 pm, you wrote:
> On Thursday July 12, LinuxKernel@jamesplace.net wrote:
> > 
> > []# 
> > 
mdadm --create --verbose /dev/md0 --assume-clean --level=raid5 --raid-devices=4 --spare-devices=0  /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
> > 
> snip
> > 
> >     Number   Major   Minor   RaidDevice State
> >        0       8       17        0      active sync   /dev/sdb1
> >        1       8       33        1      active sync   /dev/sdc1
> >        2       8        1        2      active sync   /dev/sda1
> >        3       8       49        3      active sync   /dev/sdd1
> 
> Something looks very wrong here.  You listed the devices to --create
> in one order:
>    a b c d
> but that appear in the array in a different order
>    b c a d
> 
> Did you cut/paste the command line into the mail, or did you retype
> it?  If you retyped it, could you have got it wrong?
> 
> You need the order that --detail shows to match the order of the
> original array....
> 
> NeilBrown
> 
> 

I don't know the original order of the array before all the problems started. 

Is there a way to determine the original order? 

The order that --detail is showing now is the order that appeared after 
issuing the command is it is in the email. (ie: a b c d)

Thank again.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problem recovering a failed RIAD5 array with 4-drives.
  2007-07-12 23:10   ` James
@ 2007-07-12 23:21     ` Neil Brown
  2007-07-13  0:49       ` Problem recovering a failed RIAD5 array with 4-drives. --RESOLVED James
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2007-07-12 23:21 UTC (permalink / raw)
  To: LinuxKernel; +Cc: linux-kernel

On Thursday July 12, LinuxKernel@jamesplace.net wrote:
> 
> I don't know the original order of the array before all the problems started. 
> 
> Is there a way to determine the original order? 

No, unless you have some old kernel logs of the last time it assembled
the array properly.
The one thing that "--create" does destroy is the information about
any previous array that the drives were a part of.

> 
> The order that --detail is showing now is the order that appeared after 
> issuing the command is it is in the email. (ie: a b c d)

Odd.  I cannot reproduce it.
I suggest you try different arrangements (of the 3 good drives and the
word 'missing') until you find one that 'fsck -n' likes.

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problem recovering a failed RIAD5 array with 4-drives. --RESOLVED
  2007-07-12 23:21     ` Neil Brown
@ 2007-07-13  0:49       ` James
  2007-07-16 15:04         ` David Greaves
  0 siblings, 1 reply; 9+ messages in thread
From: James @ 2007-07-13  0:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: Neil Brown

> > 
> > I don't know the original order of the array before all the problems 
started. 
> > 
> > Is there a way to determine the original order? 
> 
> No, unless you have some old kernel logs of the last time it assembled
> the array properly.
> The one thing that "--create" does destroy is the information about
> any previous array that the drives were a part of.
> 
> > 
> > The order that --detail is showing now is the order that appeared after 
> > issuing the command is it is in the email. (ie: a b c d)
> 
> Odd.  I cannot reproduce it.
> I suggest you try different arrangements (of the 3 good drives and the
> word 'missing') until you find one that 'fsck -n' likes.
> 
> NeilBrown
> 
> 

I don't understand how the order of --detail was different than the command 
line on my system, however....

YOU ARE A LIFE SAVER!!!

After going through 21 combinations, beginning to lose all hope and plummeting 
into eternal despair, combo 22 worked. The array is up and working. All the 
data (1.3Tb) is there and I'm probably the happiest character on the mail 
list today. 

Thanks a bunch for your help.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problem recovering a failed RIAD5 array with 4-drives. --RESOLVED
  2007-07-13  0:49       ` Problem recovering a failed RIAD5 array with 4-drives. --RESOLVED James
@ 2007-07-16 15:04         ` David Greaves
  0 siblings, 0 replies; 9+ messages in thread
From: David Greaves @ 2007-07-16 15:04 UTC (permalink / raw)
  To: LinuxKernel; +Cc: linux-kernel, Neil Brown

James wrote:
>>> I don't know the original order of the array before all the problems 
> started. 
>>> Is there a way to determine the original order? 
>> No, unless you have some old kernel logs of the last time it assembled
>> the array properly.
>> The one thing that "--create" does destroy is the information about
>> any previous array that the drives were a part of.
>>
>>> The order that --detail is showing now is the order that appeared after 
>>> issuing the command is it is in the email. (ie: a b c d)
>> Odd.  I cannot reproduce it.
>> I suggest you try different arrangements (of the 3 good drives and the
>> word 'missing') until you find one that 'fsck -n' likes.
>>
>> NeilBrown
>>
>>
> 
> I don't understand how the order of --detail was different than the command 
> line on my system, however....
> 
> YOU ARE A LIFE SAVER!!!
> 
> After going through 21 combinations, beginning to lose all hope and plummeting 
> into eternal despair, combo 22 worked. The array is up and working. All the 
> data (1.3Tb) is there and I'm probably the happiest character on the mail 
> list today. 
> 
> Thanks a bunch for your help.

Funnily enough someone else was having a similar problem on the linux-raid list 
at the same time

Here's a script that may be useful to others in this predicament - a hell of a 
lot quicker than doing it by hand...

The 'is the filesystem safe' test probably wants improving from a read-only mount...

http://linux-raid.osdl.org/index.php/RAID_Recovery
http://linux-raid.osdl.org/index.php/Permute_array.pl

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-07-16 15:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-12 13:49 Problem recovering a failed RIAD5 array with 4-drives James
2007-07-12 16:44 ` Lennart Sorensen
2007-07-12 20:21   ` James
2007-07-12 21:41     ` Phil Turmel
2007-07-12 22:48 ` Neil Brown
2007-07-12 23:10   ` James
2007-07-12 23:21     ` Neil Brown
2007-07-13  0:49       ` Problem recovering a failed RIAD5 array with 4-drives. --RESOLVED James
2007-07-16 15:04         ` David Greaves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox