Recovering failed array

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Recovering failed array
@ 2011-09-22 18:07 Alex
  2011-09-22 21:52 ` Phil Turmel
  0 siblings, 1 reply; 5+ messages in thread
From: Alex @ 2011-09-22 18:07 UTC (permalink / raw)
  To: linux-raid

Hi,

I have a RAID5 array that has died and I need help recovering it.
Somehow two of the four partitions in the array have failed. The
server was completely dead, and had very little recognizable
information on the console before it was rebooted. I believe they were
kernel messages, but it wasn't a panic.

I'm able to read data from all four disks (using dd) but can't figure
out how to try and reassmble it. Here is some information I've
obtained by booting from a rescue CDROM and the mdadm.conf from
backup.

Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [linear]
md1 : inactive sda2[0] sdd2[4](S) sdb2[1]
      205820928 blocks super 1.1

md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
      255988 blocks super 1.0 [4/4] [UUUU]

# mdadm --add /dev/md1 /dev/sdd2
mdadm: Cannot open /dev/sdd2: Device or resource busy

# mdadm --run /dev/md1
mdadm: failed to run array /dev/md1: Input/output error

I've tried "--assemble --scan" and it also provides an IO error.

mdadm.conf:
# mdadm.conf written out by anaconda
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md0 level=raid1 num-devices=4
UUID=9406b71d:8024a882:f17932f6:98d4df18
ARRAY /dev/md1 level=raid5 num-devices=4
UUID=f5bb8db9:85f66b43:32a8282a:fb664152

Any ideas greatly appreciated.
Thanks,
Alex

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recovering failed array
  2011-09-22 18:07 Recovering failed array Alex
@ 2011-09-22 21:52 ` Phil Turmel
  2011-09-22 22:39   ` Alex
  0 siblings, 1 reply; 5+ messages in thread
From: Phil Turmel @ 2011-09-22 21:52 UTC (permalink / raw)
  To: Alex; +Cc: linux-raid

Hi Alex,

More information please....

On 09/22/2011 02:07 PM, Alex wrote:
> Hi,
> 
> I have a RAID5 array that has died and I need help recovering it.
> Somehow two of the four partitions in the array have failed. The
> server was completely dead, and had very little recognizable
> information on the console before it was rebooted. I believe they were
> kernel messages, but it wasn't a panic.
> 
> I'm able to read data from all four disks (using dd) but can't figure
> out how to try and reassmble it. Here is some information I've
> obtained by booting from a rescue CDROM and the mdadm.conf from
> backup.
> 
> Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [linear]
> md1 : inactive sda2[0] sdd2[4](S) sdb2[1]
>       205820928 blocks super 1.1
> 
> md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
>       255988 blocks super 1.0 [4/4] [UUUU]
> 
> 
> # mdadm --add /dev/md1 /dev/sdd2
> mdadm: Cannot open /dev/sdd2: Device or resource busy
> 
> # mdadm --run /dev/md1
> mdadm: failed to run array /dev/md1: Input/output error
> 
> I've tried "--assemble --scan" and it also provides an IO error.
> 
> mdadm.conf:
> # mdadm.conf written out by anaconda
> MAILADDR root
> AUTO +imsm +1.x -all
> ARRAY /dev/md0 level=raid1 num-devices=4
> UUID=9406b71d:8024a882:f17932f6:98d4df18
> ARRAY /dev/md1 level=raid5 num-devices=4
> UUID=f5bb8db9:85f66b43:32a8282a:fb664152

Please show the output of "lsdrv" [1] and then "mdadm -D /dev/md[01]", and also "mdadm -E /dev/sd[abcd][12]"

(From within your rescue environment.)  Some errors are likely, but get what you can.

Phil

[1] http://github.com/pturmel/lsdrv

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recovering failed array
  2011-09-22 21:52 ` Phil Turmel
@ 2011-09-22 22:39   ` Alex
  2011-09-23  4:15     ` NeilBrown
  0 siblings, 1 reply; 5+ messages in thread
From: Alex @ 2011-09-22 22:39 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Hi,

>> Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [linear]
>> md1 : inactive sda2[0] sdd2[4](S) sdb2[1]
>>       205820928 blocks super 1.1
>>
>> md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
>>       255988 blocks super 1.0 [4/4] [UUUU]
>>
>>
>> # mdadm --add /dev/md1 /dev/sdd2
>> mdadm: Cannot open /dev/sdd2: Device or resource busy
>>
>> # mdadm --run /dev/md1
>> mdadm: failed to run array /dev/md1: Input/output error
>>
>> I've tried "--assemble --scan" and it also provides an IO error.
>>
>> mdadm.conf:
>> # mdadm.conf written out by anaconda
>> MAILADDR root
>> AUTO +imsm +1.x -all
>> ARRAY /dev/md0 level=raid1 num-devices=4
>> UUID=9406b71d:8024a882:f17932f6:98d4df18
>> ARRAY /dev/md1 level=raid5 num-devices=4
>> UUID=f5bb8db9:85f66b43:32a8282a:fb664152
>
> Please show the output of "lsdrv" [1] and then "mdadm -D /dev/md[01]", and also "mdadm -E /dev/sd[abcd][12]"
>
> (From within your rescue environment.)  Some errors are likely, but get what you can.

Great, thanks for your offer to help. Great program you've written.
I've included the output here:

# mdadm -E /dev/sd[abcd][12]
http://pastebin.com/3JcBjiV6

# When I booted into the rescue CD again, it mounted md0 as md127
http://pastebin.com/yXnzzL6K

# lsdrv output (also included below)
http://pastebin.com/JkpgVNL4

The md1 array appeared as md125 and was inactive, so "mdadm -D" didn't
work. Here is how the arrays now appear. sdb2 should be part of md125,
not its own array, as it is below, obviously.

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md125 : inactive sda2[0](S) sdc2[4](S)
      137213952 blocks super 1.1

md126 : inactive sdb2[1](S)
      68606976 blocks super 1.1

md127 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
      255988 blocks super 1.0 [4/4] [UUUU]

unused devices: <none>

lsdrv output:

PCI [pata_amd] 00:07.1 IDE interface: Advanced Micro Devices [AMD]
AMD-8111 IDE (rev 03)
 ├─scsi 0:0:0:0 MATSHITA DVD-ROM SR-8178 {MATSHITADVD-ROM_SR-8178_}
 │  └─sr0: [11:0] (iso9660) 352.23m 'sysrcd-2.3.1'
 │     └─Mounted as /dev/sr0 @ /livemnt/boot
 └─scsi 1:x:x:x [Empty]
PCI [aic79xx] 02:0a.0 SCSI storage controller: Adaptec AIC-7902 U320 (rev 10)
 └─scsi 2:x:x:x [Empty]
PCI [aic79xx] 02:0a.1 SCSI storage controller: Adaptec AIC-7902 U320 (rev 10)
 ├─scsi 3:0:0:0 FUJITSU MAW3073NC {DAM9PA1001LL}
 │  └─sda: [8:0] Partitioned (dos) 68.49g
 │     ├─sda1: [8:1] MD raid1 (0/4) 250.00m md127 clean in_sync
'dbserv.guardiandigital.com:0' {9406b71d-8024-a882-f179-32f698d4df18}
 │     │  └─md127: [9:127] (ext4) 249.99m {a99a461a-ff72-4bc0-9ccc-095b8a26f5e2}
 │     ├─sda2: [8:2] MD raid5 (none/4) 65.43g md125 inactive spare
'dbserv.guardiandigital.com:1' {f5bb8db9-85f6-6b43-32a8-282afb664152}
 │     │  └─md125: [9:125] Empty/Unknown 0.00k
 │     └─sda3: [8:3] (swap) 2.82g {0c2eeeb1-fc35-4e43-9432-21cb005f8e05}
 ├─scsi 3:0:1:0 FUJITSU MAW3073NC {DAM9PA1001L5}
 │  └─sdb: [8:16] Partitioned (dos) 68.49g
 │     ├─sdb1: [8:17] MD raid1 (1/4) 250.00m md127 clean in_sync
'dbserv.guardiandigital.com:0' {9406b71d-8024-a882-f179-32f698d4df18}
 │     ├─sdb2: [8:18] MD raid5 (none/4) 65.43g md126 inactive spare
'dbserv.guardiandigital.com:1' {f5bb8db9-85f6-6b43-32a8-282afb664152}
 │     │  └─md126: [9:126] Empty/Unknown 0.00k
 │     └─sdb3: [8:19] (swap) 2.82g {e36083c8-59a1-437b-8f93-6c624a8b0b90}
 ├─scsi 3:0:2:0 FUJITSU MAW3073NC {DAM9PA1001LD}
 │  └─sdc: [8:32] Partitioned (dos) 68.49g
 │     ├─sdc1: [8:33] MD raid1 (2/4) 250.00m md127 clean in_sync
'dbserv.guardiandigital.com:0' {9406b71d-8024-a882-f179-32f698d4df18}
 │     ├─sdc2: [8:34] MD raid5 (none/4) 65.43g md125 inactive spare
'dbserv.guardiandigital.com:1' {f5bb8db9-85f6-6b43-32a8-282afb664152}
 │     └─sdc3: [8:35] (swap) 2.82g {fe4c0314-7e61-475e-9034-1b90b23e817a}
 └─scsi 3:0:3:0 FUJITSU MAW3073NC {DAM9PA1001LA}
    └─sdd: [8:48] Partitioned (dos) 68.49g
       ├─sdd1: [8:49] MD raid1 (3/4) 250.00m md127 clean in_sync
'dbserv.guardiandigital.com:0' {9406b71d-8024-a882-f179-32f698d4df18}
       ├─sdd2: [8:50] MD raid5 (4) 65.43g inactive
'dbserv.guardiandigital.com:1' {f5bb8db9-85f6-6b43-32a8-282afb664152}
       └─sdd3: [8:51] (swap) 2.82g {deaf299e-e988-4581-aba0-d061aa36914a}
Other Block Devices
 ├─loop0: [7:0] (squashfs) 288.07m
 │  └─Mounted as /dev/loop0 @ /livemnt/squashfs
 ├─ram0: [1:0] Empty/Unknown 16.00m
 ├─ram1: [1:1] Empty/Unknown 16.00m
 ├─ram2: [1:2] Empty/Unknown 16.00m
 ├─ram3: [1:3] Empty/Unknown 16.00m
 ├─ram4: [1:4] Empty/Unknown 16.00m
 ├─ram5: [1:5] Empty/Unknown 16.00m
 ├─ram6: [1:6] Empty/Unknown 16.00m
 ├─ram7: [1:7] Empty/Unknown 16.00m
 ├─ram8: [1:8] Empty/Unknown 16.00m
 ├─ram9: [1:9] Empty/Unknown 16.00m
 ├─ram10: [1:10] Empty/Unknown 16.00m
 ├─ram11: [1:11] Empty/Unknown 16.00m
 ├─ram12: [1:12] Empty/Unknown 16.00m
 ├─ram13: [1:13] Empty/Unknown 16.00m
 ├─ram14: [1:14] Empty/Unknown 16.00m
 └─ram15: [1:15] Empty/Unknown 16.00m
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recovering failed array
  2011-09-22 22:39   ` Alex
@ 2011-09-23  4:15     ` NeilBrown
  2011-09-24 22:39       ` Alex
  0 siblings, 1 reply; 5+ messages in thread
From: NeilBrown @ 2011-09-23  4:15 UTC (permalink / raw)
  To: Alex; +Cc: Phil Turmel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2843 bytes --]

On Thu, 22 Sep 2011 18:39:10 -0400 Alex <mysqlstudent@gmail.com> wrote:

> Hi,
> 
> >> Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [linear]
> >> md1 : inactive sda2[0] sdd2[4](S) sdb2[1]
> >>       205820928 blocks super 1.1
> >>
> >> md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
> >>       255988 blocks super 1.0 [4/4] [UUUU]
> >>
> >>
> >> # mdadm --add /dev/md1 /dev/sdd2
> >> mdadm: Cannot open /dev/sdd2: Device or resource busy
> >>
> >> # mdadm --run /dev/md1
> >> mdadm: failed to run array /dev/md1: Input/output error
> >>
> >> I've tried "--assemble --scan" and it also provides an IO error.
> >>
> >> mdadm.conf:
> >> # mdadm.conf written out by anaconda
> >> MAILADDR root
> >> AUTO +imsm +1.x -all
> >> ARRAY /dev/md0 level=raid1 num-devices=4
> >> UUID=9406b71d:8024a882:f17932f6:98d4df18
> >> ARRAY /dev/md1 level=raid5 num-devices=4
> >> UUID=f5bb8db9:85f66b43:32a8282a:fb664152
> >
> > Please show the output of "lsdrv" [1] and then "mdadm -D /dev/md[01]", and also "mdadm -E /dev/sd[abcd][12]"
> >
> > (From within your rescue environment.)  Some errors are likely, but get what you can.
> 
> Great, thanks for your offer to help. Great program you've written.
> I've included the output here:
> 
> # mdadm -E /dev/sd[abcd][12]
> http://pastebin.com/3JcBjiV6
> 
> # When I booted into the rescue CD again, it mounted md0 as md127
> http://pastebin.com/yXnzzL6K
> 

Hmmm ... looks like a bit of a mess.  Two devices that should be active
arrays appear to be spares. I suspect you tried to --add them when you
shouldn't have.  Newer version of mdadm stop you from doing that but older
version don't.  You only --add a device that you want to be a spare, not a
device that you think is part of the array.

All of the devices think that device 2 (the third in the array) should  exist
and  be working, but no device claims to be it.  Presumably it is /dev/sdc2.

You will need to recreate the array.
i.e.

 mdadm -S /dev/md1
or 
 mdadm -S /dev/md125 /dev/md126

or whatever md arrays claim to be holding any of the 4 devices according
to /proc/mdstat.

Then

 mdadm -C /dev/md1 -e 1.1 --level 5 -n 4  --chunk 512 --assume-clean \
    /dev/sda2 /dev/sdb2 /dev/sdc2 missing

This will just re-write the metadata and assemble the array.  It won't change
the data.
Then "fsck -n /dev/md1" and make sure it looks good.
If it does: good.
If not, try again with sdd2 in place of sdc2.

Once you are happy that you can see your data, you can add the other device
as a spare and it will rebuild.

You don't really need the --assume-clean above because a degraded RAID5 is
always assumed to be clean, but it is good practice to use --assume-clean
whenever re-creating an array which has real data on it.

Good luck,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recovering failed array
  2011-09-23  4:15     ` NeilBrown
@ 2011-09-24 22:39       ` Alex
  0 siblings, 0 replies; 5+ messages in thread
From: Alex @ 2011-09-24 22:39 UTC (permalink / raw)
  To: NeilBrown; +Cc: Phil Turmel, linux-raid

Hi guys,

>> Great, thanks for your offer to help. Great program you've written.
>> I've included the output here:
>>
>> # mdadm -E /dev/sd[abcd][12]
>> http://pastebin.com/3JcBjiV6
>>
>> # When I booted into the rescue CD again, it mounted md0 as md127
>> http://pastebin.com/yXnzzL6K
>>
>
> Hmmm ... looks like a bit of a mess.  Two devices that should be active
> arrays appear to be spares. I suspect you tried to --add them when you
> shouldn't have.  Newer version of mdadm stop you from doing that but older
> version don't.  You only --add a device that you want to be a spare, not a
> device that you think is part of the array.

Yes, that is what I did.

> All of the devices think that device 2 (the third in the array) should  exist
> and  be working, but no device claims to be it.  Presumably it is /dev/sdc2.
>
>
> You will need to recreate the array.
> i.e.
>
>  mdadm -S /dev/md1
> or
>  mdadm -S /dev/md125 /dev/md126
>
> or whatever md arrays claim to be holding any of the 4 devices according
> to /proc/mdstat.
>
> Then
>
>  mdadm -C /dev/md1 -e 1.1 --level 5 -n 4  --chunk 512 --assume-clean \
>    /dev/sda2 /dev/sdb2 /dev/sdc2 missing
>
> This will just re-write the metadata and assemble the array.  It won't change
> the data.
> Then "fsck -n /dev/md1" and make sure it looks good.
> If it does: good.
> If not, try again with sdd2 in place of sdc2.
>
> Once you are happy that you can see your data, you can add the other device
> as a spare and it will rebuild.

Just wanted to let you know that these instructions worked perfectly,
and thanks for all your help.

Any idea how this would have happened in the first place? Somehow two
of the three RAID5 devices failed at once. I've checked all the disks
with the Fujitsu disk scanner, and it found no physical errors. I also
don't think the disks were disturbed in any way while they were
operating.

Thanks again,
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-09-24 22:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-22 18:07 Recovering failed array Alex
2011-09-22 21:52 ` Phil Turmel
2011-09-22 22:39   ` Alex
2011-09-23  4:15     ` NeilBrown
2011-09-24 22:39       ` Alex

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).