help please - recovering from failed RAID5 rebuild

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* help please - recovering from failed RAID5 rebuild
@ 2011-04-07 20:00 sanktnelson 1
  2011-04-08 12:01 ` NeilBrown
  0 siblings, 1 reply; 5+ messages in thread
From: sanktnelson 1 @ 2011-04-07 20:00 UTC (permalink / raw)
  To: linux-raid

Hi all,
I need some advice on recovering from a catastrophic RAID5 failure
(and a possible f***-up on my part in trying to recover). I didn't
find this covered in any of the howtos out there, so here is my
problem:
I was trying to rebuild my RAID5 (5 drives) after I had replaced one
failed drive (sdc) with a fresh one. During the rebuild another drive
(sdd) failed. so I was left with the sdc as spare (S) and sdd failed
(F).
first question: what should I have done at this point? I'm fairly
certain that the error that failed sdd was a fluke, loose cable or
something, so I wanted mdadm to just assume it was clean and retry the
rebuild.

what I actually did was reboot the machine in the hope it would just
restart the array in the previous degraded state, which of course it
did not. Instead all drives except the failed sdd were reported as
spare(S) in /proc/mdstat.

so I tried

root@enterprise:~# mdadm --run /dev/md0
mdadm: failed to run array /dev/md0: Input/output error

syslog showed at this point:

Apr  7 20:37:49 localhost kernel: [  893.981851] md: kicking non-fresh
sdd1 from array!
Apr  7 20:37:49 localhost kernel: [  893.981864] md: unbind<sdd1>
Apr  7 20:37:49 localhost kernel: [  893.992526] md: export_rdev(sdd1)
Apr  7 20:37:49 localhost kernel: [  893.995844] raid5: device sdb1
operational as raid disk 3
Apr  7 20:37:49 localhost kernel: [  893.995848] raid5: device sdf1
operational as raid disk 4
Apr  7 20:37:49 localhost kernel: [  893.995852] raid5: device sde1
operational as raid disk 2
Apr  7 20:37:49 localhost kernel: [  893.996353] raid5: allocated 5265kB for md0
Apr  7 20:37:49 localhost kernel: [  893.996478] 3: w=1 pa=0 pr=5 m=1
a=2 r=5 op1=0 op2=0
Apr  7 20:37:49 localhost kernel: [  893.996482] 4: w=2 pa=0 pr=5 m=1
a=2 r=5 op1=0 op2=0
Apr  7 20:37:49 localhost kernel: [  893.996485] 2: w=3 pa=0 pr=5 m=1
a=2 r=5 op1=0 op2=0
Apr  7 20:37:49 localhost kernel: [  893.996488] raid5: not enough
operational devices for md0 (2/5 failed)
Apr  7 20:37:49 localhost kernel: [  893.996514] RAID5 conf printout:
Apr  7 20:37:49 localhost kernel: [  893.996517]  --- rd:5 wd:3
Apr  7 20:37:49 localhost kernel: [  893.996520]  disk 2, o:1, dev:sde1
Apr  7 20:37:49 localhost kernel: [  893.996522]  disk 3, o:1, dev:sdb1
Apr  7 20:37:49 localhost kernel: [  893.996525]  disk 4, o:1, dev:sdf1
Apr  7 20:37:49 localhost kernel: [  893.996898] raid5: failed to run
raid set md0
Apr  7 20:37:49 localhost kernel: [  893.996901] md: pers->run() failed ...

so I figured I'd re-add sdd:

root@enterprise:~# mdadm --re-add /dev/md0 /dev/sdd1
mdadm: re-added /dev/sdd1
root@enterprise:~# mdadm --run /dev/md0
mdadm: started /dev/md0

Apr  7 20:44:16 localhost kernel: [ 1281.139654] md: bind<sdd1>
Apr  7 20:44:16 localhost mdadm[1523]: SpareActive event detected on
md device /dev/md0, component device /dev/sdd1
Apr  7 20:44:32 localhost kernel: [ 1297.147581] raid5: device sdd1
operational as raid disk 0
Apr  7 20:44:32 localhost kernel: [ 1297.147585] raid5: device sdb1
operational as raid disk 3
Apr  7 20:44:32 localhost kernel: [ 1297.147588] raid5: device sdf1
operational as raid disk 4
Apr  7 20:44:32 localhost kernel: [ 1297.147591] raid5: device sde1
operational as raid disk 2
Apr  7 20:44:32 localhost kernel: [ 1297.148102] raid5: allocated 5265kB for md0
Apr  7 20:44:32 localhost kernel: [ 1297.148704] 0: w=1 pa=0 pr=5 m=1
a=2 r=5 op1=0 op2=0
Apr  7 20:44:32 localhost kernel: [ 1297.148708] 3: w=2 pa=0 pr=5 m=1
a=2 r=5 op1=0 op2=0
Apr  7 20:44:32 localhost kernel: [ 1297.148712] 4: w=3 pa=0 pr=5 m=1
a=2 r=5 op1=0 op2=0
Apr  7 20:44:32 localhost kernel: [ 1297.148715] 2: w=4 pa=0 pr=5 m=1
a=2 r=5 op1=0 op2=0
Apr  7 20:44:32 localhost kernel: [ 1297.148718] raid5: raid level 5
set md0 active with 4 out of 5 devices, algorithm 2
Apr  7 20:44:32 localhost kernel: [ 1297.148722] RAID5 conf printout:
Apr  7 20:44:32 localhost kernel: [ 1297.148725]  --- rd:5 wd:4
Apr  7 20:44:32 localhost kernel: [ 1297.148728]  disk 0, o:1, dev:sdd1
Apr  7 20:44:32 localhost kernel: [ 1297.148731]  disk 2, o:1, dev:sde1
Apr  7 20:44:32 localhost kernel: [ 1297.148734]  disk 3, o:1, dev:sdb1
Apr  7 20:44:32 localhost kernel: [ 1297.148737]  disk 4, o:1, dev:sdf1
Apr  7 20:44:32 localhost kernel: [ 1297.148779] md0: detected
capacity change from 0 to 6001196531712
Apr  7 20:44:32 localhost kernel: [ 1297.149047]  md0:RAID5 conf printout:
Apr  7 20:44:32 localhost kernel: [ 1297.149559]  --- rd:5 wd:4
Apr  7 20:44:32 localhost kernel: [ 1297.149562]  disk 0, o:1, dev:sdd1
Apr  7 20:44:32 localhost kernel: [ 1297.149565]  disk 1, o:1, dev:sdc1
Apr  7 20:44:32 localhost kernel: [ 1297.149568]  disk 2, o:1, dev:sde1
Apr  7 20:44:32 localhost kernel: [ 1297.149570]  disk 3, o:1, dev:sdb1
Apr  7 20:44:32 localhost kernel: [ 1297.149573]  disk 4, o:1, dev:sdf1
Apr  7 20:44:32 localhost kernel: [ 1297.149846] md: recovery of RAID array md0
Apr  7 20:44:32 localhost kernel: [ 1297.149849] md: minimum
_guaranteed_  speed: 1000 KB/sec/disk.
Apr  7 20:44:32 localhost kernel: [ 1297.149852] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
recovery.
Apr  7 20:44:32 localhost kernel: [ 1297.149858] md: using 128k
window, over a total of 1465135872 blocks.
Apr  7 20:44:32 localhost kernel: [ 1297.188272]  unknown partition table
Apr  7 20:44:32 localhost mdadm[1523]: RebuildStarted event detected
on md device /dev/md0

I figured this was definitely wrong, since I still couldn't mount
/dev/md0, so I manually failed sdc and sdd to stop any further
destruction on my part and to go seek expert help, so here I am. Is my
data still there or have the first few hundred MB been zeroed to
initialize a fresh array? how do I force mdadm to assume sdd is fresh
and give me access to the array without any writes happening to it?
Sorry to come running to the highest authority on linuxraid here, but
all the howtos out there are pretty thin when it comes to anything
more complicated than creating and array and recovering from one
failed drive.

Any advice (even if it is just for doing the right thing the next
time), is greatly appreciated.

-Felix
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: help please - recovering from failed RAID5 rebuild
  2011-04-07 20:00 help please - recovering from failed RAID5 rebuild sanktnelson 1
@ 2011-04-08 12:01 ` NeilBrown
  2011-04-09  9:27   ` sanktnelson 1
  0 siblings, 1 reply; 5+ messages in thread
From: NeilBrown @ 2011-04-08 12:01 UTC (permalink / raw)
  To: sanktnelson 1; +Cc: linux-raid

On Thu, 7 Apr 2011 22:00:58 +0200 sanktnelson 1 <sanktnelson@googlemail.com>
wrote:

> Hi all,
> I need some advice on recovering from a catastrophic RAID5 failure
> (and a possible f***-up on my part in trying to recover). I didn't
> find this covered in any of the howtos out there, so here is my
> problem:
> I was trying to rebuild my RAID5 (5 drives) after I had replaced one
> failed drive (sdc) with a fresh one. During the rebuild another drive
> (sdd) failed. so I was left with the sdc as spare (S) and sdd failed
> (F).
> first question: what should I have done at this point? I'm fairly
> certain that the error that failed sdd was a fluke, loose cable or
> something, so I wanted mdadm to just assume it was clean and retry the
> rebuild.


What I would have done is stop the array (mdadm -S /dev/md0) and re-assemble
it with --force.  This would get you the degraded array back.
Then I would backup and data that I really couldn't live without - just in
case.

Then I would stop the array, dd-rescue sdd to some other device, possible
sdc, and assemble the array with the known-good devices and the new device
(which might have been sdc) and NOT sdd. 
This would give me a degraded array of good devices.
Then I would add another good device - maybe sdc if I thought that it was
just a ready error and writes would work.
Then wait and hope.

> 
> what I actually did was reboot the machine in the hope it would just
> restart the array in the previous degraded state, which of course it
> did not. Instead all drives except the failed sdd were reported as
> spare(S) in /proc/mdstat.
> 
> so I tried
> 
> root@enterprise:~# mdadm --run /dev/md0
> mdadm: failed to run array /dev/md0: Input/output error
> 
> syslog showed at this point:
> 
> Apr  7 20:37:49 localhost kernel: [  893.981851] md: kicking non-fresh
> sdd1 from array!
> Apr  7 20:37:49 localhost kernel: [  893.981864] md: unbind<sdd1>
> Apr  7 20:37:49 localhost kernel: [  893.992526] md: export_rdev(sdd1)
> Apr  7 20:37:49 localhost kernel: [  893.995844] raid5: device sdb1
> operational as raid disk 3
> Apr  7 20:37:49 localhost kernel: [  893.995848] raid5: device sdf1
> operational as raid disk 4
> Apr  7 20:37:49 localhost kernel: [  893.995852] raid5: device sde1
> operational as raid disk 2
> Apr  7 20:37:49 localhost kernel: [  893.996353] raid5: allocated 5265kB for md0
> Apr  7 20:37:49 localhost kernel: [  893.996478] 3: w=1 pa=0 pr=5 m=1
> a=2 r=5 op1=0 op2=0
> Apr  7 20:37:49 localhost kernel: [  893.996482] 4: w=2 pa=0 pr=5 m=1
> a=2 r=5 op1=0 op2=0
> Apr  7 20:37:49 localhost kernel: [  893.996485] 2: w=3 pa=0 pr=5 m=1
> a=2 r=5 op1=0 op2=0
> Apr  7 20:37:49 localhost kernel: [  893.996488] raid5: not enough
> operational devices for md0 (2/5 failed)
> Apr  7 20:37:49 localhost kernel: [  893.996514] RAID5 conf printout:
> Apr  7 20:37:49 localhost kernel: [  893.996517]  --- rd:5 wd:3
> Apr  7 20:37:49 localhost kernel: [  893.996520]  disk 2, o:1, dev:sde1
> Apr  7 20:37:49 localhost kernel: [  893.996522]  disk 3, o:1, dev:sdb1
> Apr  7 20:37:49 localhost kernel: [  893.996525]  disk 4, o:1, dev:sdf1
> Apr  7 20:37:49 localhost kernel: [  893.996898] raid5: failed to run
> raid set md0
> Apr  7 20:37:49 localhost kernel: [  893.996901] md: pers->run() failed ...
> 
> so I figured I'd re-add sdd:
> 
> root@enterprise:~# mdadm --re-add /dev/md0 /dev/sdd1
> mdadm: re-added /dev/sdd1
> root@enterprise:~# mdadm --run /dev/md0
> mdadm: started /dev/md0

This should be effectively equivalent to --assemble --force (I think).


> 
> Apr  7 20:44:16 localhost kernel: [ 1281.139654] md: bind<sdd1>
> Apr  7 20:44:16 localhost mdadm[1523]: SpareActive event detected on
> md device /dev/md0, component device /dev/sdd1
> Apr  7 20:44:32 localhost kernel: [ 1297.147581] raid5: device sdd1
> operational as raid disk 0
> Apr  7 20:44:32 localhost kernel: [ 1297.147585] raid5: device sdb1
> operational as raid disk 3
> Apr  7 20:44:32 localhost kernel: [ 1297.147588] raid5: device sdf1
> operational as raid disk 4
> Apr  7 20:44:32 localhost kernel: [ 1297.147591] raid5: device sde1
> operational as raid disk 2
> Apr  7 20:44:32 localhost kernel: [ 1297.148102] raid5: allocated 5265kB for md0
> Apr  7 20:44:32 localhost kernel: [ 1297.148704] 0: w=1 pa=0 pr=5 m=1
> a=2 r=5 op1=0 op2=0
> Apr  7 20:44:32 localhost kernel: [ 1297.148708] 3: w=2 pa=0 pr=5 m=1
> a=2 r=5 op1=0 op2=0
> Apr  7 20:44:32 localhost kernel: [ 1297.148712] 4: w=3 pa=0 pr=5 m=1
> a=2 r=5 op1=0 op2=0
> Apr  7 20:44:32 localhost kernel: [ 1297.148715] 2: w=4 pa=0 pr=5 m=1
> a=2 r=5 op1=0 op2=0
> Apr  7 20:44:32 localhost kernel: [ 1297.148718] raid5: raid level 5
> set md0 active with 4 out of 5 devices, algorithm 2
> Apr  7 20:44:32 localhost kernel: [ 1297.148722] RAID5 conf printout:
> Apr  7 20:44:32 localhost kernel: [ 1297.148725]  --- rd:5 wd:4
> Apr  7 20:44:32 localhost kernel: [ 1297.148728]  disk 0, o:1, dev:sdd1
> Apr  7 20:44:32 localhost kernel: [ 1297.148731]  disk 2, o:1, dev:sde1
> Apr  7 20:44:32 localhost kernel: [ 1297.148734]  disk 3, o:1, dev:sdb1
> Apr  7 20:44:32 localhost kernel: [ 1297.148737]  disk 4, o:1, dev:sdf1
> Apr  7 20:44:32 localhost kernel: [ 1297.148779] md0: detected
> capacity change from 0 to 6001196531712
> Apr  7 20:44:32 localhost kernel: [ 1297.149047]  md0:RAID5 conf printout:
> Apr  7 20:44:32 localhost kernel: [ 1297.149559]  --- rd:5 wd:4
> Apr  7 20:44:32 localhost kernel: [ 1297.149562]  disk 0, o:1, dev:sdd1
> Apr  7 20:44:32 localhost kernel: [ 1297.149565]  disk 1, o:1, dev:sdc1
> Apr  7 20:44:32 localhost kernel: [ 1297.149568]  disk 2, o:1, dev:sde1
> Apr  7 20:44:32 localhost kernel: [ 1297.149570]  disk 3, o:1, dev:sdb1
> Apr  7 20:44:32 localhost kernel: [ 1297.149573]  disk 4, o:1, dev:sdf1
> Apr  7 20:44:32 localhost kernel: [ 1297.149846] md: recovery of RAID array md0
> Apr  7 20:44:32 localhost kernel: [ 1297.149849] md: minimum
> _guaranteed_  speed: 1000 KB/sec/disk.
> Apr  7 20:44:32 localhost kernel: [ 1297.149852] md: using maximum
> available idle IO bandwidth (but not more than 200000 KB/sec) for
> recovery.
> Apr  7 20:44:32 localhost kernel: [ 1297.149858] md: using 128k
> window, over a total of 1465135872 blocks.
> Apr  7 20:44:32 localhost kernel: [ 1297.188272]  unknown partition table
> Apr  7 20:44:32 localhost mdadm[1523]: RebuildStarted event detected
> on md device /dev/md0
> 
> I figured this was definitely wrong, since I still couldn't mount
> /dev/md0, so I manually failed sdc and sdd to stop any further
> destruction on my part and to go seek expert help, so here I am. Is my
> data still there or have the first few hundred MB been zeroed to
> initialize a fresh array? how do I force mdadm to assume sdd is fresh
> and give me access to the array without any writes happening to it?
> Sorry to come running to the highest authority on linuxraid here, but
> all the howtos out there are pretty thin when it comes to anything
> more complicated than creating and array and recovering from one
> failed drive.

Well I wouldn't have failed the devices.  I would simply have stopped the
array
  mdamd -S /dev/md0

But I'm very surprised that this didn't work.
md and mdadm never write zeros to initialise anything (Except a bitmap).

Maybe the best thing to do at this point is post the output of
  mdadm -E /dev/sd[bcdef]1
and I'll see if I can make sense of it.

NeilBrown


> 
> Any advice (even if it is just for doing the right thing the next
> time), is greatly appreciated.
> 
> -Felix
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: help please - recovering from failed RAID5 rebuild
  2011-04-08 12:01 ` NeilBrown
@ 2011-04-09  9:27   ` sanktnelson 1
  2011-04-09 11:29     ` NeilBrown
  0 siblings, 1 reply; 5+ messages in thread
From: sanktnelson 1 @ 2011-04-09  9:27 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

2011/4/8 NeilBrown <neilb@suse.de>:

>
> Maybe the best thing to do at this point is post the output of
>  mdadm -E /dev/sd[bcdef]1
> and I'll see if I can make sense of it.

Thanks for your advice! here is the output:


/dev/sdb1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 07ca9dc0:8d91f663:fe51d0ea:fe5e38c2
  Creation Time : Mon Jun  8 20:38:29 2009
     Raid Level : raid5
  Used Dev Size : 1465135872 (1397.26 GiB 1500.30 GB)
     Array Size : 5860543488 (5589.05 GiB 6001.20 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Apr  7 20:49:01 2011
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 2a930ff9 - correct
         Events : 1885746

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     3       8       17        3      active sync   /dev/sdb1

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8       65        2      active sync   /dev/sde1
   3     3       8       17        3      active sync   /dev/sdb1
   4     4       8       81        4      active sync   /dev/sdf1
/dev/sdc1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 07ca9dc0:8d91f663:fe51d0ea:fe5e38c2
  Creation Time : Mon Jun  8 20:38:29 2009
     Raid Level : raid5
  Used Dev Size : 1465135872 (1397.26 GiB 1500.30 GB)
     Array Size : 5860543488 (5589.05 GiB 6001.20 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Apr  7 20:48:31 2011
          State : clean
 Active Devices : 3
Working Devices : 4
 Failed Devices : 1
  Spare Devices : 1
       Checksum : 2a930fe8 - correct
         Events : 1885744

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     6       8       33        6      spare   /dev/sdc1

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8       65        2      active sync   /dev/sde1
   3     3       8       17        3      active sync   /dev/sdb1
   4     4       8       81        4      active sync   /dev/sdf1
   5     5       8       49        5      faulty   /dev/sdd1
/dev/sdd1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 07ca9dc0:8d91f663:fe51d0ea:fe5e38c2
  Creation Time : Mon Jun  8 20:38:29 2009
     Raid Level : raid5
  Used Dev Size : 1465135872 (1397.26 GiB 1500.30 GB)
     Array Size : 5860543488 (5589.05 GiB 6001.20 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Apr  7 20:45:10 2011
          State : clean
 Active Devices : 4
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 1
       Checksum : 2a930f0c - correct
         Events : 1885736

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     0       8       49        0      active sync   /dev/sdd1

   0     0       8       49        0      active sync   /dev/sdd1
   1     1       0        0        1      faulty removed
   2     2       8       65        2      active sync   /dev/sde1
   3     3       8       17        3      active sync   /dev/sdb1
   4     4       8       81        4      active sync   /dev/sdf1
   5     5       8       33        5      spare   /dev/sdc1
/dev/sde1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 07ca9dc0:8d91f663:fe51d0ea:fe5e38c2
  Creation Time : Mon Jun  8 20:38:29 2009
     Raid Level : raid5
  Used Dev Size : 1465135872 (1397.26 GiB 1500.30 GB)
     Array Size : 5860543488 (5589.05 GiB 6001.20 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Apr  7 20:49:01 2011
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 2a931027 - correct
         Events : 1885746

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     2       8       65        2      active sync   /dev/sde1

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8       65        2      active sync   /dev/sde1
   3     3       8       17        3      active sync   /dev/sdb1
   4     4       8       81        4      active sync   /dev/sdf1
/dev/sdf1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 07ca9dc0:8d91f663:fe51d0ea:fe5e38c2
  Creation Time : Mon Jun  8 20:38:29 2009
     Raid Level : raid5
  Used Dev Size : 1465135872 (1397.26 GiB 1500.30 GB)
     Array Size : 5860543488 (5589.05 GiB 6001.20 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Apr  7 20:49:01 2011
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 2a93103b - correct
         Events : 1885746

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     4       8       81        4      active sync   /dev/sdf1

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8       65        2      active sync   /dev/sde1
   3     3       8       17        3      active sync   /dev/sdb1
   4     4       8       81        4      active sync   /dev/sdf1


~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md0 : inactive sdf1[4](S) sdd1[0](S) sde1[2](S) sdb1[3](S) sdc1[6](S)
      7325679680 blocks

unused devices: <none>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: help please - recovering from failed RAID5 rebuild
  2011-04-09  9:27   ` sanktnelson 1
@ 2011-04-09 11:29     ` NeilBrown
  2011-04-13 16:45       ` sanktnelson 1
  0 siblings, 1 reply; 5+ messages in thread
From: NeilBrown @ 2011-04-09 11:29 UTC (permalink / raw)
  To: sanktnelson 1; +Cc: linux-raid

On Sat, 9 Apr 2011 11:27:20 +0200 sanktnelson 1 <sanktnelson@googlemail.com>
wrote:

> 2011/4/8 NeilBrown <neilb@suse.de>:
>  
> >
> > Maybe the best thing to do at this point is post the output of
> >  mdadm -E /dev/sd[bcdef]1
> > and I'll see if I can make sense of it.
> 
> Thanks for your advice! here is the output:

It looks like it is in reasonably good shape (relatively speaking...).

I would
  mdadm -S /dev/md0
  mdadm -Af /dev/md0 /dev/sd[bdef]1
  fsck -nf /dev/md0

and if it looks good
  mount /dev/md0
  copy off any data that you can that you cannot live without.

Then stop the array
  mdadm -S /dev/md0
and copy sdd to sdc with dd-rescue
When that completes, assemble the array with
  mdadm -A /dev/md0 /dev/sd[bcef]1
  fsck -nf /dev/md0

and then if it looks good
  mdadm --zero /dev/sdd1
  mdadm /dev/md0 --add /dev/sdd1
and with luck it will successfully rebuild sdd1 and you will have redundancy
again.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: help please - recovering from failed RAID5 rebuild
  2011-04-09 11:29     ` NeilBrown
@ 2011-04-13 16:45       ` sanktnelson 1
  0 siblings, 0 replies; 5+ messages in thread
From: sanktnelson 1 @ 2011-04-13 16:45 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

thanks a lot, the array is rebuilt and redundant again, apparently
with no loss of data (at least all fscks ran without complaints).

It took me a while to get to this point, since the dd_rescue randomly
failed after an hour or two, so I didn't get to try more than once per
night. That seems to be fixed now with a new SATA cable.

I just wanted to quickly write down here what the original cause of
the problem was, in case someone ever googles this conversation:

after

mdadm -Af /dev/md0 /dev/sd[bdef]1
fsck -nf /dev/md0

I would get the message that /dev/md0 was busy, even though according
to mount it was not mounted anywhere. Trying to mount it would report
it as already mounted, but the data was not visible at its usual
mountpoint. This was indeed the same symptom I had had in the first
place after trying with --re-add and --run.

It turns out ubuntu does something odd with mounts that are mentioned
in /etc/fstab but are not available at bootime. I did not understand
what this actually is, but as a result the device is "in use" even
though the data is not visible at the mount point. After I temporarily
removed /dev/md0 from /etc/fstab and rebooted I was able to follow
your advice to the letter without any more problems, and I strongly
suspect my original approach would have worked as well.

Again, thank you so much for your help!

Cheers,
Felix

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-04-13 16:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-07 20:00 help please - recovering from failed RAID5 rebuild sanktnelson 1
2011-04-08 12:01 ` NeilBrown
2011-04-09  9:27   ` sanktnelson 1
2011-04-09 11:29     ` NeilBrown
2011-04-13 16:45       ` sanktnelson 1

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).