From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: help please - recovering from failed RAID5 rebuild Date: Fri, 8 Apr 2011 22:01:30 +1000 Message-ID: <20110408220130.4830852d@notabene.brown> References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: sanktnelson 1 Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids On Thu, 7 Apr 2011 22:00:58 +0200 sanktnelson 1 wrote: > Hi all, > I need some advice on recovering from a catastrophic RAID5 failure > (and a possible f***-up on my part in trying to recover). I didn't > find this covered in any of the howtos out there, so here is my > problem: > I was trying to rebuild my RAID5 (5 drives) after I had replaced one > failed drive (sdc) with a fresh one. During the rebuild another drive > (sdd) failed. so I was left with the sdc as spare (S) and sdd failed > (F). > first question: what should I have done at this point? I'm fairly > certain that the error that failed sdd was a fluke, loose cable or > something, so I wanted mdadm to just assume it was clean and retry th= e > rebuild. What I would have done is stop the array (mdadm -S /dev/md0) and re-ass= emble it with --force. This would get you the degraded array back. Then I would backup and data that I really couldn't live without - just= in case. Then I would stop the array, dd-rescue sdd to some other device, possib= le sdc, and assemble the array with the known-good devices and the new dev= ice (which might have been sdc) and NOT sdd.=20 This would give me a degraded array of good devices. Then I would add another good device - maybe sdc if I thought that it w= as just a ready error and writes would work. Then wait and hope. >=20 > what I actually did was reboot the machine in the hope it would just > restart the array in the previous degraded state, which of course it > did not. Instead all drives except the failed sdd were reported as > spare(S) in /proc/mdstat. >=20 > so I tried >=20 > root@enterprise:~# mdadm --run /dev/md0 > mdadm: failed to run array /dev/md0: Input/output error >=20 > syslog showed at this point: >=20 > Apr =A07 20:37:49 localhost kernel: [ =A0893.981851] md: kicking non-= fresh > sdd1 from array! > Apr =A07 20:37:49 localhost kernel: [ =A0893.981864] md: unbind > Apr =A07 20:37:49 localhost kernel: [ =A0893.992526] md: export_rdev(= sdd1) > Apr =A07 20:37:49 localhost kernel: [ =A0893.995844] raid5: device sd= b1 > operational as raid disk 3 > Apr =A07 20:37:49 localhost kernel: [ =A0893.995848] raid5: device sd= f1 > operational as raid disk 4 > Apr =A07 20:37:49 localhost kernel: [ =A0893.995852] raid5: device sd= e1 > operational as raid disk 2 > Apr =A07 20:37:49 localhost kernel: [ =A0893.996353] raid5: allocated= 5265kB for md0 > Apr =A07 20:37:49 localhost kernel: [ =A0893.996478] 3: w=3D1 pa=3D0 = pr=3D5 m=3D1 > a=3D2 r=3D5 op1=3D0 op2=3D0 > Apr =A07 20:37:49 localhost kernel: [ =A0893.996482] 4: w=3D2 pa=3D0 = pr=3D5 m=3D1 > a=3D2 r=3D5 op1=3D0 op2=3D0 > Apr =A07 20:37:49 localhost kernel: [ =A0893.996485] 2: w=3D3 pa=3D0 = pr=3D5 m=3D1 > a=3D2 r=3D5 op1=3D0 op2=3D0 > Apr =A07 20:37:49 localhost kernel: [ =A0893.996488] raid5: not enoug= h > operational devices for md0 (2/5 failed) > Apr =A07 20:37:49 localhost kernel: [ =A0893.996514] RAID5 conf print= out: > Apr =A07 20:37:49 localhost kernel: [ =A0893.996517] =A0--- rd:5 wd:3 > Apr =A07 20:37:49 localhost kernel: [ =A0893.996520] =A0disk 2, o:1, = dev:sde1 > Apr =A07 20:37:49 localhost kernel: [ =A0893.996522] =A0disk 3, o:1, = dev:sdb1 > Apr =A07 20:37:49 localhost kernel: [ =A0893.996525] =A0disk 4, o:1, = dev:sdf1 > Apr =A07 20:37:49 localhost kernel: [ =A0893.996898] raid5: failed to= run > raid set md0 > Apr =A07 20:37:49 localhost kernel: [ =A0893.996901] md: pers->run() = failed ... >=20 > so I figured I'd re-add sdd: >=20 > root@enterprise:~# mdadm --re-add /dev/md0 /dev/sdd1 > mdadm: re-added /dev/sdd1 > root@enterprise:~# mdadm --run /dev/md0 > mdadm: started /dev/md0 This should be effectively equivalent to --assemble --force (I think). >=20 > Apr =A07 20:44:16 localhost kernel: [ 1281.139654] md: bind > Apr =A07 20:44:16 localhost mdadm[1523]: SpareActive event detected o= n > md device /dev/md0, component device /dev/sdd1 > Apr =A07 20:44:32 localhost kernel: [ 1297.147581] raid5: device sdd1 > operational as raid disk 0 > Apr =A07 20:44:32 localhost kernel: [ 1297.147585] raid5: device sdb1 > operational as raid disk 3 > Apr =A07 20:44:32 localhost kernel: [ 1297.147588] raid5: device sdf1 > operational as raid disk 4 > Apr =A07 20:44:32 localhost kernel: [ 1297.147591] raid5: device sde1 > operational as raid disk 2 > Apr =A07 20:44:32 localhost kernel: [ 1297.148102] raid5: allocated 5= 265kB for md0 > Apr =A07 20:44:32 localhost kernel: [ 1297.148704] 0: w=3D1 pa=3D0 pr= =3D5 m=3D1 > a=3D2 r=3D5 op1=3D0 op2=3D0 > Apr =A07 20:44:32 localhost kernel: [ 1297.148708] 3: w=3D2 pa=3D0 pr= =3D5 m=3D1 > a=3D2 r=3D5 op1=3D0 op2=3D0 > Apr =A07 20:44:32 localhost kernel: [ 1297.148712] 4: w=3D3 pa=3D0 pr= =3D5 m=3D1 > a=3D2 r=3D5 op1=3D0 op2=3D0 > Apr =A07 20:44:32 localhost kernel: [ 1297.148715] 2: w=3D4 pa=3D0 pr= =3D5 m=3D1 > a=3D2 r=3D5 op1=3D0 op2=3D0 > Apr =A07 20:44:32 localhost kernel: [ 1297.148718] raid5: raid level = 5 > set md0 active with 4 out of 5 devices, algorithm 2 > Apr =A07 20:44:32 localhost kernel: [ 1297.148722] RAID5 conf printou= t: > Apr =A07 20:44:32 localhost kernel: [ 1297.148725] =A0--- rd:5 wd:4 > Apr =A07 20:44:32 localhost kernel: [ 1297.148728] =A0disk 0, o:1, de= v:sdd1 > Apr =A07 20:44:32 localhost kernel: [ 1297.148731] =A0disk 2, o:1, de= v:sde1 > Apr =A07 20:44:32 localhost kernel: [ 1297.148734] =A0disk 3, o:1, de= v:sdb1 > Apr =A07 20:44:32 localhost kernel: [ 1297.148737] =A0disk 4, o:1, de= v:sdf1 > Apr =A07 20:44:32 localhost kernel: [ 1297.148779] md0: detected > capacity change from 0 to 6001196531712 > Apr =A07 20:44:32 localhost kernel: [ 1297.149047] =A0md0:RAID5 conf = printout: > Apr =A07 20:44:32 localhost kernel: [ 1297.149559] =A0--- rd:5 wd:4 > Apr =A07 20:44:32 localhost kernel: [ 1297.149562] =A0disk 0, o:1, de= v:sdd1 > Apr =A07 20:44:32 localhost kernel: [ 1297.149565] =A0disk 1, o:1, de= v:sdc1 > Apr =A07 20:44:32 localhost kernel: [ 1297.149568] =A0disk 2, o:1, de= v:sde1 > Apr =A07 20:44:32 localhost kernel: [ 1297.149570] =A0disk 3, o:1, de= v:sdb1 > Apr =A07 20:44:32 localhost kernel: [ 1297.149573] =A0disk 4, o:1, de= v:sdf1 > Apr =A07 20:44:32 localhost kernel: [ 1297.149846] md: recovery of RA= ID array md0 > Apr =A07 20:44:32 localhost kernel: [ 1297.149849] md: minimum > _guaranteed_ =A0speed: 1000 KB/sec/disk. > Apr =A07 20:44:32 localhost kernel: [ 1297.149852] md: using maximum > available idle IO bandwidth (but not more than 200000 KB/sec) for > recovery. > Apr =A07 20:44:32 localhost kernel: [ 1297.149858] md: using 128k > window, over a total of 1465135872 blocks. > Apr =A07 20:44:32 localhost kernel: [ 1297.188272] =A0unknown partiti= on table > Apr =A07 20:44:32 localhost mdadm[1523]: RebuildStarted event detecte= d > on md device /dev/md0 >=20 > I figured this was definitely wrong, since I still couldn't mount > /dev/md0, so I manually failed sdc and sdd to stop any further > destruction on my part and to go seek expert help, so here I am. Is m= y > data still there or have the first few hundred MB been zeroed to > initialize a fresh array? how do I force mdadm to assume sdd is fresh > and give me access to the array without any writes happening to it? > Sorry to come running to the highest authority on linuxraid here, but > all the howtos out there are pretty thin when it comes to anything > more complicated than creating and array and recovering from one > failed drive. Well I wouldn't have failed the devices. I would simply have stopped t= he array mdamd -S /dev/md0 But I'm very surprised that this didn't work. md and mdadm never write zeros to initialise anything (Except a bitmap)= =2E Maybe the best thing to do at this point is post the output of mdadm -E /dev/sd[bcdef]1 and I'll see if I can make sense of it. NeilBrown >=20 > Any advice (even if it is just for doing the right thing the next > time), is greatly appreciated. >=20 > -Felix > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid"= in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html