From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: Rebuilding a RAID5 array after drive (hardware) failure Date: Mon, 26 May 2014 10:46:40 +1000 Message-ID: <20140526104640.0895483c@notabene.brown> References: <20140522144905.50511c1a@notabene.brown> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/8wr8Qx624VFyGB5pk0dGizd"; protocol="application/pgp-signature" Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: George Duffield Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids --Sig_/8wr8Qx624VFyGB5pk0dGizd Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Fri, 23 May 2014 20:38:22 +0200 George Duffield wrote: > If it's of use in diagnosing: >=20 > cat /proc/mdstat > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] > [raid4] [raid10] > md127 : active (auto-read-only) raid5 sdc1[2] sdb1[1] sda1[0] sdd1[4] > 8790400512 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] = [UUUU] That is why they were "busy" :-) >=20 > So it looks to me I have an array /dev/md127 and it is healthy: >=20 > $ sudo mdadm --detail /dev/md127 >=20 > /dev/md127: > Version : 1.2 > Creation Time : Sun Feb 2 21:40:15 2014 > Raid Level : raid5 > Array Size : 8790400512 (8383.18 GiB 9001.37 GB) > Used Dev Size : 2930133504 (2794.39 GiB 3000.46 GB) > Raid Devices : 4 > Total Devices : 4 > Persistence : Superblock is persistent >=20 > Update Time : Fri May 23 00:06:34 2014 > State : clean > Active Devices : 4 > Working Devices : 4 > Failed Devices : 0 > Spare Devices : 0 >=20 > Layout : left-symmetric > Chunk Size : 512K >=20 > Name : fileserver:0 > UUID : 8389cd99:a86f705a:15c33960:9f1d7cbe > Events : 210 >=20 > Number Major Minor RaidDevice State > 0 8 1 0 active sync /dev/sda1 > 1 8 17 1 active sync /dev/sdb1 > 2 8 33 2 active sync /dev/sdc1 > 4 8 49 3 active sync /dev/sdd1 >=20 >=20 >=20 >=20 > Some questions: > - How did md127 come into existence? When mdadm tried to assemble the array it found no hard evidence to suggest the use of "/dev/md0" or similar so it choose a high unused number: 127. If the hostname had been "fileserver", mdadm would have realised that this array was array "0" in this machine, and would have used "/dev/md0". (See the "Name" field). > - How do I get it out of active (auto-read-only) state so I can use it? Just use it. On first write the read-only status will automatically disappear. > - Can it be renamed to md0? Sure. If this is being assembled by the initrd, then either the initrd must set t= he hostname to "fileserver" before mdadm gets to assemble the array, or the initrd must contain an /etc/mdadm.conf which lists "/dev/md0" has having the UUID of this array. NeilBrown >=20 > On Fri, May 23, 2014 at 8:29 PM, George Duffield > wrote: > > Thanks for clarifying my questions. Seeing as the flash drive has > > indeed failed (Murphy at his proverbial best) I have to change my > > approach by creating a fresh install of Ubuntu Server then integrating > > the array into the new install. On top of that the drive that was > > marked faulty is actually up and running again (in the new machine --- > > I've no idea why/how), but all drives passed POST sequence in the > > Microserver and have since been successfully moved to the new machine. > > I ran a fresh install of Ubuntu Server last night and installed > > mdadm. On rebooting the array was automatically seen and reported by > > mdadm as Clean. I did not attempt to mount the array. Somehow the > > flash disk with the new OS was corrupted on a reboot (/ could not be > > mounted) so I shut down the box using shutdown -h now. > > > > Tonight I've reinstalled Ubuntu Server on the flash drive, added mdadm > > and rebooted without the RAID drives powered up. After completing the > > config of th server OS (nfs, samba etc) I shut down again, added the > > drives and rebooted. > > > > Running lsblk returns the following showing all of the drives from the > > array accounted for: > > > > $ lsblk > > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > > sda 8:0 0 2.7T 0 disk > > =E2=94=94=E2=94=80sda1 8:1 0 2.7T 0 part > > =E2=94=94=E2=94=80md127 9:127 0 8.2T 0 raid5 > > sdb 8:16 0 2.7T 0 disk > > =E2=94=94=E2=94=80sdb1 8:17 0 2.7T 0 part > > =E2=94=94=E2=94=80md127 9:127 0 8.2T 0 raid5 > > sdc 8:32 0 2.7T 0 disk > > =E2=94=94=E2=94=80sdc1 8:33 0 2.7T 0 part > > =E2=94=94=E2=94=80md127 9:127 0 8.2T 0 raid5 > > sdd 8:48 0 2.7T 0 disk > > =E2=94=94=E2=94=80sdd1 8:49 0 2.7T 0 part > > =E2=94=94=E2=94=80md127 9:127 0 8.2T 0 raid5 > > sde 8:64 1 14.5G 0 disk > > =E2=94=94=E2=94=80sde1 8:65 1 14.4G 0 part / > > > > I then tried to assemble the array as follows: > > > > $ sudo mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 > > mdadm: /dev/sda1 is busy - skipping > > mdadm: /dev/sdb1 is busy - skipping > > mdadm: /dev/sdc1 is busy - skipping > > mdadm: /dev/sdd1 is busy - skipping > > > > No idea why the drives are reported as being busy - they're not > > mounted nor referenced in /etc/fstab. > > > > What is required in order to reassemble the array? > > > > Thanks again. > > > > On Thu, May 22, 2014 at 6:49 AM, NeilBrown wrote: > >> On Thu, 22 May 2014 06:31:58 +0200 George Duffield > >> wrote: > >> > >>> I have a RAID5 array comprised of 4 x 3TB Seagate 7200 RPM SATAII > >>> drives. The array was created on Ubuntu Server running on a HP > >>> Microserver N54L using the following command: > >>> > >>> sudo mdadm --create --verbose /dev/md0 --raid-devices=3D4 --level=3D5 > >>> /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 > >>> > >>> Formatted using: > >>> mkfs.ext4 -b 4096 -E stride=3D128,stripe-width=3D384 /dev/md0 > >>> > >>> The array is mounted in /etc/fstab by reference to its UUID and is now > >>> near full. > >>> > >>> A few days back I turned on the server to access some of the files > >>> stored on it when I found the server was not present on the network. > >>> Inspecting the actual server (connected kb & monitor) I noticed that > >>> the machine had not progressed beyond the BIOS post screen =E2=80=93 = one of > >>> the drives had become damaged (2nd drive in same slot in same > >>> Microserver to be damaged the same way =E2=80=93 drive spins up fine,= machine > >>> knows it's there, but can't communicat successfully with the drive). > >>> In any event, suffice it to say the drive is history =E2=80=93 it and= the > >>> Microserver will be RMAd when this is over. > >>> > >>> So, I'm now left with a degraded array comprising 3x3TB drives. I've > >>> purchased a replacement drive (same make and model) in the interim > >>> (and I've yet to boot this machine with the old drive removed or the > >>> new one inserted i.e. from an OS standpoint Ubuntu/mdadm does not yet > >>> know the array is degraded). > >>> > >>> As I've lost complete faith in the Microserver (and it may very well > >>> damage the new drive during recovery of the array) I've also purchased > >>> and assembled a 2nd machine with 6 on board SATA ports rather than > >>> rely on another Microserver. My intention is to remove the drives > >>> from the Microserver and install them in the new machine (which I'll > >>> boot off the same USB flash drive I used to boot the Microserver from > >>> [to further complicate things it seems my flash drive may also be > >>> corrupted, so I may have to recover from a fresh Ubuntu install and > >>> reassemble the array]). > >>> > >>> A few questions if I may: > >>> - Is moving the array to another computer and recovering it on the new > >>> computer running Ubuntu Server likely to present any particular > >>> challenges? > >> > >> No. If you were trying to boot of the array that you moved it might be > >> interesting. But as you aren't I cannot see any possible issue (assum= ing the > >> hardware functions correctly). > >> > >>> > >>> - Does the order/ sequence of connection of the drives to the > >>> motherboard matter? > >> > >> No. > >> > >>> > >>> Another way of asking the aforementioned question is whether mdadm > >>> would care if one swapped drives in Microserver backplane/ PC SATA > >>> ports such that the physical backplane slot/ SATA port that one/more > >>> of the drives occupies differs from that it occupied when the array > >>> was created? > >> > >> No. mdadm looks at the content of the devices, not their location. > >> > >> > >>> > >>> - How would I best approach rebuilding the array, my current thinking > >>> is as follows: > >>> =3D Identify with certainty which drive has failed - this will be done > >>> by removing the OS flash drive from the Microserver and disconnecting > >>> all drives from the backplane other than the one I believe is faulty > >>> (first slot on backplane) and booting the machine. The failed drive > >>> causes a POST failure and is thus easily identified. > >>> =3D Remove all drives from the Microserver and install into new PC > >>> referenced above, at the same time replacing the failed drive with the > >>> replacement I purchased > >>> =3D Powering new PC via UPS > >>> =3D Booting the PC from the flash drive > >>> =3D Allowing the degraded array to be assembled by mdadm when prompte= d at boot > >>> =3D Adding the replacement drive to the array and allowing the array = to > >>> be re-synchronized > >>> =3D If I'm not able to access the flash drive I will create a fresh > >>> install of Ubuntu Server and attempt to recreate the array in the > >>> fresh install. > >>> > >>> All thoughts/ comments/ guidance much appreciated. > >> > >> Sounds good. > >> Though I would discourage the boot sequence from assembling the degrad= ed > >> array if possible. > >> Just get the machine up with the drive untouched. Then use "mdadm -E"= to > >> look at each device and make sure they are what you think they are (e.= g. > >> consistent Event numbers etc). > >> Then > >> mdadm --assemble /dev/mdWHATEVER ..list-of-devices... > >> > >> Then make sure that looks good. > >> Then > >> mdadm /dev/mdWHATEVER --add new-device > >> > >> NeilBrown > >> > >> > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe linux-raid"= in > >>> the body of a message to majordomo@vger.kernel.org > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> --Sig_/8wr8Qx624VFyGB5pk0dGizd Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUBU4KO8Dnsnt1WYoG5AQJRtg//T9dY89jVozmXi46SDF8y7A5UUMIOPeN5 2nEcwYyDwX55Ta9qSZChbiYf66U6+AJmGryTzS/hY+h5hxCVm2irg3ShEDza7OlA 0SDg0XBtvWv2CXf8h3VmqK6tyO8iaomXxukV2Xy0Wcv7B33CoFsVg8zOvJ8bjKmx 4XeJApHQhn9qNOUSyO48YbN92DtK1j1g/W+gNok6twMXkk5Rw2xC0xlbvSQv50dm JrvPu2RNRlA0UqXissOxaVgdPG6Dqijuq5Aul/tZ2+Eas2fu0NJk1KR825WNWXCk kBLjVv91R9X7+zNaAxJhuHv6ot2De9YE7fx4GRBOeRDFkAfUuYQ80qLlYTVVMxMN fWs0UcsgauNlTpuycwR5P9gU1hZfFhk+wATsEPXGxDVaf327cdeFbBiQ6Xbwmj6o nTmy+qeO2RTv9jiUFL176Wyqb5tJ5PyaXLUUWUTIh/f+Wor9CSnbUjikFKbDtRpz YubdEXJ4+sz1HCbfSoYjQW/EARn9iE5L6t/2m2GZfIlL3IJq4fMHTVoW/QFc5RXm b8T/u3/CjkZuUJXYmf6RHOEZT4sHk4MD3IkGnlUhi4te+L0zM4GAehkzOzpR9MYC QFj2jjmwYO3UuJmn1UdpDcNiQ+MYCC7b8MM0tbtqhbRHkrtqxRpGyliVtWvFK7xU u6udDLpvyJM= =1F/l -----END PGP SIGNATURE----- --Sig_/8wr8Qx624VFyGB5pk0dGizd--