From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bill Davidsen Subject: Re: some ?? re failed disk and resyncing of array Date: Sun, 01 Feb 2009 14:41:37 -0500 Message-ID: <4985FAF1.2090208@tmr.com> References: <1233389816.28363.1297740563@webmail.messagingengine.com> <49842A1E.1090105@dgreaves.com> <1233403388.29916.1297756217@webmail.messagingengine.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1233403388.29916.1297756217@webmail.messagingengine.com> Sender: linux-raid-owner@vger.kernel.org To: whollygoat@letterboxes.org Cc: linux-raid@vger.kernel.org, David Greaves List-Id: linux-raid.ids whollygoat@letterboxes.org wrote: > On Sat, 31 Jan 2009 10:38:22 +0000, "David Greaves" > said: > >> whollygoat@letterboxes.org wrote: >> >>> On a boot a couple of days ago, mdadm failed a disk and >>> started resyncing to spare (raid5, 6 drives, 5 active, 1 >>> spare). smartctl -H returned info (can't remember >>> the exact text) that made me suspect the drive was >>> fine, but the data connection was bad. Sure enough the >>> data cable was damaged. Replaced the cable and smartctl >>> sees the disk just fine and reports no errors. >>> >>> - I'd like to readd the drive as a spare. Is it enough >>> to "mdadm --add /dev/hdk" or do I need to prep the drive to >>> remove any data that said where it previously belonged >>> in the array? >>> >> That should work. >> Any issues and you can zero the superblock (man mdadm) >> No need to zero the disk. >> > > Would --re-add be better? > > I don't think do. And I would zero the superblock. The more detail you put into preventing unwanted autodetection the fewer learning experiences you will have. > I've noticed something else since I made the initial post > > --------- begin output ------------- > fly:~# mdadm -D /dev/md0 > /dev/md0: > Version : 01.00.03 > Creation Time : Sun Jan 11 21:49:36 2009 > Raid Level : raid5 > Array Size : 312602368 (298.12 GiB 320.10 GB) > Device Size : 156301184 (74.53 GiB 80.03 GB) > Raid Devices : 5 > Total Devices : 5 > Preferred Minor : 0 > Persistence : Superblock is persistent > > Intent Bitmap : Internal > > Update Time : Fri Jan 30 15:52:01 2009 > State : active > Active Devices : 5 > Working Devices : 5 > Failed Devices : 0 > Spare Devices : 0 > > Layout : left-symmetric > Chunk Size : 64K > > Name : fly:FlyFileServ_md (local to host fly) > UUID : 0e2b9157:a58edc1d:213a220f:68a555c9 > Events : 16 > > Number Major Minor RaidDevice State > 0 33 1 0 active sync /dev/hde1 > 1 34 1 1 active sync /dev/hdg1 > 2 56 1 2 active sync /dev/hdi1 > 5 89 1 3 active sync /dev/hdo1 > 6 88 1 4 active sync /dev/hdm1 > > > fly:~# mdadm -E /dev/hdo1 > /dev/hdo1: > Magic : a92b4efc > Version : 01 > Feature Map : 0x1 > Array UUID : 0e2b9157:a58edc1d:213a220f:68a555c9 > Name : fly:FlyFileServ_md (local to host fly) > Creation Time : Sun Jan 11 21:49:36 2009 > Raid Level : raid5 > Raid Devices : 5 > > Device Size : 234436336 (111.79 GiB 120.03 GB) > Array Size : 625204736 (298.12 GiB 320.10 GB) > Used Size : 156301184 (74.53 GiB 80.03 GB) > Super Offset : 234436464 sectors > State : clean > Device UUID : e072bd09:2df53d6d:d23321cc:cf2c37de > > Internal Bitmap : 2 sectors from superblock > Update Time : Fri Jan 30 15:52:01 2009 > Checksum : 4689ff5 - correct > Events : 16 > > Layout : left-symmetric > Chunk Size : 64K > > Array Slot : 5 (0, 1, 2, failed, failed, 3, 4) > Array State : uuuUu 2 failed > --------- end output ------------- > > Why does the "Array Slot" field show 7 slots? And why > does the field "Array State" show 2 failed? There > ever only were 6 disks in the array. Only one of those > is currently missing. mdadm -D above doesn't list any > failed devices in the "Failed Devices" field. > > No idea, but did you explicitly remove the failed drive? Was there a failed drive at some time in the past? I've never seen this, but I always remove drives, which may or may not be related. > Thanks for your answers below as well. It's kind of > what I was expecting. There was a h/w problem that > took ages to track down and I think it was reponsible > for all the e2fs errors. > > WG > > >>> - When I tried to list some files on one of the filesystems >>> on the array (the fact that it took so long to react to >>> the ls is how I discovered the box was in the middle of >>> rebuiling to spare) >>> >> This is OK - resync involves a lot of IO and can slow things down. This >> is tuneable. >> >> >>> it couldn't find the file (or many >>> others). I thought that resyncing was supposed to be >>> transparent, yet parts of the fs seemed to be missing. >>> Everything was there afterwards. Is that normal? >>> >> No. This is nothing to do with normal md resyncing and certainly not >> expected. >> >> >>> - On a subsequent boot I had to run e2fsck on the three >>> filesystems housed on the array. Many stray blocks, >>> illegal inodes, etc were found. An artifact of the rebuild >>> or unrelated? >>> >> Well, you had a fault in your IO system there's a good chance your O >> broke. >> >> Verify against a backup. >> >> David >> >> >> -- >> "Don't worry, you'll be fine; I saw it work in a cartoon once..." >> -- Bill Davidsen "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark