From mboxrd@z Thu Jan 1 00:00:00 1970 From: Anshuman Aggarwal Subject: Re: Growing raid 5: Failed to reshape Date: Sat, 22 Aug 2009 10:05:28 +0530 Message-ID: <3FA7DE88-932D-4194-9195-E7CBA93D5432@gmail.com> References: <5c45fce80908211231v9238a12i3829ad5d1b107df5@mail.gmail.com> <2735df411d9ed83a9d11664f595d6dfc.squirrel@neil.brown.name> <121580D1-2950-43FB-AD1F-B235D1160932@gmail.com> <112F4A08-5E5C-4132-A233-6898D24B1D74@gmail.com> Mime-Version: 1.0 (Apple Message framework v936) Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: NeilBrown Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids Well, I was so relieved on seeing what looked like my data, that I didn't wait for this last mail....and already started the grow operation again! What is the best way I can have the array check itself out now? to make sure there are no data inconsistencies? I guess I'll should wait for the grow operation to complete first? Will a controlled system shutdown hurt the grow operation (I have an APC UPS which shuts down my machine well in time when there is an outage)? I am hoping it will resume from where it left off? since the critical section has passed? Also, one interesting observation about mdadm you may be especially interested in: I have tried using both 2.6.7 and 3.0 (final june version) of mdadm with the kernel 2.6.30.4... * mdadm 3.0 wouldn't grow the array : /Src/mdadm-3.0# ./mdadm --grow /dev/md127 -n 4 mdadm: Need to backup 384K of critical section.. mdadm: /dev/md127: failed to save critical region I resorted to using the mdadm 2.6.7 that came with Ubuntu... Thanks, Anshuman On 22-Aug-09, at 9:44 AM, NeilBrown wrote: > On Sat, August 22, 2009 1:55 pm, Anshuman Aggarwal wrote: >> I have just sent in another mail with the mdadm examine details from >> the 3 + 1(grown) partitions. I am sure of the device names, but not >> sure of the order (which examine does tell me) >> Here are the devices, in order (I think): /dev/sdb, /dev/sdd5, /dev/ >> sdc5 + /dev/sda2 with the dd output you requested: > > Thanks. > /dev/sdb and /dev/sdd5 definitely look correct. > I am very suspicious of the others though. If the metadata has been > destroyed, it is entirely possible that some of the data has been > corrupted as well. > > As you only need two drives to recover your data, and you have two > drives that look good, I suggest that you just use those. > So: > > mdadm --create /dev/md0 -l5 -n3 -e1.2 --name raid5_280G \ > /dev/sdb /dev/sdd5 missing > > The first thing to do is --examine sdb and sdd5 and make sure that > "Data Offset" is 272. It probably will be, but some different > versions > of mdadm used different offsets and you need to be sure. > Assuming it is 272 your data should be safe and you an "fsck" and > "mount" > just to confirm that. > > Then add sdc5 and sda2 and let the array recover the missing device. > Once that is done you can try the --grow again. > > NeilBrown > > > > >> >> ---------------------------------- >> dd if=/dev/sdb skip=8 count=2 | od -x >> >> 2+0 records in >> 2+0 records out >> 1024 bytes (1.0 kB) copied, 5.6394e-05 s, 18.2 MB/s >> 0000000 4efc a92b 0001 0000 0000 0000 0000 0000 >> 0000020 5f49 6866 e1f1 102d 5299 920f 1976 87b4 >> 0000040 4147 4554 4157 3a59 6172 6469 5f35 3832 >> 0000060 4730 0000 0000 0000 0000 0000 0000 0000 >> 0000100 2b74 4a73 0000 0000 0005 0000 0002 0000 >> 0000120 2900 22ef 0000 0000 0080 0000 0003 0000 >> 0000140 0002 0000 0000 0000 0300 0000 0000 0000 >> 0000160 0000 0000 0000 0000 0000 0000 0000 0000 >> 0000200 0110 0000 0000 0000 6580 22ef 0000 0000 >> 0000220 0008 0000 0000 0000 0000 0000 0000 0000 >> 0000240 0000 0000 0000 0000 a272 abb3 8be3 62a6 >> 0000260 c0bd c0a0 990e 583b 0000 0000 0000 0000 >> 0000300 209f 4a8e 0000 0000 3508 0000 0000 0000 >> 0000320 ffff ffff ffff ffff 24a1 59e3 0180 0000 >> 0000340 0000 0000 0000 0000 0000 0000 0000 0000 >> * >> 0000400 0000 fffe fffe 0002 0001 ffff ffff ffff >> 0000420 ffff ffff ffff ffff ffff ffff ffff ffff >> * >> 0002000 >> ------------------------------------ >> dd if=/dev/sdd5 skip=8 count=2 | od -x >> 2+0 records in >> 2+0 records out >> 1024 bytes (1.0 kB) copied, 0.0104253 s, 98.2 kB/s >> 0000000 4efc a92b 0001 0000 0004 0000 0000 0000 >> 0000020 5f49 6866 e1f1 102d 5299 920f 1976 87b4 >> 0000040 4147 4554 4157 3a59 6172 6469 5f35 3832 >> 0000060 4730 0000 0000 0000 0000 0000 0000 0000 >> 0000100 2b74 4a73 0000 0000 0005 0000 0002 0000 >> 0000120 2900 22ef 0000 0000 0080 0000 0004 0000 >> 0000140 0002 0000 0005 0000 0000 0000 0000 0000 >> 0000160 0001 0000 0002 0000 0080 0000 0000 0000 >> 0000200 0110 0000 0000 0000 2974 22ef 0000 0000 >> 0000220 0008 0000 0000 0000 0000 0000 0000 0000 >> 0000240 0004 0000 0000 0000 4a75 cfe1 eebb 8205 >> 0000260 60f6 89ec 88a8 d300 0000 0000 0000 0000 >> 0000300 21c2 4a8e 0000 0000 350d 0000 0000 0000 >> 0000320 0000 0000 0000 0000 81fb e184 0180 0000 >> 0000340 0000 0000 0000 0000 0000 0000 0000 0000 >> * >> 0000400 0000 fffe fffe 0002 0001 0003 ffff ffff >> 0000420 ffff ffff ffff ffff ffff ffff ffff ffff >> * >> 0002000 >> ------------------------------------ >> dd if=/dev/sdc5 skip=8 count=2 | od -x >> 2+0 records in >> 2+0 records out >> 1024 bytes (1.0 kB) copied, 0.0102071 s, 100 kB/s >> 0000000 0000 0000 0000 0000 0000 0000 0000 0000 >> * >> 0002000 >> -------------------------------------- >> This following is probably just junk since it is not even initialized >> >> dd if=/dev/sda1 skip=8 count=2 | od -x >> 2+0 records in >> 2+0 records out >> 1024 bytes (1.0 kB) copied, 0.0127419 s, 80.4 kB/s >> 0000000 0000 0000 0000 0000 0000 0000 0000 0000 >> * >> 0000200 0000 0000 0000 0000 4cf4 0000 0000 0000 >> 0000220 0000 0000 0000 0000 0000 0000 0000 0000 >> 0000240 0004 0000 0000 0000 e807 6452 6558 e0a3 >> 0000260 a04b 494c 11a6 8b3b 0000 0000 0000 0000 >> 0000300 0000 0000 0000 0000 0002 0000 0000 0000 >> 0000320 0000 0000 0000 0000 a1e8 b863 0000 0000 >> 0000340 0000 0000 0000 0000 0000 0000 0000 0000 >> * >> 0002000 >> >> >> Thanks, >> Anshuman >> >> >> On 22-Aug-09, at 8:58 AM, NeilBrown wrote: >> >>> On Sat, August 22, 2009 12:41 pm, Anshuman Aggarwal wrote: >>>> Neil, >>>> Thanks for your input. Its great to have some hand holding when >>>> your >>>> heart is stuck in your mouth. >>>> >>>> Here is some more explanation: >>>> >>>> I have another raid array on the same disks in different partitions >>>> and there was a grow operation happening on those also at time >>>> (which >>>> has completed splendidly after the power outage). From what I have >>>> observed so far, when there is heavy activity on the disk due to 1 >>>> array, the kernel delays puts the other tasks in a DELAYED status. >>>> ( I >>>> have done it this way because I have 4 different sized disks >>>> purchased >>>> over time) >>>> >>>> I had given the grow command before I realized that the other grow >>>> operation had not completed on the other partitions. >>>> >>>> * The critical section status from mdam was stuck (apparently >>>> waiting >>>> for the grow on the other partitions to complete). Hence it did not >>>> complete as quickly as it should have. >>>> * Because it kept waiting for the other md operations on the disk >>>> to >>>> complete, the critical section didn't get written (my guess, its >>>> also >>>> possible that the disk was so busy that it took more than an hour >>>> but >>>> unlikely) >>>> >>>> Please tell me if you this additional info changes our approach to >>>> try >>>> and fix this? >>> >>> I understand now (and on reflection, your original email had enough >>> information that I should have been able to pick up on). When >>> there is a resync happening on one partition of a drive, md will >>> not start a resync on any other partition of that drive and that >>> would result in significantly reduced performance and reduced total >>> time to completion. >>> This applies equally to recovery and reshape. >>> >>> So while the first reshape has happening, the second would not >>> have started at all. This confirms that no data will have been >>> relocated at all, so a correct '--create' will get your data back >>> correctly. >>> >>> I should change mdadm to not try starting a reshape if it won't >>> proceed as it could cause real problems if the start of the reshape >>> blocks for too long. >>> >>> This still doesn't explain why you lost some metadata though. >>> If it updated one of the devices, it should have updated all of them >>> as it does the update in parallel. >>> >>> Would you be able to: >>> >>> dd if=/dev/WHATEVER skip=8 count=2 | od -x >>> >>> where 'WHATEVER' is each of the different devices that you think it >>> in the array. That might give me some clue. >>> >>> My recommendation for how to fix it remains the same. I now have >>> more confidence that it will work. You need to be sure which device >>> is >>> which though. >>> >>> NeilBrown >>> >>> >>>> >>>> I do have a UPS with an hour of backup but recently moved back to >>>> my >>>> home country, India where power supply will probably *NEVER* ever >>>> be >>>> continuos enough for a long md operation :). Hence, I'm definitely >>>> one to vote for recoverable moves (which mdadm and the kernal have >>>> been pretty good at so far) >>>> >>>> Thanks, >>>> Anshuman >>>> >>>> On 22-Aug-09, at 3:00 AM, NeilBrown wrote: >>>> >>>>> On Sat, August 22, 2009 5:31 am, Anshuman Aggarwal wrote: >>>>>> Hi all, >>>>>> >>>>>> Here is my problem and configuration. : >>>>>> >>>>>> I had a 3 partition raid5 cluster to which I added a 4th disk >>>>>> and >>>>>> tried to grow the raid5 by adding the partition on the 4th disk >>>>>> and >>>>>> then growing it. Unfortunately since another sync task was >>>>>> happening >>>>>> on the same disks, the operation to move the critical section did >>>>>> not >>>>>> complete before the machine was shutdown by the UPS (in control >>>>>> not a >>>>>> crash) due to low battery. >>>>>> >>>>>> Kernel: 2.6.30.4; mdadm (tried 2.6.7 and 3.0) >>>>>> >>>>>> Now, only 1 of my 3 partitions has the superblock and the other 2 >>>>>> and >>>>>> the 4th new one does not have anything. >>>>> >>>>> It is very strange that only one partition has a superblock. >>>>> I cannot imagine any way that could have happened short of >>>>> changing >>>>> the partition tables or deliberately destroying them. >>>>> I feel the need to ask "are you sure" though presumably you are or >>>>> you wouldn't have said so... >>>> >>>> >>>> I am positive (at least from the output of mdadm that no superblock >>>> exists on the other partitions). I am also sure that I am not >>>> fumbling >>>> on the partition device names. >>>> >>>>> >>>>>> >>>>>> Here is the output of a few mdadm commands. >>>>>> >>>>>> $mdadm --misc --examine /dev/sdd5 >>>>>> /dev/sdd5: >>>>>> Magic : a92b4efc >>>>>> Version : 1.2 >>>>>> Feature Map : 0x4 >>>>>> Array UUID : 495f6668:f1e12d10:99520f92:7619b487 >>>>>> Name : GATEWAY:raid5_280G (local to host GATEWAY) >>>>>> Creation Time : Fri Jul 31 23:05:48 2009 >>>>>> Raid Level : raid5 >>>>>> Raid Devices : 4 >>>>>> >>>>>> Avail Dev Size : 586099060 (279.47 GiB 300.08 GB) >>>>>> Array Size : 1758296832 (838.42 GiB 900.25 GB) >>>>>> Used Dev Size : 586098944 (279.47 GiB 300.08 GB) >>>>>> Data Offset : 272 sectors >>>>>> Super Offset : 8 sectors >>>>>> State : active >>>>>> Device UUID : 754ae1cf:bbee0582:f660ec89:a88800d3 >>>>>> >>>>>> Reshape pos'n : 0 >>>>>> Delta Devices : 1 (3->4) >>>>> >>>>> It certainly looks like it didn't get very far. We cannot >>>>> know from this for certain. >>>>> mdadm should have copied the first 4 chunks (256K) to somewhere >>>>> near the end of the new device, then allowed the reshape to >>>>> continue. >>>>> It is possible that the reshape had written to some of these early >>>>> blocks. If it did we need to recover that backed-up data. I >>>>> should >>>>> probably add functionality to mdadm to find and recover such a >>>>> backup.... >>>>> >>>>> For now your best bet is to simply try to recreate the array. >>>>> i.e something like >>>>> >>>>> mdadm -C /dev/md0 -l5 -n3 -e 1.2 --name "raid5_280G" --assume- >>>>> clean \ >>>>> /dev/sdc5 /dev/sdd5 /dev/sde5 >>>>> >>>>> You need to make sure that you get the right devices in the right >>>>> order. From the information you gave I only know for certain that >>>>> /dev/sdd5 is the middle of the three. >>>>> >>>>> This will write new superblocks and assemble the array but will >>>>> not >>>>> change any of the data. You can then access the array read-only >>>>> and see if the data looks like it is all there. If it isn't, stop >>>>> the array and try to work out why. >>>>> If it is, you can try to grow the array again, this time with a >>>>> more >>>>> reliable power supply ;-) >>>>> >>>>> Speaking of which... just how long was it before when you started >>>>> the >>>>> grow and when the power shut off. It really shouldn't be more >>>>> than >>>>> a few seconds, even if other things are happening on the system. >>>>> (normally it would be a few hundred milliseconds at most). >>>>> >>>>> Good luck, >>>>> NeilBrown >>>>> >>>>> >>>>>> >>>>>> Update Time : Fri Aug 21 09:55:38 2009 >>>>>> Checksum : e18481fb - correct >>>>>> Events : 13581 >>>>>> >>>>>> Layout : left-symmetric >>>>>> Chunk Size : 64K >>>>>> >>>>>> Array Slot : 4 (0, failed, failed, 2, 1, 3) >>>>>> Array State : uUuu 2 failed >>>>>> >>>>>> $mdadm --assemble --scan >>>>>> mdadm: Failed to restore critical section for reshape, sorry. >>>>>> >>>>>> I am positive that none of the actual growing steps even >>>>>> started so >>>>>> my >>>>>> data 'should' be safe as long as I can recreate the superblocks, >>>>>> right? >>>>>> >>>>>> As always, appreciate the help of the open source community. >>>>>> Thanks!! >>>>>> >>>>>> Thanks, >>>>>> Anshuman >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe linux- >>>>>> raid" in >>>>>> the body of a message to majordomo@vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo- >>>>>> info.html >>>>>> >>>>> >>>> >>> >> >