Re: Likely forced assemby with wrong disk during raid5 grow. Recoverable?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Mathias Burén" <mathias.buren@gmail.com>
To: Claude Nobs <claudenobs@blunet.cc>
Cc: NeilBrown <neilb@suse.de>, linux-raid@vger.kernel.org
Subject: Re: Likely forced assemby with wrong disk during raid5 grow. Recoverable?
Date: Sun, 20 Feb 2011 14:47:54 +0000	[thread overview]
Message-ID: <AANLkTink0VKa_yQpgxKqXsbxU67Kq++4RJEMW3p6UoBe@mail.gmail.com> (raw)
In-Reply-To: <AANLkTi=-guMf-8YJDMvq9ybyY9Fppi+W0pqhH2Of=mKd@mail.gmail.com>

On 20 February 2011 14:44, Claude Nobs <claudenobs@blunet.cc> wrote:
> On Sun, Feb 20, 2011 at 06:25, NeilBrown <neilb@suse.de> wrote:
>> On Sun, 20 Feb 2011 04:23:09 +0100 Claude Nobs <claudenobs@blunet.cc> wrote:
>>
>>> Hi All,
>>>
>>> I was wondering if someone might be willing to share if this array is
>>> recoverable.
>>>
>>
>> Probably is.  But don't do anything yet - any further action until you have
>> read all of the following email, will probably cause more harm than good.
>>
>>> I had a clean, running raid5 using 4 block devices (two of those were
>>> 2 disk raid0 md devices) in RAID 5. Last night I decided it was safe
>>> to grow the array by one disk. But then a) a disk failed, b) a power
>>> loss occured, c) i probably switched the wrong disk and forced
>>> assembly, resulting in an inconsistent state. Here is a complete set
>>> of actions taken :
>>
>> Providing this level of information is excellent!
>>
>>
>>>
>>> > bernstein@server:~$ sudo mdadm --grow --raid-devices=5 --backup-file=/raid.grow.backupfile /dev/md2
>>> > mdadm: Need to backup 768K of critical section..
>>> > mdadm: ... critical section passed.
>>> > bernstein@server:~$ cat /proc/mdstat
>>> > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>>> > md1 : active raid0 sdg1[1] sdf1[0]
>>> >       976770944 blocks super 1.2 64k chunks
>>> >
>>> > md2 : active raid5 sda1[5] md0[4] md1[3] sdd1[1] sdc1[0]
>>> >       2930281920 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
>>> >       [>....................]  reshape =  1.6% (16423164/976760640) finish=902.2min speed=17739K/sec
>>> >
>>> > md0 : active raid0 sdh1[0] sdb1[1]
>>> >       976770944 blocks super 1.2 64k chunks
>>> >
>>> > unused devices: <none>
>>
>> All looks good so-far.
>>
>>>
>>>
>>> now i thought /dev/sdg1 failed. unfortunately i have no log for this
>>> one, just my memory of seeing this changed to the one above :
>>>
>>> >       2930281920 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/5] [UU_UU]
>>>
>>
>> Unfortunately it is not possible to know which drive is missing from the
>> above info.  The [numbers] is brackets don't exactly corresponds to the
>> positions in the array that you might thing they do.  The mdstat listing above
>> has numbers 0,1,3,4,5.
>>
>> They are the 'Number' column in the --detail output below.  This is /dev/md1
>> - I can tell from the --examine outputs, but it is a bit confusing.  Newer
>> versions of mdadm make this a little less confusing.  If you look for
>> patterns of U and u  in the 'Array State' line, the U is 'this device', the
>> 'u' is some other devices.
>
> Actually this is running a stock Ubunutu 10.10 server kernel. But as
> it is from my memory it could very well have been :
>
>       2930281920 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/5] [U_UUU]
>
>>
>> So /dev/md1 had a failure, so it could well have been sdg1.
>>
>>
>>> some 10 minutes later a power loss occurred, thanks to an ups the
>>> server shut down as with 'shutdown -h now'. now i exchanged /dev/sdg1,
>>> rebooted and in a lapse of judgement forced assembly:
>>
>> Perfect timing :-)
>>
>>>
>>> > bernstein@server:~$ sudo mdadm --assemble --run /dev/md2 /dev/md0 /dev/sda1 /dev/sdc1 /dev/sdd1
>>> > mdadm: Could not open /dev/sda1 for write - cannot Assemble array.
>>> > mdadm: Failed to restore critical section for reshape, sorry.
>>
>> This isn't actually a 'forced assembly' as you seem to think.  There is no
>> '-f' or '--force'.  It didn't cause any harm.
>
> phew... at last some luck! that "Failed to restore critical section
> for reshape, sorry" really scared the hell out of me.
> But then again it got me paying attention and stop making things worse... :-)
>
>>
>>> >
>>> > bernstein@server:~$ sudo mdadm --detail /dev/md2
>>> > /dev/md2:
>>> >         Version : 01.02
>>> >   Creation Time : Sat Jan 22 00:15:43 2011
>>> >      Raid Level : raid5
>>> >   Used Dev Size : 976760640 (931.51 GiB 1000.20 GB)
>>> >    Raid Devices : 5
>>> >   Total Devices : 3
>>> > Preferred Minor : 3
>>> >     Persistence : Superblock is persistent
>>> >
>>> >     Update Time : Sat Feb 19 22:32:04 2011
>>> >           State : active, degraded, Not Started
>>                                        ^^^^^^^^^^^^
>>
>> mdadm has put the devices together as best it can, but has not started the
>> array because it didn't have enough devices.  This is good.
>>
>>
>>> >  Active Devices : 3
>>> > Working Devices : 3
>>> >  Failed Devices : 0
>>> >   Spare Devices : 0
>>> >
>>> >          Layout : left-symmetric
>>> >      Chunk Size : 64K
>>> >
>>> >   Delta Devices : 1, (4->5)
>>> >
>>> >            Name : master:public
>>> >            UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
>>> >          Events : 133609
>>> >
>>> >     Number   Major   Minor   RaidDevice State
>>> >        0       8       33        0      active sync   /dev/sdc1
>>> >        1       0        0        1      removed
>>> >        2       0        0        2      removed
>>> >        4       9        0        3      active sync   /dev/block/9:0
>>> >        5       8        1        4      active sync   /dev/sda1
>>
>> Some you now have 2 devices missing.  Along as we can find the devices,
>>  mdadm --assemble --force
>> should be able to put them togethe for you.  But let's see  what we have...
>>
>>>
>>> so i reattached the old disk, got /dev/md1 back and did the
>>> investigation i should have done before :
>>>
>>> > bernstein@server:~$ sudo mdadm --examine /dev/sdd1
>>> > /dev/sdd1:
>>> >           Magic : a92b4efc
>>> >         Version : 1.2
>>> >     Feature Map : 0x4
>>> >      Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
>>> >            Name : master:public
>>> >   Creation Time : Sat Jan 22 00:15:43 2011
>>> >      Raid Level : raid5
>>> >    Raid Devices : 5
>>> >
>>> >  Avail Dev Size : 1953521392 (931.51 GiB 1000.20 GB)
>>> >      Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
>>> >   Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
>>> >     Data Offset : 272 sectors
>>> >    Super Offset : 8 sectors
>>> >           State : clean
>>> >     Device UUID : 5e37fc7c:50ff3b50:de3755a1:6bdbebc6
>>> >
>>> >   Reshape pos'n : 489510400 (466.83 GiB 501.26 GB)
>>> >   Delta Devices : 1 (4->5)
>>> >
>>> >     Update Time : Sat Feb 19 22:23:09 2011
>>> >        Checksum : fd0c1794 - correct
>>> >          Events : 133567
>>> >
>>> >          Layout : left-symmetric
>>> >      Chunk Size : 64K
>>> >
>>> >     Array Slot : 1 (0, 1, failed, 2, 3, 4)
>>> >    Array State : uUuuu 1 failed
>>
>> This device thinks all is well.  The "1 failed" is misleading.  The
>>   uUuuu
>> patterns says that all the devices are though to be working.
>> Note for later reference:
>>         Events: 133567
>>  Reshape pos'n : 489510400
>>
>>
>>> > bernstein@server:~$ sudo mdadm --examine /dev/sda1
>>> > /dev/sda1:
>>> >           Magic : a92b4efc
>>> >         Version : 1.2
>>> >     Feature Map : 0x4
>>> >      Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
>>> >            Name : master:public
>>> >   Creation Time : Sat Jan 22 00:15:43 2011
>>> >      Raid Level : raid5
>>> >    Raid Devices : 5
>>> >
>>> >  Avail Dev Size : 1953521392 (931.51 GiB 1000.20 GB)
>>> >      Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
>>> >   Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
>>> >     Data Offset : 272 sectors
>>> >    Super Offset : 8 sectors
>>> >           State : clean
>>> >     Device UUID : baebd175:e4128e4c:f768b60f:4df18f77
>>> >
>>> >   Reshape pos'n : 502815488 (479.52 GiB 514.88 GB)
>>> >   Delta Devices : 1 (4->5)
>>> >
>>> >     Update Time : Sat Feb 19 22:32:04 2011
>>> >        Checksum : 12c832c6 - correct
>>> >          Events : 133609
>>> >
>>> >          Layout : left-symmetric
>>> >      Chunk Size : 64K
>>> >
>>> >     Array Slot : 5 (0, failed, failed, failed, 3, 4)
>>> >    Array State : u__uU 3 failed
>>
>> This device thinks devices 1 and 2 have failed (the '_'s).
>> So 'sdd1' above, and and md1.
>>        Events : 133609 - this has advanced a bit from sdd1
>>  Reshape Pos'n : 502815488 - this has advanced quite a lot.
>>
>>
>>> > bernstein@server:~$ sudo mdadm --examine /dev/sdc1
>>> > /dev/sdc1:
>>> >           Magic : a92b4efc
>>> >         Version : 1.2
>>> >     Feature Map : 0x4
>>> >      Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
>>> >            Name : master:public
>>> >   Creation Time : Sat Jan 22 00:15:43 2011
>>> >      Raid Level : raid5
>>> >    Raid Devices : 5
>>> >
>>> >  Avail Dev Size : 1953521392 (931.51 GiB 1000.20 GB)
>>> >      Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
>>> >   Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
>>> >     Data Offset : 272 sectors
>>> >    Super Offset : 8 sectors
>>> >           State : clean
>>> >     Device UUID : 82f5284a:2bffb837:19d366ab:ef2e3d94
>>> >
>>> >   Reshape pos'n : 502815488 (479.52 GiB 514.88 GB)
>>> >   Delta Devices : 1 (4->5)
>>> >
>>> >     Update Time : Sat Feb 19 22:32:04 2011
>>> >        Checksum : 8aa7d094 - correct
>>> >          Events : 133609
>>> >
>>> >          Layout : left-symmetric
>>> >      Chunk Size : 64K
>>> >
>>> >     Array Slot : 0 (0, failed, failed, failed, 3, 4)
>>> >    Array State : U__uu 3 failed
>>
>>  Reshape pos'n, Events, and Array State are identical to sda1.
>> So these two are in agreement.
>>
>>
>>> > bernstein@server:~$ sudo mdadm --examine /dev/md0
>>> > /dev/md0:
>>> >           Magic : a92b4efc
>>> >         Version : 1.2
>>> >     Feature Map : 0x4
>>> >      Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
>>> >            Name : master:public
>>> >   Creation Time : Sat Jan 22 00:15:43 2011
>>> >      Raid Level : raid5
>>> >    Raid Devices : 5
>>> >
>>> >  Avail Dev Size : 1953541616 (931.52 GiB 1000.21 GB)
>>> >      Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
>>> >   Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
>>> >     Data Offset : 272 sectors
>>> >    Super Offset : 8 sectors
>>> >           State : clean
>>> >     Device UUID : 83ecd60d:f3947a5e:a69c4353:3c4a0893
>>> >
>>> >   Reshape pos'n : 502815488 (479.52 GiB 514.88 GB)
>>> >   Delta Devices : 1 (4->5)
>>> >
>>> >     Update Time : Sat Feb 19 22:32:04 2011
>>> >        Checksum : 1bbf913b - correct
>>> >          Events : 133609
>>> >
>>> >          Layout : left-symmetric
>>> >      Chunk Size : 64K
>>> >
>>> >     Array Slot : 4 (0, failed, failed, failed, 3, 4)
>>> >    Array State : u__Uu 3 failed
>>
>> again, exactly the same as sda1 and sdc1.
>>
>>> > bernstein@server:~$ sudo mdadm --examine /dev/md1
>>> > /dev/md1:
>>> >           Magic : a92b4efc
>>> >         Version : 1.2
>>> >     Feature Map : 0x4
>>> >      Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
>>> >            Name : master:public
>>> >   Creation Time : Sat Jan 22 00:15:43 2011
>>> >      Raid Level : raid5
>>> >    Raid Devices : 5
>>> >
>>> >  Avail Dev Size : 1953541616 (931.52 GiB 1000.21 GB)
>>> >      Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
>>> >   Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
>>> >     Data Offset : 272 sectors
>>> >    Super Offset : 8 sectors
>>> >           State : clean
>>> >     Device UUID : 3c7e2c3f:8b6c7c43:a0ce7e33:ad680bed
>>> >
>>> >   Reshape pos'n : 502809856 (479.52 GiB 514.88 GB)
>>> >   Delta Devices : 1 (4->5)
>>> >
>>> >     Update Time : Sat Feb 19 22:30:29 2011
>>> >        Checksum : 6c591e90 - correct
>>> >          Events : 133603
>>> >
>>> >          Layout : left-symmetric
>>> >      Chunk Size : 64K
>>> >
>>> >     Array Slot : 3 (0, failed, failed, 2, 3, 4)
>>> >    Array State : u_Uuu 2 failed
>>
>> And here is md1.  Thinks device 2 - sdd1 - has failed.
>>        Events : 133603 - slightly behind the 3 good devices, be well after
>>                                                  sdd1
>>  Reshape Pos'n : 502809856 - just a little before the 3 good devices too.
>>
>>>
>>> so obviously not /dev/sdd1 failed. however (due to that silly forced
>>> assembly?!) the reshape pos'n field of md0, sd[ac]1 differs from md1 a
>>> few bytes, resulting in an inconsistent state...
>>
>> The way I read it is:
>>
>>  sdd1 failed first - shortly after Sat Feb 19 22:23:09 2011 - the update time on sdd1
>> reshape continued until some time between Sat Feb 19 22:30:29 2011
>> and Sat Feb 19 22:32:04 2011 when md1 had a failure.
>> The reshape couldn't continue now, so it stopped.
>>
>> So the data on sdd1 is only (there has been about 8 minutes of reshape since
>> then) and cannot be used.
>> The data on md1 is very close to the rest.  The data that was in the process
>> of being relocated lives in two locations on the 'good' drives, both the new
>> and the old.  It only lives in the 'old' location on md1.
>>
>> So what we need to do is re-assemble the array, but telling it that the
>> reshape has only gone as far as md1 thinks it has.  This will make sure it
>> repeats that last part of the reshape.
>>
>> mdadm -Af should do that BUT IT DOESN'T.  Assuming I have thought through
>> this properly (and I should go through it again with more care), mdadm won't
>> do the right thing for you.  I need to get it to handle 'reshape' specially
>> when doing a --force assemble.
>
> exactly what i was thinking of doing, glad i waited and asked.
>
>>
>>>
>>> > bernstein@server:~$ sudo mdadm --assemble /dev/md2 /dev/sda1 /dev/md0 /dev/md1 /dev/sdd1 /dev/sdc1
>>> >
>>> > mdadm: /dev/md2 assembled from 3 drives - not enough to start the array.
>>> > bernstein@server:~$ cat /proc/mdstat
>>> > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>>> > md2 : inactive sdc1[0](S) sda1[5](S) md0[4](S) md1[3](S) sdd1[1](S)
>>> >       4883823704 blocks super 1.2
>>> >
>>> > md1 : active raid0 sdf1[0] sdg1[1]
>>> >       976770944 blocks super 1.2 64k chunks
>>> >
>>> > md0 : active raid0 sdb1[1] sdh1[0]
>>> >       976770944 blocks super 1.2 64k chunks
>>> >
>>> > unused devices: <none>
>>>
>>> i do have a backup but since recovery from it takes a few days, i'd
>>> like to know if there is a way to recover the array or if it's
>>> completely lost.
>>>
>>> Any suggestions gratefully received,
>>
>> The fact that you have a backup is excellent.  You might need it, but I hope
>> not.
>>
>> I would like to provide you with a modified version of mdadm which you can
>> then user to --force assemble the array.  It should be able to get you access
>> to all your data.
>> The array will be degraded and will finish reshape in that state.  Then you
>> will need to add sdd1 back in (Assuming you are confident that it works) and
>> it will be rebuilt.
>>
>> Just to go through some of the numbers...
>>
>> Chunk size is 64K.  Reshape was 4->5, so 3 -> 4 data disks.
>> So old stripes have 192K, new stripes have 256K.
>>
>> The 'good' disks think reshape has reached 502815488K which is
>> 1964123 new stripes. (2618830.66 old stripes)
>> md1 thinks reshape has only reached 489510400K which is 1912150
>> new stripes (2549533.33 old stripes).
>
> i think you mixed up sdd1 with md1 here? (the numbers above for md1
> are for sdd1. md1 would be :  reshape has reached 502809856K which
> would be 1964101 new stripes. so the difference between the good disks
> and md1 would be 22 stripes.)
>
>>
>> So of the 51973 stripes that have been reshaped since the last metadata
>> update on sdd1, some will have been done on sdd1, but some not, and we don't
>> really know how many.  But it is perfectly safe to repeat those stripes
>> as all writes to that region will have been suspended (and you probably
>> weren't writing anyway).
>
> jep there was nothing writing to the array. so now i am a little
> confused, if you meant sdd1 (which failed first is 51973 stripes
> behind) this would imply that at least so many stripes of data are
> kept of the old (3 data disks) configuration as well as the new one?
> if continuing from there is possible then the array would no longer be
> degraded right? so i think you meant md1 (22 stripes behind), as
> keeping 5.5M of data from the old and new config seems more
> reasonable. however this is just a guess :-)
>
>>
>> So I need to change the loop in Assemble.c which calls ->update_super
>> with "force-one" to also make sure the reshape_position in the 'chosen'
>> superblock match the oldest 'forced' superblock.
>
> uh... ah... probably, i have zero knowledge of kernel code :-)
> i guess it should take into account that the oldest superblock (sdd1
> in this case) may already be out of the section were the data (in the
> old config) still exists? but i guess you already thought of that...
>
>>
>> So if you are able to wait a day, I'll try to write a patch first thing
>> tomorrow and send it to you.
>
> sure, that would be awesome! that boils down to compiling the patched
> kernel doesn't it? this will probably take a few days as the system is
> quite slow and i'd have to get up to speed with kernel compiling. but
> shouldn't be a problem. would i have to patch the ubuntu kernel (based
> on 2.6.35.4) or the latest 2.6.38-rc from kernel.org?
>
>>
>> Thanks for the excellent problem report.
>>
>> NeilBrown
>
> Well i thank you for providing such an elaborate and friendly answer!
> this is actually my first mailing list post and considering how many
> questions get ignored (don't know about this list though) i just hoped
> someone would at least answer with a one liner... i never expected
> this. so thanks again.
>
> Claude
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Just a quick FYI, you can find (new, and unreleased) Ubuntu kernels
here: http://kernel.ubuntu.com/~kernel-ppa/mainline/

// Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2011-02-20 14:47 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <AANLkTikhOAXQ6JAG1fK3x9V3icki8cjn0_ggyQwkGmnt@mail.gmail.com>
2011-02-20  3:23 ` Likely forced assemby with wrong disk during raid5 grow. Recoverable? Claude Nobs
2011-02-20  5:25   ` NeilBrown
2011-02-20 14:44     ` Claude Nobs
2011-02-20 14:47       ` Mathias Burén [this message]
2011-02-21  0:53       ` NeilBrown
2011-02-21  1:03         ` NeilBrown
2011-02-23  0:56         ` Claude Nobs
2011-02-23  1:53           ` NeilBrown
2011-02-24  4:06             ` Claude Nobs

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AANLkTink0VKa_yQpgxKqXsbxU67Kq++4RJEMW3p6UoBe@mail.gmail.com \
    --to=mathias.buren@gmail.com \
    --cc=claudenobs@blunet.cc \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).