Fail to assemble raid4 with replaced disk

All of lore.kernel.org
 help / color / mirror / Atom feed

* Fail to assemble raid4 with replaced disk
@ 2016-10-25 17:08 Santiago DIEZ
  2016-10-25 17:35 ` Mikael Abrahamsson
  2016-10-25 17:50 ` Wols Lists
  0 siblings, 2 replies; 6+ messages in thread
From: Santiago DIEZ @ 2016-10-25 17:08 UTC (permalink / raw)
  To: Linux Raid LIST

Hi Raiders,

I had a raid5 array md10 with sd[abcd]10.
Eventually, sdd10 failed.

I did NOT do any mdadm --fail NOR mdadm --remove command.
What I did is comment out the line "ARRAY /dev/md10 ..." in
/etc/mdadm/mdadm.conf.

Then I powered off the server, replaced the disk sdd with a new one
and booted the system.

I examined the status with:
# cat /proc/mdstat
md10 : inactive sdb10[1]
      1926247296 blocks

I stopped the array with:
# mdadm --stop /dev/md10

I tried to assemble the array with the 3 original disks like this
# mdadm --assemble /dev/md10 --verbose /dev/sda10 /dev/sdb10 /dev/sdc10
mdadm: looking for devices for /dev/md10
mdadm: /dev/sda10 is identified as a member of /dev/md10, slot 0.
mdadm: /dev/sdb10 is identified as a member of /dev/md10, slot 1.
mdadm: /dev/sdc10 is identified as a member of /dev/md10, slot 2.
mdadm: added /dev/sda10 to /dev/md10 as 0 (possibly out of date)
mdadm: added /dev/sdc10 to /dev/md10 as 2 (possibly out of date)
mdadm: no uptodate device for slot 3 of /dev/md10
mdadm: added /dev/sdb10 to /dev/md10 as 1
mdadm: /dev/md10 assembled from 1 drive - not enough to start the array.

I examined the status again with:
# cat /proc/mdstat
md10 : inactive sdb10[1](S) sdc10[2](S) sda10[0](S)
      5778741888 blocks

Now I'm SCARED!
What does the (S) mean?
How do I reassemble my array and add the new sdd10 partition?

Thanks for your help

Regards
-------------------------
Santiago DIEZ
Quark Systems & CAOBA
23 rue du Buisson Saint-Louis, 75010 Paris
-------------------------

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fail to assemble raid4 with replaced disk
  2016-10-25 17:08 Fail to assemble raid4 with replaced disk Santiago DIEZ
@ 2016-10-25 17:35 ` Mikael Abrahamsson
  2016-10-25 17:50 ` Wols Lists
  1 sibling, 0 replies; 6+ messages in thread
From: Mikael Abrahamsson @ 2016-10-25 17:35 UTC (permalink / raw)
  To: Santiago DIEZ; +Cc: Linux Raid LIST

On Tue, 25 Oct 2016, Santiago DIEZ wrote:

> # mdadm --assemble /dev/md10 --verbose /dev/sda10 /dev/sdb10 /dev/sdc10
> mdadm: looking for devices for /dev/md10
> mdadm: /dev/sda10 is identified as a member of /dev/md10, slot 0.
> mdadm: /dev/sdb10 is identified as a member of /dev/md10, slot 1.
> mdadm: /dev/sdc10 is identified as a member of /dev/md10, slot 2.
> mdadm: added /dev/sda10 to /dev/md10 as 0 (possibly out of date)
> mdadm: added /dev/sdc10 to /dev/md10 as 2 (possibly out of date)
> mdadm: no uptodate device for slot 3 of /dev/md10
> mdadm: added /dev/sdb10 to /dev/md10 as 1
> mdadm: /dev/md10 assembled from 1 drive - not enough to start the array.

This means sda10 and sdc10 most likely have a lower event count than 
sdb10.

> I examined the status again with:
> # cat /proc/mdstat
> md10 : inactive sdb10[1](S) sdc10[2](S) sda10[0](S)
>      5778741888 blocks
>
> Now I'm SCARED!
> What does the (S) mean?
> How do I reassemble my array and add the new sdd10 partition?

Check with mdadm -E /dev/sd[abc]10, check the event count, if it differs 
just a little (5-10 perhaps), then you can use --assemble --force to start 
it even though the event count is not exactly the same on each drive.

The event count is increased every time a drive is written to, when there 
is an unclean shutdown mdadm won't auto-assemble drives without operator 
intervention to understand the situation and act accordingly.


-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fail to assemble raid4 with replaced disk
  2016-10-25 17:08 Fail to assemble raid4 with replaced disk Santiago DIEZ
  2016-10-25 17:35 ` Mikael Abrahamsson
@ 2016-10-25 17:50 ` Wols Lists
  2016-10-27 14:11   ` Santiago DIEZ
  1 sibling, 1 reply; 6+ messages in thread
From: Wols Lists @ 2016-10-25 17:50 UTC (permalink / raw)
  To: Santiago DIEZ, Linux Raid LIST

On 25/10/16 18:08, Santiago DIEZ wrote:
> Hi Raiders,

This looks like a fairly simple recovery job - but you will probably
lose a little data - fsck will moan about a few new files being corrupted.

Firstly, DON'T DO ANYTHING WITH THE RAID.

Secondly, go to the linux raid wiki
https://raid.wiki.kernel.org/index.php/Linux_Raid and read section 4
"When things go wrogn". You've messed up replacing the failed drive, and
are now at "My raid won't assemble/run". But as I say, it doesn't look
particularly serious.
> 
> I had a raid5 array md10 with sd[abcd]10.
> Eventually, sdd10 failed.
> 
> I did NOT do any mdadm --fail NOR mdadm --remove command.
> What I did is comment out the line "ARRAY /dev/md10 ..." in
> /etc/mdadm/mdadm.conf.

mdadm.conf is somewhat of a relic from a bygone age, I believe. It used
to be necessary, in the new world of raid superblocks it is mostly
ignored and redundant.
> 
> Then I powered off the server, replaced the disk sdd with a new one
> and booted the system.
> 
> I examined the status with:
> # cat /proc/mdstat
> md10 : inactive sdb10[1]
>       1926247296 blocks
> 
> I stopped the array with:
> # mdadm --stop /dev/md10
> 
> I tried to assemble the array with the 3 original disks like this
> # mdadm --assemble /dev/md10 --verbose /dev/sda10 /dev/sdb10 /dev/sdc10
> mdadm: looking for devices for /dev/md10
> mdadm: /dev/sda10 is identified as a member of /dev/md10, slot 0.
> mdadm: /dev/sdb10 is identified as a member of /dev/md10, slot 1.
> mdadm: /dev/sdc10 is identified as a member of /dev/md10, slot 2.
> mdadm: added /dev/sda10 to /dev/md10 as 0 (possibly out of date)
> mdadm: added /dev/sdc10 to /dev/md10 as 2 (possibly out of date)
> mdadm: no uptodate device for slot 3 of /dev/md10
> mdadm: added /dev/sdb10 to /dev/md10 as 1
> mdadm: /dev/md10 assembled from 1 drive - not enough to start the array.

Okay. It's got three drives. When you've done what "Asking for help"
says, you should have event counts for all those three drives -
sd[abc]10. Hopefully they're all pretty much the same. If they are, a
simple "--assemble --force" should get your array up and running again.

The complaint about slot 3 is because you haven't removed the old sdd10,
and the new sdd10 isn't part of the array, it has no superblock.
> 
> I examined the status again with:
> # cat /proc/mdstat
> md10 : inactive sdb10[1](S) sdc10[2](S) sda10[0](S)
>       5778741888 blocks
> 
> Now I'm SCARED!
> What does the (S) mean?
> How do I reassemble my array and add the new sdd10 partition?
> 
> Thanks for your help
> 
Okay. That leaves your recovery path neatly mapped out. Get the event
count of the three remaining drives and post them here. Wait for an
expert to muck in and say it all looks good. Then

Assemble the array with --force
Remove the old sdd10
Add the new sdd10
Run a fsck.

And your array should all be back fine. One thing - the wiki bangs on
about the timeout problem. Is that your problem? Because if it is you
will have grief trying to get the array back unless you fix that as your
very first step.

Cheers,
Wol


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fail to assemble raid4 with replaced disk
  2016-10-25 17:50 ` Wols Lists
@ 2016-10-27 14:11   ` Santiago DIEZ
  2016-10-31 15:57     ` Wols Lists
  0 siblings, 1 reply; 6+ messages in thread
From: Santiago DIEZ @ 2016-10-27 14:11 UTC (permalink / raw)
  To: Wols Lists; +Cc: Linux Raid LIST

Hi,

Indeed, here is what I had in terms of event count:
/dev/sda10: 81589
/dev/sdb10: 81626
/dev/sdc10: 81589

Then the following procedure worked quite straightforward:
--------------------------------------------------------------------------------
# mdadm --assemble /dev/md10 --verbose --force /dev/sda10 /dev/sdb10 /dev/sdc10
# mdadm --manage /dev/md10 --add /dev/sdd10
--------------------------------------------------------------------------------

And 6h+ later:
--------------------------------------------------------------------------------
# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md10 : active raid5 sdd10[3] sda10[0] sdc10[2] sdb10[1]
      5778741888 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
--------------------------------------------------------------------------------

Then I ran:
--------------------------------------------------------------------------------
# e2fsck -f -n -t -v /dev/md10
e2fsck 1.42.5 (29-Jul-2012)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

    15675837 inodes used (4.34%, out of 361177088)
      188798 non-contiguous files (1.2%)
       14751 non-contiguous directories (0.1%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 15626455/47037/15
  1281308341 blocks used (88.69%, out of 1444685472)
           0 bad blocks
         101 large files

    15311457 regular files
      361754 directories
           0 character device files
           0 block device files
           0 fifos
           0 links
        2607 symbolic links (2310 fast symbolic links)
          10 sockets
------------
    15675828 files
Memory used: 50976k/1912k (20541k/30436k), time: 1304.00/334.06/ 8.00
I/O read: 4891MB, write: 0MB, rate: 3.75MB/s
--------------------------------------------------------------------------------

Does it look OK enough to launch the mount?

Regards and thanks for your help
-------------------------
Santiago DIEZ
Quark Systems & CAOBA
23 rue du Buisson Saint-Louis, 75010 Paris
-------------------------

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fail to assemble raid4 with replaced disk
  2016-10-27 14:11   ` Santiago DIEZ
@ 2016-10-31 15:57     ` Wols Lists
  2017-02-02 13:33       ` Santiago DIEZ
  0 siblings, 1 reply; 6+ messages in thread
From: Wols Lists @ 2016-10-31 15:57 UTC (permalink / raw)
  To: Santiago DIEZ; +Cc: Linux Raid LIST

On 27/10/16 15:11, Santiago DIEZ wrote:
> Hi,
> 
> Indeed, here is what I had in terms of event count:
> /dev/sda10: 81589
> /dev/sdb10: 81626
> /dev/sdc10: 81589
> 
> Then the following procedure worked quite straightforward:
> --------------------------------------------------------------------------------
> # mdadm --assemble /dev/md10 --verbose --force /dev/sda10 /dev/sdb10 /dev/sdc10
> # mdadm --manage /dev/md10 --add /dev/sdd10
> --------------------------------------------------------------------------------
> 
> And 6h+ later:
> --------------------------------------------------------------------------------
> # cat /proc/mdstat
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md10 : active raid5 sdd10[3] sda10[0] sdc10[2] sdb10[1]
>       5778741888 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
> --------------------------------------------------------------------------------
> 
> Then I ran:
> --------------------------------------------------------------------------------
> # e2fsck -f -n -t -v /dev/md10
> e2fsck 1.42.5 (29-Jul-2012)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> 
>     15675837 inodes used (4.34%, out of 361177088)
>       188798 non-contiguous files (1.2%)
>        14751 non-contiguous directories (0.1%)
>              # of inodes with ind/dind/tind blocks: 0/0/0
>              Extent depth histogram: 15626455/47037/15
>   1281308341 blocks used (88.69%, out of 1444685472)
>            0 bad blocks
>          101 large files
> 
>     15311457 regular files
>       361754 directories
>            0 character device files
>            0 block device files
>            0 fifos
>            0 links
>         2607 symbolic links (2310 fast symbolic links)
>           10 sockets
> ------------
>     15675828 files
> Memory used: 50976k/1912k (20541k/30436k), time: 1304.00/334.06/ 8.00
> I/O read: 4891MB, write: 0MB, rate: 3.75MB/s
> --------------------------------------------------------------------------------
> 
> Does it look OK enough to launch the mount?
> 
sorry - I've been away for the weekend - daughter's wedding :-)

But yes, that looks great. No errors on fsck either, I think :-)

I think your array looks fine. Just look at the output from smartctl for
your old drives and make sure that it doesn't look like another drive is
going to fail soon. I'm not quite sure what to look for, mostly bad
blocks and relocates, I think, but if you compare it with your new drive
and stuff looks dodgy, you can always ask for help.

Cheers,
Wol


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fail to assemble raid4 with replaced disk
  2016-10-31 15:57     ` Wols Lists
@ 2017-02-02 13:33       ` Santiago DIEZ
  0 siblings, 0 replies; 6+ messages in thread
From: Santiago DIEZ @ 2017-02-02 13:33 UTC (permalink / raw)
  To: Wols Lists; +Cc: Linux Raid LIST

Hi,

I never said THANKS.

Never too late ;o)

-------------------------
Santiago DIEZ
-------------------------
Quark Systems & CAOBA
23 rue du Buisson Saint-Louis, 75010 Paris
-------------------------


On Mon, Oct 31, 2016 at 4:57 PM, Wols Lists <antlists@youngman.org.uk> wrote:
> On 27/10/16 15:11, Santiago DIEZ wrote:
>> Hi,
>>
>> Indeed, here is what I had in terms of event count:
>> /dev/sda10: 81589
>> /dev/sdb10: 81626
>> /dev/sdc10: 81589
>>
>> Then the following procedure worked quite straightforward:
>> --------------------------------------------------------------------------------
>> # mdadm --assemble /dev/md10 --verbose --force /dev/sda10 /dev/sdb10 /dev/sdc10
>> # mdadm --manage /dev/md10 --add /dev/sdd10
>> --------------------------------------------------------------------------------
>>
>> And 6h+ later:
>> --------------------------------------------------------------------------------
>> # cat /proc/mdstat
>> Personalities : [raid1] [raid6] [raid5] [raid4]
>> md10 : active raid5 sdd10[3] sda10[0] sdc10[2] sdb10[1]
>>       5778741888 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
>> --------------------------------------------------------------------------------
>>
>> Then I ran:
>> --------------------------------------------------------------------------------
>> # e2fsck -f -n -t -v /dev/md10
>> e2fsck 1.42.5 (29-Jul-2012)
>> Pass 1: Checking inodes, blocks, and sizes
>> Pass 2: Checking directory structure
>> Pass 3: Checking directory connectivity
>> Pass 4: Checking reference counts
>> Pass 5: Checking group summary information
>>
>>     15675837 inodes used (4.34%, out of 361177088)
>>       188798 non-contiguous files (1.2%)
>>        14751 non-contiguous directories (0.1%)
>>              # of inodes with ind/dind/tind blocks: 0/0/0
>>              Extent depth histogram: 15626455/47037/15
>>   1281308341 blocks used (88.69%, out of 1444685472)
>>            0 bad blocks
>>          101 large files
>>
>>     15311457 regular files
>>       361754 directories
>>            0 character device files
>>            0 block device files
>>            0 fifos
>>            0 links
>>         2607 symbolic links (2310 fast symbolic links)
>>           10 sockets
>> ------------
>>     15675828 files
>> Memory used: 50976k/1912k (20541k/30436k), time: 1304.00/334.06/ 8.00
>> I/O read: 4891MB, write: 0MB, rate: 3.75MB/s
>> --------------------------------------------------------------------------------
>>
>> Does it look OK enough to launch the mount?
>>
> sorry - I've been away for the weekend - daughter's wedding :-)
>
> But yes, that looks great. No errors on fsck either, I think :-)
>
> I think your array looks fine. Just look at the output from smartctl for
> your old drives and make sure that it doesn't look like another drive is
> going to fail soon. I'm not quite sure what to look for, mostly bad
> blocks and relocates, I think, but if you compare it with your new drive
> and stuff looks dodgy, you can always ask for help.
>
> Cheers,
> Wol
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-02-02 13:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-10-25 17:08 Fail to assemble raid4 with replaced disk Santiago DIEZ
2016-10-25 17:35 ` Mikael Abrahamsson
2016-10-25 17:50 ` Wols Lists
2016-10-27 14:11   ` Santiago DIEZ
2016-10-31 15:57     ` Wols Lists
2017-02-02 13:33       ` Santiago DIEZ

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.