* recovering failed and unrecognizable RAID5 during mdadm --grow without backup
@ 2016-05-12 6:22 Claudiu Rad
2016-05-12 18:58 ` Phil Turmel
0 siblings, 1 reply; 9+ messages in thread
From: Claudiu Rad @ 2016-05-12 6:22 UTC (permalink / raw)
To: linux-raid
hello all,
i am a desperate guy that 'successfully' made a chain of mistakes
leading to a real personal disaster. i need to try recover this as much
as i can as total data loss is really not acceptable.
the short story is that having a weak performance 4x4TB RAID5 (full
drives allocated to RAID5 besides the small RAID1 partitions for boot) +
LVM, after reading a few articles on the internet, i figured out i
should try some chunk size 'optimizations' and read that this can be
done with my version of mdadm and my kernel (machine running debian 7.9).
the mistakes:
1. no backup of 10TB of data. i am talking about a remote rented
server, and didn't had any easy way to do backups
2. i did run mdadm --grow -c 128 /dev/md2, it complained about
--backup-file. run the command again with the file placed in
/root/...txt, this being a partition inside the vg0 filling
/dev/md2, thus defeating the purpose. the chunk size was
automatically set to 512K before, i was trying to reduce it
3. the command returned almost immediately, didn't have any idea that
this would trigger a background process, although it is now obvious.
i then tried to see what it has done but after a ls, a second ls in
root partition was hanging. my web server panel (webmin) hanged in
'waiting for...'; tried connecting to a new shell, after providing
credentials, hanging, no cursor. i thought that my ever running
monitoring system and some other constant I/O processes running with
higher priority were clogging the system that now had lower
throughput due to parameter change and entire I/O was filled because
of this and maybe my experiments with the scheduler. actually nginx
webserver seemed to be working properly and this had nice -10
attached, which led me to this conclusion. another mistake
4. after a few minutes of unresponsive machine, decided to send a soft
CTRL+ALT+DELETE restart signal from datacenter control panel but it
wouldn't work apparently, thus, decided there is no way to exit this
situation unless using a hard restart (system reset), and this was
my final and big mistake not knowing that the array was reshaping.
the system won't boot and datacenter's rescue (network boot) system
can't see/assemble the /dev/md2 array
i assume i really did the best to destroy a working array (well, besides
not being satisfied with performance and apparent degradation during
time). into the rescue system, this is what i see so far:
root@rescue ~ # mdadm --detail --scan
ARRAY /dev/md/0 metadata=1.2 name=rescue:0
UUID=63b58acc:19623c52:c1134929:5d592d29
ARRAY /dev/md/1 metadata=1.2 name=rescue:1
UUID=94713b26:3eca44bc:dee330c8:23443240
root@rescue ~ # mdadm --examine --scan
ARRAY /dev/md/0 metadata=1.2 UUID=63b58acc:19623c52:c1134929:5d592d29
name=rescue:0
ARRAY /dev/md/1 metadata=1.2 UUID=94713b26:3eca44bc:dee330c8:23443240
name=rescue:1
ARRAY /dev/md/2 metadata=1.2 UUID=a935894f:be435fc0:589c1c7f:d5454b43
name=rescue:2
(so here the array appears)
root@rescue ~ # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdd2[3] sdc2[2] sdb2[1]
523968 blocks super 1.2 [4/4] [UUUU]
md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
16768896 blocks super 1.2 [4/4] [UUUU]
root@rescue ~ # mdadm --assemble --scan
mdadm: /dev/md/0 has been started with 4 drives.
mdadm: /dev/md/1 has been started with 4 drives.
mdadm: Failed to restore critical section for reshape, sorry.
Possibly you needed to specify the --backup-file
Segmentation fault
(this segmentation fault is weird)
root@rescue ~ # mdadm --assemble --scan --invalid-backup
mdadm: /dev/md/2: Need a backup file to complete reshape of this array.
mdadm: Please provided one with "--backup-file=..."
root@rescue ~ # mdadm -V
mdadm - v3.3.2 - 21st August 2014
now.. what can i best do to try as much as i can to recover my array?
the backup is actually trapped inside the / partition in the vg0 in the
array. after starting the --grow, i estimate it has been running for
about 10minutes when i did a force reboot. how can this be reconstructed
properly? i have broken it enough, i don't want to make any other move
without asking experts.
please, help. this is my greatest nightmare :(
--
Claudiu
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: recovering failed and unrecognizable RAID5 during mdadm --grow without backup
2016-05-12 6:22 recovering failed and unrecognizable RAID5 during mdadm --grow without backup Claudiu Rad
@ 2016-05-12 18:58 ` Phil Turmel
2016-05-12 20:09 ` Claudiu Rad-Lohanel
0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2016-05-12 18:58 UTC (permalink / raw)
To: Claudiu Rad, linux-raid
On 05/12/2016 02:22 AM, Claudiu Rad wrote:
> hello all,
Please show the examine for the individual partitions of the raid5:
mdadm --examine /dev/sd[a-d]3
{ Replace the '3' if appropriate. You don't say what partition numbers
your raid5 is on. }
You will need to manually assemble (not create !) your array with a
backup file outside the raid5, and the --invalid-backup option to
abandon the backup file you can't get to. You will likely have some
unavoidable corruption at the reshape position due to this.
Phil
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: recovering failed and unrecognizable RAID5 during mdadm --grow without backup
2016-05-12 18:58 ` Phil Turmel
@ 2016-05-12 20:09 ` Claudiu Rad-Lohanel
2016-05-12 20:23 ` Phil Turmel
0 siblings, 1 reply; 9+ messages in thread
From: Claudiu Rad-Lohanel @ 2016-05-12 20:09 UTC (permalink / raw)
To: Phil Turmel, linux-raid
On 5/12/2016 9:58 PM, Phil Turmel wrote:
> Please show the examine for the individual partitions of the raid5:
>
> mdadm --examine /dev/sd[a-d]3
>
root@rescue ~ # mdadm --examine /dev/sd[a-d]3
/dev/sda3:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x4
Array UUID : a935894f:be435fc0:589c1c7f:d5454b43
Name : rescue:2 (local to host rescue)
Creation Time : Mon Apr 14 15:22:47 2014
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 7779167887 (3709.40 GiB 3982.93 GB)
Array Size : 11668750848 (11128.19 GiB 11948.80 GB)
Used Dev Size : 7779167232 (3709.40 GiB 3982.93 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=655 sectors
State : active
Device UUID : 9bd5271f:9cb24f1f:f27b2d29:71320066
Reshape pos'n : 49152 (48.01 MiB 50.33 MB)
New Chunksize : 64K
Update Time : Wed May 11 16:19:38 2016
Checksum : 286cd938 - correct
Events : 11526
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 0
Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdb3:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x4
Array UUID : a935894f:be435fc0:589c1c7f:d5454b43
Name : rescue:2 (local to host rescue)
Creation Time : Mon Apr 14 15:22:47 2014
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 7779167887 (3709.40 GiB 3982.93 GB)
Array Size : 11668750848 (11128.19 GiB 11948.80 GB)
Used Dev Size : 7779167232 (3709.40 GiB 3982.93 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=655 sectors
State : active
Device UUID : fe992c5f:cf125d01:9bb8e3f7:572aef37
Reshape pos'n : 49152 (48.01 MiB 50.33 MB)
New Chunksize : 64K
Update Time : Wed May 11 16:19:38 2016
Checksum : eb24325e - correct
Events : 11526
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 1
Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc3:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x4
Array UUID : a935894f:be435fc0:589c1c7f:d5454b43
Name : rescue:2 (local to host rescue)
Creation Time : Mon Apr 14 15:22:47 2014
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 7779167887 (3709.40 GiB 3982.93 GB)
Array Size : 11668750848 (11128.19 GiB 11948.80 GB)
Used Dev Size : 7779167232 (3709.40 GiB 3982.93 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=655 sectors
State : active
Device UUID : 0eb93951:876cbbad:46c6004c:0101f3ca
Reshape pos'n : 49152 (48.01 MiB 50.33 MB)
New Chunksize : 64K
Update Time : Wed May 11 16:19:38 2016
Checksum : 70b08f7d - correct
Events : 11526
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 2
Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd3:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x4
Array UUID : a935894f:be435fc0:589c1c7f:d5454b43
Name : rescue:2 (local to host rescue)
Creation Time : Mon Apr 14 15:22:47 2014
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 7779167887 (3709.40 GiB 3982.93 GB)
Array Size : 11668750848 (11128.19 GiB 11948.80 GB)
Used Dev Size : 7779167232 (3709.40 GiB 3982.93 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=655 sectors
State : active
Device UUID : 957d7ddb:dc6de4e7:feb6fb1f:7776adcc
Reshape pos'n : 49152 (48.01 MiB 50.33 MB)
New Chunksize : 64K
Update Time : Wed May 11 16:19:38 2016
Checksum : ad2bb8a - correct
Events : 11526
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 3
Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
> You will need to manually assemble (not create !) your array with a
> backup file outside the raid5, and the --invalid-backup option to
> abandon the backup file you can't get to. You will likely have some
> unavoidable corruption at the reshape position due to this.
i am waiting for your input on this and how to continue. it seems that i
actually set new chunk size to 64K not 128K as i was remembering.
clearly i wasn't with a clear mind when i did all this..
should i be worried that reshape position is so at the beginning of the
volume? maybe LVM vg0 metadata lost? (i am just assuming, don't know
much about how and where LVM stores info about its volumes).
the backup file is there, inside the array, if i could reach it somehow
i could feed it to mdadm and would probably go well afterwards.
anyway, if just data is lost, i don't care, what are really important
are some LVM volumes probably placed much further inside the array.
thank you phil!
--
jazzman
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: recovering failed and unrecognizable RAID5 during mdadm --grow without backup
2016-05-12 20:09 ` Claudiu Rad-Lohanel
@ 2016-05-12 20:23 ` Phil Turmel
[not found] ` <7cf56631-7909-6a92-f0b2-05dd02722ee8@misalpina.net>
0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2016-05-12 20:23 UTC (permalink / raw)
To: Claudiu Rad-Lohanel, linux-raid
On 05/12/2016 04:09 PM, Claudiu Rad-Lohanel wrote:
>
>
> On 5/12/2016 9:58 PM, Phil Turmel wrote:
>> Please show the examine for the individual partitions of the raid5:
>>
>> mdadm --examine /dev/sd[a-d]3
>>
>
> root@rescue ~ # mdadm --examine /dev/sd[a-d]3
> /dev/sda3:
Ok. Nothing outlandish.
>> You will need to manually assemble (not create !) your array with a
>> backup file outside the raid5, and the --invalid-backup option to
>> abandon the backup file you can't get to. You will likely have some
>> unavoidable corruption at the reshape position due to this.
>
> i am waiting for your input on this and how to continue. it seems that i
> actually set new chunk size to 64K not 128K as i was remembering.
> clearly i wasn't with a clear mind when i did all this..
> should i be worried that reshape position is so at the beginning of the
> volume? maybe LVM vg0 metadata lost? (i am just assuming, don't know
> much about how and where LVM stores info about its volumes).
It just didn't get very far.
> the backup file is there, inside the array, if i could reach it somehow
> i could feed it to mdadm and would probably go well afterwards.
No way to get to it without assembling, and you can't assemble
error-free without it. Sorry.
> anyway, if just data is lost, i don't care, what are really important
> are some LVM volumes probably placed much further inside the array.
They are likely to be fine, then.
> thank you phil!
You're welcome.
You should mount your /boot array somewhere convenient, then:
mdadm -Av /dev/md3 --invalid-backup \
--backup-file=/mount/path/to/boot/newbackupfile \
/dev/sd[a-d]3
If that fails, repeat with the --force option included. If that fails,
show us everything it prints out.
If it succeeds, the reshape will be continuing in the background. While
that is going on, you may mount the array and grab backups of the most
critical content. Just in case :-)
It will probably take a very long time. Look at /proc/mdstat to check
the progress.
Phil
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-05-13 15:33 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-12 6:22 recovering failed and unrecognizable RAID5 during mdadm --grow without backup Claudiu Rad
2016-05-12 18:58 ` Phil Turmel
2016-05-12 20:09 ` Claudiu Rad-Lohanel
2016-05-12 20:23 ` Phil Turmel
[not found] ` <7cf56631-7909-6a92-f0b2-05dd02722ee8@misalpina.net>
2016-05-13 14:04 ` Phil Turmel
2016-05-13 14:11 ` Phil Turmel
2016-05-13 14:26 ` Claudiu Rad-Lohanel
2016-05-13 14:39 ` Andreas Klauer
2016-05-13 15:33 ` Claudiu Rad-Lohanel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox