All of lore.kernel.org
 help / color / mirror / Atom feed
* recovering failed and unrecognizable RAID5 during mdadm --grow without backup
@ 2016-05-12  6:22 Claudiu Rad
  2016-05-12 18:58 ` Phil Turmel
  0 siblings, 1 reply; 9+ messages in thread
From: Claudiu Rad @ 2016-05-12  6:22 UTC (permalink / raw)
  To: linux-raid

hello all,

i am a desperate guy that 'successfully' made a chain of mistakes 
leading to a real personal disaster. i need to try recover this as much 
as i can as total data loss is really not acceptable.
the short story is that having a weak performance 4x4TB RAID5 (full 
drives allocated to RAID5 besides the small RAID1 partitions for boot) + 
LVM, after reading a few articles on the internet, i figured out i 
should try some chunk size 'optimizations' and read that this can be 
done with my version of mdadm and my kernel (machine running debian 7.9).
the mistakes:

 1. no backup of 10TB of data. i am talking about a remote rented
    server, and didn't had any easy way to do backups
 2. i did run mdadm --grow -c 128 /dev/md2, it complained about
    --backup-file. run the command again with the file placed in
    /root/...txt, this being a partition inside the vg0 filling
    /dev/md2, thus defeating the purpose. the chunk size was
    automatically set to 512K before, i was trying to reduce it
 3. the command returned almost immediately, didn't have any idea that
    this would trigger a background process, although it is now obvious.
    i then tried to see what it has done but after a ls, a second ls in
    root partition was hanging. my web server panel (webmin) hanged in
    'waiting for...'; tried connecting to a new shell, after providing
    credentials, hanging, no cursor. i thought that my ever running
    monitoring system and some other constant I/O processes running with
    higher priority were clogging the system that now had lower
    throughput due to parameter change and entire I/O was filled because
    of this and maybe my experiments with the scheduler. actually nginx
    webserver seemed to be working properly and this had nice -10
    attached, which led me to this conclusion. another mistake
 4. after a few minutes of unresponsive machine, decided to send a soft
    CTRL+ALT+DELETE restart signal from datacenter control panel but it
    wouldn't work apparently, thus, decided there is no way to exit this
    situation unless using a hard restart (system reset), and this was
    my final and big mistake not knowing that the array was reshaping.
    the system won't boot and datacenter's rescue (network boot) system
    can't see/assemble the /dev/md2 array

i assume i really did the best to destroy a working array (well, besides 
not being satisfied with performance and apparent degradation during 
time). into the rescue system, this is what i see so far:


root@rescue ~ # mdadm --detail --scan
ARRAY /dev/md/0 metadata=1.2 name=rescue:0 
UUID=63b58acc:19623c52:c1134929:5d592d29
ARRAY /dev/md/1 metadata=1.2 name=rescue:1 
UUID=94713b26:3eca44bc:dee330c8:23443240

root@rescue ~ # mdadm --examine --scan
ARRAY /dev/md/0  metadata=1.2 UUID=63b58acc:19623c52:c1134929:5d592d29 
name=rescue:0
ARRAY /dev/md/1  metadata=1.2 UUID=94713b26:3eca44bc:dee330c8:23443240 
name=rescue:1
ARRAY /dev/md/2  metadata=1.2 UUID=a935894f:be435fc0:589c1c7f:d5454b43 
name=rescue:2
(so here the array appears)

root@rescue ~ # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdd2[3] sdc2[2] sdb2[1]
       523968 blocks super 1.2 [4/4] [UUUU]
md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
       16768896 blocks super 1.2 [4/4] [UUUU]

root@rescue ~ # mdadm --assemble --scan
mdadm: /dev/md/0 has been started with 4 drives.
mdadm: /dev/md/1 has been started with 4 drives.
mdadm: Failed to restore critical section for reshape, sorry.
        Possibly you needed to specify the --backup-file
Segmentation fault
(this segmentation fault is weird)

root@rescue ~ # mdadm --assemble --scan --invalid-backup
mdadm: /dev/md/2: Need a backup file to complete reshape of this array.
mdadm: Please provided one with "--backup-file=..."

root@rescue ~ # mdadm -V
mdadm - v3.3.2 - 21st August 2014


now.. what can i best do to try as much as i can to recover my array? 
the backup is actually trapped inside the / partition in the vg0 in the 
array. after starting the --grow, i estimate it has been running for 
about 10minutes when i did a force reboot. how can this be reconstructed 
properly? i have broken it enough, i don't want to make any other move 
without asking experts.

please, help. this is my greatest nightmare :(

-- 
Claudiu


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-05-13 15:33 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-12  6:22 recovering failed and unrecognizable RAID5 during mdadm --grow without backup Claudiu Rad
2016-05-12 18:58 ` Phil Turmel
2016-05-12 20:09   ` Claudiu Rad-Lohanel
2016-05-12 20:23     ` Phil Turmel
     [not found]       ` <7cf56631-7909-6a92-f0b2-05dd02722ee8@misalpina.net>
2016-05-13 14:04         ` Phil Turmel
2016-05-13 14:11           ` Phil Turmel
2016-05-13 14:26             ` Claudiu Rad-Lohanel
2016-05-13 14:39               ` Andreas Klauer
2016-05-13 15:33                 ` Claudiu Rad-Lohanel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.