Re: How to recover after md crash during reshape? - SOLVED/SUMMARY

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andras Tantos <andras@tantosonline.com>
To: Phil Turmel <philip@turmel.org>
Cc: Linux-RAID <linux-raid@vger.kernel.org>
Subject: Re: How to recover after md crash during reshape? - SOLVED/SUMMARY
Date: Tue, 3 Nov 2015 15:42:34 -0800	[thread overview]
Message-ID: <5639466A.9050109@tantosonline.com> (raw)
In-Reply-To: <5633B2F0.9010902@turmel.org>

Thank you all who helped me solve my problem, especially Phil Turmel, 
who I am in dept for the rest of my live. Right now my family photos - 
and my marriage - are safe.

For people, who might be interested in the future, here's a quick 
summary of the events and the recovery:

Trouble:
==========

Was going to extend RAID6 array from 7 disks to 10. Array reshape 
crashed early in the process. After reboot, the array wouldn't 
re-assemble with error message:

     mdadm: WARNING /dev/sda and /dev/sda1 appear to have very similar
     superblocks.
           If they are really different, please --zero the superblock on one
           If they are the same or overlap, please remove one from the
           DEVICE list in mdadm.conf.

What I SHOULD have done here is to remove SDA from the DEVICE list in 
mdadm.conf followed by mdadm --grow --continue /dev/md1 --backup-file .....
What I did is to zero the superblock of SDA1.

The same message appeard for the other two new HDDs in the array as 
well. By the time I zeroed the super blocks of all three new disks the 
array assembled but didn't start because it was missing three drives.

Recovery:
===========
1. Look at the partitions listed in /proc/mdstat for the array.
2. For each of the constituents of the array, do mdadm -E <disk name 
from the array>
3. Note all the parameters, especially these: 'Chunk Size', 'Raid 
Level', 'Version'
4. Make sure all remaining disks show the same event count ('Events') 
and they have correct checksum and all the above parameters match.
5. Note the order of the disks in the array. You can find that in this line:

            Number   Major   Minor   RaidDevice State
      this     6       8       98        6      active sync

6. If all matches, stop the array:
     mdadm --stop /dev/md1

7. Re-create your array as follows:
     mdadm --create --assume-clean --verbose \
         --metadata=1.0 --raid-devices=7 --chunk=64 --level=6 \
         /dev/md1 <list of devices in the exact order from note 5 above>

     Replace number of devices, chunk size and raid level from note 3 
above. For me, I had do specify metadata version 0.9, which was my 
original metadata version (as reported by the 'Version' parameter in 
point 3 above). YMMV.

8. If all goes well, the array will now re-assemble with the original 7 
disks. The data on the array is corrupted up to the point where the 
reshape stopped, so...
9. fsck -n /dev/md1 to assess the damage. If doesn't look terrible, fix 
the errors: fsck -y /dev/md1.
10. Mount the array rejoice in the data that's recovered.

Final notes:
===============
I still don't know the root cause of the crash. What I did notice is 
that this particular (Core2 duo) system seems to become unstable with 
more than 9 HDDs. It doesn't seem to be a power supply issue as it has 
trouble even if about half of the drives are supplied from a second PSU.

Version 0.9 metadata has some problems, causing the misleading message 
in the first place. Upgrading to version 1.0 metadata is a good idea.

If you use desktop or green drives in your array, fix the short kernel 
timeout on SATA devices (30s). Issue this on every boot:
     for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done
If you don't do that, the first unrecoverable read error will degrade 
your array instead of simply relocating the failing sector on the hard 
drive.

To find and fix unrecoverable read errors on your array, regularly issue:
     echo check >/sys/block/md0/md/sync_action
This is a looooong operation on a large RAID6 array, but makes sure that 
bad sectors don't accumulate in seldom-accessed corners and destroy your 
array at the worst possible time.

Andras

next prev parent reply	other threads:[~2015-11-03 23:42 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-20  2:35 How to recover after md crash during reshape? andras
2015-10-20 12:50 ` Anugraha Sinha
2015-10-20 13:04 ` Wols Lists
2015-10-20 13:49 ` Phil Turmel
     [not found]   ` <3baf849321d819483c5d20c005a31844@tantosonline.com>
2015-10-20 15:42     ` Phil Turmel
2015-10-20 22:34       ` Anugraha Sinha
2015-10-21  3:52       ` andras
2015-10-21 12:01         ` Phil Turmel
2015-10-21 16:17       ` Wols Lists
2015-10-21 16:05         ` Phil Turmel
2015-10-25 14:15       ` andras
2015-10-25 23:02         ` Phil Turmel
2015-10-28 16:31           ` Andras Tantos
2015-10-28 16:42             ` Phil Turmel
2015-10-28 17:10               ` Andras Tantos
2015-10-28 17:38                 ` Phil Turmel
2015-10-29 16:59               ` Andras Tantos
2015-10-30 18:12                 ` Phil Turmel
2015-11-03 23:42                   ` Andras Tantos [this message]
2015-10-21  1:35 ` Neil Brown
2015-10-21  4:03   ` andras
2015-10-21 12:18   ` Phil Turmel
2015-10-21 20:26     ` Neil Brown
2015-10-21 20:37       ` Phil Turmel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5639466A.9050109@tantosonline.com \
    --to=andras@tantosonline.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=philip@turmel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.