From: Andras Tantos <andras@tantosonline.com>
To: Phil Turmel <philip@turmel.org>
Cc: Linux-RAID <linux-raid@vger.kernel.org>
Subject: Re: How to recover after md crash during reshape? - SOLVED/SUMMARY
Date: Tue, 3 Nov 2015 15:42:34 -0800 [thread overview]
Message-ID: <5639466A.9050109@tantosonline.com> (raw)
In-Reply-To: <5633B2F0.9010902@turmel.org>
Thank you all who helped me solve my problem, especially Phil Turmel,
who I am in dept for the rest of my live. Right now my family photos -
and my marriage - are safe.
For people, who might be interested in the future, here's a quick
summary of the events and the recovery:
Trouble:
==========
Was going to extend RAID6 array from 7 disks to 10. Array reshape
crashed early in the process. After reboot, the array wouldn't
re-assemble with error message:
mdadm: WARNING /dev/sda and /dev/sda1 appear to have very similar
superblocks.
If they are really different, please --zero the superblock on one
If they are the same or overlap, please remove one from the
DEVICE list in mdadm.conf.
What I SHOULD have done here is to remove SDA from the DEVICE list in
mdadm.conf followed by mdadm --grow --continue /dev/md1 --backup-file .....
What I did is to zero the superblock of SDA1.
The same message appeard for the other two new HDDs in the array as
well. By the time I zeroed the super blocks of all three new disks the
array assembled but didn't start because it was missing three drives.
Recovery:
===========
1. Look at the partitions listed in /proc/mdstat for the array.
2. For each of the constituents of the array, do mdadm -E <disk name
from the array>
3. Note all the parameters, especially these: 'Chunk Size', 'Raid
Level', 'Version'
4. Make sure all remaining disks show the same event count ('Events')
and they have correct checksum and all the above parameters match.
5. Note the order of the disks in the array. You can find that in this line:
Number Major Minor RaidDevice State
this 6 8 98 6 active sync
6. If all matches, stop the array:
mdadm --stop /dev/md1
7. Re-create your array as follows:
mdadm --create --assume-clean --verbose \
--metadata=1.0 --raid-devices=7 --chunk=64 --level=6 \
/dev/md1 <list of devices in the exact order from note 5 above>
Replace number of devices, chunk size and raid level from note 3
above. For me, I had do specify metadata version 0.9, which was my
original metadata version (as reported by the 'Version' parameter in
point 3 above). YMMV.
8. If all goes well, the array will now re-assemble with the original 7
disks. The data on the array is corrupted up to the point where the
reshape stopped, so...
9. fsck -n /dev/md1 to assess the damage. If doesn't look terrible, fix
the errors: fsck -y /dev/md1.
10. Mount the array rejoice in the data that's recovered.
Final notes:
===============
I still don't know the root cause of the crash. What I did notice is
that this particular (Core2 duo) system seems to become unstable with
more than 9 HDDs. It doesn't seem to be a power supply issue as it has
trouble even if about half of the drives are supplied from a second PSU.
Version 0.9 metadata has some problems, causing the misleading message
in the first place. Upgrading to version 1.0 metadata is a good idea.
If you use desktop or green drives in your array, fix the short kernel
timeout on SATA devices (30s). Issue this on every boot:
for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done
If you don't do that, the first unrecoverable read error will degrade
your array instead of simply relocating the failing sector on the hard
drive.
To find and fix unrecoverable read errors on your array, regularly issue:
echo check >/sys/block/md0/md/sync_action
This is a looooong operation on a large RAID6 array, but makes sure that
bad sectors don't accumulate in seldom-accessed corners and destroy your
array at the worst possible time.
Andras
next prev parent reply other threads:[~2015-11-03 23:42 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-20 2:35 How to recover after md crash during reshape? andras
2015-10-20 12:50 ` Anugraha Sinha
2015-10-20 13:04 ` Wols Lists
2015-10-20 13:49 ` Phil Turmel
[not found] ` <3baf849321d819483c5d20c005a31844@tantosonline.com>
2015-10-20 15:42 ` Phil Turmel
2015-10-20 22:34 ` Anugraha Sinha
2015-10-21 3:52 ` andras
2015-10-21 12:01 ` Phil Turmel
2015-10-21 16:17 ` Wols Lists
2015-10-21 16:05 ` Phil Turmel
2015-10-25 14:15 ` andras
2015-10-25 23:02 ` Phil Turmel
2015-10-28 16:31 ` Andras Tantos
2015-10-28 16:42 ` Phil Turmel
2015-10-28 17:10 ` Andras Tantos
2015-10-28 17:38 ` Phil Turmel
2015-10-29 16:59 ` Andras Tantos
2015-10-30 18:12 ` Phil Turmel
2015-11-03 23:42 ` Andras Tantos [this message]
2015-10-21 1:35 ` Neil Brown
2015-10-21 4:03 ` andras
2015-10-21 12:18 ` Phil Turmel
2015-10-21 20:26 ` Neil Brown
2015-10-21 20:37 ` Phil Turmel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5639466A.9050109@tantosonline.com \
--to=andras@tantosonline.com \
--cc=linux-raid@vger.kernel.org \
--cc=philip@turmel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.