From: Bob Brand <brand@wmawater.com.au>
To: <linux-raid@vger.kernel.org>
Subject: Failed adadm RAID array after aborted Grown operation
Date: Sun, 8 May 2022 23:18:32 +1000 (AEST) [thread overview]
Message-ID: <00ae01d862de$1d336980$579a3c80$@wmawater.com.au> (raw)
Hi,
Im somewhat new to Linux and mdadm although Ive certainly learnt a lot
over the last 24 hours.
I have a SuperMicro server running CentOS 7 (3.10.0-1160.11.1.e17.x86_64)
with version 4.1 2018-10-01 of mdadm with that was happily running with
30 8TB disk in a RAID6 configuration. (It also has boot and root on a
RAID1 array the RAID6 array being solely for data.) It was however
starting to run out of space and I investigated adding more drives to the
array (it can hold a total of 45 drives).
Since this device is no longer under support, obtaining the same drives as
it already contained wasnt an option and the supplier couldnt guarantee
that they could supply compatible drives. We did come to an arrangement
where I would try one drive and, if it didnt work, I could return any
unopened units.
I spent ages ensuring that the ones hed suggested were as compatible as
possible and I based the specs of the existing drives off the invoice for
the entire system. This turned out to be a mistake as the invoice stated
they were 512e drives but, as I discovered after the new drives had
arrived and I was doing a final check the existing were actually 4096k
drives. Of course the new drives were 512e. Bother! After a lot more
reading I found out that it might be possible to reformat the new drives
from 512e to 4096k using sg_format.
I installed the test drive and proceeded to see if it was possible to
format them to 4096k using the command sg_format size=4096 /dev/sd<x>.
All was proceeding smoothly when my ssh session terminated due a faulty
docking station killing my Ethernet connection.
So I logged onto the console and restarted the sg_format which completed
OK, sort of it did convert the disk to 4096k but it did throw an I/O
error or two but they didnt seem too concerning and I figured, if there
was a problem, it would show up in the next couple of steps. Ive since
discovered the dmesg log and that indicated that there were significantly
more I/O errors than I thought.
Anyway, since sg_format appeared to complete OK, I moved onto the next
stage which was to partition the disk with the following commands
parted -a optimal /dev/sd<x>
(parted) mklabel msdos
(parted) mkpart primary 2048s 100% (need to check that the start is
correct)
(parted) align-check optimal 1 (verify alignment of partition 1)
(parted) set 1 raid on (set the FLAG to RAID)
(parted) print
Unfortunately, I dont have the results of the print command as my laptop
unexpectedly shut down over night (it hasnt been a good weekend) but the
partitioning appeared to complete without incident.
I then added the new disk to the array:
mdadm --add /dev/md125 /dev/sd<x>
And it completed without any problems.
I then proceeded to grow the array:
mdadm --grow --raid-devices=31 --backup-file=/grow_md125.bak
/dev/md125
I monitored this with cat /proc/mdstat and it showed that it was reshaping
but the speed was 0K/sec and the reshape didnt progress from 0%.
#cat /proc/mdstat produced:
Personalities : [raid1] [raid6] [raid5] [raid4]
md125 : active raid6 sdab1[30] sdw1[26] sdc1[6] sdm1[16] sdi1[12]
sdz1[29] sdh1[11] sdg1[10] sds1[22] sdf1[9] sdq1[20] sdaa1[1] sdo1[18]
sdu1[24] sdb1[5] sdae1[4] sdl1[15] sdj1[13] sdn1[17] sdp1[19] sdv1[25]
sde1[8] sdd1[7] sdr1[21] sdt1[23] sdx1[27] sdad1[3] sdac1[2] sdy1[28]
sda1[0] sdk1[14]
218789036032 blocks super 1.2 level 6, 512k chunk, algorithm 2
[31/31] [UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU]
[>....................] reshape = 0.0% (1/7813894144)
finish=328606806584.3min speed=0K/sec
bitmap: 0/59 pages [0KB], 65536KB chunk
md126 : active raid1 sdaf1[0] sdag1[1]
100554752 blocks super 1.2 [2/2] [UU]
bitmap: 1/1 pages [4KB], 65536KB chunk
md127 : active raid1 sdaf3[0] sdag2[1]
976832 blocks super 1.0 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk
unused devices: <none>
# mdadm --detail /dev/md125 produced:
/dev/md125:
Version : 1.2
Creation Time : Wed Sep 13 15:09:40 2017
Raid Level : raid6
Array Size : 218789036032 (203.76 TiB 224.04 TB)
Used Dev Size : 7813894144 (7.28 TiB 8.00 TB)
Raid Devices : 31
Total Devices : 31
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sun May 8 00:47:35 2022
State : clean, reshaping
Active Devices : 31
Working Devices : 31
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Reshape Status : 0% complete
Delta Devices : 1, (30->31)
Name : localhost.localdomain:SW-RAID6
UUID : f9b65f55:5f257add:1140ccc0:46ca6c19
Events : 1053617
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 65 161 1 active sync /dev/sdaa1
2 65 193 2 active sync /dev/sdac1
3 65 209 3 active sync /dev/sdad1
4 65 225 4 active sync /dev/sdae1
5 8 17 5 active sync /dev/sdb1
6 8 33 6 active sync /dev/sdc1
7 8 49 7 active sync /dev/sdd1
8 8 65 8 active sync /dev/sde1
9 8 81 9 active sync /dev/sdf1
10 8 97 10 active sync /dev/sdg1
11 8 113 11 active sync /dev/sdh1
12 8 129 12 active sync /dev/sdi1
13 8 145 13 active sync /dev/sdj1
14 8 161 14 active sync /dev/sdk1
15 8 177 15 active sync /dev/sdl1
16 8 193 16 active sync /dev/sdm1
17 8 209 17 active sync /dev/sdn1
18 8 225 18 active sync /dev/sdo1
19 8 241 19 active sync /dev/sdp1
20 65 1 20 active sync /dev/sdq1
21 65 17 21 active sync /dev/sdr1
22 65 33 22 active sync /dev/sds1
23 65 49 23 active sync /dev/sdt1
24 65 65 24 active sync /dev/sdu1
25 65 81 25 active sync /dev/sdv1
26 65 97 26 active sync /dev/sdw1
27 65 113 27 active sync /dev/sdx1
28 65 129 28 active sync /dev/sdy1
29 65 145 29 active sync /dev/sdz1
30 65 177 30 active sync /dev/sdab1
NOTE: the new disk is /dev/sdab
About 12 hours later, as the reshape hadnt progressed from 0%, I looked
at ways of aborting it, such as mdadm --stop /dev/md125 which didn't work
so I ended up rebooting the server and this is where things really went
pear-shaped.
The server came up in emergency mode, which I found odd given that the
boot and root should have been OK.
I was able to log on as root OK but the RAID6 array ws stuck in the
reshape state.
I then tried mdadm --assemble --update=revert-reshape
--backup-file=/grow_md125.bak --verbose --uuid=
f9b65f55:5f257add:1140ccc0:46ca6c19 /dev/md125 and this produced:
mdadm: No super block found on /dev/sde (Expected magic a92b4efc, got
<varying numbers>
mdadm: No RAID super block on /dev/sde
.
.
mdadm: /dev/sde1 is identified as a member of /dev/md125, slot 6
.
.
mdadm: /dev/md125 has an active reshape - checking if critical
section needs to be restored
mdadm: No backup metadata on /grow_md125.back
mdadm: Failed to find backup of critical section
mdadm: Failed to restore critical section for reshape, sorry.
I've tried difference variations on this including mdadm --assemble
--invalid-backup --force but I won't include all the different commands
here because I'm having to type all this since I can't copy anything off
the server while it's in Emergency Mode.
I have also removed the suspect disk but this hasn't made any difference.
But the closest I've come to fixing this is running mdadm /dev/md125
--assemble --invalid-backup --backup-file=/grow_md125.bak --verbose
/dev/sdc1 /dev/sdd1 ....... /dev/sdaf1 and this produces:
.
.
.
mdadm: /dev/sdaf1 is identified as a member of /dev/md125, slot 4.
mdadm: /dev/md125 has an active reshape - checking if critical
section needs to be restored
mdadm: No backup metadata on /grow_md125.back
mdadm: Failed to find backup of critical section
mdadm: continuing without restoring backup
mdadm: added /dev/sdac1 to /dev/md125 as 1
.
.
.
mdadm: failed to RUN_ARRAY /dev/md125: Invalid argument
dmesg has this information:
md: md125 stopped.
md/raid:md125: reshape_position too early for auto-recovery -
aborting.
md: pers->run() failed ...
md: md125 stopped.
If youve stuck with me and read all this way, thank you and I hope you
can help me.
Regards,
Bob Brand
next reply other threads:[~2022-05-08 13:18 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-08 13:18 Bob Brand [this message]
2022-05-08 15:32 ` Failed adadm RAID array after aborted Grown operation Wols Lists
2022-05-08 22:04 ` Bob Brand
2022-05-08 22:15 ` Wol
2022-05-08 22:19 ` Bob Brand
2022-05-08 23:02 ` Bob Brand
2022-05-08 23:32 ` Bob Brand
2022-05-09 0:09 ` Bob Brand
2022-05-09 6:52 ` Wols Lists
2022-05-09 13:07 ` Bob Brand
[not found] ` <CAAMCDecTb69YY+jGzq9HVqx4xZmdVGiRa54BD55Amcz5yaZo1Q@mail.gmail.com>
2022-05-11 5:39 ` Bob Brand
2022-05-11 12:35 ` Reindl Harald
2022-05-11 13:22 ` Bob Brand
2022-05-11 14:56 ` Reindl Harald
2022-05-11 14:59 ` Reindl Harald
2022-05-13 5:32 ` Bob Brand
2022-05-13 8:18 ` Reindl Harald
2022-05-20 15:13 ` Bob Brand
2022-05-20 15:41 ` Reindl Harald
2022-05-22 4:13 ` Bob Brand
2022-05-22 11:25 ` Reindl Harald
2022-05-22 13:31 ` Wols Lists
2022-05-22 22:54 ` Bob Brand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='00ae01d862de$1d336980$579a3c80$@wmawater.com.au' \
--to=brand@wmawater.com.au \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox