* 4-disk RAID6 (non-standard layout) normalise hung, now all disks spare
@ 2021-06-25 12:08 Jason Flood
2021-06-25 13:59 ` Phil Turmel
0 siblings, 1 reply; 7+ messages in thread
From: Jason Flood @ 2021-06-25 12:08 UTC (permalink / raw)
To: linux-raid
I started with a 4x4TB disk RAID5 array and, over a few years changed all
the drives to 8TB (WD Red - I hadn't seen the warnings before now, but it
looks like these ones are OK). I then successfully migrated it to RAID6, but
it then had a non-standard layout, so I ran:
sudo mdadm --grow /dev/md0 --raid-devices=4
--backup-file=/root/raid5backup --layout=normalize
After a few days it reached 99% complete, but then the "hours remaining"
counter started counting up. After a few days I had to power the system down
before I could get a backup of the non-critical data (Couldn't get hold of
enough storage quickly enough, but it wouldn't be catastrophic to lose it),
and now the four drives are in standby, with the array thinking it is RAID0.
Running:
sudo mdadm --assemble /dev/md0 /dev/sd[bcde]
responds with:
mdadm: /dev/md0 assembled from 4 drives - not enough to start the
array while not clean - consider --force.
It appears to be similar to https://marc.info/?t=155492912100004&r=1&w=2,
but before trying --force I was considering using overlay files as I'm not
sure of the risk of damage. The set-up process that is documented in the "
Recovering a damaged RAID" Wiki article is excellent, however the latter
part of the process isn't clear to me. If successful, are the overlay files
written to the disk like a virtual machine snapshot, or is the process
stopped, the overlays removed and the process repeated, knowing that it now
has a low risk of damage?
System details follow. Thanks for any help.
============================================================================
=====
user@host:~$ uname -a
Linux conan 5.4.0-74-generic #83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021
x86_64 x86_64 x86_64 GNU/Linux
============================================================================
=====
user@host:~$ mdadm --version
mdadm - v4.1 - 2018-10-01
============================================================================
=====
user@host:~$ sudo smartctl -H -i -l scterc /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-74-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 860 EVO M.2 250GB
Serial Number: S413NX0K707647T
LU WWN Device Id: 5 002538 e40528ae8
Firmware Version: RVT21B6Q
User Capacity: 250,059,350,016 bytes [250 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: M.2
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Jun 20 10:44:10 2021 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
============================================================================
=====
user@host:~$ sudo smartctl -H -i -l scterc /dev/sdb
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-74-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD80EFAX-68KNBN0
Serial Number: VGJM3NXK
LU WWN Device Id: 5 000cca 0bee4dfda
Firmware Version: 81.00A81
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Jun 20 10:44:10 2021 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
============================================================================
=====
user@host:~$ sudo smartctl -H -i -l scterc /dev/sdc
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-74-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: WDC WD80EFBX-68AZZN0
Serial Number: VRG5YT4K
LU WWN Device Id: 5 000cca 0c2c2b5a4
Firmware Version: 85.00A85
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Jun 20 10:44:11 2021 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
============================================================================
=====
user@host:~$ sudo smartctl -H -i -l scterc /dev/sdd
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-74-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD80EFAX-68KNBN0
Serial Number: VAGV1WLL
LU WWN Device Id: 5 000cca 099cbd8be
Firmware Version: 81.00A81
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Jun 20 10:44:12 2021 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
============================================================================
=====
user@host:~$ sudo smartctl -H -i -l scterc /dev/sde
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-74-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD80EFAX-68LHPN0
Serial Number: 7SJ5W2KW
LU WWN Device Id: 5 000cca 252deda87
Firmware Version: 83.H0A83
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Jun 20 10:44:12 2021 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
============================================================================
=====
user@host:~$ sudo mdadm --examine /dev/sdb
/dev/sdb:
Magic : a92b4efc
Version : 1.2
Feature Map : 0xd
Array UUID : 3eee8746:8a3bf425:afb9b538:daa61b29
Name : Universe:0
Creation Time : Thu Jul 13 01:11:22 2017
Raid Level : raid6
Raid Devices : 4
Avail Dev Size : 15627794096 (7451.91 GiB 8001.43 GB)
Array Size : 15627793408 (14903.83 GiB 16002.86 GB)
Used Dev Size : 15627793408 (7451.91 GiB 8001.43 GB)
Data Offset : 259072 sectors
Super Offset : 8 sectors
Unused Space : before=258992 sectors, after=688 sectors
State : active
Device UUID : eee9201e:d9769906:b6dccda1:b1f35abe
Internal Bitmap : 8 sectors from superblock
Reshape pos'n : 39006208 (37.20 GiB 39.94 GB)
Delta Devices : -1 (5->4)
New Layout : left-symmetric
Update Time : Fri Jun 18 08:56:43 2021
Bad Block Log : 512 entries available at offset 24 sectors - bad blocks
present.
Checksum : de1db60e - correct
Events : 184251
Layout : left-symmetric-6
Chunk Size : 512K
Device Role : Active device 0
Array State : AAAA. ('A' == active, '.' == missing, 'R' == replacing)
============================================================================
=====
user@host:~$ sudo mdadm --examine /dev/sdc
/dev/sdc:
Magic : a92b4efc
Version : 1.2
Feature Map : 0xd
Array UUID : 3eee8746:8a3bf425:afb9b538:daa61b29
Name : Universe:0
Creation Time : Thu Jul 13 01:11:22 2017
Raid Level : raid6
Raid Devices : 4
Avail Dev Size : 15627794096 (7451.91 GiB 8001.43 GB)
Array Size : 15627793408 (14903.83 GiB 16002.86 GB)
Used Dev Size : 15627793408 (7451.91 GiB 8001.43 GB)
Data Offset : 259072 sectors
Super Offset : 8 sectors
Unused Space : before=258992 sectors, after=688 sectors
State : active
Device UUID : 7ed45d83:84db8f79:e3aadf4b:a88212d1
Internal Bitmap : 8 sectors from superblock
Reshape pos'n : 39006208 (37.20 GiB 39.94 GB)
Delta Devices : -1 (5->4)
New Layout : left-symmetric
Update Time : Fri Jun 18 08:56:43 2021
Bad Block Log : 512 entries available at offset 24 sectors - bad blocks
present.
Checksum : 731a6e9f - correct
Events : 184251
Layout : left-symmetric-6
Chunk Size : 512K
Device Role : Active device 1
Array State : AAAA. ('A' == active, '.' == missing, 'R' == replacing)
============================================================================
=====
user@host:~$ sudo mdadm --examine /dev/sdd
/dev/sdd:
Magic : a92b4efc
Version : 1.2
Feature Map : 0xd
Array UUID : 3eee8746:8a3bf425:afb9b538:daa61b29
Name : Universe:0
Creation Time : Thu Jul 13 01:11:22 2017
Raid Level : raid6
Raid Devices : 4
Avail Dev Size : 15627794096 (7451.91 GiB 8001.43 GB)
Array Size : 15627793408 (14903.83 GiB 16002.86 GB)
Used Dev Size : 15627793408 (7451.91 GiB 8001.43 GB)
Data Offset : 259072 sectors
Super Offset : 8 sectors
Unused Space : before=258992 sectors, after=688 sectors
State : active
Device UUID : 015b3ea0:9b3a38d2:a860f58a:34c19985
Internal Bitmap : 8 sectors from superblock
Reshape pos'n : 39006208 (37.20 GiB 39.94 GB)
Delta Devices : -1 (5->4)
New Layout : left-symmetric
Update Time : Fri Jun 18 08:56:43 2021
Bad Block Log : 512 entries available at offset 24 sectors - bad blocks
present.
Checksum : dc4048b8 - correct
Events : 184251
Layout : left-symmetric-6
Chunk Size : 512K
Device Role : Active device 2
Array State : AAAA. ('A' == active, '.' == missing, 'R' == replacing)
============================================================================
=====
user@host:~$ sudo mdadm --examine /dev/sde
/dev/sde:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x5
Array UUID : 3eee8746:8a3bf425:afb9b538:daa61b29
Name : Universe:0
Creation Time : Thu Jul 13 01:11:22 2017
Raid Level : raid6
Raid Devices : 4
Avail Dev Size : 15627794096 (7451.91 GiB 8001.43 GB)
Array Size : 15627793408 (14903.83 GiB 16002.86 GB)
Used Dev Size : 15627793408 (7451.91 GiB 8001.43 GB)
Data Offset : 259072 sectors
Super Offset : 8 sectors
Unused Space : before=258992 sectors, after=688 sectors
State : active
Device UUID : bf9e316b:5910c7ca:1fd799e3:41a349b3
Internal Bitmap : 8 sectors from superblock
Reshape pos'n : 39006208 (37.20 GiB 39.94 GB)
Delta Devices : -1 (5->4)
New Layout : left-symmetric
Update Time : Fri Jun 18 08:56:43 2021
Bad Block Log : 512 entries available at offset 24 sectors
Checksum : 2616ba80 - correct
Events : 184251
Layout : left-symmetric-6
Chunk Size : 512K
Device Role : Active device 3
Array State : AAAA. ('A' == active, '.' == missing, 'R' == replacing)
============================================================================
=====
user@host:~$ sudo mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Raid Level : raid0
Total Devices : 4
Persistence : Superblock is persistent
State : inactive
Working Devices : 4
Delta Devices : -1, (1->0)
New Level : raid6
New Layout : left-symmetric
New Chunksize : 512K
Name : Universe:0
UUID : 3eee8746:8a3bf425:afb9b538:daa61b29
Events : 184251
Number Major Minor RaidDevice
- 8 64 - /dev/sde
- 8 32 - /dev/sdc
- 8 48 - /dev/sdd
- 8 16 - /dev/sdb
============================================================================
=====
user@host:~$ sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 9.1M 1 loop /snap/canonical-livepatch/98
loop1 7:1 0 9.1M 1 loop /snap/canonical-livepatch/99
loop2 7:2 0 99.4M 1 loop /snap/core/11187
loop3 7:3 0 99.2M 1 loop /snap/core/11167
loop4 7:4 0 55.4M 1 loop /snap/core18/2066
loop5 7:5 0 70.4M 1 loop /snap/lxd/19647
loop7 7:7 0 217.5M 1 loop /snap/nextcloud/28088
loop8 7:8 0 67.6M 1 loop /snap/lxd/20326
loop9 7:9 0 217.5M 1 loop /snap/nextcloud/27920
loop10 7:10 0 55.5M 1 loop /snap/core18/2074
sda 8:0 0 232.9G 0 disk
+-sda1 8:1 0 512M 0 part /boot/efi
L-sda2 8:2 0 232.4G 0 part /
sdb 8:16 0 7.3T 0 disk
sdc 8:32 0 7.3T 0 disk
sdd 8:48 0 7.3T 0 disk
sde 8:64 0 7.3T 0 disk
============================================================================
=====
user@host:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4]
[raid10]
md0 : inactive sdc[7](S) sdb[6](S) sde[4](S) sdd[5](S)
31255588192 blocks super 1.2
============================================================================
=====
user@host:~$ sudo mdadm --stop /dev/md0
mdadm: stopped /dev/md0
user@host:~$ sudo mdadm --assemble /dev/md0 /dev/sd[bcde]
mdadm: /dev/md0 assembled from 4 drives - not enough to start the array
while not clean - consider --force.
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: 4-disk RAID6 (non-standard layout) normalise hung, now all disks spare
2021-06-25 12:08 4-disk RAID6 (non-standard layout) normalise hung, now all disks spare Jason Flood
@ 2021-06-25 13:59 ` Phil Turmel
2021-06-26 11:09 ` Jason Flood
0 siblings, 1 reply; 7+ messages in thread
From: Phil Turmel @ 2021-06-25 13:59 UTC (permalink / raw)
To: Jason Flood, linux-raid
Good morning Jason,
Good report. Comments inline.
On 6/25/21 8:08 AM, Jason Flood wrote:
> I started with a 4x4TB disk RAID5 array and, over a few years changed all
> the drives to 8TB (WD Red - I hadn't seen the warnings before now, but it
> looks like these ones are OK). I then successfully migrated it to RAID6, but
> it then had a non-standard layout, so I ran:
> sudo mdadm --grow /dev/md0 --raid-devices=4
> --backup-file=/root/raid5backup --layout=normalize
Ugh. You don't have to use a backup file unless mdadm tells you too.
Now you are stuck with it.
> After a few days it reached 99% complete, but then the "hours remaining"
> counter started counting up. After a few days I had to power the system down
> before I could get a backup of the non-critical data (Couldn't get hold of
> enough storage quickly enough, but it wouldn't be catastrophic to lose it),
> and now the four drives are in standby, with the array thinking it is RAID0.
> Running:
> sudo mdadm --assemble /dev/md0 /dev/sd[bcde]
> responds with:
> mdadm: /dev/md0 assembled from 4 drives - not enough to start the
> array while not clean - consider --force.
You have to specify the backup file on assembly if a reshape using one
was interrupted.
> It appears to be similar to https://marc.info/?t=155492912100004&r=1&w=2,
> but before trying --force I was considering using overlay files as I'm not
> sure of the risk of damage. The set-up process that is documented in the "
> Recovering a damaged RAID" Wiki article is excellent, however the latter
> part of the process isn't clear to me. If successful, are the overlay files
> written to the disk like a virtual machine snapshot, or is the process
> stopped, the overlays removed and the process repeated, knowing that it now
> has a low risk of damage?
Using --force is very low risk on assembly. I would try it (without
overlays, and with backup file specified) before you do anything else.
Odds of success are high.
Also try the flags to treat the backup file as garbage if its contents
don't match what mdadm expects.
Report back here after the above.
> System details follow. Thanks for any help.
[details trimmed]
Your report of the details was excellent. Thanks for helping us help you.
Phil
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: 4-disk RAID6 (non-standard layout) normalise hung, now all disks spare
2021-06-25 13:59 ` Phil Turmel
@ 2021-06-26 11:09 ` Jason Flood
2021-06-26 13:13 ` antlists
0 siblings, 1 reply; 7+ messages in thread
From: Jason Flood @ 2021-06-26 11:09 UTC (permalink / raw)
To: 'Phil Turmel', linux-raid
Thanks for that, Phil - I think I'm starting to piece it all together now. I was going from a 4-disk RAID5 to 4-disk RAID6, so from my reading the backup file was recommended. The non-standard layout meant that the array had over 20TB usable, but standardising the layout reduced that to 16TB. In that case the reshape starts at the end so the critical section (and so the backup file) may have been in progress at the 99% complete point when it failed, hence the need to specify the backup file for the assemble command.
I ran "sudo mdadm --assemble --verbose --force /dev/md0 /dev/sd[bcde] --backup-file=/root/raid5backup":
mdadm: looking for devices for /dev/md0
mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdc is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdd is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sde is identified as a member of /dev/md0, slot 3.
mdadm: Marking array /dev/md0 as 'clean'
mdadm: /dev/md0 has an active reshape - checking if critical section needs to be restored
mdadm: No backup metadata on /root/raid5backup
mdadm: added /dev/sdc to /dev/md0 as 1
mdadm: added /dev/sdd to /dev/md0 as 2
mdadm: added /dev/sde to /dev/md0 as 3
mdadm: no uptodate device for slot 4 of /dev/md0
mdadm: added /dev/sdb to /dev/md0 as 0
mdadm: Need to backup 3072K of critical section..
mdadm: /dev/md0 has been started with 4 drives (out of 5).
=============================================================
sudo mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Thu Jul 13 01:11:22 2017
Raid Level : raid6
Array Size : 15627793408 (14903.83 GiB 16002.86 GB)
Used Dev Size : 7813896704 (7451.91 GiB 8001.43 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sat Jun 26 19:40:16 2021
State : clean, reshaping
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric-6
Chunk Size : 512K
Consistency Policy : bitmap
Reshape Status : 99% complete
Delta Devices : -1, (5->4)
New Layout : left-symmetric
Name : Universe:0
UUID : 3eee8746:8a3bf425:afb9b538:daa61b29
Events : 184255
Number Major Minor RaidDevice State
6 8 16 0 active sync /dev/sdb
7 8 32 1 active sync /dev/sdc
5 8 48 2 active sync /dev/sdd
4 8 64 3 active sync /dev/sde
=============================================================
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid6 sdb[6] sde[4] sdd[5] sdc[7]
15627793408 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUUU]
[===================>.] reshape = 99.7% (7794393600/7813896704) finish=52211434.6min speed=0K/sec
bitmap: 14/30 pages [56KB], 131072KB chunk
=============================================================
The drive mounts and the files are all intact, but still sitting on 99% complete with 52 million minutes to finish and counting up. The "No backup metadata" made me suspicious that it is stuck because it can't write to /root/raid5backup (and looking at it now I should have put it somewhere more sensible as I'm using sudo, but I used it in the RAID5 to RAID6 process and it was happy). It does seem to have modified the file, though:
stat raid5backup
File: raid5backup
Size: 3149824 Blocks: 6152 IO Block: 4096 regular file
Device: 802h/2050d Inode: 1572897 Links: 1
Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-06-26 19:39:16.739983712 +1000
Modify: 2021-06-26 19:40:16.778498938 +1000
Change: 2021-06-26 19:40:16.778498938 +1000
Birth: -
=============================================================
But I believe those times are from when I first ran the assemble command - it's 20:30 now. I couldn't find a flag to conditionally treat the backup file as garbage - just the --invalid-backup "I know it's garbage" option. Given that the assemble isn't complaining about needing to restore the critical section, is my next step something like:
sudo mdadm --assemble --verbose --force /dev/md0 /dev/sd[bcde] --backup-file=raidbackup --invalid-backup
Thanks again, Phil. I haven't been using Linux seriously for very long, so this has been a steep learning curve for me.
Jason
=======================================================================================================================================
-----Original Message-----
From: Phil Turmel <philip@turmel.org>
Sent: Saturday, 26 June 2021 00:00
To: Jason Flood <3mu5555@gmail.com>; linux-raid@vger.kernel.org
Subject: Re: 4-disk RAID6 (non-standard layout) normalise hung, now all disks spare
Good morning Jason,
Good report. Comments inline.
On 6/25/21 8:08 AM, Jason Flood wrote:
> I started with a 4x4TB disk RAID5 array and, over a few years changed
> all the drives to 8TB (WD Red - I hadn't seen the warnings before now,
> but it looks like these ones are OK). I then successfully migrated it
> to RAID6, but it then had a non-standard layout, so I ran:
> sudo mdadm --grow /dev/md0 --raid-devices=4
> --backup-file=/root/raid5backup --layout=normalize
Ugh. You don't have to use a backup file unless mdadm tells you too.
Now you are stuck with it.
> After a few days it reached 99% complete, but then the "hours remaining"
> counter started counting up. After a few days I had to power the
> system down before I could get a backup of the non-critical data
> (Couldn't get hold of enough storage quickly enough, but it wouldn't
> be catastrophic to lose it), and now the four drives are in standby, with the array thinking it is RAID0.
> Running:
> sudo mdadm --assemble /dev/md0 /dev/sd[bcde] responds with:
> mdadm: /dev/md0 assembled from 4 drives - not enough to start the
> array while not clean - consider --force.
You have to specify the backup file on assembly if a reshape using one was interrupted.
> It appears to be similar to
> https://marc.info/?t=155492912100004&r=1&w=2,
> but before trying --force I was considering using overlay files as I'm
> not sure of the risk of damage. The set-up process that is documented in the "
> Recovering a damaged RAID" Wiki article is excellent, however the
> latter part of the process isn't clear to me. If successful, are the
> overlay files written to the disk like a virtual machine snapshot, or
> is the process stopped, the overlays removed and the process repeated,
> knowing that it now has a low risk of damage?
Using --force is very low risk on assembly. I would try it (without overlays, and with backup file specified) before you do anything else.
Odds of success are high.
Also try the flags to treat the backup file as garbage if its contents don't match what mdadm expects.
Report back here after the above.
> System details follow. Thanks for any help.
[details trimmed]
Your report of the details was excellent. Thanks for helping us help you.
Phil
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: 4-disk RAID6 (non-standard layout) normalise hung, now all disks spare
2021-06-26 11:09 ` Jason Flood
@ 2021-06-26 13:13 ` antlists
2021-06-26 14:28 ` Phil Turmel
0 siblings, 1 reply; 7+ messages in thread
From: antlists @ 2021-06-26 13:13 UTC (permalink / raw)
To: Jason Flood, 'Phil Turmel', linux-raid
On 26/06/2021 12:09, Jason Flood wrote:
> Reshape Status : 99% complete
> Delta Devices : -1, (5->4)
> New Layout : left-symmetric
>
> Name : Universe:0
> UUID : 3eee8746:8a3bf425:afb9b538:daa61b29
> Events : 184255
>
> Number Major Minor RaidDevice State
> 6 8 16 0 active sync /dev/sdb
> 7 8 32 1 active sync /dev/sdc
> 5 8 48 2 active sync /dev/sdd
> 4 8 64 3 active sync /dev/sde
Phil will know much more about this than me, but I did notice that the
system thinks there should be FIVE raid drives. Is that an mdadm bug?
That would explain the failure to assemble - it thinks there's a drive
missing. And while I don't think we've had data-eating trouble,
reshaping a parity raid has caused quite a lot of grief for people over
the years ...
However, you're running a recent Ubuntu and mdadm - that should all have
been fixed by now.
Cheers,
Wol
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 4-disk RAID6 (non-standard layout) normalise hung, now all disks spare
2021-06-26 13:13 ` antlists
@ 2021-06-26 14:28 ` Phil Turmel
2021-06-27 11:09 ` Jason Flood
0 siblings, 1 reply; 7+ messages in thread
From: Phil Turmel @ 2021-06-26 14:28 UTC (permalink / raw)
To: antlists, Jason Flood, linux-raid
Good morning Jason, Wol,
On 6/26/21 9:13 AM, antlists wrote:
> On 26/06/2021 12:09, Jason Flood wrote:
>> Reshape Status : 99% complete
>> Delta Devices : -1, (5->4)
>> New Layout : left-symmetric
>>
>> Name : Universe:0
>> UUID : 3eee8746:8a3bf425:afb9b538:daa61b29
>> Events : 184255
>>
>> Number Major Minor RaidDevice State
>> 6 8 16 0 active sync /dev/sdb
>> 7 8 32 1 active sync /dev/sdc
>> 5 8 48 2 active sync /dev/sdd
>> 4 8 64 3 active sync /dev/sde
>
> Phil will know much more about this than me, but I did notice that the
> system thinks there should be FIVE raid drives. Is that an mdadm bug?
Not a bug, but a reshape from a degraded array with a reduction in space.
> That would explain the failure to assemble - it thinks there's a drive
> missing. And while I don't think we've had data-eating trouble,
> reshaping a parity raid has caused quite a lot of grief for people over
> the years ...
I've never tried it starting from a degraded array. Might be a corner
case bug not yet exposed.
> However, you're running a recent Ubuntu and mdadm - that should all have
> been fixed by now.
Indeed.
> Cheers,
> Wol
On 6/26/21 7:09 AM, Jason Flood wrote:
> Thanks for that, Phil - I think I'm starting to piece it all together
now. I was going from a 4-disk RAID5 to 4-disk RAID6, so from my reading
the backup file was recommended. The non-standard layout meant that the
array had over 20TB usable, but standardising the layout reduced that to
16TB. In that case the reshape starts at the end so the critical section
(and so the backup file) may have been in progress at the 99% complete
point when it failed, hence the need to specify the backup file for the
assemble command.
>
> I ran "sudo mdadm --assemble --verbose --force /dev/md0 /dev/sd[bcde]
--backup-file=/root/raid5backup":
>
> mdadm: looking for devices for /dev/md0
> mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.
> mdadm: /dev/sdc is identified as a member of /dev/md0, slot 1.
> mdadm: /dev/sdd is identified as a member of /dev/md0, slot 2.
> mdadm: /dev/sde is identified as a member of /dev/md0, slot 3.
> mdadm: Marking array /dev/md0 as 'clean'
> mdadm: /dev/md0 has an active reshape - checking if critical section
needs to be restored
> mdadm: No backup metadata on /root/raid5backup
> mdadm: added /dev/sdc to /dev/md0 as 1
> mdadm: added /dev/sdd to /dev/md0 as 2
> mdadm: added /dev/sde to /dev/md0 as 3
> mdadm: no uptodate device for slot 4 of /dev/md0
> mdadm: added /dev/sdb to /dev/md0 as 0
> mdadm: Need to backup 3072K of critical section..
> mdadm: /dev/md0 has been started with 4 drives (out of 5).
>
So force was sufficient to assemble. But you are still stuck at 99%.
Look at the output of ps to see if mdmon is still running (that is the
background process that actually reshapes stripe by stripe). If not,
look in your logs for clues as to why it died.
If you can't find anything significant, the next step would be to backup
the currently functioning array to another system/drive collection and
start from scratch. I wouldn't trust anything else with the information
available.
Phil
ps. Convention on kernel.org mailing lists is to NOT top-post, and to
trim unnecessary context.
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: 4-disk RAID6 (non-standard layout) normalise hung, now all disks spare
2021-06-26 14:28 ` Phil Turmel
@ 2021-06-27 11:09 ` Jason Flood
2021-06-28 14:24 ` Phil Turmel
0 siblings, 1 reply; 7+ messages in thread
From: Jason Flood @ 2021-06-27 11:09 UTC (permalink / raw)
To: 'Phil Turmel', 'antlists', linux-raid
Good morning Phil, Wol,
> So force was sufficient to assemble. But you are still stuck at 99%.
> Look at the output of ps to see if mdmon is still running (that is the background process that actually reshapes stripe by stripe). If not, look in your logs for clues as to why it died.
> If you can't find anything significant, the next step would be to backup the currently functioning array to another system/drive collection and start from scratch. I wouldn't trust anything else with the information available.
> Phil
Will do, thanks. I have a few assignments due next weekend so I may not be able to report back for a week.
> ps. Convention on kernel.org mailing lists is to NOT top-post, and to trim unnecessary context.
Sorry. First time on a mailing list since well before Outlook was invented!
Thanks again, Phil and Wol.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 4-disk RAID6 (non-standard layout) normalise hung, now all disks spare
2021-06-27 11:09 ` Jason Flood
@ 2021-06-28 14:24 ` Phil Turmel
0 siblings, 0 replies; 7+ messages in thread
From: Phil Turmel @ 2021-06-28 14:24 UTC (permalink / raw)
To: Jason Flood, linux-raid; +Cc: 'antlists'
Good morning Jason,
On 6/27/21 7:09 AM, Jason Flood wrote:
> Good morning Phil, Wol,
>
>> So force was sufficient to assemble. But you are still stuck at 99%.
>
>> Look at the output of ps to see if mdmon is still running (that is the background process that actually reshapes stripe by stripe). If not, look in your logs for clues as to why it died.
>
>> If you can't find anything significant, the next step would be to backup the currently functioning array to another system/drive collection and start from scratch. I wouldn't trust anything else with the information available.
>
>> Phil
>
> Will do, thanks. I have a few assignments due next weekend so I may not be able to report back for a week.
No worries.
>> ps. Convention on kernel.org mailing lists is to NOT top-post, and to trim unnecessary context.
>
> Sorry. First time on a mailing list since well before Outlook was invented!
No worries. Also note that many mailing lists disagree with this. And
with kernel.org's convention to CC: all participants. You almost can't
avoid getting flamed one way or the other. (:
> Thanks again, Phil and Wol.
You're welcome.
Phil
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-06-28 14:34 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-06-25 12:08 4-disk RAID6 (non-standard layout) normalise hung, now all disks spare Jason Flood
2021-06-25 13:59 ` Phil Turmel
2021-06-26 11:09 ` Jason Flood
2021-06-26 13:13 ` antlists
2021-06-26 14:28 ` Phil Turmel
2021-06-27 11:09 ` Jason Flood
2021-06-28 14:24 ` Phil Turmel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox