* Recovering a RAID6 after all disks were disconnected
@ 2016-12-07 8:59 Giuseppe Bilotta
2016-12-07 14:31 ` John Stoffel
0 siblings, 1 reply; 12+ messages in thread
From: Giuseppe Bilotta @ 2016-12-07 8:59 UTC (permalink / raw)
To: linux-raid
Hello all,
my situation is the following: I have a small 4-disk JBOD that I use
to hold a RAID6 software raid setup controlled by mdraid (currently
Debian version 3.4-4 on Linux kernel 4.7.8-1
I've had sporadic resets of the JBOD due to a variety of reasons
(power failures or disk failures —the JBOD has the bad habit of
resetting when one disk has an I/O error, which causes all of the
disks to go offline temporarily).
When this happens, all the disks get kicked from the RAID, as md
fails to find them until the reset of the JBOD is complete. When the
disks come back online, even if it's just a few seconds later, the
RAID remains in the failed configuration with all 4 disks missing, of
course.
Normally, the way I would proceed in this case is to unmount the
filesystem sitting on top of the RAID, stop the RAID, and then try to
start it again, which works reasonably well (aside from the obvious
filesystem check that is often needed).
The thing happened again a couple of days ago, but this time I tried
re-adding the disks directly when they came back online, using mdadm
-a and confident that since they _had_ been recently part of the
array, the array would actually go back to work fine —except that this
is not the case when ALL disks were kicked out of the array! Instead,
what happened was that all the disks were marked as 'spare' and the
RAID would not assemble anymore.
At this point I stopped everything and made a full copy of the RAID
disks (lucky me, I had just bought a new JBOD for an upgrade, and a
bunch of new disks, even if one of them is apparently defective so I
have only been able to backup 3 of the 4 disks) and I have been toying
around with ways to recover the array by playing on the copies I've
made (I've set the original disks to readonly at the kernel level just
to be sure).
So now my situation is this, and I would like to know if there is
something I can try to recover the RAID (I've made a few tests that I
will describe momentarily). (I would like to know if there is any
possibility for md to handle these kind of issue —all disks in a RAID
going temporarily offline— more gracefully, which is likely needed for
a lot of home setup where SATA is used instead of SAS).
So one thing that I've done is to hack around the superblock in the
disks (copies) to put back the device roles as they were (getting the
information from the pre-failure dmesg output). (By the way, I've been
using Andy's Binary Editor for the superblock editing, so if anyone is
interested in a be.ini for mdraid v1 superblocks, including checksum
verification, I'd be happy to share). Specifically, I've left the
device number untouched, but I have edited the dev_roles array so that
the slots corresponding to the dev_number from all the disks map to
appropriate device roles.
I can then assemble the array with only 3 of 4 disks (because I do not
have a copy of the fourth, essentially) and force-run it. However,
when I do this, I get two things:
(1) a complaint about the bitmap being out of date (number of events
too low by 3) and
(2) I/O errors on logical block 0 (and the RAID data thus completely
inaccessible)
I'm now wondering about what I should try next. Prevent a resync by
matching the event count with that of the bitmap (or conversely)? Try
a different permutation of the roles? (I have triple-checked but who
knows)? Try a different subset of disks? Try and recreate the array?
Thanks in advance for any suggestion you may have,
--
Giuseppe "Oblomov" Bilotta
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering a RAID6 after all disks were disconnected
2016-12-07 8:59 Recovering a RAID6 after all disks were disconnected Giuseppe Bilotta
@ 2016-12-07 14:31 ` John Stoffel
2016-12-07 17:21 ` Giuseppe Bilotta
0 siblings, 1 reply; 12+ messages in thread
From: John Stoffel @ 2016-12-07 14:31 UTC (permalink / raw)
To: Giuseppe Bilotta; +Cc: linux-raid
Giuseppe> my situation is the following: I have a small 4-disk JBOD that I use
Giuseppe> to hold a RAID6 software raid setup controlled by mdraid (currently
Giuseppe> Debian version 3.4-4 on Linux kernel 4.7.8-1
Giuseppe> I've had sporadic resets of the JBOD due to a variety of reasons
Giuseppe> (power failures or disk failures —the JBOD has the bad habit of
Giuseppe> resetting when one disk has an I/O error, which causes all of the
Giuseppe> disks to go offline temporarily).
Please toss that JBOD out the window! *grin*
Giuseppe> When this happens, all the disks get kicked from the RAID,
Giuseppe> as md fails to find them until the reset of the JBOD is
Giuseppe> complete. When the disks come back online, even if it's just
Giuseppe> a few seconds later, the RAID remains in the failed
Giuseppe> configuration with all 4 disks missing, of course.
Giuseppe> Normally, the way I would proceed in this case is to unmount
Giuseppe> the filesystem sitting on top of the RAID, stop the RAID,
Giuseppe> and then try to start it again, which works reasonably well
Giuseppe> (aside from the obvious filesystem check that is often
Giuseppe> needed).
Giuseppe> The thing happened again a couple of days ago, but this time
Giuseppe> I tried re-adding the disks directly when they came back
Giuseppe> online, using mdadm -a and confident that since they _had_
Giuseppe> been recently part of the array, the array would actually go
Giuseppe> back to work fine —except that this is not the case when ALL
Giuseppe> disks were kicked out of the array! Instead, what happened
Giuseppe> was that all the disks were marked as 'spare' and the RAID
Giuseppe> would not assemble anymore.
Can you please send us the full details of each disk using the
command:
mdadm -E /dev/sda1
Where of course 'a' and '1' depend on whether or not you are using
whole disk or partitioned disks for your arrays.
You might be able to just for the three spare disks (assumed in this
case to be sda1, sdb1, sdc1; but you need to be sure first!) to
assemble into a full array with:
mdadm -A /dev/md50 /dev/sda1 /dev/sdb1 /dev/sdc1
And if that works, great. If not, post the error message(s) you get
back.
Basically provide more details on your setup so we can help you.
John
Giuseppe> At this point I stopped everything and made a full copy of
Giuseppe> the RAID disks (lucky me, I had just bought a new JBOD for
Giuseppe> an upgrade, and a bunch of new disks, even if one of them is
Giuseppe> apparently defective so I have only been able to backup 3 of
Giuseppe> the 4 disks) and I have been toying around with ways to
Giuseppe> recover the array by playing on the copies I've made (I've
Giuseppe> set the original disks to readonly at the kernel level just
Giuseppe> to be sure).
Giuseppe> So now my situation is this, and I would like to know if there is
Giuseppe> something I can try to recover the RAID (I've made a few tests that I
Giuseppe> will describe momentarily). (I would like to know if there is any
Giuseppe> possibility for md to handle these kind of issue —all disks in a RAID
Giuseppe> going temporarily offline— more gracefully, which is likely needed for
Giuseppe> a lot of home setup where SATA is used instead of SAS).
Giuseppe> So one thing that I've done is to hack around the superblock in the
Giuseppe> disks (copies) to put back the device roles as they were (getting the
Giuseppe> information from the pre-failure dmesg output). (By the way, I've been
Giuseppe> using Andy's Binary Editor for the superblock editing, so if anyone is
Giuseppe> interested in a be.ini for mdraid v1 superblocks, including checksum
Giuseppe> verification, I'd be happy to share). Specifically, I've left the
Giuseppe> device number untouched, but I have edited the dev_roles array so that
Giuseppe> the slots corresponding to the dev_number from all the disks map to
Giuseppe> appropriate device roles.
Giuseppe> I can then assemble the array with only 3 of 4 disks (because I do not
Giuseppe> have a copy of the fourth, essentially) and force-run it. However,
Giuseppe> when I do this, I get two things:
Giuseppe> (1) a complaint about the bitmap being out of date (number of events
Giuseppe> too low by 3) and
Giuseppe> (2) I/O errors on logical block 0 (and the RAID data thus completely
Giuseppe> inaccessible)
Giuseppe> I'm now wondering about what I should try next. Prevent a resync by
Giuseppe> matching the event count with that of the bitmap (or conversely)? Try
Giuseppe> a different permutation of the roles? (I have triple-checked but who
Giuseppe> knows)? Try a different subset of disks? Try and recreate the array?
Giuseppe> Thanks in advance for any suggestion you may have,
Giuseppe> --
Giuseppe> Giuseppe "Oblomov" Bilotta
Giuseppe> --
Giuseppe> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
Giuseppe> the body of a message to majordomo@vger.kernel.org
Giuseppe> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering a RAID6 after all disks were disconnected
2016-12-07 14:31 ` John Stoffel
@ 2016-12-07 17:21 ` Giuseppe Bilotta
2016-12-08 19:02 ` John Stoffel
0 siblings, 1 reply; 12+ messages in thread
From: Giuseppe Bilotta @ 2016-12-07 17:21 UTC (permalink / raw)
To: John Stoffel; +Cc: linux-raid
Hello John, and thanks for your time
Giuseppe> I've had sporadic resets of the JBOD due to a variety of reasons
Giuseppe> (power failures or disk failures —the JBOD has the bad habit of
Giuseppe> resetting when one disk has an I/O error, which causes all of the
Giuseppe> disks to go offline temporarily).
John> Please toss that JBOD out the window! *grin*
Well, that's exactly why I bought the new one which is the one I'm
currently using to host the backup disks I'm experimenting on! 8-)
However I suspect this is a misfeature common to many if not all
'home' JBODS which are all SATA based and only provide eSATA and/or
USB3 connection to the machine.
Giuseppe> The thing happened again a couple of days ago, but this time
Giuseppe> I tried re-adding the disks directly when they came back
Giuseppe> online, using mdadm -a and confident that since they _had_
Giuseppe> been recently part of the array, the array would actually go
Giuseppe> back to work fine —except that this is not the case when ALL
Giuseppe> disks were kicked out of the array! Instead, what happened
Giuseppe> was that all the disks were marked as 'spare' and the RAID
Giuseppe> would not assemble anymore.
John> Can you please send us the full details of each disk using the
John> command:
John>
John> mdadm -E /dev/sda1
John>
Here it is. Notice that this is the result of -E _after_ the attempted
re-add while the RAID was running, which marked all the disks as
spares:
==8<=======
/dev/sdc:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x9
Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Name : labrador:oneforall (local to host labrador)
Creation Time : Fri Nov 30 19:57:45 2012
Raid Level : raid6
Raid Devices : 4
Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262048 sectors, after=944 sectors
State : clean
Device UUID : 543f75ac:a1f3cf99:1c6b71d9:52e358b9
Internal Bitmap : 8 sectors from superblock
Update Time : Sun Dec 4 17:11:19 2016
Bad Block Log : 512 entries available at offset 80 sectors - bad
blocks present.
Checksum : 1e2f00fc - correct
Events : 31196
Layout : left-symmetric
Chunk Size : 512K
Device Role : spare
Array State : .... ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x9
Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Name : labrador:oneforall (local to host labrador)
Creation Time : Fri Nov 30 19:57:45 2012
Raid Level : raid6
Raid Devices : 4
Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262048 sectors, after=944 sectors
State : clean
Device UUID : 649d53ad:f909b7a9:cd0f57f2:08a55e3b
Internal Bitmap : 8 sectors from superblock
Update Time : Sun Dec 4 17:11:19 2016
Bad Block Log : 512 entries available at offset 80 sectors - bad
blocks present.
Checksum : c9dfe033 - correct
Events : 31196
Layout : left-symmetric
Chunk Size : 512K
Device Role : spare
Array State : .... ('A' == active, '.' == missing, 'R' == replacing)
/dev/sde:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x9
Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Name : labrador:oneforall (local to host labrador)
Creation Time : Fri Nov 30 19:57:45 2012
Raid Level : raid6
Raid Devices : 4
Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262048 sectors, after=944 sectors
State : clean
Device UUID : dd3f90ab:619684c0:942a7d88:f116f2db
Internal Bitmap : 8 sectors from superblock
Update Time : Sun Dec 4 17:11:19 2016
Bad Block Log : 512 entries available at offset 80 sectors - bad
blocks present.
Checksum : 15a3975a - correct
Events : 31196
Layout : left-symmetric
Chunk Size : 512K
Device Role : spare
Array State : .... ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdf:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x9
Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Name : labrador:oneforall (local to host labrador)
Creation Time : Fri Nov 30 19:57:45 2012
Raid Level : raid6
Raid Devices : 4
Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262048 sectors, after=944 sectors
State : clean
Device UUID : f7359c4e:c1f04b22:ce7aa32f:ed5bb054
Internal Bitmap : 8 sectors from superblock
Update Time : Sun Dec 4 17:11:19 2016
Bad Block Log : 512 entries available at offset 80 sectors - bad
blocks present.
Checksum : 3a5b94a7 - correct
Events : 31196
Layout : left-symmetric
Chunk Size : 512K
Device Role : spare
Array State : .... ('A' == active, '.' == missing, 'R' == replacing)
==8<=======
I do however know the _original_ positions of the respective disks
from the kernel messages
At assembly time:
[ +0.000638] RAID conf printout:
[ +0.000001] --- level:6 rd:4 wd:4
[ +0.000001] disk 0, o:1, dev:sdf
[ +0.000001] disk 1, o:1, dev:sde
[ +0.000000] disk 2, o:1, dev:sdd
[ +0.000001] disk 3, o:1, dev:sdc
After the JBOD disappeared and right before they all get kicked out:
[ +0.000438] RAID conf printout:
[ +0.000001] --- level:6 rd:4 wd:0
[ +0.000001] disk 0, o:0, dev:sdf
[ +0.000001] disk 1, o:0, dev:sde
[ +0.000000] disk 2, o:0, dev:sdd
[ +0.000001] disk 3, o:0, dev:sdc
John> You might be able to just for the three spare disks (assumed in this
John> case to be sda1, sdb1, sdc1; but you need to be sure first!) to
John> assemble into a full array with:
John>
John> mdadm -A /dev/md50 /dev/sda1 /dev/sdb1 /dev/sdc1
John>
John> And if that works, great. If not, post the error message(s) you get
John> back.
Note that the RAID has no active disks anymore, since when I tried
re-adding the formerly active disks that
where kicked from the array they got marked as spares, and mdraid
simply refuses to start a RAID6 setup with only spares. The message I
get is indeed
mdadm: /dev/md126 assembled from 0 drives and 3 spares - not enough to
start the array.
This is the point at which I made a copy of 3 of the 4 disks and
started playing around. Specifically, I dd'ed sdc into sdh, sdd into
sdi and sde into sdj and started playing around with sd[hij] rather
than the original disks, as I mentioned:
Giuseppe> So one thing that I've done is to hack around the superblock in the
Giuseppe> disks (copies) to put back the device roles as they were (getting the
Giuseppe> information from the pre-failure dmesg output). (By the way, I've been
Giuseppe> using Andy's Binary Editor for the superblock editing, so if anyone is
Giuseppe> interested in a be.ini for mdraid v1 superblocks, including checksum
Giuseppe> verification, I'd be happy to share). Specifically, I've left the
Giuseppe> device number untouched, but I have edited the dev_roles array so that
Giuseppe> the slots corresponding to the dev_number from all the disks map to
Giuseppe> appropriate device roles.
Specifically, I hand-edited the superblocks to achieve this:
==8<===============
/dev/sdh:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x9
Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Name : labrador:oneforall (local to host labrador)
Creation Time : Fri Nov 30 19:57:45 2012
Raid Level : raid6
Raid Devices : 4
Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262048 sectors, after=944 sectors
State : clean
Device UUID : 543f75ac:a1f3cf99:1c6b71d9:52e358b9
Internal Bitmap : 8 sectors from superblock
Update Time : Sun Dec 4 17:11:19 2016
Bad Block Log : 512 entries available at offset 80 sectors - bad
blocks present.
Checksum : 1e3300fe - correct
Events : 31196
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 3
Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdi:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x9
Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Name : labrador:oneforall (local to host labrador)
Creation Time : Fri Nov 30 19:57:45 2012
Raid Level : raid6
Raid Devices : 4
Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262048 sectors, after=944 sectors
State : clean
Device UUID : 649d53ad:f909b7a9:cd0f57f2:08a55e3b
Internal Bitmap : 8 sectors from superblock
Update Time : Sun Dec 4 17:11:19 2016
Bad Block Log : 512 entries available at offset 80 sectors - bad
blocks present.
Checksum : c9e3e035 - correct
Events : 31196
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 2
Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdj:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x9
Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Name : labrador:oneforall (local to host labrador)
Creation Time : Fri Nov 30 19:57:45 2012
Raid Level : raid6
Raid Devices : 4
Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262048 sectors, after=944 sectors
State : clean
Device UUID : dd3f90ab:619684c0:942a7d88:f116f2db
Internal Bitmap : 8 sectors from superblock
Update Time : Sun Dec 4 17:11:19 2016
Bad Block Log : 512 entries available at offset 80 sectors - bad
blocks present.
Checksum : 15a7975c - correct
Events : 31196
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 1
Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
==8<===============
And I _can_ assemble the array, but what I get is this:
[ +0.003574] md: bind<sdi>
[ +0.001823] md: bind<sdh>
[ +0.000978] md: bind<sdj>
[ +0.003971] md/raid:md127: device sdj operational as raid disk 1
[ +0.000125] md/raid:md127: device sdh operational as raid disk 3
[ +0.000105] md/raid:md127: device sdi operational as raid disk 2
[ +0.015017] md/raid:md127: allocated 4374kB
[ +0.000139] md/raid:md127: raid level 6 active with 3 out of 4
devices, algorithm 2
[ +0.000063] RAID conf printout:
[ +0.000002] --- level:6 rd:4 wd:3
[ +0.000003] disk 1, o:1, dev:sdj
[ +0.000002] disk 2, o:1, dev:sdi
[ +0.000001] disk 3, o:1, dev:sdh
[ +0.004187] md127: bitmap file is out of date (31193 < 31196) --
forcing full recovery
[ +0.000065] created bitmap (22 pages) for device md127
[ +0.000072] md127: bitmap file is out of date, doing full recovery
[ +0.100300] md127: bitmap initialized from disk: read 2 pages, set
44711 of 44711 bits
[ +0.039741] md127: detected capacity change from 0 to 6000916561920
[ +0.000085] Buffer I/O error on dev md127, logical block 0, async page read
[ +0.000064] Buffer I/O error on dev md127, logical block 0, async page read
[ +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
[ +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
[ +0.000019] ldm_validate_partition_table(): Disk read failed.
[ +0.000021] Buffer I/O error on dev md127, logical block 0, async page read
[ +0.000026] Buffer I/O error on dev md127, logical block 0, async page read
[ +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
[ +0.000021] Buffer I/O error on dev md127, logical block 0, async page read
[ +0.000019] Dev md127: unable to read RDB block 0
[ +0.000016] Buffer I/O error on dev md127, logical block 0, async page read
[ +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
[ +0.000030] md127: unable to read partition table
and any attempt to access md127 content gives an I/O error.
--
Giuseppe "Oblomov" Bilotta
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering a RAID6 after all disks were disconnected
2016-12-07 17:21 ` Giuseppe Bilotta
@ 2016-12-08 19:02 ` John Stoffel
2016-12-22 23:11 ` Giuseppe Bilotta
0 siblings, 1 reply; 12+ messages in thread
From: John Stoffel @ 2016-12-08 19:02 UTC (permalink / raw)
To: Giuseppe Bilotta; +Cc: John Stoffel, linux-raid
Sorry for not getting back to you sooner, I've been under the weather
lately. And I'm NOT an expert on this, but it's good you've made
copies of the disks.
Giuseppe> Hello John, and thanks for your time
Giuseppe> I've had sporadic resets of the JBOD due to a variety of reasons
Giuseppe> (power failures or disk failures —the JBOD has the bad habit of
Giuseppe> resetting when one disk has an I/O error, which causes all of the
Giuseppe> disks to go offline temporarily).
John> Please toss that JBOD out the window! *grin*
Giuseppe> Well, that's exactly why I bought the new one which is the one I'm
Giuseppe> currently using to host the backup disks I'm experimenting on! 8-)
Giuseppe> However I suspect this is a misfeature common to many if not all
Giuseppe> 'home' JBODS which are all SATA based and only provide eSATA and/or
Giuseppe> USB3 connection to the machine.
Giuseppe> The thing happened again a couple of days ago, but this time
Giuseppe> I tried re-adding the disks directly when they came back
Giuseppe> online, using mdadm -a and confident that since they _had_
Giuseppe> been recently part of the array, the array would actually go
Giuseppe> back to work fine —except that this is not the case when ALL
Giuseppe> disks were kicked out of the array! Instead, what happened
Giuseppe> was that all the disks were marked as 'spare' and the RAID
Giuseppe> would not assemble anymore.
John> Can you please send us the full details of each disk using the
John> command:
John>
John> mdadm -E /dev/sda1
John>
Giuseppe> Here it is. Notice that this is the result of -E _after_ the attempted
Giuseppe> re-add while the RAID was running, which marked all the disks as
Giuseppe> spares:
Yeah, this is probably a bad state. I would suggest you try to just
assemble the disks in various orders using your clones:
mdadm -A /dev/md0 /dev/sdc /dev/sdd /dev/sde /dev/sdf
And then mix up the order until you get a working array. You might
also want to try assembling using the 'missing' flag for the original
disk which dropped out of the array, so that just the three good disks
are used. This might take a while to test all the possible
permutations.
You might also want to look back in the archives of this mailing
list. Phil Turmel has some great advice and howto guides for this.
You can do the test assembles using loop back devices so that you
don't write to the originals, or even to the clones.
This should let you do testing more quickly.
Here's some other pointers for drive timeout issues that you should
look at as well:
Readings for timeout mismatch issues: (whole threads if possible)
http://marc.info/?l=linux-raid&m=139050322510249&w=2
http://marc.info/?l=linux-raid&m=135863964624202&w=2
http://marc.info/?l=linux-raid&m=135811522817345&w=1
http://marc.info/?l=linux-raid&m=133761065622164&w=2
http://marc.info/?l=linux-raid&m=132477199207506
http://marc.info/?l=linux-raid&m=133665797115876&w=2
http://marc.info/?l=linux-raid&m=142487508806844&w=3
http://marc.info/?l=linux-raid&m=144535576302583&w=2
Giuseppe> ==8<=======
Giuseppe> /dev/sdc:
Giuseppe> Magic : a92b4efc
Giuseppe> Version : 1.2
Giuseppe> Feature Map : 0x9
Giuseppe> Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Giuseppe> Name : labrador:oneforall (local to host labrador)
Giuseppe> Creation Time : Fri Nov 30 19:57:45 2012
Giuseppe> Raid Level : raid6
Giuseppe> Raid Devices : 4
Giuseppe> Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Giuseppe> Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Giuseppe> Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Giuseppe> Data Offset : 262144 sectors
Giuseppe> Super Offset : 8 sectors
Giuseppe> Unused Space : before=262048 sectors, after=944 sectors
Giuseppe> State : clean
Giuseppe> Device UUID : 543f75ac:a1f3cf99:1c6b71d9:52e358b9
Giuseppe> Internal Bitmap : 8 sectors from superblock
Giuseppe> Update Time : Sun Dec 4 17:11:19 2016
Giuseppe> Bad Block Log : 512 entries available at offset 80 sectors - bad
Giuseppe> blocks present.
Giuseppe> Checksum : 1e2f00fc - correct
Giuseppe> Events : 31196
Giuseppe> Layout : left-symmetric
Giuseppe> Chunk Size : 512K
Giuseppe> Device Role : spare
Giuseppe> Array State : .... ('A' == active, '.' == missing, 'R' == replacing)
Giuseppe> /dev/sdd:
Giuseppe> Magic : a92b4efc
Giuseppe> Version : 1.2
Giuseppe> Feature Map : 0x9
Giuseppe> Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Giuseppe> Name : labrador:oneforall (local to host labrador)
Giuseppe> Creation Time : Fri Nov 30 19:57:45 2012
Giuseppe> Raid Level : raid6
Giuseppe> Raid Devices : 4
Giuseppe> Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Giuseppe> Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Giuseppe> Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Giuseppe> Data Offset : 262144 sectors
Giuseppe> Super Offset : 8 sectors
Giuseppe> Unused Space : before=262048 sectors, after=944 sectors
Giuseppe> State : clean
Giuseppe> Device UUID : 649d53ad:f909b7a9:cd0f57f2:08a55e3b
Giuseppe> Internal Bitmap : 8 sectors from superblock
Giuseppe> Update Time : Sun Dec 4 17:11:19 2016
Giuseppe> Bad Block Log : 512 entries available at offset 80 sectors - bad
Giuseppe> blocks present.
Giuseppe> Checksum : c9dfe033 - correct
Giuseppe> Events : 31196
Giuseppe> Layout : left-symmetric
Giuseppe> Chunk Size : 512K
Giuseppe> Device Role : spare
Giuseppe> Array State : .... ('A' == active, '.' == missing, 'R' == replacing)
Giuseppe> /dev/sde:
Giuseppe> Magic : a92b4efc
Giuseppe> Version : 1.2
Giuseppe> Feature Map : 0x9
Giuseppe> Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Giuseppe> Name : labrador:oneforall (local to host labrador)
Giuseppe> Creation Time : Fri Nov 30 19:57:45 2012
Giuseppe> Raid Level : raid6
Giuseppe> Raid Devices : 4
Giuseppe> Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Giuseppe> Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Giuseppe> Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Giuseppe> Data Offset : 262144 sectors
Giuseppe> Super Offset : 8 sectors
Giuseppe> Unused Space : before=262048 sectors, after=944 sectors
Giuseppe> State : clean
Giuseppe> Device UUID : dd3f90ab:619684c0:942a7d88:f116f2db
Giuseppe> Internal Bitmap : 8 sectors from superblock
Giuseppe> Update Time : Sun Dec 4 17:11:19 2016
Giuseppe> Bad Block Log : 512 entries available at offset 80 sectors - bad
Giuseppe> blocks present.
Giuseppe> Checksum : 15a3975a - correct
Giuseppe> Events : 31196
Giuseppe> Layout : left-symmetric
Giuseppe> Chunk Size : 512K
Giuseppe> Device Role : spare
Giuseppe> Array State : .... ('A' == active, '.' == missing, 'R' == replacing)
Giuseppe> /dev/sdf:
Giuseppe> Magic : a92b4efc
Giuseppe> Version : 1.2
Giuseppe> Feature Map : 0x9
Giuseppe> Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Giuseppe> Name : labrador:oneforall (local to host labrador)
Giuseppe> Creation Time : Fri Nov 30 19:57:45 2012
Giuseppe> Raid Level : raid6
Giuseppe> Raid Devices : 4
Giuseppe> Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Giuseppe> Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Giuseppe> Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Giuseppe> Data Offset : 262144 sectors
Giuseppe> Super Offset : 8 sectors
Giuseppe> Unused Space : before=262048 sectors, after=944 sectors
Giuseppe> State : clean
Giuseppe> Device UUID : f7359c4e:c1f04b22:ce7aa32f:ed5bb054
Giuseppe> Internal Bitmap : 8 sectors from superblock
Giuseppe> Update Time : Sun Dec 4 17:11:19 2016
Giuseppe> Bad Block Log : 512 entries available at offset 80 sectors - bad
Giuseppe> blocks present.
Giuseppe> Checksum : 3a5b94a7 - correct
Giuseppe> Events : 31196
Giuseppe> Layout : left-symmetric
Giuseppe> Chunk Size : 512K
Giuseppe> Device Role : spare
Giuseppe> Array State : .... ('A' == active, '.' == missing, 'R' == replacing)
Giuseppe> ==8<=======
Giuseppe> I do however know the _original_ positions of the respective disks
Giuseppe> from the kernel messages
Giuseppe> At assembly time:
Giuseppe> [ +0.000638] RAID conf printout:
Giuseppe> [ +0.000001] --- level:6 rd:4 wd:4
Giuseppe> [ +0.000001] disk 0, o:1, dev:sdf
Giuseppe> [ +0.000001] disk 1, o:1, dev:sde
Giuseppe> [ +0.000000] disk 2, o:1, dev:sdd
Giuseppe> [ +0.000001] disk 3, o:1, dev:sdc
Giuseppe> After the JBOD disappeared and right before they all get kicked out:
Giuseppe> [ +0.000438] RAID conf printout:
Giuseppe> [ +0.000001] --- level:6 rd:4 wd:0
Giuseppe> [ +0.000001] disk 0, o:0, dev:sdf
Giuseppe> [ +0.000001] disk 1, o:0, dev:sde
Giuseppe> [ +0.000000] disk 2, o:0, dev:sdd
Giuseppe> [ +0.000001] disk 3, o:0, dev:sdc
John> You might be able to just for the three spare disks (assumed in this
John> case to be sda1, sdb1, sdc1; but you need to be sure first!) to
John> assemble into a full array with:
John>
John> mdadm -A /dev/md50 /dev/sda1 /dev/sdb1 /dev/sdc1
John>
John> And if that works, great. If not, post the error message(s) you get
John> back.
Giuseppe> Note that the RAID has no active disks anymore, since when I tried
Giuseppe> re-adding the formerly active disks that
Giuseppe> where kicked from the array they got marked as spares, and mdraid
Giuseppe> simply refuses to start a RAID6 setup with only spares. The message I
Giuseppe> get is indeed
Giuseppe> mdadm: /dev/md126 assembled from 0 drives and 3 spares - not enough to
Giuseppe> start the array.
Giuseppe> This is the point at which I made a copy of 3 of the 4 disks and
Giuseppe> started playing around. Specifically, I dd'ed sdc into sdh, sdd into
Giuseppe> sdi and sde into sdj and started playing around with sd[hij] rather
Giuseppe> than the original disks, as I mentioned:
Giuseppe> So one thing that I've done is to hack around the superblock in the
Giuseppe> disks (copies) to put back the device roles as they were (getting the
Giuseppe> information from the pre-failure dmesg output). (By the way, I've been
Giuseppe> using Andy's Binary Editor for the superblock editing, so if anyone is
Giuseppe> interested in a be.ini for mdraid v1 superblocks, including checksum
Giuseppe> verification, I'd be happy to share). Specifically, I've left the
Giuseppe> device number untouched, but I have edited the dev_roles array so that
Giuseppe> the slots corresponding to the dev_number from all the disks map to
Giuseppe> appropriate device roles.
Giuseppe> Specifically, I hand-edited the superblocks to achieve this:
Giuseppe> ==8<===============
Giuseppe> /dev/sdh:
Giuseppe> Magic : a92b4efc
Giuseppe> Version : 1.2
Giuseppe> Feature Map : 0x9
Giuseppe> Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Giuseppe> Name : labrador:oneforall (local to host labrador)
Giuseppe> Creation Time : Fri Nov 30 19:57:45 2012
Giuseppe> Raid Level : raid6
Giuseppe> Raid Devices : 4
Giuseppe> Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Giuseppe> Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Giuseppe> Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Giuseppe> Data Offset : 262144 sectors
Giuseppe> Super Offset : 8 sectors
Giuseppe> Unused Space : before=262048 sectors, after=944 sectors
Giuseppe> State : clean
Giuseppe> Device UUID : 543f75ac:a1f3cf99:1c6b71d9:52e358b9
Giuseppe> Internal Bitmap : 8 sectors from superblock
Giuseppe> Update Time : Sun Dec 4 17:11:19 2016
Giuseppe> Bad Block Log : 512 entries available at offset 80 sectors - bad
Giuseppe> blocks present.
Giuseppe> Checksum : 1e3300fe - correct
Giuseppe> Events : 31196
Giuseppe> Layout : left-symmetric
Giuseppe> Chunk Size : 512K
Giuseppe> Device Role : Active device 3
Giuseppe> Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
Giuseppe> /dev/sdi:
Giuseppe> Magic : a92b4efc
Giuseppe> Version : 1.2
Giuseppe> Feature Map : 0x9
Giuseppe> Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Giuseppe> Name : labrador:oneforall (local to host labrador)
Giuseppe> Creation Time : Fri Nov 30 19:57:45 2012
Giuseppe> Raid Level : raid6
Giuseppe> Raid Devices : 4
Giuseppe> Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Giuseppe> Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Giuseppe> Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Giuseppe> Data Offset : 262144 sectors
Giuseppe> Super Offset : 8 sectors
Giuseppe> Unused Space : before=262048 sectors, after=944 sectors
Giuseppe> State : clean
Giuseppe> Device UUID : 649d53ad:f909b7a9:cd0f57f2:08a55e3b
Giuseppe> Internal Bitmap : 8 sectors from superblock
Giuseppe> Update Time : Sun Dec 4 17:11:19 2016
Giuseppe> Bad Block Log : 512 entries available at offset 80 sectors - bad
Giuseppe> blocks present.
Giuseppe> Checksum : c9e3e035 - correct
Giuseppe> Events : 31196
Giuseppe> Layout : left-symmetric
Giuseppe> Chunk Size : 512K
Giuseppe> Device Role : Active device 2
Giuseppe> Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
Giuseppe> /dev/sdj:
Giuseppe> Magic : a92b4efc
Giuseppe> Version : 1.2
Giuseppe> Feature Map : 0x9
Giuseppe> Array UUID : 943d287e:af28b455:88a047f2:d714b8c6
Giuseppe> Name : labrador:oneforall (local to host labrador)
Giuseppe> Creation Time : Fri Nov 30 19:57:45 2012
Giuseppe> Raid Level : raid6
Giuseppe> Raid Devices : 4
Giuseppe> Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Giuseppe> Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
Giuseppe> Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Giuseppe> Data Offset : 262144 sectors
Giuseppe> Super Offset : 8 sectors
Giuseppe> Unused Space : before=262048 sectors, after=944 sectors
Giuseppe> State : clean
Giuseppe> Device UUID : dd3f90ab:619684c0:942a7d88:f116f2db
Giuseppe> Internal Bitmap : 8 sectors from superblock
Giuseppe> Update Time : Sun Dec 4 17:11:19 2016
Giuseppe> Bad Block Log : 512 entries available at offset 80 sectors - bad
Giuseppe> blocks present.
Giuseppe> Checksum : 15a7975c - correct
Giuseppe> Events : 31196
Giuseppe> Layout : left-symmetric
Giuseppe> Chunk Size : 512K
Giuseppe> Device Role : Active device 1
Giuseppe> Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
Giuseppe> ==8<===============
Giuseppe> And I _can_ assemble the array, but what I get is this:
Giuseppe> [ +0.003574] md: bind<sdi>
Giuseppe> [ +0.001823] md: bind<sdh>
Giuseppe> [ +0.000978] md: bind<sdj>
Giuseppe> [ +0.003971] md/raid:md127: device sdj operational as raid disk 1
Giuseppe> [ +0.000125] md/raid:md127: device sdh operational as raid disk 3
Giuseppe> [ +0.000105] md/raid:md127: device sdi operational as raid disk 2
Giuseppe> [ +0.015017] md/raid:md127: allocated 4374kB
Giuseppe> [ +0.000139] md/raid:md127: raid level 6 active with 3 out of 4
Giuseppe> devices, algorithm 2
Giuseppe> [ +0.000063] RAID conf printout:
Giuseppe> [ +0.000002] --- level:6 rd:4 wd:3
Giuseppe> [ +0.000003] disk 1, o:1, dev:sdj
Giuseppe> [ +0.000002] disk 2, o:1, dev:sdi
Giuseppe> [ +0.000001] disk 3, o:1, dev:sdh
Giuseppe> [ +0.004187] md127: bitmap file is out of date (31193 < 31196) --
Giuseppe> forcing full recovery
Giuseppe> [ +0.000065] created bitmap (22 pages) for device md127
Giuseppe> [ +0.000072] md127: bitmap file is out of date, doing full recovery
Giuseppe> [ +0.100300] md127: bitmap initialized from disk: read 2 pages, set
Giuseppe> 44711 of 44711 bits
Giuseppe> [ +0.039741] md127: detected capacity change from 0 to 6000916561920
Giuseppe> [ +0.000085] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000064] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000019] ldm_validate_partition_table(): Disk read failed.
Giuseppe> [ +0.000021] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000026] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000021] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000019] Dev md127: unable to read RDB block 0
Giuseppe> [ +0.000016] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000022] Buffer I/O error on dev md127, logical block 0, async page read
Giuseppe> [ +0.000030] md127: unable to read partition table
Giuseppe> and any attempt to access md127 content gives an I/O error.
Giuseppe> --
Giuseppe> Giuseppe "Oblomov" Bilotta
Giuseppe> --
Giuseppe> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
Giuseppe> the body of a message to majordomo@vger.kernel.org
Giuseppe> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering a RAID6 after all disks were disconnected
2016-12-08 19:02 ` John Stoffel
@ 2016-12-22 23:11 ` Giuseppe Bilotta
2016-12-22 23:25 ` NeilBrown
0 siblings, 1 reply; 12+ messages in thread
From: Giuseppe Bilotta @ 2016-12-22 23:11 UTC (permalink / raw)
To: John Stoffel; +Cc: linux-raid
Hello again,
On Thu, Dec 8, 2016 at 8:02 PM, John Stoffel <john@stoffel.org> wrote:
>
> Sorry for not getting back to you sooner, I've been under the weather
> lately. And I'm NOT an expert on this, but it's good you've made
> copies of the disks.
Don't worry about the timing, as you can see I haven't had much time
to dedicate to the recovery of this RAID either. As you can see, it
was not that urgent ;-)
> Giuseppe> Here it is. Notice that this is the result of -E _after_ the attempted
> Giuseppe> re-add while the RAID was running, which marked all the disks as
> Giuseppe> spares:
>
> Yeah, this is probably a bad state. I would suggest you try to just
> assemble the disks in various orders using your clones:
>
> mdadm -A /dev/md0 /dev/sdc /dev/sdd /dev/sde /dev/sdf
>
> And then mix up the order until you get a working array. You might
> also want to try assembling using the 'missing' flag for the original
> disk which dropped out of the array, so that just the three good disks
> are used. This might take a while to test all the possible
> permutations.
>
> You might also want to look back in the archives of this mailing
> list. Phil Turmel has some great advice and howto guides for this.
> You can do the test assembles using loop back devices so that you
> don't write to the originals, or even to the clones.
I've used the instructions on using overlays with dmsetup + sparse
files on the RAID wiki
https://raid.wiki.kernel.org/index.php/Recovering_a_damaged_RAID
to experiment with the recovery (and just to be sure, I set the
original disks read-only using blockdev; might be worth adding this to
the wiki).
I also wrote a small script to test all combinations (nothing smart,
really, simply enumeration of combos, but I'll consider putting it up
on the wiki as well), and I was actually surprised by the results. To
test if the RAID was being re-created correctly with each combination,
I used `file -s` on the RAID, and verified that the results made
sense. I am surprised to find out that there are multiple combinations
that make sense (note that the disk names are shifted by one compared
to previous emails due a machine lockup that required a reboot and
another disk butting in to a different order):
trying /dev/sdd /dev/sdf /dev/sde /dev/sdg
/dev/md111: Linux rev 1.0 ext4 filesystem data,
UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
(needs journal recovery) (extents) (large files) (huge files)
trying /dev/sdd /dev/sdf /dev/sdg /dev/sde
/dev/md111: Linux rev 1.0 ext4 filesystem data,
UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
(needs journal recovery) (extents) (large files) (huge files)
trying /dev/sde /dev/sdf /dev/sdd /dev/sdg
/dev/md111: Linux rev 1.0 ext4 filesystem data,
UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
(needs journal recovery) (extents) (large files) (huge files)
trying /dev/sde /dev/sdf /dev/sdg /dev/sdd
/dev/md111: Linux rev 1.0 ext4 filesystem data,
UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
(needs journal recovery) (extents) (large files) (huge files)
trying /dev/sdg /dev/sdf /dev/sde /dev/sdd
/dev/md111: Linux rev 1.0 ext4 filesystem data,
UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
(needs journal recovery) (extents) (large files) (huge files)
trying /dev/sdg /dev/sdf /dev/sdd /dev/sde
/dev/md111: Linux rev 1.0 ext4 filesystem data,
UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
(needs journal recovery) (extents) (large files) (huge files)
:
So there are six out of 24 combinations that make sense, at least for
the first block. I know from the pre-fail dmesg that the g-f-e-d order
should be the correct one, but now I'm left wondering if there is a
better way to verify this (other than manually sampling files to see
if they make sense), or if the left-symmetric layout on a RAID6 simply
allows some of the disk positions to be swapped without loss of data.
--
Giuseppe "Oblomov" Bilotta
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering a RAID6 after all disks were disconnected
2016-12-22 23:11 ` Giuseppe Bilotta
@ 2016-12-22 23:25 ` NeilBrown
2016-12-23 16:17 ` Giuseppe Bilotta
0 siblings, 1 reply; 12+ messages in thread
From: NeilBrown @ 2016-12-22 23:25 UTC (permalink / raw)
To: Giuseppe Bilotta, John Stoffel; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 4557 bytes --]
On Fri, Dec 23 2016, Giuseppe Bilotta wrote:
> Hello again,
>
> On Thu, Dec 8, 2016 at 8:02 PM, John Stoffel <john@stoffel.org> wrote:
>>
>> Sorry for not getting back to you sooner, I've been under the weather
>> lately. And I'm NOT an expert on this, but it's good you've made
>> copies of the disks.
>
> Don't worry about the timing, as you can see I haven't had much time
> to dedicate to the recovery of this RAID either. As you can see, it
> was not that urgent ;-)
>
>
>> Giuseppe> Here it is. Notice that this is the result of -E _after_ the attempted
>> Giuseppe> re-add while the RAID was running, which marked all the disks as
>> Giuseppe> spares:
>>
>> Yeah, this is probably a bad state. I would suggest you try to just
>> assemble the disks in various orders using your clones:
>>
>> mdadm -A /dev/md0 /dev/sdc /dev/sdd /dev/sde /dev/sdf
>>
>> And then mix up the order until you get a working array. You might
>> also want to try assembling using the 'missing' flag for the original
>> disk which dropped out of the array, so that just the three good disks
>> are used. This might take a while to test all the possible
>> permutations.
>>
>> You might also want to look back in the archives of this mailing
>> list. Phil Turmel has some great advice and howto guides for this.
>> You can do the test assembles using loop back devices so that you
>> don't write to the originals, or even to the clones.
>
> I've used the instructions on using overlays with dmsetup + sparse
> files on the RAID wiki
> https://raid.wiki.kernel.org/index.php/Recovering_a_damaged_RAID
> to experiment with the recovery (and just to be sure, I set the
> original disks read-only using blockdev; might be worth adding this to
> the wiki).
>
> I also wrote a small script to test all combinations (nothing smart,
> really, simply enumeration of combos, but I'll consider putting it up
> on the wiki as well), and I was actually surprised by the results. To
> test if the RAID was being re-created correctly with each combination,
> I used `file -s` on the RAID, and verified that the results made
> sense. I am surprised to find out that there are multiple combinations
> that make sense (note that the disk names are shifted by one compared
> to previous emails due a machine lockup that required a reboot and
> another disk butting in to a different order):
>
> trying /dev/sdd /dev/sdf /dev/sde /dev/sdg
> /dev/md111: Linux rev 1.0 ext4 filesystem data,
> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
> (needs journal recovery) (extents) (large files) (huge files)
>
> trying /dev/sdd /dev/sdf /dev/sdg /dev/sde
> /dev/md111: Linux rev 1.0 ext4 filesystem data,
> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
> (needs journal recovery) (extents) (large files) (huge files)
>
> trying /dev/sde /dev/sdf /dev/sdd /dev/sdg
> /dev/md111: Linux rev 1.0 ext4 filesystem data,
> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
> (needs journal recovery) (extents) (large files) (huge files)
>
> trying /dev/sde /dev/sdf /dev/sdg /dev/sdd
> /dev/md111: Linux rev 1.0 ext4 filesystem data,
> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
> (needs journal recovery) (extents) (large files) (huge files)
>
> trying /dev/sdg /dev/sdf /dev/sde /dev/sdd
> /dev/md111: Linux rev 1.0 ext4 filesystem data,
> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
> (needs journal recovery) (extents) (large files) (huge files)
>
> trying /dev/sdg /dev/sdf /dev/sdd /dev/sde
> /dev/md111: Linux rev 1.0 ext4 filesystem data,
> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
> (needs journal recovery) (extents) (large files) (huge files)
> :
> So there are six out of 24 combinations that make sense, at least for
> the first block. I know from the pre-fail dmesg that the g-f-e-d order
> should be the correct one, but now I'm left wondering if there is a
> better way to verify this (other than manually sampling files to see
> if they make sense), or if the left-symmetric layout on a RAID6 simply
> allows some of the disk positions to be swapped without loss of data.
>
You script has reported all arrangements with /dev/sdf as the second
device. Presumably that is where the single block you are reading
resides.
To check if a RAID6 arrangement is credible, you can try the raid6check
program that is include in the mdadm source release. There is a man
page.
If the order of devices is not correct raid6check will tell you about
it.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering a RAID6 after all disks were disconnected
2016-12-22 23:25 ` NeilBrown
@ 2016-12-23 16:17 ` Giuseppe Bilotta
2016-12-23 21:14 ` Giuseppe Bilotta
2016-12-23 22:46 ` NeilBrown
0 siblings, 2 replies; 12+ messages in thread
From: Giuseppe Bilotta @ 2016-12-23 16:17 UTC (permalink / raw)
To: NeilBrown; +Cc: John Stoffel, linux-raid
On Fri, Dec 23, 2016 at 12:25 AM, NeilBrown <neilb@suse.com> wrote:
> On Fri, Dec 23 2016, Giuseppe Bilotta wrote:
>> I also wrote a small script to test all combinations (nothing smart,
>> really, simply enumeration of combos, but I'll consider putting it up
>> on the wiki as well), and I was actually surprised by the results. To
>> test if the RAID was being re-created correctly with each combination,
>> I used `file -s` on the RAID, and verified that the results made
>> sense. I am surprised to find out that there are multiple combinations
>> that make sense (note that the disk names are shifted by one compared
>> to previous emails due a machine lockup that required a reboot and
>> another disk butting in to a different order):
>>
>> trying /dev/sdd /dev/sdf /dev/sde /dev/sdg
>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>> (needs journal recovery) (extents) (large files) (huge files)
>>
>> trying /dev/sdd /dev/sdf /dev/sdg /dev/sde
>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>> (needs journal recovery) (extents) (large files) (huge files)
>>
>> trying /dev/sde /dev/sdf /dev/sdd /dev/sdg
>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>> (needs journal recovery) (extents) (large files) (huge files)
>>
>> trying /dev/sde /dev/sdf /dev/sdg /dev/sdd
>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>> (needs journal recovery) (extents) (large files) (huge files)
>>
>> trying /dev/sdg /dev/sdf /dev/sde /dev/sdd
>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>> (needs journal recovery) (extents) (large files) (huge files)
>>
>> trying /dev/sdg /dev/sdf /dev/sdd /dev/sde
>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>> (needs journal recovery) (extents) (large files) (huge files)
>> :
>> So there are six out of 24 combinations that make sense, at least for
>> the first block. I know from the pre-fail dmesg that the g-f-e-d order
>> should be the correct one, but now I'm left wondering if there is a
>> better way to verify this (other than manually sampling files to see
>> if they make sense), or if the left-symmetric layout on a RAID6 simply
>> allows some of the disk positions to be swapped without loss of data.
> You script has reported all arrangements with /dev/sdf as the second
> device. Presumably that is where the single block you are reading
> resides.
That makes sense.
> To check if a RAID6 arrangement is credible, you can try the raid6check
> program that is include in the mdadm source release. There is a man
> page.
> If the order of devices is not correct raid6check will tell you about
> it.
That's a wonderful small utility, thanks for making it known to me!
Checking even just a small number of stripes was enough in this case,
as the expected combination (g f e d) was the only one that produced
no errors.
Now I wonder if it it would be possible to combine this approach with
something that simply hacked the metadata of each disk to re-establish
the correct disk order to make it possible to reassemble this
particular array without recreating anything. Are problems such as
mine common enough to warrant support for this kind of verified
reassembly from assumed-clean disks easier?.
--
Giuseppe "Oblomov" Bilotta
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering a RAID6 after all disks were disconnected
2016-12-23 16:17 ` Giuseppe Bilotta
@ 2016-12-23 21:14 ` Giuseppe Bilotta
2016-12-23 22:50 ` NeilBrown
2016-12-23 22:46 ` NeilBrown
1 sibling, 1 reply; 12+ messages in thread
From: Giuseppe Bilotta @ 2016-12-23 21:14 UTC (permalink / raw)
To: NeilBrown; +Cc: John Stoffel, linux-raid
On Fri, Dec 23, 2016 at 5:17 PM, Giuseppe Bilotta
<giuseppe.bilotta@gmail.com> wrote:
>
> Now I wonder if it it would be possible to combine this approach with
> something that simply hacked the metadata of each disk to re-establish
> the correct disk order to make it possible to reassemble this
> particular array without recreating anything. Are problems such as
> mine common enough to warrant support for this kind of verified
> reassembly from assumed-clean disks easier?.
Actually, now that the correct order is verified, I would like to know
why re-creating the array using mdadm -C --assume-clean with the disks
in the correct order works (the RAID is then accessible, and I can
read data off of it).
However, if I simply hand-edit the metadata to assign the correct
device order to the disks (I do this by restoring the correct device
roles in the dev_roles table, at the entries corresponding to the
disks' dev_numbers, in the correct order, and then adjust the checksum
accrdingly) and then assemble the array, I get I/O errors accessing
the array contents, even though raid6check doesn't report issues.
In the 'hacked dev role' case, the dmesg reads:
[ +0.002057] md: bind<dm-2>
[ +0.000936] md: bind<dm-1>
[ +0.000932] md: bind<dm-0>
[ +0.000925] md: bind<dm-3>
[ +0.001443] md/raid:md112: device dm-3 operational as raid disk 0
[ +0.000540] md/raid:md112: device dm-0 operational as raid disk 3
[ +0.000710] md/raid:md112: device dm-1 operational as raid disk 2
[ +0.000508] md/raid:md112: device dm-2 operational as raid disk 1
[ +0.009716] md/raid:md112: allocated 4374kB
[ +0.000555] md/raid:md112: raid level 6 active with 4 out of 4
devices, algorithm 2
[ +0.000531] RAID conf printout:
[ +0.000001] --- level:6 rd:4 wd:4
[ +0.000001] disk 0, o:1, dev:dm-3
[ +0.000001] disk 1, o:1, dev:dm-2
[ +0.000000] disk 2, o:1, dev:dm-1
[ +0.000001] disk 3, o:1, dev:dm-0
[ +0.000449] created bitmap (22 pages) for device md112
[ +0.001865] md112: bitmap initialized from disk: read 2 pages, set 5
of 44711 bits
[ +0.533458] md112: detected capacity change from 0 to 6000916561920
[ +0.004194] Buffer I/O error on dev md112, logical block 0, async page read
[ +0.003450] Buffer I/O error on dev md112, logical block 0, async page read
[ +0.001953] Buffer I/O error on dev md112, logical block 0, async page read
[ +0.001978] Buffer I/O error on dev md112, logical block 0, async page read
[ +0.001852] ldm_validate_partition_table(): Disk read failed.
[ +0.001889] Buffer I/O error on dev md112, logical block 0, async page read
[ +0.001875] Buffer I/O error on dev md112, logical block 0, async page read
[ +0.001834] Buffer I/O error on dev md112, logical block 0, async page read
[ +0.001596] Buffer I/O error on dev md112, logical block 0, async page read
[ +0.001551] Dev md112: unable to read RDB block 0
[ +0.001293] Buffer I/O error on dev md112, logical block 0, async page read
[ +0.001284] Buffer I/O error on dev md112, logical block 0, async page read
[ +0.001307] md112: unable to read partition table
So the array assembles, and raid6check reports no error, but the data
is actually inaccessible .. am I missing other aspects of the metadata
that need to be restored?
--
Giuseppe "Oblomov" Bilotta
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering a RAID6 after all disks were disconnected
2016-12-23 16:17 ` Giuseppe Bilotta
2016-12-23 21:14 ` Giuseppe Bilotta
@ 2016-12-23 22:46 ` NeilBrown
2016-12-24 14:34 ` Giuseppe Bilotta
1 sibling, 1 reply; 12+ messages in thread
From: NeilBrown @ 2016-12-23 22:46 UTC (permalink / raw)
To: Giuseppe Bilotta; +Cc: John Stoffel, linux-raid
[-- Attachment #1: Type: text/plain, Size: 4429 bytes --]
On Sat, Dec 24 2016, Giuseppe Bilotta wrote:
> On Fri, Dec 23, 2016 at 12:25 AM, NeilBrown <neilb@suse.com> wrote:
>> On Fri, Dec 23 2016, Giuseppe Bilotta wrote:
>>> I also wrote a small script to test all combinations (nothing smart,
>>> really, simply enumeration of combos, but I'll consider putting it up
>>> on the wiki as well), and I was actually surprised by the results. To
>>> test if the RAID was being re-created correctly with each combination,
>>> I used `file -s` on the RAID, and verified that the results made
>>> sense. I am surprised to find out that there are multiple combinations
>>> that make sense (note that the disk names are shifted by one compared
>>> to previous emails due a machine lockup that required a reboot and
>>> another disk butting in to a different order):
>>>
>>> trying /dev/sdd /dev/sdf /dev/sde /dev/sdg
>>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>>> (needs journal recovery) (extents) (large files) (huge files)
>>>
>>> trying /dev/sdd /dev/sdf /dev/sdg /dev/sde
>>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>>> (needs journal recovery) (extents) (large files) (huge files)
>>>
>>> trying /dev/sde /dev/sdf /dev/sdd /dev/sdg
>>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>>> (needs journal recovery) (extents) (large files) (huge files)
>>>
>>> trying /dev/sde /dev/sdf /dev/sdg /dev/sdd
>>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>>> (needs journal recovery) (extents) (large files) (huge files)
>>>
>>> trying /dev/sdg /dev/sdf /dev/sde /dev/sdd
>>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>>> (needs journal recovery) (extents) (large files) (huge files)
>>>
>>> trying /dev/sdg /dev/sdf /dev/sdd /dev/sde
>>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>>> (needs journal recovery) (extents) (large files) (huge files)
>>> :
>>> So there are six out of 24 combinations that make sense, at least for
>>> the first block. I know from the pre-fail dmesg that the g-f-e-d order
>>> should be the correct one, but now I'm left wondering if there is a
>>> better way to verify this (other than manually sampling files to see
>>> if they make sense), or if the left-symmetric layout on a RAID6 simply
>>> allows some of the disk positions to be swapped without loss of data.
>
>> You script has reported all arrangements with /dev/sdf as the second
>> device. Presumably that is where the single block you are reading
>> resides.
>
> That makes sense.
>
>> To check if a RAID6 arrangement is credible, you can try the raid6check
>> program that is include in the mdadm source release. There is a man
>> page.
>> If the order of devices is not correct raid6check will tell you about
>> it.
>
> That's a wonderful small utility, thanks for making it known to me!
> Checking even just a small number of stripes was enough in this case,
> as the expected combination (g f e d) was the only one that produced
> no errors.
>
> Now I wonder if it it would be possible to combine this approach with
> something that simply hacked the metadata of each disk to re-establish
> the correct disk order to make it possible to reassemble this
> particular array without recreating anything. Are problems such as
> mine common enough to warrant support for this kind of verified
> reassembly from assumed-clean disks easier?.
The way I look at this sort of question is to ask "what is the root
cause?", and then "What is the best response to the consequences of that
root cause?".
In your case, I would look at the sequence of event that lead to you
needing to re-create your array, and ask "At which point could md or
mdadm done something differently?".
If you, or someone, can describe precisely how to reproduce your outcome
- so that I can reproduce it myself - then I'll happily have a look and
see at which point something different could have happened.
Until then, I think the best response to these situations is to ask for
help, and to have tools which allow details to be extract and repairs to
be made.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering a RAID6 after all disks were disconnected
2016-12-23 21:14 ` Giuseppe Bilotta
@ 2016-12-23 22:50 ` NeilBrown
2016-12-24 14:47 ` Giuseppe Bilotta
0 siblings, 1 reply; 12+ messages in thread
From: NeilBrown @ 2016-12-23 22:50 UTC (permalink / raw)
To: Giuseppe Bilotta; +Cc: John Stoffel, linux-raid
[-- Attachment #1: Type: text/plain, Size: 588 bytes --]
On Sat, Dec 24 2016, Giuseppe Bilotta wrote:
>
>
> So the array assembles, and raid6check reports no error, but the data
> is actually inaccessible .. am I missing other aspects of the metadata
> that need to be restored?
Presumably, yes.
If you provide "mdadm --examine" from devices in both the "working" and
the "not working" case, I might be able to point to the difference.
Alternately, use "mdadm --dump" to extract the metadata, then "tar
--sparse" to combine the (sparse) metadata files into a tar-archive, and
send that. Then I would be able to experiment myself.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering a RAID6 after all disks were disconnected
2016-12-23 22:46 ` NeilBrown
@ 2016-12-24 14:34 ` Giuseppe Bilotta
0 siblings, 0 replies; 12+ messages in thread
From: Giuseppe Bilotta @ 2016-12-24 14:34 UTC (permalink / raw)
To: NeilBrown; +Cc: John Stoffel, linux-raid
On Fri, Dec 23, 2016 at 11:46 PM, NeilBrown <neilb@suse.com> wrote:
> On Sat, Dec 24 2016, Giuseppe Bilotta wrote:
>>
>> Now I wonder if it it would be possible to combine this approach with
>> something that simply hacked the metadata of each disk to re-establish
>> the correct disk order to make it possible to reassemble this
>> particular array without recreating anything. Are problems such as
>> mine common enough to warrant support for this kind of verified
>> reassembly from assumed-clean disks easier?.
>
> The way I look at this sort of question is to ask "what is the root
> cause?", and then "What is the best response to the consequences of that
> root cause?".
>
> In your case, I would look at the sequence of event that lead to you
> needing to re-create your array, and ask "At which point could md or
> mdadm done something differently?".
>
> If you, or someone, can describe precisely how to reproduce your outcome
> - so that I can reproduce it myself - then I'll happily have a look and
> see at which point something different could have happened.
As I mentioned on the first post, the root of the issue is cheap
hardware plus user error. Basically, all disks in this RAID are hosted
on a JBOD that has a tendency to 'disappear' at times. I've seen this
happen generally when one of the disks acts up (in which case Linux
attempting to reset it leads to a reset of the whole JBOD, which makes
all disks disappear until the device recovers). The JBOD is connected
via USB3, but I had the same issues when using an eSATA connection
with port multiplexer, and from what I've read around it's a known
limitation of SATA (as opposed to professional stuff based on SAS).
When this happens, md ends up removing all devices from the RAID. The
proper way to handle this, I've found, is to unmount the filesystem,
stop the array, and then reassemble it and remount it as soon as the
JBOD is back online. With this approach the RAID recovers in pretty
good shape (aside from the disk that is acting up, possibly). However,
it's a bit bothersome and may take some time to free up all filesystem
usage to allow for the unmounting, sometimes to the point of requiring
a reboot. So the last time this happened I tried something different,
and I made the mistake of trying a re-add of all the disks. This
resulted in the disks being marked as spares because md could not
restore the RAID functionality after having dropped to 0 disks.
I'm not sure this could be handled differently, unless mdraid could be
made to not kick all disks out if the whole JBOD disappears, but
rather wait for it to come back?
--
Giuseppe "Oblomov" Bilotta
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Recovering a RAID6 after all disks were disconnected
2016-12-23 22:50 ` NeilBrown
@ 2016-12-24 14:47 ` Giuseppe Bilotta
0 siblings, 0 replies; 12+ messages in thread
From: Giuseppe Bilotta @ 2016-12-24 14:47 UTC (permalink / raw)
To: NeilBrown; +Cc: John Stoffel, linux-raid
On Fri, Dec 23, 2016 at 11:50 PM, NeilBrown <neilb@suse.com> wrote:
> On Sat, Dec 24 2016, Giuseppe Bilotta wrote:
>>
>>
>> So the array assembles, and raid6check reports no error, but the data
>> is actually inaccessible .. am I missing other aspects of the metadata
>> that need to be restored?
>
> Presumably, yes.
>
> If you provide "mdadm --examine" from devices in both the "working" and
> the "not working" case, I might be able to point to the difference.
I found the culprit. All disks have bad block lists, and the bad block
lists include the initial data sectors (i.e. the sectors pointed at by
the data offset in the superblock). This is quite probably a side
effect of my stupid idea of trying a re-add of all disks after all of
them were kicked out during the JBOD disconnect. One more reason to
just stop the array when this situation arises.
--
Giuseppe "Oblomov" Bilotta
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2016-12-24 14:47 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-07 8:59 Recovering a RAID6 after all disks were disconnected Giuseppe Bilotta
2016-12-07 14:31 ` John Stoffel
2016-12-07 17:21 ` Giuseppe Bilotta
2016-12-08 19:02 ` John Stoffel
2016-12-22 23:11 ` Giuseppe Bilotta
2016-12-22 23:25 ` NeilBrown
2016-12-23 16:17 ` Giuseppe Bilotta
2016-12-23 21:14 ` Giuseppe Bilotta
2016-12-23 22:50 ` NeilBrown
2016-12-24 14:47 ` Giuseppe Bilotta
2016-12-23 22:46 ` NeilBrown
2016-12-24 14:34 ` Giuseppe Bilotta
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).