* Recovery help? 4-disk RAID5 double-failure, but good disks have event count mismatch.
@ 2013-07-27 16:46 Richard Michael
2013-07-27 20:43 ` Phil Turmel
0 siblings, 1 reply; 2+ messages in thread
From: Richard Michael @ 2013-07-27 16:46 UTC (permalink / raw)
To: linux-raid
Hello everyone,
I have inherited a failed RAID5 and am attempting to recover as much
data as possible. Full mdadm -E output at the bottom.
The RAID is 4 SATA disks, /dev/sd[abcd]3 and EXT4.
One disk is unable to talk to the controller, another is out-of-date,
the remaining two are current and match each other.
sdb spins up but fails to talk, the kernel hard resets the link
several times, then slows the link to 1.5Gb/s and retries, then
eventually gives up entirely (fail; then "EH complete"). I have no
/dev node, etc..
Bad sectors were found while ddrescue-coping sdc. It was actually
kicked from the array back on 14-July-2013 02:26:00, and thus has a
lower event count than the remaining two good disks.
/dev/sdc3:
Update Time : Sun Jul 14 02:26:00 2013
Checksum : 5a16857a - correct
Events : 308375
The remaining, functioning, disks sd[ad]3 are in "sync" with each
other, but 10 days (~70,000 events) ahead of sdc3:
/dev/sd[ad]3:
Update Time : Wed Jul 24 14:01:52 2013
Checksum : d7cff537 - correct
Events : 378389
Questions:
0/ Any thoughts on the best method to proceed with recovery?
1/ What will happen if I --assemble --force? I think the low event
count on sdc3 will be forced up to 378389 and the array will start
degraded. The filesystem will be corrupted (missing "real/updated"
data on sdc3), but I can fsck and check lost+found to find damaged
file names. I'll md5sum all against the latest (but old) "backup" to
find silent corruption.
2/ Could the write intent bitmap on sd[ad]3 go far enough back to
replay the last ~70K events to sdc3? Generally, what are the
limitations of the bitmap -- how many events can be replayed? I'm not
sure I have a clear understanding of the WIBM.
3/ Should the sdc superblock indicate information about it being
kicked? It's listed as "clean" and sees all the drives active
('AAAA').
4/ Perhaps beyond the scope of linux-raid, I'm not sure what to do
about sdb. I've tried different positions on the controller, and
re-orienting the drive (vertical, sideways, etc.). I could send it
alone for recovery, perhaps. I don't know how to get lower-level than
the kernel failing to talk to the device. Perhaps a vendor diagnostic
tool?
Thank you very much in advance for your time and comments. I hope
you're all having a better weekend than I am. :-)
Regards,
Richard
Full mdadm -E output:
-----------------------------------------
/dev/sda3:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x1
Array UUID : 05d6b8b5:ad42cf19:452afe4d:a71d6f7c
Name : system.domain.lan:2
Creation Time : Sat Jul 14 22:31:26 2012
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 1871190016 (892.25 GiB 958.05 GB)
Array Size : 2806783488 (2676.76 GiB 2874.15 GB)
Used Dev Size : 1871188992 (892.25 GiB 958.05 GB)
Data Offset : 2048 sectors
Super Offset : 0 sectors
State : clean
Device UUID : 3f59e1c5:c00d4583:2770f4a8:2e54ac7e
Internal Bitmap : 8 sectors from superblock
Update Time : Wed Jul 24 14:01:52 2013
Checksum : d7cff537 - correct
Events : 378389
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 0
Array State : A..A ('A' == active, '.' == missing)
/dev/sdc3:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x1
Array UUID : 05d6b8b5:ad42cf19:452afe4d:a71d6f7c
Name : system.domain.lan:2
Creation Time : Sat Jul 14 22:31:26 2012
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 1871190016 (892.25 GiB 958.05 GB)
Array Size : 2806783488 (2676.76 GiB 2874.15 GB)
Used Dev Size : 1871188992 (892.25 GiB 958.05 GB)
Data Offset : 2048 sectors
Super Offset : 0 sectors
State : clean
Device UUID : b6ceedcc:9bbe475c:a683e0f1:308e04d8
Internal Bitmap : 8 sectors from superblock
Update Time : Sun Jul 14 02:26:00 2013
Checksum : 5a16857a - correct
Events : 308375
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 2
Array State : AAAA ('A' == active, '.' == missing)
/dev/sdd3:
Magic : a92b4efc
Version : 1.1
Feature Map : 0x1
Array UUID : 05d6b8b5:ad42cf19:452afe4d:a71d6f7c
Name : system.domain.lan:2
Creation Time : Sat Jul 14 22:31:26 2012
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 1871190016 (892.25 GiB 958.05 GB)
Array Size : 2806783488 (2676.76 GiB 2874.15 GB)
Used Dev Size : 1871188992 (892.25 GiB 958.05 GB)
Data Offset : 2048 sectors
Super Offset : 0 sectors
State : clean
Device UUID : 2e0b87b5:f22c5571:fb5dc447:307eec7f
Internal Bitmap : 8 sectors from superblock
Update Time : Wed Jul 24 14:01:52 2013
Checksum : 5599c482 - correct
Events : 378389
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 3
Array State : A..A ('A' == active, '.' == missing)
^ permalink raw reply [flat|nested] 2+ messages in thread* Re: Recovery help? 4-disk RAID5 double-failure, but good disks have event count mismatch.
2013-07-27 16:46 Recovery help? 4-disk RAID5 double-failure, but good disks have event count mismatch Richard Michael
@ 2013-07-27 20:43 ` Phil Turmel
0 siblings, 0 replies; 2+ messages in thread
From: Phil Turmel @ 2013-07-27 20:43 UTC (permalink / raw)
To: Richard Michael; +Cc: linux-raid
Hi Richard,
On 07/27/2013 12:46 PM, Richard Michael wrote:
> Hello everyone,
>
> I have inherited a failed RAID5 and am attempting to recover as much
> data as possible. Full mdadm -E output at the bottom.
Please also supply "smartctl -x /dev/sdX" for each of your drives.
> The RAID is 4 SATA disks, /dev/sd[abcd]3 and EXT4.
>
> One disk is unable to talk to the controller, another is out-of-date,
> the remaining two are current and match each other.
>
> sdb spins up but fails to talk, the kernel hard resets the link
> several times, then slows the link to 1.5Gb/s and retries, then
> eventually gives up entirely (fail; then "EH complete"). I have no
> /dev node, etc..
Is this still true if you plug it into a different computer?
> Bad sectors were found while ddrescue-coping sdc. It was actually
> kicked from the array back on 14-July-2013 02:26:00, and thus has a
> lower event count than the remaining two good disks.
>
> /dev/sdc3:
> Update Time : Sun Jul 14 02:26:00 2013
> Checksum : 5a16857a - correct
> Events : 308375
>
>
> The remaining, functioning, disks sd[ad]3 are in "sync" with each
> other, but 10 days (~70,000 events) ahead of sdc3:
>
> /dev/sd[ad]3:
> Update Time : Wed Jul 24 14:01:52 2013
> Checksum : d7cff537 - correct
> Events : 378389
Ok. This all makes sense.
> Questions:
>
> 0/ Any thoughts on the best method to proceed with recovery?
First, determine if the problem with /dev/sdb is a failed drive, failed
cabling, or failed controller. If either of the latter, attempt to
force assembly with /dev/sd[abd] in a working controller/cabling
environment.
> 1/ What will happen if I --assemble --force? I think the low event
> count on sdc3 will be forced up to 378389 and the array will start
> degraded. The filesystem will be corrupted (missing "real/updated"
> data on sdc3), but I can fsck and check lost+found to find damaged
> file names. I'll md5sum all against the latest (but old) "backup" to
> find silent corruption.
You are correct. If /dev/sdb is truly dead, this is the best you can do.
> 2/ Could the write intent bitmap on sd[ad]3 go far enough back to
> replay the last ~70K events to sdc3? Generally, what are the
> limitations of the bitmap -- how many events can be replayed? I'm not
> sure I have a clear understanding of the WIBM.
Write-intent bitmaps do not contain events. Just markers for blocks of
sectors that have been written to while an array is degraded. The
bitmap is an optimization useful when re-adding a failed drive to an
otherwise working array.
> 3/ Should the sdc superblock indicate information about it being
> kicked? It's listed as "clean" and sees all the drives active
> ('AAAA').
Drives are generally kicked out of an array when MD fails to write to
them. If MD cannot write to a drive, how do you expect it to update
that drive's superblock? Detecting this phenomenon (re-appearance of a
failed drive) is precisely why each drive maintains an event count and a
list of the other drives status.
> 4/ Perhaps beyond the scope of linux-raid, I'm not sure what to do
> about sdb. I've tried different positions on the controller, and
> re-orienting the drive (vertical, sideways, etc.). I could send it
> alone for recovery, perhaps. I don't know how to get lower-level than
> the kernel failing to talk to the device. Perhaps a vendor diagnostic
> tool?
Try different controllers, different cables (power and data), and if all
else fails, different computer. If you do get it talking, include its
"smartctl -x" report too.
> Thank you very much in advance for your time and comments. I hope
> you're all having a better weekend than I am. :-)
Hope this helps,
Phil
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2013-07-27 20:43 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-27 16:46 Recovery help? 4-disk RAID5 double-failure, but good disks have event count mismatch Richard Michael
2013-07-27 20:43 ` Phil Turmel
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.