Recovery help? 4-disk RAID5 double-failure, but good disks have event count mismatch.

All of lore.kernel.org
 help / color / mirror / Atom feed

* Recovery help? 4-disk RAID5 double-failure, but good disks have event count mismatch.
@ 2013-07-27 16:46 Richard Michael
  2013-07-27 20:43 ` Phil Turmel
  0 siblings, 1 reply; 2+ messages in thread
From: Richard Michael @ 2013-07-27 16:46 UTC (permalink / raw)
  To: linux-raid

Hello everyone,

I have inherited a failed RAID5 and am attempting to recover as much
data as possible.   Full mdadm -E output at the bottom.

The RAID is 4 SATA disks, /dev/sd[abcd]3 and EXT4.

One disk is unable to talk to the controller, another is out-of-date,
the remaining two are current and match each other.

sdb spins up but fails to talk, the kernel hard resets the link
several times, then slows the link to 1.5Gb/s and retries, then
eventually gives up entirely (fail; then "EH complete").  I have no
/dev node, etc..

Bad sectors were found while ddrescue-coping sdc.  It was actually
kicked from the array back on 14-July-2013 02:26:00, and thus has a
lower event count than the remaining two good disks.

/dev/sdc3:
  Update Time : Sun Jul 14 02:26:00 2013
  Checksum : 5a16857a - correct
  Events : 308375

The remaining, functioning, disks sd[ad]3 are in "sync" with each
other, but 10 days (~70,000 events) ahead of sdc3:

/dev/sd[ad]3:
  Update Time : Wed Jul 24 14:01:52 2013
  Checksum : d7cff537 - correct
  Events : 378389

Questions:

0/ Any thoughts on the best method to proceed with recovery?

1/ What will happen if I --assemble --force?  I think the low event
count on sdc3 will be forced up to 378389 and the array will start
degraded.  The filesystem will be corrupted (missing "real/updated"
data on sdc3), but I can fsck and check lost+found to find damaged
file names.  I'll md5sum all against the latest (but old) "backup" to
find silent corruption.

2/ Could the write intent bitmap on sd[ad]3 go far enough back to
replay the last ~70K events to sdc3?  Generally, what are the
limitations of the bitmap -- how many events can be replayed?  I'm not
sure I have a clear understanding of the WIBM.

3/ Should the sdc superblock indicate information about it being
kicked?  It's listed as "clean" and sees all the drives active
('AAAA').

4/ Perhaps beyond the scope of linux-raid, I'm not sure what to do
about sdb.  I've tried different positions on the controller, and
re-orienting the drive (vertical, sideways, etc.).  I could send it
alone for recovery, perhaps.  I don't know how to get lower-level than
the kernel failing to talk to the device.  Perhaps a vendor diagnostic
tool?

Thank you very much in advance for your time and comments.  I hope
you're all having a better weekend than I am. :-)

Regards,
Richard

Full mdadm -E output:
-----------------------------------------

/dev/sda3:
          Magic : a92b4efc
        Version : 1.1
    Feature Map : 0x1
     Array UUID : 05d6b8b5:ad42cf19:452afe4d:a71d6f7c
           Name : system.domain.lan:2
  Creation Time : Sat Jul 14 22:31:26 2012
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 1871190016 (892.25 GiB 958.05 GB)
     Array Size : 2806783488 (2676.76 GiB 2874.15 GB)
  Used Dev Size : 1871188992 (892.25 GiB 958.05 GB)
    Data Offset : 2048 sectors
   Super Offset : 0 sectors
          State : clean
    Device UUID : 3f59e1c5:c00d4583:2770f4a8:2e54ac7e

Internal Bitmap : 8 sectors from superblock
    Update Time : Wed Jul 24 14:01:52 2013
       Checksum : d7cff537 - correct
         Events : 378389

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : A..A ('A' == active, '.' == missing)

/dev/sdc3:
          Magic : a92b4efc
        Version : 1.1
    Feature Map : 0x1
     Array UUID : 05d6b8b5:ad42cf19:452afe4d:a71d6f7c
           Name : system.domain.lan:2
  Creation Time : Sat Jul 14 22:31:26 2012
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 1871190016 (892.25 GiB 958.05 GB)
     Array Size : 2806783488 (2676.76 GiB 2874.15 GB)
  Used Dev Size : 1871188992 (892.25 GiB 958.05 GB)
    Data Offset : 2048 sectors
   Super Offset : 0 sectors
          State : clean
    Device UUID : b6ceedcc:9bbe475c:a683e0f1:308e04d8

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Jul 14 02:26:00 2013
       Checksum : 5a16857a - correct
         Events : 308375

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAAA ('A' == active, '.' == missing)

/dev/sdd3:
          Magic : a92b4efc
        Version : 1.1
    Feature Map : 0x1
     Array UUID : 05d6b8b5:ad42cf19:452afe4d:a71d6f7c
           Name : system.domain.lan:2
  Creation Time : Sat Jul 14 22:31:26 2012
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 1871190016 (892.25 GiB 958.05 GB)
     Array Size : 2806783488 (2676.76 GiB 2874.15 GB)
  Used Dev Size : 1871188992 (892.25 GiB 958.05 GB)
    Data Offset : 2048 sectors
   Super Offset : 0 sectors
          State : clean
    Device UUID : 2e0b87b5:f22c5571:fb5dc447:307eec7f

Internal Bitmap : 8 sectors from superblock
    Update Time : Wed Jul 24 14:01:52 2013
       Checksum : 5599c482 - correct
         Events : 378389

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : A..A ('A' == active, '.' == missing)

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Recovery help? 4-disk RAID5 double-failure, but good disks have event count mismatch.
  2013-07-27 16:46 Recovery help? 4-disk RAID5 double-failure, but good disks have event count mismatch Richard Michael
@ 2013-07-27 20:43 ` Phil Turmel
  0 siblings, 0 replies; 2+ messages in thread
From: Phil Turmel @ 2013-07-27 20:43 UTC (permalink / raw)
  To: Richard Michael; +Cc: linux-raid

Hi Richard,

On 07/27/2013 12:46 PM, Richard Michael wrote:
> Hello everyone,
> 
> I have inherited a failed RAID5 and am attempting to recover as much
> data as possible.   Full mdadm -E output at the bottom.

Please also supply "smartctl -x /dev/sdX" for each of your drives.

> The RAID is 4 SATA disks, /dev/sd[abcd]3 and EXT4.
> 
> One disk is unable to talk to the controller, another is out-of-date,
> the remaining two are current and match each other.
> 
> sdb spins up but fails to talk, the kernel hard resets the link
> several times, then slows the link to 1.5Gb/s and retries, then
> eventually gives up entirely (fail; then "EH complete").  I have no
> /dev node, etc..

Is this still true if you plug it into a different computer?

> Bad sectors were found while ddrescue-coping sdc.  It was actually
> kicked from the array back on 14-July-2013 02:26:00, and thus has a
> lower event count than the remaining two good disks.
> 
> /dev/sdc3:
>   Update Time : Sun Jul 14 02:26:00 2013
>   Checksum : 5a16857a - correct
>   Events : 308375
> 
> 
> The remaining, functioning, disks sd[ad]3 are in "sync" with each
> other, but 10 days (~70,000 events) ahead of sdc3:
> 
> /dev/sd[ad]3:
>   Update Time : Wed Jul 24 14:01:52 2013
>   Checksum : d7cff537 - correct
>   Events : 378389

Ok.  This all makes sense.

> Questions:
> 
> 0/ Any thoughts on the best method to proceed with recovery?

First, determine if the problem with /dev/sdb is a failed drive, failed
cabling, or failed controller.  If either of the latter, attempt to
force assembly with /dev/sd[abd] in a working controller/cabling
environment.

> 1/ What will happen if I --assemble --force?  I think the low event
> count on sdc3 will be forced up to 378389 and the array will start
> degraded.  The filesystem will be corrupted (missing "real/updated"
> data on sdc3), but I can fsck and check lost+found to find damaged
> file names.  I'll md5sum all against the latest (but old) "backup" to
> find silent corruption.

You are correct.  If /dev/sdb is truly dead, this is the best you can do.

> 2/ Could the write intent bitmap on sd[ad]3 go far enough back to
> replay the last ~70K events to sdc3?  Generally, what are the
> limitations of the bitmap -- how many events can be replayed?  I'm not
> sure I have a clear understanding of the WIBM.

Write-intent bitmaps do not contain events.  Just markers for blocks of
sectors that have been written to while an array is degraded.  The
bitmap is an optimization useful when re-adding a failed drive to an
otherwise working array.

> 3/ Should the sdc superblock indicate information about it being
> kicked?  It's listed as "clean" and sees all the drives active
> ('AAAA').

Drives are generally kicked out of an array when MD fails to write to
them.  If MD cannot write to a drive, how do you expect it to update
that drive's superblock?  Detecting this phenomenon (re-appearance of a
failed drive) is precisely why each drive maintains an event count and a
list of the other drives status.

> 4/ Perhaps beyond the scope of linux-raid, I'm not sure what to do
> about sdb.  I've tried different positions on the controller, and
> re-orienting the drive (vertical, sideways, etc.).  I could send it
> alone for recovery, perhaps.  I don't know how to get lower-level than
> the kernel failing to talk to the device.  Perhaps a vendor diagnostic
> tool?

Try different controllers, different cables (power and data), and if all
else fails, different computer.  If you do get it talking, include its
"smartctl -x" report too.

> Thank you very much in advance for your time and comments.  I hope
> you're all having a better weekend than I am. :-)

Hope this helps,

Phil

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2013-07-27 20:43 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-27 16:46 Recovery help? 4-disk RAID5 double-failure, but good disks have event count mismatch Richard Michael
2013-07-27 20:43 ` Phil Turmel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.