Re: Recovery help? 4-disk RAID5 double-failure, but good disks have event count mismatch.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Phil Turmel <philip@turmel.org>
To: Richard Michael <rmichael@edgeofthenet.org>
Cc: linux-raid@vger.kernel.org
Subject: Re: Recovery help? 4-disk RAID5 double-failure, but good disks have event count mismatch.
Date: Sat, 27 Jul 2013 16:43:55 -0400	[thread overview]
Message-ID: <51F4310B.3050709@turmel.org> (raw)
In-Reply-To: <CABR0jES0HHeyM6_7-whqLQgF6a+u4PCDAdyWPJ5Bp8g1umi2jA@mail.gmail.com>

Hi Richard,

On 07/27/2013 12:46 PM, Richard Michael wrote:
> Hello everyone,
> 
> I have inherited a failed RAID5 and am attempting to recover as much
> data as possible.   Full mdadm -E output at the bottom.

Please also supply "smartctl -x /dev/sdX" for each of your drives.

> The RAID is 4 SATA disks, /dev/sd[abcd]3 and EXT4.
> 
> One disk is unable to talk to the controller, another is out-of-date,
> the remaining two are current and match each other.
> 
> sdb spins up but fails to talk, the kernel hard resets the link
> several times, then slows the link to 1.5Gb/s and retries, then
> eventually gives up entirely (fail; then "EH complete").  I have no
> /dev node, etc..

Is this still true if you plug it into a different computer?

> Bad sectors were found while ddrescue-coping sdc.  It was actually
> kicked from the array back on 14-July-2013 02:26:00, and thus has a
> lower event count than the remaining two good disks.
> 
> /dev/sdc3:
>   Update Time : Sun Jul 14 02:26:00 2013
>   Checksum : 5a16857a - correct
>   Events : 308375
> 
> 
> The remaining, functioning, disks sd[ad]3 are in "sync" with each
> other, but 10 days (~70,000 events) ahead of sdc3:
> 
> /dev/sd[ad]3:
>   Update Time : Wed Jul 24 14:01:52 2013
>   Checksum : d7cff537 - correct
>   Events : 378389

Ok.  This all makes sense.

> Questions:
> 
> 0/ Any thoughts on the best method to proceed with recovery?

First, determine if the problem with /dev/sdb is a failed drive, failed
cabling, or failed controller.  If either of the latter, attempt to
force assembly with /dev/sd[abd] in a working controller/cabling
environment.

> 1/ What will happen if I --assemble --force?  I think the low event
> count on sdc3 will be forced up to 378389 and the array will start
> degraded.  The filesystem will be corrupted (missing "real/updated"
> data on sdc3), but I can fsck and check lost+found to find damaged
> file names.  I'll md5sum all against the latest (but old) "backup" to
> find silent corruption.

You are correct.  If /dev/sdb is truly dead, this is the best you can do.

> 2/ Could the write intent bitmap on sd[ad]3 go far enough back to
> replay the last ~70K events to sdc3?  Generally, what are the
> limitations of the bitmap -- how many events can be replayed?  I'm not
> sure I have a clear understanding of the WIBM.

Write-intent bitmaps do not contain events.  Just markers for blocks of
sectors that have been written to while an array is degraded.  The
bitmap is an optimization useful when re-adding a failed drive to an
otherwise working array.

> 3/ Should the sdc superblock indicate information about it being
> kicked?  It's listed as "clean" and sees all the drives active
> ('AAAA').

Drives are generally kicked out of an array when MD fails to write to
them.  If MD cannot write to a drive, how do you expect it to update
that drive's superblock?  Detecting this phenomenon (re-appearance of a
failed drive) is precisely why each drive maintains an event count and a
list of the other drives status.

> 4/ Perhaps beyond the scope of linux-raid, I'm not sure what to do
> about sdb.  I've tried different positions on the controller, and
> re-orienting the drive (vertical, sideways, etc.).  I could send it
> alone for recovery, perhaps.  I don't know how to get lower-level than
> the kernel failing to talk to the device.  Perhaps a vendor diagnostic
> tool?

Try different controllers, different cables (power and data), and if all
else fails, different computer.  If you do get it talking, include its
"smartctl -x" report too.

> Thank you very much in advance for your time and comments.  I hope
> you're all having a better weekend than I am. :-)

Hope this helps,

Phil

     prev parent reply	other threads:[~2013-07-27 20:43 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-27 16:46 Recovery help? 4-disk RAID5 double-failure, but good disks have event count mismatch Richard Michael
2013-07-27 20:43 ` Phil Turmel [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51F4310B.3050709@turmel.org \
    --to=philip@turmel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=rmichael@edgeofthenet.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.