Re: Reconstruct a RAID 6 that has failed in a non typical manner

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Phil Turmel <philip@turmel.org>
To: Marc Pinhede <marc.pinhede@inria.fr>
Cc: Clement Parisot <clement.parisot@inria.fr>, linux-raid@vger.kernel.org
Subject: Re: Reconstruct a RAID 6 that has failed in a non typical manner
Date: Tue, 17 Nov 2015 08:25:04 -0500	[thread overview]
Message-ID: <564B2AB0.5030707@turmel.org> (raw)
In-Reply-To: <402863738.19875205.1447763445794.JavaMail.zimbra@inria.fr>

Good morning Marc, Clément,

On 11/17/2015 07:30 AM, Marc Pinhede wrote:
> Hello,
> 
> Thanks for your answer. Update since our last mail: We saved many
> data thanks to long and boring rsyncs, with countless reboots: during
> rsync, sometime a drive was suddenly considered in 'failed' state by
> the array. The array was still active (with 13 or 12 / 16 disks) but
> 100% of files failed with I/O after that. We were then forced to
> reboot, reassemble the array and restart rsync.

Yes, a miserable task on a large array.  Good to know you saved most (?)
of your data.

> During those long operation, we have been advised to re-tighten our
> storage bay's screws (carri bay). And this is were the magic
> happened. After screwing them back on, no more problem with drive
> considered failed. We only had 4 file copy failures with I/O, but it
> didn't correspond to a drive failing in the array (still working with
> 14/16 drives).

> We can't guarantee than the problem is fixed, but we moved from about
> 10 reboot a day to 5 days of work without problems.

Very good news.  Finding a root cause for a problem greatly raises the
odds future efforts will succeed.

> We now plan to reset and re-introduce one by one the two drive that
> were not recognize by the array, and let the array synchronize,
> rewriting data on those drive. Does it sounds like a good idea to
> you, or do you think it may fails due to some errors?

Since you've identified a real hardware issue that impacted the entire
array, I wouldn't trust it until every drive is thoroughly wiped and
retested.  Use "badblocks -w -p 2" or similar.  Then construct a new
array and restore your saved data.

[trim /]

>> It's very important that we get a map of drive serial numbers to
>> current device names and the "Device Role" from "mdadm --examine".
>> As an alternative, post the output of "ls -l /dev/disk/by-id/".
>> This is critical information for any future re-create attempts.

If you look close at the lsdrv output, you'll see it successfully
acquired drive serial numbers for all drives.  However, they are
reported as Adaptec Logical drives -- these might be generated by the
adaptec firmware, not the real serial numbers.

> It seems that the mapping changes at each reboot (two drives that
> host the operating system had different name across reboots). Since
> we re-tighten screws, we didn't reboot though.

Device names are dependent on device discovery order, which can change
somewhat randomly.  What I've seen with lsdrv is that order doesn't
change within a single controller -- the scsi addresses
{host:bus:target:lun} have consistent bus:target:lun for a given port on
a controller.  I don't have much experience with adaptec devices, so I'd
be curious if it holds true for them.

>> The rest of the information from smartctl is important, and you
>> should upgrade your system to a level that supports it, but it can
>> wait for later.

Consider compiling a local copy of the latest smartctl instead of using
a chroot.  Supply the scsi address shown in lsdrv to the -d aacraid, option.

Regards,

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2015-11-17 13:25 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <404650428.13997384.1446132658661.JavaMail.zimbra@inria.fr>
2015-10-29 15:59 ` Reconstruct a RAID 6 that has failed in a non typical manner Clement Parisot
2015-10-30 18:31   ` Phil Turmel
2015-11-05 10:35     ` Clement Parisot
2015-11-05 13:34       ` Phil Turmel
2015-11-17 12:30         ` Marc Pinhede
2015-11-17 13:25           ` Phil Turmel [this message]
2015-12-21  3:40         ` NeilBrown
2015-12-21 12:20           ` Phil Turmel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=564B2AB0.5030707@turmel.org \
    --to=philip@turmel.org \
    --cc=clement.parisot@inria.fr \
    --cc=linux-raid@vger.kernel.org \
    --cc=marc.pinhede@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.