Re: Reconstruct a RAID 6 that has failed in a non typical manner

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Phil Turmel <philip@turmel.org>
To: Marc Pinhede <marc.pinhede@inria.fr>
Cc: Clement Parisot <clement.parisot@inria.fr>, linux-raid@vger.kernel.org
Subject: Re: Reconstruct a RAID 6 that has failed in a non typical manner
Date: Tue, 17 Nov 2015 08:25:04 -0500	[thread overview]
Message-ID: <564B2AB0.5030707@turmel.org> (raw)
In-Reply-To: <402863738.19875205.1447763445794.JavaMail.zimbra@inria.fr>

Good morning Marc, Clément,

On 11/17/2015 07:30 AM, Marc Pinhede wrote:
> Hello,
> 
> Thanks for your answer. Update since our last mail: We saved many
> data thanks to long and boring rsyncs, with countless reboots: during
> rsync, sometime a drive was suddenly considered in 'failed' state by
> the array. The array was still active (with 13 or 12 / 16 disks) but
> 100% of files failed with I/O after that. We were then forced to
> reboot, reassemble the array and restart rsync.

Yes, a miserable task on a large array.  Good to know you saved most (?)
of your data.

> During those long operation, we have been advised to re-tighten our
> storage bay's screws (carri bay). And this is were the magic
> happened. After screwing them back on, no more problem with drive
> considered failed. We only had 4 file copy failures with I/O, but it
> didn't correspond to a drive failing in the array (still working with
> 14/16 drives).

> We can't guarantee than the problem is fixed, but we moved from about
> 10 reboot a day to 5 days of work without problems.

Very good news.  Finding a root cause for a problem greatly raises the
odds future efforts will succeed.

> We now plan to reset and re-introduce one by one the two drive that
> were not recognize by the array, and let the array synchronize,
> rewriting data on those drive. Does it sounds like a good idea to
> you, or do you think it may fails due to some errors?

Since you've identified a real hardware issue that impacted the entire
array, I wouldn't trust it until every drive is thoroughly wiped and
retested.  Use "badblocks -w -p 2" or similar.  Then construct a new
array and restore your saved data.

[trim /]

>> It's very important that we get a map of drive serial numbers to
>> current device names and the "Device Role" from "mdadm --examine".
>> As an alternative, post the output of "ls -l /dev/disk/by-id/".
>> This is critical information for any future re-create attempts.

If you look close at the lsdrv output, you'll see it successfully
acquired drive serial numbers for all drives.  However, they are
reported as Adaptec Logical drives -- these might be generated by the
adaptec firmware, not the real serial numbers.

> It seems that the mapping changes at each reboot (two drives that
> host the operating system had different name across reboots). Since
> we re-tighten screws, we didn't reboot though.

Device names are dependent on device discovery order, which can change
somewhat randomly.  What I've seen with lsdrv is that order doesn't
change within a single controller -- the scsi addresses
{host:bus:target:lun} have consistent bus:target:lun for a given port on
a controller.  I don't have much experience with adaptec devices, so I'd
be curious if it holds true for them.

>> The rest of the information from smartctl is important, and you
>> should upgrade your system to a level that supports it, but it can
>> wait for later.

Consider compiling a local copy of the latest smartctl instead of using
a chroot.  Supply the scsi address shown in lsdrv to the -d aacraid, option.

Regards,

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2015-11-17 13:25 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <404650428.13997384.1446132658661.JavaMail.zimbra@inria.fr>
2015-10-29 15:59 ` Reconstruct a RAID 6 that has failed in a non typical manner Clement Parisot
2015-10-30 18:31   ` Phil Turmel
2015-11-05 10:35     ` Clement Parisot
2015-11-05 13:34       ` Phil Turmel
2015-11-17 12:30         ` Marc Pinhede
2015-11-17 13:25           ` Phil Turmel [this message]
2015-12-21  3:40         ` NeilBrown
2015-12-21 12:20           ` Phil Turmel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=564B2AB0.5030707@turmel.org \
    --to=philip@turmel.org \
    --cc=clement.parisot@inria.fr \
    --cc=linux-raid@vger.kernel.org \
    --cc=marc.pinhede@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).