Re: RAID5 with 2 drive failure at the same time

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Phil Turmel <philip@turmel.org>
To: Christoph Nelles <evilazrael@evilazrael.de>
Cc: linux-raid@vger.kernel.org
Subject: Re: RAID5 with 2 drive failure at the same time
Date: Fri, 01 Feb 2013 20:24:06 -0500	[thread overview]
Message-ID: <510C6AB6.8040900@turmel.org> (raw)
In-Reply-To: <510C5E3F.8090807@evilazrael.de>

On 02/01/2013 07:30 PM, Christoph Nelles wrote:
[trim /]

>> If you're using standard desktop drives then you may be running into
>> issues with the drive timeout being longer than the kernel's. You need
>> to reset on or the other to ensure that the drive times out (and is
>> available for subsequent commands) before the kernel does. Most current
>> consumer drives don't allow resetting the timeout, but it's worth trying
>> that first before changing the kernel timeout. For each
>> drive, do:
>>     smartctl -l scterc,70,70 /dev/sdX
>>         || echo 180 > /sys/block/sdX/device/timeout
>>
> 
> Only the WDC Red supports that. The drives on the Marvell Controller all
> report
> SCT Error Recovery Control:
>            Read: Disabled
>           Write: Disabled

First, the syntax should have had a backslash on the first line, so that
a failure on setting SCTERC would fall back to setting a 180 second
timeout in the driver.

Second, you list three Hitachi Deskstar 7k3000 drives as being on that
controller.  These have supported SCTERC in the past (I have some of
them) and this is the first I've seen where they don't.  Could you
repeat your smart logs, but with "-x" to get a full report?

> To be honest, I don't trust SMART much and prefer a write/read badblocks
> over SMART tests. But of course i won't do that on a disk which has data
> on it.

I've never found badblocks to be of use, but smart monitoring for
relocations is vital information.

Neither SMART nor badblocks will save you if you have a timeout
mismatch.  Enterprise drives work "out-of-the-box" as they have a
default timeout of 7.0 seconds.  Any other drives must have a timeout
set, or the driver adjusted.  Linux drivers default to 30 seconds--not
enough.

[trim /]

> I think I don't like this part of the  discussion ("That won't work").

I've gone back through your data, and part of the story is muddled by
the timeout mismatch.  Your kernel logs show "DRDY" status problems
before the drives are kicked out.  That suggests a drive still in error
recovery when the kernel driver times out, then not being able to talk
to the drive to reset the link.  Classic no-win situation with desktop
drives.

> I hope no question is left open

I didn't see anywhere in your reports whether you've tried "--assemble
--force".  That is always the first tool to revive an array that has
kicked out drives on such problems.

When you ran badblocks for 2 days, what mode did you use?

Your descriptions and kernel logs suggest that is /dev/sdg, but the
"mdadm --examine" reports show /dev/sdg was in the array longer than
/dev/sdj.  Please elaborate.

If you didn't destroy its contents, you should include it in the
"--assemble --force" attempt.  Then, with proper drive timeouts, run a
"check" scrub.  That should fix your UREs.

If you did destroy that drive's contents, you need to clean up the UREs
on the other drives with dd_rescue, then "--assemble --force" with the
remaining drives.

> Kind regards and thanks for all the help so far

I think it would be useful to provide a fresh set of "mdadm --examine"
reports for all member disks, along with a partial listing of
/dev/disk/by-id/ that shows what serial numbers are assigned to what
device names.

I don't think your situation is hopeless.

Phil

next prev parent reply	other threads:[~2013-02-02  1:24 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-31 10:42 RAID5 with 2 drive failure at the same time Christoph Nelles
2013-01-31 11:38 ` Robin Hill
2013-01-31 13:15   ` Christoph Nelles
2013-01-31 13:45     ` Robin Hill
2013-01-31 17:46     ` Chris Murphy
     [not found]       ` <510ABC1E.6060308@evilazrael.de>
2013-01-31 21:19         ` Chris Murphy
2013-01-31 22:10       ` Robin Hill
2013-01-31 22:40         ` Chris Murphy
2013-01-31 22:48           ` Chris Murphy
2013-02-01 13:34           ` Robin Hill
2013-02-01 17:27             ` Chris Murphy
2013-02-01 19:57               ` Robin Hill
2013-02-02  0:30                 ` Christoph Nelles
2013-02-02  1:24                   ` Phil Turmel [this message]
2013-02-02 15:55                     ` Christoph Nelles
2013-02-02 20:34                       ` Chris Murphy
2013-02-02 23:56                         ` Phil Turmel
2013-02-03  1:22                       ` Phil Turmel
2013-02-03 15:56                         ` Christoph Nelles
2013-02-03 21:59                           ` Robin Hill
2013-02-10 20:48                             ` Christoph Nelles

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=510C6AB6.8040900@turmel.org \
    --to=philip@turmel.org \
    --cc=evilazrael@evilazrael.de \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).