Re: recovering RAID5 from multiple disk failures

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Phil Turmel <philip@turmel.org>
To: Chris Murphy <lists@colorremedies.com>
Cc: Michael Ritzert <ksciplot@gmx.net>,
	"linux-raid@vger.kernel.org list" <linux-raid@vger.kernel.org>
Subject: Re: recovering RAID5 from multiple disk failures
Date: Sat, 02 Feb 2013 19:23:32 -0500	[thread overview]
Message-ID: <510DAE04.5070709@turmel.org> (raw)
In-Reply-To: <D0556FFA-7710-47DC-873C-38820439F054@colorremedies.com>

On 02/02/2013 06:08 PM, Chris Murphy wrote:
> 
> On Feb 2, 2013, at 2:56 PM, Michael Ritzert <ksciplot@gmx.net> 
> wrote:
>> 
>> Chris Murphy <lists@colorremedies.com> wrote:
>>> 
>>> Nevertheless, over an hour and a half is a long time if the file 
>>> system were being updated at all. There'd definitely be 
>>> data/parity mismatches for disk1.
>> 
>> After disk1 failed, the only write access should have been
>> metadata update when the filesystem was mounted.

This is significant.

> Was it mounted ro?
> 
>> I only read data from the filesystem thereafter. So only atime 
>> changes are to be expected, there, and only for a small number of 
>> files that I could capture before disk3 failed. I know which files 
>> are affected, and could leave them alone.
> 
> Even for a small number of files there could be dozens or hundreds
> of chunks altered. I think conservatively you have to consider disk
> 1 out and mount in degraded mode.
> 
>> 
>>> If disk 1 is assumed to be useless, meaning force assemble the 
>>> array in degraded mode; a URE or linux SCSI layer time out is to 
>>> be avoided or the array as a whole fails. Every sector is
>>> needed. So what do you think about raising the linux scsi layer
>>> time out to maybe 2 minutes, and leaving the remaining drive's
>>> SCT ERC alone so that they don't time out sooner, but rather go
>>> into whatever deep recovery they have to in the hopes those bad 
>>> sectors can be read?
>>> 
>>> echo 120 >/sys/block/sdX/device/timeout
>> 
>> I just tried that, but I couldn't see any effect. The error rate 
>> coming in is much higher than 1 every two minutes.
> 
> This timeout is not about error rate. And what the value should be 
> depends on context. Normal operation you want the disk error
> recovery to be short, so that the disk produces a bonafide URE, not a
> SCSI layer timeout error. That way md will correct the bad sector.
> That's what probably wasn't happening in your case, which allowed
> bad sectors to accumulate until it was a real problem.

If you try to recover from the degraded array, this is the correct approach.

> But now, for the cloning process, you want the disk error timeout to 
> be long (or disabled) so that the disk has as long as possible to do 
> ECC to recover each of these problematic sectors. But this also
> means getting the SCSI layer timeout set to at least 1 second longer
> than the longest recovery time for the drives, so that the SCSI layer
> time out doesn't stop sector recovery during cloning. Now maybe the
> disk still won't be able to recover all data from these bad sectors,
> but it's your best shot IMO.

For the array assembled degraded (disk1 left out).

>> When I assemble the array, I will have all new disks (with good 
>> smart selftests...), so I wouldn't expect timeouts. Instead, junk 
>> data will be returned from the sectors in question¹. How will md 
>> react to that?
> 
> Well yeah, with the new drives, they won't report UREs. So there's
> an ambiguity with any mismatch between data and parity chunks as to 
> which is correct. Without a URE, md doesn't know that the data chunk 
> is right or wrong with RAID 5.

Bingo.  Working from the copies guarantees you won't have correct data
where the UREs are.  (The copies are very good to have, of course.)

> Phil may disagree, and I have to defer to his experience in this,
> but I think the most conservative and best shot you have at getting
> the 20GB you want off the array is this:

I do disagree.

The above, combined with:

> I do know where the bad sectors are from the ddrescue report. We are
> talking about less that 50kB bad data on disk1. Unfortunately, disk3
> is worse, but there is no sector that is bad on both disks.

Leads me to recommend "mdadm --create --assume-clean" using the original
drives, taking care to specify the devices in the proper order (per
their "Raid Device" number in the --examine reports).  I still haven't
seen any data that definitively links specific serial numbers to
specific raid device numbers.  Please do that.

After re-creating the array, and setting all the drive timeouts to 7.0
seconds, issue a "check" scrub:

echo "check" >/sys/block/md0/md/sync_action

This should clean up the few pending sectors on disk #1 by
reconstruction from the others, and may very well do the same for disk #3.

If disk #3 gets kicked out at this point, assemble in degraded mode with
disk #2, #4, and a fresh copy of disk #1 (picking up the new superblock
and any fixes during the partial scrub).  Then "--add" a spare (wiped)
disk and let the array rebuild.

And grab your data.

Phil.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2013-02-03  0:23 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-01 12:28 recovering RAID5 from multiple disk failures Michael Ritzert
2013-02-01 13:21 ` Phil Turmel
2013-02-02 13:04   ` Michael Ritzert
2013-02-02 13:44     ` Phil Turmel
2013-02-02 20:20       ` Chris Murphy
2013-02-02 21:56         ` Michael Ritzert
2013-02-02 23:08           ` Chris Murphy
2013-02-03  0:23             ` Phil Turmel [this message]
2013-02-03  0:39               ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=510DAE04.5070709@turmel.org \
    --to=philip@turmel.org \
    --cc=ksciplot@gmx.net \
    --cc=linux-raid@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.