Re: Read errors on raid5 ignored, array still clean .. then disaster !!

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Asdo <asdo@shiftmail.org>
To: Giovanni Tessore <giotex@texsoft.it>
Cc: linux-raid@vger.kernel.org, Neil Brown <neilb@suse.de>
Subject: Re: Read errors on raid5 ignored, array still clean .. then disaster !!
Date: Sun, 31 Jan 2010 15:31:22 +0100	[thread overview]
Message-ID: <4B65943A.4040800@shiftmail.org> (raw)
In-Reply-To: <4B64A779.6070809@shiftmail.org>

Asdo wrote:
> Giovanni Tessore wrote:
>> Hm funny ... I just read now from md's man:
>>
>> "In  kernels  prior to about 2.6.15, a read error would cause the 
>> same effect as a write error.  In later kernels, a read-error will 
>> instead cause md to attempt a recovery by overwriting the bad block. 
>> .... "
>>
>> So things have changed since 2.6.15 ... I was not so wrong to expect 
>> "the old behaviour" and to be disappointed.
>> [CUT]
>
> I have the feeling the current behaviour is the correct one at least 
> for RAID-6.
>
> [CUT]
>
> This is with RAID-6.
>
> RAID-5 unfortunately is inherently insecure, here is why:
> If one drive gets kicked, MD starts recovering to a spare.
> At that point any single read error during the regeneration (that's a 
> scrub) will fail the array.
> This is a problem that cannot be overcome in theory.
> Even with the old algorithm, any sector failed after the last scrub 
> will take the array down when one disk is kicked (array will go down 
> during recovery).
> So you would need to scrub continuously, or you would need 
> hyper-reliable disks.
>
> Yes, kicking a drive as soon as it presents the first unreadable 
> sector can be a strategy for trying to select hyper-reliable disks...
>
> Ok after all I might agree this can be a reasonable strategy for 
> raid1,4,5...
>
> I'd also agree that with 1.x superblock it would be desirable to be 
> able to set the maximum number of corrected read errors before a drive 
> is kicked, which could be set by default to 0 for raid 1,4,5 and to... 
> I don't know... 20 (50? 100?) for raid-6.
>
> Actually I believe the drives should be kicked for this threshold only 
> AFTER the end of the scrub, so that they are used for parity 
> computation till the end of the scrub. I would suggest to check for 
> this threshold at the end of each scrub, not before, and during normal 
> array operation only if a scrub/resync is not in progress (will be 
> checked at the end anyway).
>
> Thank you

I can add that this situation with raid 1,4,5,10 would be greatly 
ameliorated when the hot-device-replace feature gets implemented.
The failures of raid 1,4,5,10 are due to the zero redundancy you get in 
the time frame from when a drive is kicked to the end of the regeneration.
However if the hot-device-replace feature is added, and gets linked to 
the drive-kicking process, the problem would disappear.

Ideally instead of kicking (=failing) a drive directly, the 
hot-device-replace feature would be triggered, so the new drive would be 
replicated from the one being kicked (a few damaged blocks can be read 
from parity in case of read error from the disk being replaced, but 
don't "fail" the drive during the replace process just for this) In this 
way you get 1 redundancy instead of zero during rebuild, and the chances 
of the array going down during the rebuild process are pratically nullified.

I think the "hot-device-replace" action can replace the "fail" action in 
the most used scenarios, which is the drive being kicked due to:
1 - unrecoverable read error (end of relocation sectors available)
2 - surpassing the threshold for max corrected read errors (see above, 
if/when this gets implemented on 1.x superblock)

The reason for why #2 is feasible is trivial

#1 is more difficult (and it's useless to implement this if threshold 
for max corrected read errors gets implemented, because such threshold 
would trigger before the first unrecoverable read error happens), but I 
think it's still feasible. This would be the algorithm: you don't kick 
the drive, you ignore the write error on the bad disk (the correct data 
for that block can still be stored on the parity). Then you immediately 
trigger the hot-device-replace. When the scrub of the bad disk reaches 
the damaged sector, that one will be unreadable (I hope that it will not 
return the old data), but data can be read from the parity so the 
regeneration process can continue. So it should work, I think.

One case when you cannot replace the "fail" with "hot-device-replace" is 
when a disk dies suddenly (e.g. the electronic part dies). Maybe the 
"hot-device-replace" could still be triggered first, but then if the bad 
drive turns out to be completely unresponsive (timeout? number of 
commands without response?) you fall back on "fail".

Thank you
Asdo

next prev parent reply	other threads:[~2010-01-31 14:31 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-26 22:28 Read errors on raid5 ignored, array still clean .. then disaster !! Giovanni Tessore
2010-01-27  7:41 ` Luca Berra
2010-01-27  9:01   ` Goswin von Brederlow
2010-01-29 10:48   ` Neil Brown
2010-01-29 11:58     ` Goswin von Brederlow
2010-01-29 19:14     ` Giovanni Tessore
2010-01-30  7:58       ` Luca Berra
2010-01-30 15:52         ` Giovanni Tessore
2010-01-30  7:54     ` Luca Berra
2010-01-30 10:55     ` Giovanni Tessore
2010-01-30 18:44     ` Giovanni Tessore
2010-01-30 21:41       ` Asdo
2010-01-30 22:20         ` Giovanni Tessore
2010-01-31  1:23           ` Roger Heflin
2010-01-31 10:45             ` Giovanni Tessore
2010-01-31 14:08               ` Roger Heflin
2010-01-31 14:31         ` Asdo [this message]
2010-02-01 10:56           ` Giovanni Tessore
2010-02-01 12:45             ` Asdo
2010-02-01 15:11               ` Giovanni Tessore
2010-02-01 13:27             ` Luca Berra
2010-02-01 15:51               ` Giovanni Tessore
2010-01-27  9:01 ` Asdo
2010-01-27 10:09   ` Giovanni Tessore
2010-01-27 10:50     ` Asdo
2010-01-27 15:06       ` Goswin von Brederlow
2010-01-27 16:15       ` Giovanni Tessore
2010-01-27 19:33     ` Richard Scobie
  -- strict thread matches above, loose matches on Subject: below --
2010-01-27  9:56 Giovanni Tessore

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B65943A.4040800@shiftmail.org \
    --to=asdo@shiftmail.org \
    --cc=giotex@texsoft.it \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).