From: Carlos Knowlton <cknowlton@science.edu>
To: Michael Tokarev <mjt@tls.msk.ru>
Cc: linux-raid@vger.kernel.org
Subject: Re: Is there a drive error "retry" parameter?
Date: Wed, 15 Jun 2005 16:40:52 -0500 [thread overview]
Message-ID: <42B0A064.20406@science.edu> (raw)
In-Reply-To: <42AF5E2B.3010908@tls.msk.ru>
Hello Michael,
Michael Tokarev wrote:
...
> (For completness: there's another reallocation feature supporting
> by most drives - write-error relocation, when a drive relocates
> bad block on *write* error, because it knows which data should be
> there. A block that was unreadable may become good again after
> re-write, either "just because", after refreshing its pieces,
> it is now in cleaner state, or because the write-error relocation
> mechanism in the drive did its work. That's why re-writing
> a drive with bad blocks often results in a good drive, and often
> that good state persists; it's more or less normal for a drive
> to develop one or two bad blocks during its lifetime and reallocate
> them.)
Thanks! This is useful info.
I did some googling on sector relocation, and it appears that SpinRite
6.0 (on their features page <http://www.grc.com/srfeatures.htm>),
claims to be able to turn off sector relocation, and re-read and analyze
the "bad" sector in different ways until it can get a good read, (or
deduce the correct data from the statistical outcome of multiple failed
reads) then turn relocation back on, and map around the sector. Any
reason this couldn't be done in the block device driver (or some other,
more appropriate layer)? It seems that this kind of transparent data
recovery would be a real plus! Do you know if any thought has gone into
this kind of thing?
>
>>>> Is there a "retry" parameter that can be set in the kernel parameters,
>>>> or else in the code itself to prolong the existence of a drive in an
>>>> array before it is considered dirty?
>>>
>>>
>>> There's no such parameter currently. But there was several discussions
>>> about how to make raid code more robust - in particular, in case of
>>> read error, raid code may keep the errored drive in the array and mark
>>> it dirty only in case of write error.
>>>
>> That would be nice. Do you know if anyone has done any work toward
>> such a fix?
>
>
> Looks like this is a "FAQ #1" candidate for linux softraid ;)
> I tried to do just that myself, with a help from Peter T. Breuer.
> The code even worked here on a test machine for some time.
> But it's umm.. quite a bit ugly, and Neil is going to slightly
> different direction (which I for one don't like much - the
> persistent bitmaps stuff, -- I think simpler approach is better).
Is that the journal stuff mentioned here
<http://lwn.net/2002/0523/a/jbd-md.php3> between Neil and Steven
Tweedie? What is the status of it? (a complex approach to a solution
is better than nothing, as long as it solves the problem, right?)
> If memory serves me right, you mentioned *several* drives goes off
> all at once. This is not a bad sector on one drive, it's something
> else like bad cabling or power supplies, whatever.
I've looked into cable and power issues, and if they are the culprit,
the problem is terribly intermittent, and my setup is generally within
spec. (although on some servers we have mounted two drives on a 40pin
ATA cable, we've rarely seen two drives fail that have shared a
cable.). After a reboot, the drives that had these errors are happily
restored back into the array as if nothing happened. If these are issues
with a standard setup, this is all the more reason to want RAID to be a
little bit more lenient on the isolated read error.
I've been looking into the IDE code to see if I can get it to give me a
few more read retries before declaring a read error. The "ERROR_MAX"
variable in ".../linux-x.x.x/include/linux/ide.h" looks like it might
afford me some extra time. Is there a better place to find this kind of
relief?
> Speaking of drives and bad sectors -- see above. On SCSI drives
> there's a way to see all the relocations (scsiinfo utility for
> example).
Is there anything similar to this for S-ATA, or P-ATA drives?
> And yes indeed, it'd be nice to keep the drive in the array in case
> of read error, and only kick it off on write errors - huge step in
> the right direction.
I appreciate your effort toward this end. Thanks again for your help!
Regards,
Carlos
next prev parent reply other threads:[~2005-06-15 21:40 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-05-02 16:24 apparent but not real raid1 failure. what happened? still confused. Gurus Please help Mitchell Laks
2005-05-02 18:20 ` Peter T. Breuer
2005-06-02 15:23 ` Is there a drive error "retry" parameter? Carlos Knowlton
2005-06-02 17:16 ` Michael Tokarev
2005-06-03 9:21 ` danci
2005-06-14 21:53 ` Carlos Knowlton
2005-06-14 22:46 ` Michael Tokarev
2005-06-15 21:40 ` Carlos Knowlton [this message]
2005-06-16 0:20 ` Paul Clements
2005-06-16 16:23 ` Michael Tokarev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=42B0A064.20406@science.edu \
--to=cknowlton@science.edu \
--cc=linux-raid@vger.kernel.org \
--cc=mjt@tls.msk.ru \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).