Re: Should we be trying re-write on write errors?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ric Wheeler <ricwheeler@gmail.com>
To: Greg Freemyer <greg.freemyer@gmail.com>
Cc: "Keld Jørn Simonsen" <keld@dkuug.dk>,
	"Neil Brown" <neilb@suse.de>,
	greg@enjellic.com, linux-raid@vger.kernel.org
Subject: Re: Should we be trying re-write on write errors?
Date: Sun, 16 Nov 2008 23:31:09 -0500	[thread overview]
Message-ID: <4920F38D.8040808@gmail.com> (raw)
In-Reply-To: <87f94c370811141655g38558a82ue57938860a4df3d@mail.gmail.com>

Greg Freemyer wrote:
>> On Sat, Nov 15, 2008 at 08:58:46AM +1100, Neil Brown wrote:
>>     
>>> On Friday November 14, greg@enjellic.com wrote:
>>>       
>>>> Hi Neil, hope the week is ending well for you and the rest of the
>>>> denizens on the linux-raid list.
>>>>
>>>> Somewhat of a Gedanken question for you.
>>>>
>>>> We currently attempt a re-write on read error for volumes which have
>>>> redundancy, ie. RAID[156] etc, on the bet that we can force a bad
>>>> sector remap.  Should we be attempting that (or do we) on a write
>>>> error as well?
>>>>         
>>> I don't think so.
>>> By the time md/raid gets an error status, lower levels (Whether driver
>>> or firmware) should have retried as much as in appropriate.  Doing
>>> further retries at the md level should be pointless.
>>>
>>> For reads, we do retry.  But the purpose is to find out exactly which
>>> block failed so that we can just re-write that block.  There is no
>>> expectation that a block which previously failed a read will now
>>> succeed.
>>>
>>> Similarly there is no reason to expect that a block which previously
>>> failed a write will now succeed.
>>>
>>> I suggest that you might like to discuss your particular case with the
>>> author of the driver for the device.  Maybe the driver should be
>>> retrying.  Maybe the firmware is doing the wrong thing.
>>>
>>> After all, you wouldn't expect every different filesystem to retry all
>>> failed writes, would you?
>>>
>>>
>>>       
>>>> BTW much thanks for the existing re-write code.  Countless mornings
>>>> I have said 'gee that Neil Brown was clever' when I see that one of
>>>> our machines cleaned up a potential problem before it became a bigger
>>>> one.
>>>>         
>>> :-)
>>> To be honest, that code was largely because people kept complaining
>>> about read errors being too fatal and wanted something done.  The only
>>> way to stop the flood of complaints was to fix something :-)
>>>
>>>       
>>>> Best wishes for a pleasant weekend.
>>>>         
>>> And for you!
>>>
>>> NeilBrown
>>>       
>
> <<Moved from the top post to a bottom post>>
>
> On Fri, Nov 14, 2008 at 7:47 PM, Keld Jørn Simonsen <keld@dkuug.dk> wrote:
>   
>> I would like to write something about this fo the wiki.
>> What exactly is done, and it is general for all of linux md raid?
>>
>> best regards
>> keld
>>
>>     
>
> If you are going to document this in a wiki, please document when a
> write error can occur because I totally don't understand how this one
> occurred.
>
> I thought they could only occur:
>
> 1) With bad media on the platter and the reallocatable sectors section
> was already 100% utilized
>
> 2) Due to a CRC error on the comm path.  (flacky cable / power / etc.)
>
> As I read the below errors, neither of those occurred.  And as Neil
> said I believe the retrys related to CRC errors should be handled
> below the MD level.
>
> Greg
>   

Most of the common write errors you see should not be retried, but you 
might see some writes fail due to transient conditions.

One possible condition would be vibrations, for example as you wheel a 
rack around in your data center or you bang into the computer.

 If you are using a SAN, you might also have transient link errors that 
will go away once the switch rights itself or someone plugs back in a 
new cable...

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

     prev parent reply	other threads:[~2008-11-17  4:31 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-14 21:30 Should we be trying re-write on write errors? greg
2008-11-14 21:58 ` Neil Brown
2008-11-15  0:47   ` Keld Jørn Simonsen
2008-11-15  0:55     ` Greg Freemyer
2008-11-17  4:31       ` Ric Wheeler [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4920F38D.8040808@gmail.com \
    --to=ricwheeler@gmail.com \
    --cc=greg.freemyer@gmail.com \
    --cc=greg@enjellic.com \
    --cc=keld@dkuug.dk \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).