From: Ric Wheeler <ric@emc.com>
To: NeilBrown <neilb@suse.de>
Cc: Theodore Tso <tytso@MIT.EDU>, Andre Noll <maan@systemlinux.org>,
Bas van Schaik <bas@tuxes.nl>,
linux-raid@vger.kernel.org
Subject: Re: Redundancy check using "echo check > sync_action": error reporting?
Date: Fri, 21 Mar 2008 16:45:39 -0400 [thread overview]
Message-ID: <47E41E73.1000702@emc.com> (raw)
In-Reply-To: <42039.192.168.1.70.1206130781.squirrel@neil.brown.name>
NeilBrown wrote:
> On Fri, March 21, 2008 5:02 am, Theodore Tso wrote:
>> On Thu, Mar 20, 2008 at 06:39:06PM +0100, Andre Noll wrote:
>>> On 12:35, Theodore Tso wrote:
>>>
>>>> If a mismatch is detected in a RAID-6 configuration, it should be
>>>> possible to figure out what should be fixed
>>> It can be figured out under the assumption that exactly one drive has
>>> bad data and all other ones have good data. But that seems to be an
>>> assumption that is hard to verify in reality.
>> True, but it's what ECC memory does. :-) And most people agree that
>> it's a useful thing to do with memory.
>>
>> If you do ECC syndrome checking on every read, and follow that up with
>> periodic scrubbing so that you catch (and correct) errors quickly, it
>> is a reasonable assumption to make.
>
> My problem with this is that I don't have a good model for what might
> cause the error, so I cannot reason about what responses are justifiable.
>
> The analogy with ECC memory is, I think, poor. With ECC memory there are
> electro/physical processes which can cause a bit to change independently
> of any other bit with very low probability, so treating an ECC error as
> a single bit error is reasonable.
>
> The analogy with a disk drive would be a media error. However disk drives
> record CRC (or similar) checks so that media errors get reported as errors,
> not as incorrect data. So the analogy doesn't hold.
The challenge is only when you don't get an error on the IO. If you have
bad hardware somewhere off platter, you can get silent corruption.
In this case, if you look at Martin's presentation on DIF, we could do
something that a check could leverage on a per sector basis for software
raid.
>
> Where else could the error come from? Presumably a bit-flip on some
> transfer bus between main memory and the media. There are several
> of these busses (mem to controller, controller to device, internal to
> device). The corruption could happen on the write or on the read.
> When you write to a RAID6 you often write several blocks to different
> devices at the same time. Are these really likely to be independent
> events wrt whatever is causing the corruption?
>
> I don't know. But without a clear model, it isn't clear to me that
> any particular action will be certain to improve the situation in
> all cases.
It can come from a lot of things (see the recent papers from FAST and
NetApp for example).
>
> And how often does silent corruption happen on modern hard drives?
> How often do you write something and later successfully read something
> else when it isn't due to a major hardware problem that is causing
> much more that just occasional errors?
>
> The ZFS people seem to say that their checksumming of all data shows
> up a lot of these cases. If that is true, how come people who
> don't use ZFS aren't reporting lots of data corruption?
>
> So yes: there are lots of things that *could* be done. But without
> a model for the "threat", an analysis of how the remedy would actually
> affect every different possible scenario, and some idea of the
> probability of the remedy being needed, it is very hard to
> justify a change of this sort.
> And there are plenty of other things to be coded that are genuinely
> useful - like converting a RAID5 to a RAID6 while online...
>
> NeilBrown
I really think that we might be able to leverage the DIF standard if and
when it rolls out.
ric
next prev parent reply other threads:[~2008-03-21 20:45 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-03-16 14:21 Redundancy check using "echo check > sync_action": error reporting? Bas van Schaik
2008-03-16 15:14 ` Janek Kozicki
2008-03-20 13:32 ` Bas van Schaik
2008-03-20 13:47 ` Robin Hill
2008-03-20 14:19 ` Bas van Schaik
2008-03-20 14:45 ` Robin Hill
2008-03-20 15:16 ` Bas van Schaik
2008-03-20 16:04 ` Robin Hill
2008-03-20 16:35 ` Theodore Tso
2008-03-20 17:10 ` Robin Hill
2008-03-20 17:39 ` Andre Noll
2008-03-20 18:02 ` Theodore Tso
2008-03-20 18:57 ` Andre Noll
2008-03-21 14:02 ` Ric Wheeler
2008-03-21 20:19 ` NeilBrown
2008-03-21 20:45 ` Ric Wheeler [this message]
2008-03-22 17:13 ` Bill Davidsen
2008-03-20 23:08 ` Peter Rabbitson
2008-03-21 14:24 ` Bill Davidsen
2008-03-21 14:52 ` Peter Rabbitson
2008-03-21 17:13 ` Theodore Tso
2008-03-21 17:35 ` Peter Rabbitson
2008-03-22 13:27 ` Theodore Tso
2008-03-22 14:00 ` Bas van Schaik
2008-03-25 4:44 ` Neil Brown
2008-03-25 15:17 ` Bill Davidsen
2008-03-25 9:19 ` Mattias Wadenstein
2008-03-21 17:43 ` Robin Hill
2008-03-21 23:01 ` Bill Davidsen
2008-03-21 23:45 ` Carlos Carvalho
2008-03-22 17:19 ` Bill Davidsen
2008-03-21 23:55 ` Robin Hill
2008-03-22 10:03 ` Peter Rabbitson
2008-03-22 10:42 ` What do Events actually mean? Justin Piszcz
2008-03-22 17:35 ` David Greaves
2008-03-22 17:48 ` Justin Piszcz
2008-03-22 18:02 ` David Greaves
2008-03-25 3:58 ` Neil Brown
2008-03-26 8:57 ` David Greaves
2008-03-26 8:57 ` David Greaves
2008-05-04 7:30 ` Redundancy check using "echo check > sync_action": error reporting? Peter Rabbitson
2008-05-06 6:36 ` Luca Berra
2008-03-25 4:24 ` Neil Brown
2008-03-25 9:00 ` Peter Rabbitson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=47E41E73.1000702@emc.com \
--to=ric@emc.com \
--cc=bas@tuxes.nl \
--cc=linux-raid@vger.kernel.org \
--cc=maan@systemlinux.org \
--cc=neilb@suse.de \
--cc=tytso@MIT.EDU \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).