* Re: raid6 check/repair
@ 2007-11-21 13:25 Thiemo Nagel
2007-11-22 3:55 ` Neil Brown
0 siblings, 1 reply; 22+ messages in thread
From: Thiemo Nagel @ 2007-11-21 13:25 UTC (permalink / raw)
To: neilb; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 2322 bytes --]
Dear Neal,
>> I have been looking a bit at the check/repair functionality in the
>> raid6 personality.
>>
>> It seems that if an inconsistent stripe is found during repair, md
>> does not try to determine which block is corrupt (using e.g. the
>> method in section 4 of HPA's raid6 paper), but just recomputes the
>> parity blocks - i.e. the same way as inconsistent raid5 stripes are
>> handled.
>>
>> Correct?
>
> Correct!
>
> The mostly likely cause of parity being incorrect is if a write to
> data + P + Q was interrupted when one or two of those had been
> written, but the other had not.
>
> No matter which was or was not written, correctly P and Q will produce
> a 'correct' result, and it is simple. I really don't see any
> justification for being more clever.
My opinion about that is quite different. Speaking just for myself:
a) When I put my data on a RAID running on Linux, I'd expect the
software to do everything which is possible to protect and when
necessary to restore data integrity. (This expectation was one of the
reasons why I chose software RAID with Linux.)
b) As a consequence of a): When I'm using a RAID level that has extra
redundancy, I'd expect Linux to make use of that extra redundancy during
a 'repair'. (Otherwise I'd consider repair a misnomer and rather call
it 'recalc parity'.)
c) Why should 'repair' be implemented in a way that only works in most
cases when there exists a solution that works in all cases? (After all,
possibilities for corruption are many, e.g. bad RAM, bad cables, chipset
bugs, driver bugs, last but not least human mistake. From all these
errors I'd like to be able to recover gracefully without putting the
array at risk by removing and readding a component device.)
Bottom line: So far I was talking about *my* expectations, is it
reasonable to assume that it is shared by others? Are there any
arguments that I'm not aware of speaking against an improved
implementation of 'repair'?
BTW: I just checked, it's the same for RAID 1: When I intentionally
corrupt a sector in the first device of a set of 16, 'repair' copies the
corrupted data to the 15 remaining devices instead of restoring the
correct sector from one of the other fifteen devices to the first.
Thank you for your time.
Kind regards,
Thiemo Nagel
[-- Attachment #2: thiemo_nagel.vcf --]
[-- Type: text/x-vcard, Size: 328 bytes --]
begin:vcard
fn:Thiemo Nagel
n:Nagel;Thiemo
org;quoted-printable:Technische Universit=C3=A4t M=C3=BCnchen;Physik Department E18
adr;quoted-printable:;;James-Franck-Stra=C3=9Fe;Garching;;85748;Germany
email;internet:thiemo.nagel@ph.tum.de
title:Dipl. Phys.
tel;work:+49 (0)89 289-12592
x-mozilla-html:FALSE
version:2.1
end:vcard
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid6 check/repair
2007-11-21 13:25 raid6 check/repair Thiemo Nagel
@ 2007-11-22 3:55 ` Neil Brown
2007-11-22 16:51 ` Thiemo Nagel
0 siblings, 1 reply; 22+ messages in thread
From: Neil Brown @ 2007-11-22 3:55 UTC (permalink / raw)
To: thiemo.nagel; +Cc: linux-raid
On Wednesday November 21, thiemo.nagel@ph.tum.de wrote:
> Dear Neal,
>
> >> I have been looking a bit at the check/repair functionality in the
> >> raid6 personality.
> >>
> >> It seems that if an inconsistent stripe is found during repair, md
> >> does not try to determine which block is corrupt (using e.g. the
> >> method in section 4 of HPA's raid6 paper), but just recomputes the
> >> parity blocks - i.e. the same way as inconsistent raid5 stripes are
> >> handled.
> >>
> >> Correct?
> >
> > Correct!
> >
> > The mostly likely cause of parity being incorrect is if a write to
> > data + P + Q was interrupted when one or two of those had been
> > written, but the other had not.
> >
> > No matter which was or was not written, correctly P and Q will produce
> > a 'correct' result, and it is simple. I really don't see any
> > justification for being more clever.
>
> My opinion about that is quite different. Speaking just for myself:
>
> a) When I put my data on a RAID running on Linux, I'd expect the
> software to do everything which is possible to protect and when
> necessary to restore data integrity. (This expectation was one of the
> reasons why I chose software RAID with Linux.)
Yes, of course. "possible" is an import aspect of this.
>
> b) As a consequence of a): When I'm using a RAID level that has extra
> redundancy, I'd expect Linux to make use of that extra redundancy during
> a 'repair'. (Otherwise I'd consider repair a misnomer and rather call
> it 'recalc parity'.)
The extra redundancy in RAID6 is there to enable you to survive two
drive failure. Nothing more.
While it is possible to use the RAID6 P+Q information to deduce which
data block is wrong if it is known that either 0 or 1 datablocks is
wrong, it is *not* possible to deduce which block or blocks are wrong
if it is possible that more than 1 data block is wrong.
As it is quite possible for a write to be aborted in the middle
(during unexpected power down) with an unknown number of blocks in a
given stripe updated but others not, we do not know how many blocks
might be "wrong" so we cannot try to recover some wrong block. Doing
so would quite possibly corrupt a block that is not wrong.
The "repair" process "repairs" the parity (redundancy information).
It does not repair the data. It cannot.
The only possible scenario that md/raid recognises for the parity
information being wrong is the case of an unexpected shutdown in the
middle of a stripe write, where some blocks have been written and some
have not.
Further (for raid 4/5/6), it only supports this case when your array
is not degraded. If you have a degraded array, then an unexpected
shutdown is potentially fatal to your data (the chances of it actually
being fatal is actually quite small, but the potential is still there).
There is nothing RAID can do about this. It is not designed to
protect against power failure. It is designed to protect again drive
failure. It does that quite well.
If you have wrong data appearing on your device for some other reason,
then you have a serious hardware problem and RAID cannot help you.
The best approach to dealing with data on drives getting spontaneously
corrupted is for the filesystem to perform strong checksums on the
data block, and store the checksums in the indexing information. This
provides detection, not recovery of course.
>
> c) Why should 'repair' be implemented in a way that only works in most
> cases when there exists a solution that works in all cases? (After all,
> possibilities for corruption are many, e.g. bad RAM, bad cables, chipset
> bugs, driver bugs, last but not least human mistake. From all these
> errors I'd like to be able to recover gracefully without putting the
> array at risk by removing and readding a component device.)
As I said above - there is no solution that works in all cases. If
more that one block is corrupt, and you don't know which ones, then
you lose and there is now way around that.
RAID is not designed to protect again bad RAM, bad cables, chipset
bugs drivers bugs etc. It is only designed to protect against drive
failure, where the drive failure is apparent. i.e. a read must return
either the same data that was last written, or a failure indication.
Anything else is beyond the design parameters for RAID.
It might be possible to design a data storage system that was
resilient to these sorts of errors. It would be much more
sophisticated than RAID though.
NeilBrown
>
> Bottom line: So far I was talking about *my* expectations, is it
> reasonable to assume that it is shared by others? Are there any
> arguments that I'm not aware of speaking against an improved
> implementation of 'repair'?
>
> BTW: I just checked, it's the same for RAID 1: When I intentionally
> corrupt a sector in the first device of a set of 16, 'repair' copies the
> corrupted data to the 15 remaining devices instead of restoring the
> correct sector from one of the other fifteen devices to the first.
>
> Thank you for your time.
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid6 check/repair
2007-11-22 3:55 ` Neil Brown
@ 2007-11-22 16:51 ` Thiemo Nagel
2007-11-27 5:08 ` Bill Davidsen
2007-11-29 6:01 ` Neil Brown
0 siblings, 2 replies; 22+ messages in thread
From: Thiemo Nagel @ 2007-11-22 16:51 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
Dear Neil,
thank you very much for your detailed answer.
Neil Brown wrote:
> While it is possible to use the RAID6 P+Q information to deduce which
> data block is wrong if it is known that either 0 or 1 datablocks is
> wrong, it is *not* possible to deduce which block or blocks are wrong
> if it is possible that more than 1 data block is wrong.
If I'm not mistaken, this is only partly correct. Using P+Q redundancy,
it *is* possible, to distinguish three cases:
a) exactly zero bad blocks
b) exactly one bad block
c) more than one bad block
Of course, it is only possible to recover from b), but one *can* tell,
whether the situation is a) or b) or c) and act accordingly.
> As it is quite possible for a write to be aborted in the middle
> (during unexpected power down) with an unknown number of blocks in a
> given stripe updated but others not, we do not know how many blocks
> might be "wrong" so we cannot try to recover some wrong block.
As already mentioned, in my opinion, one can distinguish between 0, 1
and >1 bad blocks, and that is sufficient.
> Doing so would quite possibly corrupt a block that is not wrong.
I don't think additional corruption could be introduced, since recovery
would only be done for the case of exactly one bad block.
>
> [...]
>
> As I said above - there is no solution that works in all cases.
I fully agree. When more than one block is corrupted, and you don't
know which are the corrupted blocks, you're lost.
> If more that one block is corrupt, and you don't know which ones,
> then you lose and there is now way around that.
Sure.
The point that I'm trying to make is, that there does exist a specific
case, in which recovery is possible, and that implementing recovery for
that case will not hurt in any way.
> RAID is not designed to protect again bad RAM, bad cables, chipset
> bugs drivers bugs etc. It is only designed to protect against drive
> failure, where the drive failure is apparent. i.e. a read must
> return either the same data that was last written, or a failure
> indication. Anything else is beyond the design parameters for RAID.
I'm taking a more pragmatic approach here. In my opinion, RAID should
"just protect my data", against drive failure, yes, of course, but if it
can help me in case of occasional data corruption, I'd happily take
that, too, especially if it doesn't cost extra... ;-)
Kind regards,
Thiemo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid6 check/repair
2007-11-22 16:51 ` Thiemo Nagel
@ 2007-11-27 5:08 ` Bill Davidsen
2007-11-29 6:04 ` Neil Brown
2007-11-29 6:01 ` Neil Brown
1 sibling, 1 reply; 22+ messages in thread
From: Bill Davidsen @ 2007-11-27 5:08 UTC (permalink / raw)
To: thiemo.nagel; +Cc: Neil Brown, linux-raid
Thiemo Nagel wrote:
> Dear Neil,
>
> thank you very much for your detailed answer.
>
> Neil Brown wrote:
>> While it is possible to use the RAID6 P+Q information to deduce which
>> data block is wrong if it is known that either 0 or 1 datablocks is
>> wrong, it is *not* possible to deduce which block or blocks are wrong
>> if it is possible that more than 1 data block is wrong.
>
> If I'm not mistaken, this is only partly correct. Using P+Q redundancy,
> it *is* possible, to distinguish three cases:
> a) exactly zero bad blocks
> b) exactly one bad block
> c) more than one bad block
>
> Of course, it is only possible to recover from b), but one *can* tell,
> whether the situation is a) or b) or c) and act accordingly.
I was waiting for a response before saying "me too," but that's exactly
the case, there is a class of failures other than power failure or total
device failure which result in just the "one identifiable bad sector"
result. Given that the data needs to be read to realize that it is bad,
why not go the extra inch and fix it properly instead of redoing the p+q
which just makes the problem invisible rather than fixing it.
Obviously this is a subset of all the things which can go wrong, but I
suspect it's a sizable subset.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid6 check/repair
2007-11-27 5:08 ` Bill Davidsen
@ 2007-11-29 6:04 ` Neil Brown
0 siblings, 0 replies; 22+ messages in thread
From: Neil Brown @ 2007-11-29 6:04 UTC (permalink / raw)
To: Bill Davidsen; +Cc: thiemo.nagel, linux-raid
On Tuesday November 27, davidsen@tmr.com wrote:
> Thiemo Nagel wrote:
> > Dear Neil,
> >
> > thank you very much for your detailed answer.
> >
> > Neil Brown wrote:
> >> While it is possible to use the RAID6 P+Q information to deduce which
> >> data block is wrong if it is known that either 0 or 1 datablocks is
> >> wrong, it is *not* possible to deduce which block or blocks are wrong
> >> if it is possible that more than 1 data block is wrong.
> >
> > If I'm not mistaken, this is only partly correct. Using P+Q redundancy,
> > it *is* possible, to distinguish three cases:
> > a) exactly zero bad blocks
> > b) exactly one bad block
> > c) more than one bad block
> >
> > Of course, it is only possible to recover from b), but one *can* tell,
> > whether the situation is a) or b) or c) and act accordingly.
> I was waiting for a response before saying "me too," but that's exactly
> the case, there is a class of failures other than power failure or total
> device failure which result in just the "one identifiable bad sector"
> result. Given that the data needs to be read to realize that it is bad,
> why not go the extra inch and fix it properly instead of redoing the p+q
> which just makes the problem invisible rather than fixing it.
>
> Obviously this is a subset of all the things which can go wrong, but I
> suspect it's a sizable subset.
Why do think that it is a sizable subset. Disk drives have internal
checksum which are designed to prevent corrupted data being returned.
If the data is getting corrupt on some buss between the CPU and the
media, then I suspect that your problem is big enough that RAID cannot
meaningfully solve it, and "New hardware plus possibly restore from
backup" would be the only credible option.
NeilBrown
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid6 check/repair
2007-11-22 16:51 ` Thiemo Nagel
2007-11-27 5:08 ` Bill Davidsen
@ 2007-11-29 6:01 ` Neil Brown
2007-11-29 19:30 ` Bill Davidsen
` (2 more replies)
1 sibling, 3 replies; 22+ messages in thread
From: Neil Brown @ 2007-11-29 6:01 UTC (permalink / raw)
To: thiemo.nagel; +Cc: linux-raid
On Thursday November 22, thiemo.nagel@ph.tum.de wrote:
> Dear Neil,
>
> thank you very much for your detailed answer.
>
> Neil Brown wrote:
> > While it is possible to use the RAID6 P+Q information to deduce which
> > data block is wrong if it is known that either 0 or 1 datablocks is
> > wrong, it is *not* possible to deduce which block or blocks are wrong
> > if it is possible that more than 1 data block is wrong.
>
> If I'm not mistaken, this is only partly correct. Using P+Q redundancy,
> it *is* possible, to distinguish three cases:
> a) exactly zero bad blocks
> b) exactly one bad block
> c) more than one bad block
>
> Of course, it is only possible to recover from b), but one *can* tell,
> whether the situation is a) or b) or c) and act accordingly.
It would seem that either you or Peter Anvin is mistaken.
On page 9 of
http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
at the end of section 4 it says:
Finally, as a word of caution it should be noted that RAID-6 by
itself cannot even detect, never mind recover from, dual-disk
corruption. If two disks are corrupt in the same byte positions,
the above algorithm will in general introduce additional data
corruption by corrupting a third drive.
>
> The point that I'm trying to make is, that there does exist a specific
> case, in which recovery is possible, and that implementing recovery for
> that case will not hurt in any way.
Assuming that it true (maybe hpa got it wrong) what specific
conditions would lead to one drive having corrupt data, and would
correcting it on an occasional 'repair' pass be an appropriate
response?
Does the value justify the cost of extra code complexity?
>
> > RAID is not designed to protect again bad RAM, bad cables, chipset
> > bugs drivers bugs etc. It is only designed to protect against drive
> > failure, where the drive failure is apparent. i.e. a read must
> > return either the same data that was last written, or a failure
> > indication. Anything else is beyond the design parameters for RAID.
>
> I'm taking a more pragmatic approach here. In my opinion, RAID should
> "just protect my data", against drive failure, yes, of course, but if it
> can help me in case of occasional data corruption, I'd happily take
> that, too, especially if it doesn't cost extra... ;-)
Everything costs extra. Code uses bytes of memory, requires
maintenance, and possibly introduced new bugs. I'm not convinced the
failure mode that you are considering actually happens with a
meaningful frequency.
NeilBrown
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: raid6 check/repair
2007-11-29 6:01 ` Neil Brown
@ 2007-11-29 19:30 ` Bill Davidsen
2007-11-29 23:17 ` Eyal Lebedinsky
2007-11-30 18:34 ` Thiemo Nagel
2 siblings, 0 replies; 22+ messages in thread
From: Bill Davidsen @ 2007-11-29 19:30 UTC (permalink / raw)
To: Neil Brown; +Cc: thiemo.nagel, linux-raid
Neil Brown wrote:
> On Thursday November 22, thiemo.nagel@ph.tum.de wrote:
>
>> Dear Neil,
>>
>> thank you very much for your detailed answer.
>>
>> Neil Brown wrote:
>>
>>> While it is possible to use the RAID6 P+Q information to deduce which
>>> data block is wrong if it is known that either 0 or 1 datablocks is
>>> wrong, it is *not* possible to deduce which block or blocks are wrong
>>> if it is possible that more than 1 data block is wrong.
>>>
>> If I'm not mistaken, this is only partly correct. Using P+Q redundancy,
>> it *is* possible, to distinguish three cases:
>> a) exactly zero bad blocks
>> b) exactly one bad block
>> c) more than one bad block
>>
>> Of course, it is only possible to recover from b), but one *can* tell,
>> whether the situation is a) or b) or c) and act accordingly.
>>
>
> It would seem that either you or Peter Anvin is mistaken.
>
> On page 9 of
> http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
> at the end of section 4 it says:
>
> Finally, as a word of caution it should be noted that RAID-6 by
> itself cannot even detect, never mind recover from, dual-disk
> corruption. If two disks are corrupt in the same byte positions,
> the above algorithm will in general introduce additional data
> corruption by corrupting a third drive.
>
>
>> The point that I'm trying to make is, that there does exist a specific
>> case, in which recovery is possible, and that implementing recovery for
>> that case will not hurt in any way.
>>
>
> Assuming that it true (maybe hpa got it wrong) what specific
> conditions would lead to one drive having corrupt data, and would
> correcting it on an occasional 'repair' pass be an appropriate
> response?
>
> Does the value justify the cost of extra code complexity?
>
>
>>> RAID is not designed to protect again bad RAM, bad cables, chipset
>>> bugs drivers bugs etc. It is only designed to protect against drive
>>> failure, where the drive failure is apparent. i.e. a read must
>>> return either the same data that was last written, or a failure
>>> indication. Anything else is beyond the design parameters for RAID.
>>>
>> I'm taking a more pragmatic approach here. In my opinion, RAID should
>> "just protect my data", against drive failure, yes, of course, but if it
>> can help me in case of occasional data corruption, I'd happily take
>> that, too, especially if it doesn't cost extra... ;-)
>>
>
> Everything costs extra. Code uses bytes of memory, requires
> maintenance, and possibly introduced new bugs. I'm not convinced the
> failure mode that you are considering actually happens with a
> meaningful frequency.
>
People accept the hardware and performance costs of raid-6 in return for
the better security of their data. If I run a check and find that I have
an error, right now I have to treat that the same way as an
unrecoverable failure, because the "repair" function doesn't fix the
data, it just makes the symptom go away by redoing the p and q values.
This makes the naive user thinks the problem is solved, when in fact
it's now worse, he has corrupt data with no indication of a problem. The
fact that (most) people who read this list are advanced enough to
understand the issue does not protect the majority of users from their
ignorance. If that sounds elitist, many of the people on this list are
the elite, and even knowing that you need to learn and understand more
is a big plus in my book. It's the people who run repair and assume the
problem is fixed who get hurt by the current behavior.
If you won't fix the recoverable case by recovering, then maybe for
raid-6 you could print an error message like
can't recover data, fix parity and hide the problem (y/N)?
or require a --force flag, and at least give a heads up to the people
who just picked the "most reliable raid level" because they're trying to
do it right, but need a clue that they have a real and serious problem,
and just a "repair" can't fix it.
Recovering a filesystem full of "just files" is pretty easy, that's what
backups with CRC are for, but a large database recovery often takes
hours to restore and run journal files. I personally consider it the job
of the kernel to do recovery when it is possible, absent that I would
like the tools to tell me clearly that I have a problem and what it is.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid6 check/repair
2007-11-29 6:01 ` Neil Brown
2007-11-29 19:30 ` Bill Davidsen
@ 2007-11-29 23:17 ` Eyal Lebedinsky
2007-11-30 14:42 ` Thiemo Nagel
2007-11-30 18:34 ` Thiemo Nagel
2 siblings, 1 reply; 22+ messages in thread
From: Eyal Lebedinsky @ 2007-11-29 23:17 UTC (permalink / raw)
Cc: linux-raid
Neil Brown wrote:
> On Thursday November 22, thiemo.nagel@ph.tum.de wrote:
>> Dear Neil,
>>
>> thank you very much for your detailed answer.
>>
>> Neil Brown wrote:
>>> While it is possible to use the RAID6 P+Q information to deduce which
>>> data block is wrong if it is known that either 0 or 1 datablocks is
>>> wrong, it is *not* possible to deduce which block or blocks are wrong
>>> if it is possible that more than 1 data block is wrong.
>> If I'm not mistaken, this is only partly correct. Using P+Q redundancy,
>> it *is* possible, to distinguish three cases:
>> a) exactly zero bad blocks
>> b) exactly one bad block
>> c) more than one bad block
>>
>> Of course, it is only possible to recover from b), but one *can* tell,
>> whether the situation is a) or b) or c) and act accordingly.
>
> It would seem that either you or Peter Anvin is mistaken.
>
> On page 9 of
> http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
> at the end of section 4 it says:
>
> Finally, as a word of caution it should be noted that RAID-6 by
> itself cannot even detect, never mind recover from, dual-disk
> corruption. If two disks are corrupt in the same byte positions,
> the above algorithm will in general introduce additional data
> corruption by corrupting a third drive.
The above a/b/c cases are not correct for raid6. While we can detect
0, 1 or 2 errors, any higher number of errors will be misidentified as
one of these.
The cases we will always see are:
a) no errors - nothing to do
b) one error - correct it
c) two errors -report? take the raid down? recalc syndromes?
and any other case will always appear as *one* of these (not as [c]).
Case [c] is where different users will want to do different things. If my data
is highly critical (would I really use raid6 here and not a higher redundancy
level?) I could consider doing some investigation. e.g. pick each pair of disks
in turn as the faulty ones, correct them and check that my data looks good
(fsck? inspect the data visually?) until one pair choice gives good data.
<may be OT>
The quote, saying two errors may not be detected, is not how I understand
ECC schemes to work. Does anyone have other papers that point this?
Also, is it the case that the raid6 alg detects a failed disk (strip)
or is it actually detecting failed bits and as such the correction is
done to the whole stripe? In other words, values in all failed locations
are fixed (when only 1-error cases are present) and not in just one
strip. This means that we do not necessarily identify the bad disk, and
neither do we need to.
--
Eyal Lebedinsky (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid6 check/repair
2007-11-29 23:17 ` Eyal Lebedinsky
@ 2007-11-30 14:42 ` Thiemo Nagel
[not found] ` <1196650421.14411.10.camel@elara.tcw.local>
0 siblings, 1 reply; 22+ messages in thread
From: Thiemo Nagel @ 2007-11-30 14:42 UTC (permalink / raw)
To: Eyal Lebedinsky, Neil Brown; +Cc: linux-raid
Dear Neil and Eyal,
Eyal Lebedinsky wrote:
> Neil Brown wrote:
>> It would seem that either you or Peter Anvin is mistaken.
>>
>> On page 9 of
>> http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
>> at the end of section 4 it says:
>>
>> Finally, as a word of caution it should be noted that RAID-6 by
>> itself cannot even detect, never mind recover from, dual-disk
>> corruption. If two disks are corrupt in the same byte positions,
>> the above algorithm will in general introduce additional data
>> corruption by corrupting a third drive.
>
> The above a/b/c cases are not correct for raid6. While we can detect
> 0, 1 or 2 errors, any higher number of errors will be misidentified as
> one of these.
>
> The cases we will always see are:
> a) no errors - nothing to do
> b) one error - correct it
> c) two errors -report? take the raid down? recalc syndromes?
> and any other case will always appear as *one* of these (not as [c]).
I still don't agree. I'll explain the algorithm for error handling that
I have in mind, maybe you can point out if I'm mistaken at some point.
We have n data blocks D1...Dn and two parities P (XOR) and Q
(Reed-Solomon). I assume the existence of two functions to calculate
the parities
P = calc_P(D1, ..., Dn)
Q = calc_Q(D1, ..., Dn)
and two functions to recover a missing data block Dx using either parity
Dx = recover_P(x, D1, ..., Dx-1, Dx+1, ..., Dn, P)
Dx = recover_Q(x, D1, ..., Dx-1, Dx+1, ..., Dn, Q)
This pseudo-code should distinguish between a), b) and c) and properly
repair case b):
P' = calc_P(D1, ..., Dn);
Q' = calc_Q(D1, ..., Dn);
if (P' == P && Q' == Q) {
/* case a): zero errors */
return;
}
if (P' == P && Q' != Q) {
/* case b1): Q is bad, can be fixed */
Q = Q';
return;
}
if (P' != P && Q' == Q) {
/* case b2): P is bad, can be fixed */
P = P';
return;
}
/* both parities are bad, so we try whether the problem can
be fixed by repairing data blocks */
for (i = 1; i <= n; n++) {
/* assume only Di is bad, use P parity to repair */
D' = recover_P(i, D1, ..., Di-1, Di+1, ..., Dn, P);
/* use Q parity to check assumption */
Q' = calc_Q(D1, ..., Di-1, D', Di+1, ..., Dn);
if (Q == Q') {
/* case b3): Q parity is ok, that means the assumption was
correct and we can fix the problem */
Di = D';
return;
}
}
/* case c): when we get here, we have excluded cases a) and b),
so now we really have a problem */
report_unrecoverable_error();
return;
Concerning misidentification: A situation can be imagined, in which two
or more simultaneous corruptions have occurred in a very special way, so
that case b3) is diagnosed accidentally. While that is not impossible,
I'd assume the probability for it to be negligible, to be compared to
that of undetectable corruption in a RAID 5 setup.
Kind regards,
Thiemo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid6 check/repair
2007-11-29 6:01 ` Neil Brown
2007-11-29 19:30 ` Bill Davidsen
2007-11-29 23:17 ` Eyal Lebedinsky
@ 2007-11-30 18:34 ` Thiemo Nagel
2 siblings, 0 replies; 22+ messages in thread
From: Thiemo Nagel @ 2007-11-30 18:34 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
Dear Neil,
>> The point that I'm trying to make is, that there does exist a specific
>> case, in which recovery is possible, and that implementing recovery for
>> that case will not hurt in any way.
>
> Assuming that it true (maybe hpa got it wrong) what specific
> conditions would lead to one drive having corrupt data, and would
> correcting it on an occasional 'repair' pass be an appropriate
> response?
The use case for the proposed 'repair' would be occasional,
low-frequency corruption, for which many sources can be imagined:
Any piece of hardware has a certain failure rate, which may depend on
things like age, temperature, stability of operating voltage, cosmic
rays, etc. but also on variations in the production process. Therefore,
hardware may suffer from infrequent glitches, which are seldom enough,
to be impossible to trace back to a particular piece of equipment. It
would be nice to recover gracefully from that.
Kernel bugs or just plain administrator mistakes are another thing.
But also the case of power-loss during writing that you have mentioned
could profit from that 'repair': With heterogeneous hardware, blocks
may be written in unpredictable order, so that in more cases graceful
recovery would be possible with 'repair' compared to just recalculating
parity.
> Does the value justify the cost of extra code complexity?
In the case of protecting data integrity, I'd say 'yes'.
> Everything costs extra. Code uses bytes of memory, requires
> maintenance, and possibly introduced new bugs.
Of course, you are right. However, in my other email, I tried to sketch
a piece of code which is very lean as it makes use of functions which I
assume to exist. (Sorry, I didn't look at the md code, yet, so please
correct me if I'm wrong.) Therefore I assume the costs in memory,
maintenance and bugs to be rather low.
Kind regards,
Thiemo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid6 check/repair
@ 2007-11-21 13:45 Thiemo Nagel
2007-12-14 15:25 ` Thiemo Nagel
0 siblings, 1 reply; 22+ messages in thread
From: Thiemo Nagel @ 2007-11-21 13:45 UTC (permalink / raw)
To: neilb, linux-raid
Dear Neal,
>> I have been looking a bit at the check/repair functionality in the
>> raid6 personality.
>>
>> It seems that if an inconsistent stripe is found during repair, md
>> does not try to determine which block is corrupt (using e.g. the
>> method in section 4 of HPA's raid6 paper), but just recomputes the
>> parity blocks - i.e. the same way as inconsistent raid5 stripes are
>> handled.
>>
>> Correct?
>
> Correct!
>
> The mostly likely cause of parity being incorrect is if a write to
> data + P + Q was interrupted when one or two of those had been
> written, but the other had not.
>
> No matter which was or was not written, correctly P and Q will produce
> a 'correct' result, and it is simple. I really don't see any
> justification for being more clever.
My opinion about that is quite different. Speaking just for myself:
a) When I put my data on a RAID running on Linux, I'd expect the
software to do everything which is possible to protect and when
necessary to restore data integrity. (This expectation was one of the
reasons why I chose software RAID with Linux.)
b) As a consequence of a): When I'm using a RAID level that has extra
redundancy, I'd expect Linux to make use of that extra redundancy during
a 'repair'. (Otherwise I'd consider repair a misnomer and rather call
it 'recalc parity'.)
c) Why should 'repair' be implemented in a way that only works in most
cases when there exists a solution that works in all cases? (After all,
possibilities for corruption are many, e.g. bad RAM, bad cables, chipset
bugs, driver bugs, last but not least human mistake. From all these
errors I'd like to be able to recover gracefully without putting the
array at risk by removing and readding a component device.)
Bottom line: So far I was talking about *my* expectations, is it
reasonable to assume that it is shared by others? Are there any
arguments that I'm not aware of speaking against an improved
implementation of 'repair'?
BTW: I just checked, it's the same for RAID 1: When I intentionally
corrupt a sector in the first device of a set of 16, 'repair' copies the
corrupted data to the 15 remaining devices instead of restoring the
correct sector from one of the other fifteen devices to the first.
Thank you for your time.
Kind regards,
Thiemo Nagel
P.S.: I've re-sent this mail as the first one didn't get through
majordomo. (Yes, it had a vcard attached. Yes, I have been told. Yes,
I am sorry.)
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid6 check/repair
2007-11-21 13:45 Thiemo Nagel
@ 2007-12-14 15:25 ` Thiemo Nagel
0 siblings, 0 replies; 22+ messages in thread
From: Thiemo Nagel @ 2007-12-14 15:25 UTC (permalink / raw)
To: neilb; +Cc: linux-raid
Dear Neil,
this thread has died out, but I'd prefer not to let it end without any
kind of result being reached. Therefore, I'm kindly asking you to draw
a conclusion from the arguments being exchanged:
Concerning the implementation of a 'repair' that can actually recover
data in some cases instead of just recalculating parity:
Do you
a) oppose the case (patches not accepted)
b) don't care (but potentially accept patches)
c) support it
Thank you very much and kind regards,
Thiemo Nagel
^ permalink raw reply [flat|nested] 22+ messages in thread
* raid6 check/repair
@ 2007-11-15 15:28 Leif Nixon
2007-11-16 4:26 ` Neil Brown
0 siblings, 1 reply; 22+ messages in thread
From: Leif Nixon @ 2007-11-15 15:28 UTC (permalink / raw)
To: linux-raid
Hi,
I have been looking a bit at the check/repair functionality in the
raid6 personality.
It seems that if an inconsistent stripe is found during repair, md
does not try to determine which block is corrupt (using e.g. the
method in section 4 of HPA's raid6 paper), but just recomputes the
parity blocks - i.e. the same way as inconsistent raid5 stripes are
handled.
Correct?
--
Leif Nixon - Systems expert
------------------------------------------------------------
National Supercomputer Centre - Linkoping University
------------------------------------------------------------
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid6 check/repair
2007-11-15 15:28 Leif Nixon
@ 2007-11-16 4:26 ` Neil Brown
0 siblings, 0 replies; 22+ messages in thread
From: Neil Brown @ 2007-11-16 4:26 UTC (permalink / raw)
To: Leif Nixon; +Cc: linux-raid
On Thursday November 15, nixon@nsc.liu.se wrote:
> Hi,
>
> I have been looking a bit at the check/repair functionality in the
> raid6 personality.
>
> It seems that if an inconsistent stripe is found during repair, md
> does not try to determine which block is corrupt (using e.g. the
> method in section 4 of HPA's raid6 paper), but just recomputes the
> parity blocks - i.e. the same way as inconsistent raid5 stripes are
> handled.
>
> Correct?
Correct!
The mostly likely cause of parity being incorrect is if a write to
data + P + Q was interrupted when one or two of those had been
written, but the other had not.
No matter which was or was not written, correctly P and Q will produce
a 'correct' result, and it is simple. I really don't see any
justification for being more clever.
NeilBrown
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2007-12-14 15:25 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-21 13:25 raid6 check/repair Thiemo Nagel
2007-11-22 3:55 ` Neil Brown
2007-11-22 16:51 ` Thiemo Nagel
2007-11-27 5:08 ` Bill Davidsen
2007-11-29 6:04 ` Neil Brown
2007-11-29 6:01 ` Neil Brown
2007-11-29 19:30 ` Bill Davidsen
2007-11-29 23:17 ` Eyal Lebedinsky
2007-11-30 14:42 ` Thiemo Nagel
[not found] ` <1196650421.14411.10.camel@elara.tcw.local>
[not found] ` <47546019.5030300@ph.tum.de>
2007-12-03 20:36 ` mailing list configuration (was: raid6 check/repair) Janek Kozicki
2007-12-04 8:45 ` Matti Aarnio
2007-12-04 21:07 ` raid6 check/repair Peter Grandi
2007-12-05 6:53 ` Mikael Abrahamsson
2007-12-05 9:00 ` Leif Nixon
2007-12-05 20:31 ` Bill Davidsen
2007-12-06 18:27 ` Andre Noll
2007-12-07 17:34 ` Gabor Gombas
2007-11-30 18:34 ` Thiemo Nagel
-- strict thread matches above, loose matches on Subject: below --
2007-11-21 13:45 Thiemo Nagel
2007-12-14 15:25 ` Thiemo Nagel
2007-11-15 15:28 Leif Nixon
2007-11-16 4:26 ` Neil Brown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).