linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Brown <david.brown@hesbynett.no>
To: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>
Cc: Andrea Mazzoleni <amadvance@gmail.com>,
	linux-raid@vger.kernel.org, linux-btrfs@vger.kernel.org,
	hpa@zytor.com, creamyfish@gmail.com
Subject: Re: Triple parity and beyond
Date: Thu, 21 Nov 2013 21:31:46 +0100	[thread overview]
Message-ID: <528E6DB2.6050005@hesbynett.no> (raw)
In-Reply-To: <20131121200545.GB1916@lazy.lzy>

On 21/11/13 21:05, Piergiorgio Sartor wrote:
> On Thu, Nov 21, 2013 at 11:13:29AM +0100, David Brown wrote:
> [...]
>> Ah, you are trying to find which disk has incorrect data so that you can
>> change just that one disk?  There are dangers with that...
> 
> Hi David,
> 
>> <http://neil.brown.name/blog/20100211050355>
> 
> I think we already did the exercise, here :-)
> 
>> If you disagree with this blog post (and I urge you to read it in full
> 
> We discussed the topic (with Neil) and, if I
> recall correctly, he is agaist having an
> _automatic_ error detectio and correction _in_
> kernel.
> I fully agree with that: user space is better
> and it should not be automatic, but it should
> do things under user control.
> 

OK.

> The current "check" operetion is pretty poor.
> It just reports how many mismatches, it does
> not even report where in the array.
> The first step, independent from how many
> parities one has, would be to tell the user
> where the mismatches occurred, so it would
> be possible to check the FS at that position.

Certainly it would be good to give the user more information.  If you
can tell the user where the errors are, and what the likely failed block
is, then that would be very useful.  If you can tell where it is in the
filesystem (such as which file, if any, owns the blocks in question)
then that would be even better.

> Having a multi parity RAID allows to check
> even which disk.
> This would provide the user with a more
> comprehensive (I forgot the spelling)
> information.
> 
> Of course, since we are there, we can
> also give the option to fix it.
> This would be much likely a "fsck".

If this can all be done to give the user an informed choice, then it
sounds good.

One issue here is whether the check should be done with the filesystem
mounted and in use, or only off-line.  If it is off-line then it will
mean a long down-time while the array is checked - but if it is online,
then there is the risk of confusing the filesystem and caches by
changing the data.

> 
>> first), then this is how I would do a "smart" stripe recovery:
>>
>> First calculate the parities from the data blocks, and compare these
>> with the existing parity blocks.
>>
>> If they all match, the stripe is consistent.
>>
>> Normal (detectable) disk errors and unrecoverable read errors get
>> flagged by the disk and the IO system, and you /know/ there is a problem
>> with that block.  Whether it is a data block or a parity block, you
>> re-generate the correct data and store it - that's what your raid is for.
> 
> That's not always the case, otherwise
> having the mismatch count would be useless.
> The issue is that errors appear, whatever
> the reason, without being reported by the
> underlying hardware.
>  

(I know you know how this works, so I am not trying to be patronising
with this explanation - I just think we have slightly misunderstood what
the other is saying, so spelling it out will hopefully make it clearer.)

Most disk errors /are/ detectable, and are reported by the underlying
hardware - small surface errors are corrected by the disk's own error
checking and correcting mechanisms, and larger errors are usually
detected.  It is (or should be!) very rare that a read error goes
undetected without there being a major problem with the disk controller.
 And if the error is detected, then the normal raid processing kicks in
as there is no doubt about which block has problems.

>> If you have no detected read errors, and there is one parity
>> inconsistency, then /probably/ that block has had an undetected read
>> error, or it simply has not been written completely before a crash.
>> Either way, just re-write the correct parity.
> 
> Why re-write the parity if I can get
> the correct data there?
> If can be sure that one data block is
> incorrect and I can re-create properly,
> that's the thing to do.

If you can be /sure/ about which data block is incorrect, then I agree -
but you can't be /entirely/ sure.  But I agree that you can make a good
enough guess to recommend a fix to the user - as long as it is not
automatic.

>  
>> Remember, this is not a general error detection and correction scheme -
> 
> It is not, but it could be. For free.
> 

For most ECC schemes, you know that all your blocks are set
synchronously - so any block that does not fit in, is an error.  With
raid, it could also be that a stripe is only partly written - you can
have two different valid sets of data mixed to give an inconsistent
stripe, without any good way of telling what consistent data is the best
choice.

Perhaps a checking tool can take advantage of a write-intent bitmap (if
there is one) so that it knows if an inconsistent stripe is partly
updated or the result of a disk error.

mvh.,

David



  reply	other threads:[~2013-11-21 20:31 UTC|newest]

Thread overview: 104+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-18 22:08 Triple parity and beyond Andrea Mazzoleni
2013-11-18 22:12 ` H. Peter Anvin
2013-11-18 22:35   ` Andrea Mazzoleni
2013-11-18 23:25     ` H. Peter Anvin
2013-11-19 10:16       ` David Brown
2013-11-19 17:36         ` Andrea Mazzoleni
2013-11-19 22:51           ` Drew
2013-11-20  0:54             ` Chris Murphy
2013-11-20  1:23               ` John Williams
2013-11-20 10:35                 ` David Brown
2013-11-20 10:31           ` David Brown
2013-11-20 18:09             ` John Williams
2013-11-20 18:44               ` Andrea Mazzoleni
2013-11-21  6:15                 ` Stan Hoeppner
2013-11-21  8:32               ` David Brown
2013-11-20 18:34             ` Andrea Mazzoleni
2013-11-20 18:43               ` H. Peter Anvin
2013-11-20 18:56                 ` Andrea Mazzoleni
2013-11-20 18:59                   ` H. Peter Anvin
2013-11-20 21:21                     ` Andrea Mazzoleni
2013-11-20 19:00                   ` H. Peter Anvin
2013-11-20 21:04                     ` Andrea Mazzoleni
2013-11-20 21:06                       ` H. Peter Anvin
2013-11-21  8:36               ` David Brown
2013-11-19 17:28       ` Andrea Mazzoleni
2013-11-19 20:29         ` Ric Wheeler
2013-11-20 16:16           ` James Plank
2013-11-20 19:05             ` Andrea Mazzoleni
2013-11-20 19:10               ` H. Peter Anvin
2013-11-20 20:30                 ` James Plank
2013-11-20 21:23                   ` Andrea Mazzoleni
2013-11-27  2:50                     ` ronnie sahlberg
2013-11-20 21:28                   ` H. Peter Anvin
2013-11-21  1:28             ` Stan Hoeppner
2013-11-21  2:46               ` John Williams
2013-11-21  6:52                 ` Stan Hoeppner
2013-11-21  7:05                   ` John Williams
2013-11-21 22:57                     ` Stan Hoeppner
2013-11-21 23:38                       ` John Williams
2013-11-22  9:35                         ` Stan Hoeppner
2013-11-22 11:24                           ` joystick
2013-11-22 15:01                           ` John Williams
2013-11-22 22:28                             ` Stan Hoeppner
2013-11-22 23:07                       ` NeilBrown
2013-11-23  3:46                         ` Stan Hoeppner
2013-11-23  5:04                           ` NeilBrown
2013-11-23  5:34                             ` John Williams
2013-11-23  7:12                               ` NeilBrown
2013-11-24  4:03                                 ` Stan Hoeppner
2013-11-24  5:14                                   ` John Williams
2013-11-24 21:13                                     ` Stan Hoeppner
2013-11-24 23:28                                       ` Rudy Zijlstra
2013-11-24 23:53                                       ` Alex Elsayed
2013-11-25  2:04                                         ` Stan Hoeppner
2013-11-25  4:48                                           ` Alex Elsayed
2013-11-25  9:15                                       ` David Brown
2013-11-24  5:19                                   ` Russell Coker
2013-11-24 21:44                                     ` Stan Hoeppner
2013-11-24 22:31                                       ` Mark Knecht
2013-11-25  2:14                                       ` Russell Coker
2013-11-25  9:20                                         ` David Brown
2013-11-21  8:08               ` joystick
2013-11-22  0:30                 ` Stan Hoeppner
2013-11-22  0:33                   ` H. Peter Anvin
2013-11-22  0:45                   ` David Brown
2013-11-21  9:07               ` David Brown
2013-11-21  9:54                 ` Adam Goryachev
2013-11-21 10:32                   ` David Brown
2013-11-22  8:12                   ` Russell Coker
2013-11-25 18:23                     ` Pasi Kärkkäinen
2013-11-22  8:13                 ` Stan Hoeppner
2013-11-22 13:15                   ` David Brown
2013-11-22 16:07                   ` Stan Hoeppner
2013-11-22 22:59                     ` NeilBrown
2013-11-23 17:39                       ` David Brown
2013-11-22 16:50                   ` Mark Knecht
2013-11-22 19:51                     ` Duncan
2013-11-22  8:38                 ` Stan Hoeppner
2013-11-22 13:24                   ` David Brown
2013-11-28  7:16                     ` Stan Hoeppner
2013-11-28  7:36                       ` Russell Coker
2013-11-28  9:56                       ` David Brown
2013-11-30  7:32                       ` Alex Elsayed
2013-12-01 15:37                         ` Stan Hoeppner
2013-11-22 14:19                   ` David Taylor
2013-11-21 19:56               ` Piergiorgio Sartor
2013-11-19 18:12 ` Piergiorgio Sartor
2013-11-20 10:44   ` David Brown
2013-11-20 21:59     ` Piergiorgio Sartor
2013-11-21 10:13       ` David Brown
2013-11-21 17:37         ` Goffredo Baroncelli
2013-11-21 20:05         ` Piergiorgio Sartor
2013-11-21 20:31           ` David Brown [this message]
2013-11-21 20:52             ` Piergiorgio Sartor
2013-11-22  0:32               ` David Brown
2013-11-22 20:32                 ` Piergiorgio Sartor
2013-11-26 18:10             ` joystick
2013-11-20 21:38   ` Andrea Mazzoleni
2013-11-20 22:29 ` Piergiorgio Sartor
2013-11-23  7:55   ` Andrea Mazzoleni
2013-11-23 22:10     ` Piergiorgio Sartor
2013-11-24  9:39       ` Andrea Mazzoleni
  -- strict thread matches above, loose matches on Subject: below --
2013-12-01 17:53 Richard Scobie
2013-12-02  4:30 ` Stan Hoeppner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=528E6DB2.6050005@hesbynett.no \
    --to=david.brown@hesbynett.no \
    --cc=amadvance@gmail.com \
    --cc=creamyfish@gmail.com \
    --cc=hpa@zytor.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=piergiorgio.sartor@nexgo.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).