Re: md road-map: 2011

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Phil Turmel <philip@turmel.org>
To: NeilBrown <neilb@suse.de>
Cc: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>,
	linux-raid@vger.kernel.org
Subject: Re: md road-map: 2011
Date: Wed, 16 Feb 2011 20:14:50 -0500	[thread overview]
Message-ID: <4D5C768A.1010502@turmel.org> (raw)
In-Reply-To: <20110217115257.28a8d174@notabene.brown>

On 02/16/2011 07:52 PM, NeilBrown wrote:
> On Wed, 16 Feb 2011 19:24:15 -0500 Phil Turmel <philip@turmel.org> wrote:
> 
>> On 02/16/2011 04:48 PM, NeilBrown wrote:
>>> On Wed, 16 Feb 2011 21:29:39 +0100 Piergiorgio Sartor
>>>>
>>>>> Better reporting of inconsistencies.
>>>>> ------------------------------------
>>>>>
>>>>> When a 'check' finds a data inconsistency it would be useful if it
>>>>> was reported.   That would allow a sysadmin to try to understand the
>>>>> cause and possibly fix it.
>>>>
>>>> Could you, please, consider to add, for RAID-6, the
>>>> capability to report also which device, potentially,
>>>> has the problem? Thanks!
>>>
>>> I would rather leave that to user-space.  If I report where the problem is, a
>>> tool could directly read all the blocks in that stripe and perform any fancy
>>> calculations you like.  I may even write that tool (but no promises).
>>
>> Hmmm.  The existing "check" code, if it encounters a read error, will use
>> available redundancy to recover that data and rewrite it on the spot.
>>
>> Without a read error, or with multiple redundancy, the calculations to
>> check consistency are performed and reported.  With all the data "hot", and half
>> the calculation to pinpoint an inconsistency done, it seems a shame to have
>> userspace redo it.
>>
>> Are you adamantly opposed to the kernel doing this?  (For Raid6)  Code talks,
>> of course, but I'd rather not start if I'm only going to be shot down.
>>
> 
> I like to think I remain open-minded to any compelling arguments.
> 
> However putting code into the kernel which *only* tells user-space something
> that it could figure out for itself doesn't sound sensible - though it
> depends a bit on how much code.
> 
> Also - as I understand it - the RAID6 code works on a byte-by-byte basis.
> This the P and Q bytes are computed from the N data bytes, and collections of
> these bytes form blocks.
> 
> The "which block is bad calculation" take the  data bytes and the P and Q
> bytes and produces a new byte.  If that byte is < N, it means that just
> changing data byte N can make P and Q consistent.  (if it is N, the the P
> bytes is bad, if it is N+1 then the Q byte is bad).  If it is >N+1, then
> ... possibly multiple bytes are bad .. my knowledge gets hazy here.
> 
> So when you do the computation on all of the bytes in all of the blocks you
> get a block full of answers.
> If the answers are all the same - that tells you something fairly strong.
> If they are a "all different" then that is also a fairly strong statement.
> But what if most are the same, but a few are different?  How do you interpret
> that?

Actually, I was thinking about that.  (You suckered me into reading that PDF
some weeks ago.)  I would be inclined to allow the kernel to make corrections
where "all the same" covers individual sectors, per the sector size reported
by the underlying device.

Also, the comparison would have to ignore "neutral bytes", where P & Q
happened to be correct for that byte position.

> The point I'm trying to get to is that the result of this RAID6 calculation
> isn't a simple "that device is bad".  It is a block of data that needs to be
> interpreted.
> 
> I'd rather have user-space do that interpretation, so it may as well do the
> calculation too.
> 
> If you wanted to do it in the kernel, you would need to be very clear about
> what information you provide, what it means exactly, and why it is sufficient.

Given that the hardware is going to do error correction and checking at a
sector size granularity, and the kernel would in fact rewrite that sector using
this calculation if the hardware made a "fairly strong" statement that it can't
be trusted, I'd argue that rewriting the sector is appropriate.

Any corrective action that isn't consistent at the sector level should be punted.
I'm very curious what percentage that would be in production environments.

Phil

next prev parent reply	other threads:[~2011-02-17  1:14 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-16 10:27 md road-map: 2011 NeilBrown
2011-02-16 11:28 ` Giovanni Tessore
2011-02-16 13:40   ` Roberto Spadim
2011-02-16 14:00     ` Robin Hill
2011-02-16 14:09       ` Roberto Spadim
2011-02-16 14:21         ` Roberto Spadim
2011-02-16 21:55           ` NeilBrown
2011-02-17  1:30             ` Roberto Spadim
2011-02-16 14:13 ` Joe Landman
2011-02-16 21:24   ` NeilBrown
2011-02-16 21:44     ` Roman Mamedov
2011-02-16 21:59       ` NeilBrown
2011-02-17  0:48         ` Phil Turmel
2011-02-16 22:12       ` Joe Landman
2011-02-16 15:42 ` David Brown
2011-02-16 21:35   ` NeilBrown
2011-02-16 22:34     ` David Brown
2011-02-16 23:01       ` NeilBrown
2011-02-17  0:30         ` David Brown
2011-02-17  0:55           ` NeilBrown
2011-02-17  1:04           ` Keld Jørn Simonsen
2011-02-17 10:45             ` David Brown
2011-02-17 10:58               ` Keld Jørn Simonsen
2011-02-17 11:45                 ` Giovanni Tessore
2011-02-17 15:44                   ` Keld Jørn Simonsen
2011-02-17 16:22                     ` Roberto Spadim
2011-02-18  0:13                     ` Giovanni Tessore
2011-02-18  2:56                       ` Keld Jørn Simonsen
2011-02-18  4:27                         ` Roberto Spadim
2011-02-18  9:47                         ` Giovanni Tessore
2011-02-18 18:43                           ` Keld Jørn Simonsen
2011-02-18 19:00                             ` Roberto Spadim
2011-02-18 19:18                               ` Keld Jørn Simonsen
2011-02-18 19:22                                 ` Roberto Spadim
2011-02-16 17:20 ` Joe Landman
2011-02-16 21:36   ` NeilBrown
2011-02-16 19:37 ` Phil Turmel
2011-02-16 21:44   ` NeilBrown
2011-02-17  0:11     ` Phil Turmel
2011-02-16 20:29 ` Piergiorgio Sartor
2011-02-16 21:48   ` NeilBrown
2011-02-16 22:53     ` Piergiorgio Sartor
2011-02-17  0:24     ` Phil Turmel
2011-02-17  0:52       ` NeilBrown
2011-02-17  1:14         ` Phil Turmel [this message]
2011-02-17  3:10           ` NeilBrown
2011-02-17 18:46             ` Phil Turmel
2011-02-17 21:04             ` Mr. James W. Laferriere
2011-02-18  1:48               ` NeilBrown
2011-02-17 19:56           ` Piergiorgio Sartor
2011-02-16 22:50 ` Keld Jørn Simonsen
2011-02-23  5:06 ` Daniel Reurich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D5C768A.1010502@turmel.org \
    --to=philip@turmel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=piergiorgio.sartor@nexgo.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).