All of lore.kernel.org
 help / color / mirror / Atom feed
From: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>
To: Neil Brown <neilb@suse.de>
Cc: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>,
	Steven Haigh <netwiz@crc.id.au>, Bill Davidsen <davidsen@tmr.com>,
	Bryan Mesich <bryan.mesich@ndsu.edu>,
	Jon@eHardcastle.com, linux-raid@vger.kernel.org
Subject: Re: Why does one get mismatches?
Date: Fri, 19 Feb 2010 23:37:36 +0100	[thread overview]
Message-ID: <20100219223736.GA2381@lazy.lzy> (raw)
In-Reply-To: <20100220090208.06c1130f@notabene.brown>

Hi,

> > Or it is like that because we trust the filesystem?
> 
> It is because we trust the filesystem.

well, I hope the trust is not misplaced... :-)

> md is not in a position to lock the page - there is simply no way it can stop
> the filesystem from changing it.

How can this be?

> The only thing it could do would be to make a copy, then write the copy out.

Even making a copy would not be safe, since during
the copy the data could still change, or not?

> This would incur a performance cost.

It's a matter of deciding what is more important.

> > It seems to me, maybe I'm wrong, not a so safe design.
> 
> I think you are wrong.

Could be, I never heard of situations like this.
 
> > I assume, it should not be possible to cause this
> > situation, unless there is a crash or a bug in the
> > md layer.
> 
> I'm not sure what situation you are referring to...

It should not be possible to cause that different
mirrors of a RAID-1 end up with different data.

Otherwise, no point to have the mirroring.
 
> > What if a new filesystem will write a block, changing
> > on the fly, i.e. during RAID-1 writes, and then, later,
> > reading this block again?
> > 
> > It will get, maybe, not the correct data.
> 
> This is correct.  However it would be equally correct if you were talking
> about s normal disk drive rather than a RAID1 pair.

Nono, there is a huge difference.
In a single drive case, the FS is responsible of writing
rubbish to a single block. The result would be that a
block has "strange" data, but *always* the same data.

Here the situation is that the data might be "strange",
but different accesses, to the same block of the RAID-1,
could potentially return different data.

As a byproduct of this effect, the "check" functionality
becomes not so useful anymore.

> If the filesystem changes the page (or allows it to change) while a write is
> pending, then it cannot know what actual data was written.  So it must write
> the block out again before it ever reads it in.
> RAID1 is no different to any other device in this respect.

Is different, as mentioned above.

The FS could, intentionally, change the data during a write,
but later it could expect to have always the same data.

In other words, the FS does not guarantee the "spatial"
consistency of the data (the bytes in a block), but the
"temporal" consistency (successive reads return always
the same data) could be expected. And this happens in
case of a normal HDD. It does not happen in RAID-1.

> Possibly, but at what cost?

As I wrote: it is matter to decide what is more important
and useful.

> There are two ways that I can imagine to 'solve' this issue.
> 
> 1/ always copy the page before writing.  This would incur a significant
>   overhead, both in the complexity of pre-allocation memory and in the
>   delay taken to perform the copy.  And it would very rarely be actually
>   needed.

Does really a copy solve the issue? Is the copy done
in atomic way?
The pre-allocation does not seem to me to be a problem,
since it will be done once and for all (at device creation),
and not dynamically.
The copy *might* be an overhead, nevertheless I wonder if it
is really so much of a problem, expecially considering that,
after the copy, the MD layer can optimize the transaction
to the HDDs as much as it likes.

> 2/ Have the filesystem protect the page from changes while it is being
>    written.  This is quite possible for the filesystem to do (while it
>    is impossible for md to do).  There could be some performance

I'm really curious to understand what kind of thinking
is behind a design allowing such a situation...
I mean *system* design, not md design.

>    cost with memory-mapped pages as they would need to be unmapped,
>    but there would be no significant cost for reads, writes, and filesystem
>    metadata operations.
>    Further, any filesystem that wants to make use of the integrity checks
>    that newer drives provide (where the filesystem provides a 'checksum' for
>    the block which gets passed all the way down and written to storage, and
>    returned on a read) will need to do this anyway.  So it is likely the in
>    the near future all significant filesystems will provide all the
>    guarantees md needs or order to simply do nothing different.

That's good to know.
 
> So my feeling is that md is doing the best thing already.

I do not think this is an md issue, per se, it seems to me,
from the description, this is a overall design issue.

Normally, also for performance reasons, one approach is
to allocate queue(s) of buffers between two modules (like
FS and MD) and each of the modules has always *exclusive*
access to its own buffer(s), i.e. the buffer(s) it holds
in a certain time frame.
Once a module releases the buffer(s) this/these cannot be
anymore touched (read or write) by the module itself.
Once the buffer(s) arrive(s) to the other module, this
can do whatever it wants with it/them, and it is sure
it has exclusive access to it/them.

Normally real-time systems use techniques like this to
guarantee consistency *and* performances.

Anyway, thanks for the clarifications,

bye,

-- 

piergiorgio

  reply	other threads:[~2010-02-19 22:37 UTC|newest]

Thread overview: 104+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-20 11:52 Fw: Why does one get mismatches? Jon Hardcastle
2010-01-22 18:13 ` Goswin von Brederlow
2010-01-24 17:40   ` Jon Hardcastle
2010-01-24 21:52     ` Roger Heflin
2010-01-24 23:13     ` Goswin von Brederlow
2010-01-25 10:07       ` Jon Hardcastle
2010-01-25 10:37         ` Goswin von Brederlow
2010-01-25 10:52           ` Jon Hardcastle
2010-01-25 17:32             ` Goswin von Brederlow
2010-01-25 19:32             ` Iustin Pop
2010-02-01 21:18 ` Bill Davidsen
2010-02-01 22:37   ` Neil Brown
2010-02-02 15:11     ` Bill Davidsen
2010-02-03 11:17       ` Goswin von Brederlow
2010-02-11  5:14       ` Neil Brown
2010-02-11 17:51         ` Bryan Mesich
2010-02-16 21:25           ` Bill Davidsen
2010-02-16 21:38             ` Steven Haigh
2010-02-17  3:19               ` Bryan Mesich
2010-02-17 23:05               ` Neil Brown
2010-02-19 15:18                 ` Piergiorgio Sartor
2010-02-19 22:02                   ` Neil Brown
2010-02-19 22:37                     ` Piergiorgio Sartor [this message]
2010-02-19 23:34                     ` Asdo
2010-02-20  4:27                       ` Goswin von Brederlow
2010-02-20 11:12                         ` Asdo
2010-02-21 11:13                           ` Goswin von Brederlow
     [not found]                             ` <8754A21825504719B463AD9809E54349@m5>
     [not found]                               ` <20100221194400.GA2570@lazy.lzy>
2010-02-22 13:01                                 ` Asdo
2010-02-22 13:30                                   ` Piergiorgio Sartor
2010-02-22 13:44                                   ` Piergiorgio Sartor
2010-02-24 19:42                               ` Bill Davidsen
2010-02-20  4:23                     ` Goswin von Brederlow
2010-02-24 14:54                     ` Bill Davidsen
2010-02-24 21:37                       ` Neil Brown
2010-02-26 20:48                         ` Bill Davidsen
2010-02-26 21:09                           ` Neil Brown
2010-02-26 22:01                             ` Piergiorgio Sartor
2010-02-26 22:15                             ` Bill Davidsen
2010-02-26 22:21                               ` Piergiorgio Sartor
2010-02-26 22:20                             ` Asdo
2010-02-27  6:01                               ` Michael Evans
2010-02-28  0:01                                 ` Bill Davidsen
2010-02-24 14:46                 ` Bill Davidsen
2010-02-24 16:12                   ` Martin K. Petersen
2010-02-24 18:51                     ` Piergiorgio Sartor
2010-02-24 22:21                       ` Neil Brown
2010-02-25  8:41                         ` Piergiorgio Sartor
2010-03-02  4:57                           ` Neil Brown
2010-03-02 18:49                             ` Piergiorgio Sartor
2010-02-24 21:39                     ` Neil Brown
     [not found]                       ` <4B8640A2.4060307@shiftmail.org>
2010-02-25 10:41                         ` Neil Brown
2010-02-28  8:09                       ` Luca Berra
2010-03-02  5:01                         ` Neil Brown
2010-03-02  7:36                           ` Luca Berra
2010-03-02 10:04                             ` Michael Evans
2010-03-02 11:02                               ` Luca Berra
2010-03-02 12:13                                 ` Michael Evans
2010-03-02 18:14                                 ` Asdo
2010-03-02 18:52                                   ` Piergiorgio Sartor
2010-03-02 23:27                                     ` Asdo
2010-03-03  9:13                                       ` Piergiorgio Sartor
2010-03-03 11:42                                         ` Asdo
2010-03-03 12:03                                           ` Piergiorgio Sartor
2010-03-02 20:17                                   ` Neil Brown
2010-02-24 21:32                   ` Neil Brown
2010-02-25  7:22                     ` Goswin von Brederlow
2010-02-25  7:39                       ` Neil Brown
2010-02-25  8:47                     ` John Robinson
2010-02-25  9:07                       ` Neil Brown
2010-02-11 18:12         ` Piergiorgio Sartor
  -- strict thread matches above, loose matches on Subject: below --
2010-02-01 23:14 Jon Hardcastle
2010-01-25 20:43 greg
2010-01-25 22:49 ` Steven Haigh
2010-01-27 21:54   ` Tirumala Reddy Marri
2010-01-28  9:16     ` Jon Hardcastle
2010-01-28 10:29       ` Asdo
2010-01-28 17:20     ` Tirumala Reddy Marri
2010-01-28 18:23       ` Goswin von Brederlow
2010-01-28 19:03         ` Tirumala Reddy Marri
2010-01-28 20:24           ` Goswin von Brederlow
2010-01-29 15:37             ` Jon Hardcastle
2010-01-29 23:52               ` Goswin von Brederlow
2010-01-30 10:39                 ` Jon Hardcastle
2010-02-01 21:10               ` Bill Davidsen
2010-01-20 15:03 Jon Hardcastle
2010-01-20 15:34 ` Brett Russ
2010-01-20 20:44   ` Majed B.
2010-01-20 22:25     ` Brett Russ
2010-01-20 22:30       ` Majed B.
2010-01-20 22:43         ` Brett Russ
2010-01-20 23:01           ` Christopher Chen
2010-01-21  4:17           ` Steven Haigh
2010-01-21  8:08             ` Asdo
2010-01-21 10:52               ` Steven Haigh
2010-01-21 11:48                 ` Farkas Levente
2010-01-21 12:15                   ` Jon Hardcastle
2010-01-19 10:04 Jon Hardcastle
2010-01-20 14:19 ` Brett Russ
2010-01-20 14:34   ` Jon Hardcastle
2010-01-20 14:46     ` Brett Russ
2010-02-01 20:48       ` Bill Davidsen
2010-01-22 16:22   ` Jon Hardcastle
2010-01-22 16:34     ` Asdo
2010-01-22 17:41     ` Brett Russ

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100219223736.GA2381@lazy.lzy \
    --to=piergiorgio.sartor@nexgo.de \
    --cc=Jon@eHardcastle.com \
    --cc=bryan.mesich@ndsu.edu \
    --cc=davidsen@tmr.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=netwiz@crc.id.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.