From: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>
To: Neil Brown <neilb@suse.de>
Cc: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>,
Steven Haigh <netwiz@crc.id.au>, Bill Davidsen <davidsen@tmr.com>,
Bryan Mesich <bryan.mesich@ndsu.edu>,
Jon@eHardcastle.com, linux-raid@vger.kernel.org
Subject: Re: Why does one get mismatches?
Date: Fri, 19 Feb 2010 23:37:36 +0100 [thread overview]
Message-ID: <20100219223736.GA2381@lazy.lzy> (raw)
In-Reply-To: <20100220090208.06c1130f@notabene.brown>
Hi,
> > Or it is like that because we trust the filesystem?
>
> It is because we trust the filesystem.
well, I hope the trust is not misplaced... :-)
> md is not in a position to lock the page - there is simply no way it can stop
> the filesystem from changing it.
How can this be?
> The only thing it could do would be to make a copy, then write the copy out.
Even making a copy would not be safe, since during
the copy the data could still change, or not?
> This would incur a performance cost.
It's a matter of deciding what is more important.
> > It seems to me, maybe I'm wrong, not a so safe design.
>
> I think you are wrong.
Could be, I never heard of situations like this.
> > I assume, it should not be possible to cause this
> > situation, unless there is a crash or a bug in the
> > md layer.
>
> I'm not sure what situation you are referring to...
It should not be possible to cause that different
mirrors of a RAID-1 end up with different data.
Otherwise, no point to have the mirroring.
> > What if a new filesystem will write a block, changing
> > on the fly, i.e. during RAID-1 writes, and then, later,
> > reading this block again?
> >
> > It will get, maybe, not the correct data.
>
> This is correct. However it would be equally correct if you were talking
> about s normal disk drive rather than a RAID1 pair.
Nono, there is a huge difference.
In a single drive case, the FS is responsible of writing
rubbish to a single block. The result would be that a
block has "strange" data, but *always* the same data.
Here the situation is that the data might be "strange",
but different accesses, to the same block of the RAID-1,
could potentially return different data.
As a byproduct of this effect, the "check" functionality
becomes not so useful anymore.
> If the filesystem changes the page (or allows it to change) while a write is
> pending, then it cannot know what actual data was written. So it must write
> the block out again before it ever reads it in.
> RAID1 is no different to any other device in this respect.
Is different, as mentioned above.
The FS could, intentionally, change the data during a write,
but later it could expect to have always the same data.
In other words, the FS does not guarantee the "spatial"
consistency of the data (the bytes in a block), but the
"temporal" consistency (successive reads return always
the same data) could be expected. And this happens in
case of a normal HDD. It does not happen in RAID-1.
> Possibly, but at what cost?
As I wrote: it is matter to decide what is more important
and useful.
> There are two ways that I can imagine to 'solve' this issue.
>
> 1/ always copy the page before writing. This would incur a significant
> overhead, both in the complexity of pre-allocation memory and in the
> delay taken to perform the copy. And it would very rarely be actually
> needed.
Does really a copy solve the issue? Is the copy done
in atomic way?
The pre-allocation does not seem to me to be a problem,
since it will be done once and for all (at device creation),
and not dynamically.
The copy *might* be an overhead, nevertheless I wonder if it
is really so much of a problem, expecially considering that,
after the copy, the MD layer can optimize the transaction
to the HDDs as much as it likes.
> 2/ Have the filesystem protect the page from changes while it is being
> written. This is quite possible for the filesystem to do (while it
> is impossible for md to do). There could be some performance
I'm really curious to understand what kind of thinking
is behind a design allowing such a situation...
I mean *system* design, not md design.
> cost with memory-mapped pages as they would need to be unmapped,
> but there would be no significant cost for reads, writes, and filesystem
> metadata operations.
> Further, any filesystem that wants to make use of the integrity checks
> that newer drives provide (where the filesystem provides a 'checksum' for
> the block which gets passed all the way down and written to storage, and
> returned on a read) will need to do this anyway. So it is likely the in
> the near future all significant filesystems will provide all the
> guarantees md needs or order to simply do nothing different.
That's good to know.
> So my feeling is that md is doing the best thing already.
I do not think this is an md issue, per se, it seems to me,
from the description, this is a overall design issue.
Normally, also for performance reasons, one approach is
to allocate queue(s) of buffers between two modules (like
FS and MD) and each of the modules has always *exclusive*
access to its own buffer(s), i.e. the buffer(s) it holds
in a certain time frame.
Once a module releases the buffer(s) this/these cannot be
anymore touched (read or write) by the module itself.
Once the buffer(s) arrive(s) to the other module, this
can do whatever it wants with it/them, and it is sure
it has exclusive access to it/them.
Normally real-time systems use techniques like this to
guarantee consistency *and* performances.
Anyway, thanks for the clarifications,
bye,
--
piergiorgio
next prev parent reply other threads:[~2010-02-19 22:37 UTC|newest]
Thread overview: 104+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-01-20 11:52 Fw: Why does one get mismatches? Jon Hardcastle
2010-01-22 18:13 ` Goswin von Brederlow
2010-01-24 17:40 ` Jon Hardcastle
2010-01-24 21:52 ` Roger Heflin
2010-01-24 23:13 ` Goswin von Brederlow
2010-01-25 10:07 ` Jon Hardcastle
2010-01-25 10:37 ` Goswin von Brederlow
2010-01-25 10:52 ` Jon Hardcastle
2010-01-25 17:32 ` Goswin von Brederlow
2010-01-25 19:32 ` Iustin Pop
2010-02-01 21:18 ` Bill Davidsen
2010-02-01 22:37 ` Neil Brown
2010-02-02 15:11 ` Bill Davidsen
2010-02-03 11:17 ` Goswin von Brederlow
2010-02-11 5:14 ` Neil Brown
2010-02-11 17:51 ` Bryan Mesich
2010-02-16 21:25 ` Bill Davidsen
2010-02-16 21:38 ` Steven Haigh
2010-02-17 3:19 ` Bryan Mesich
2010-02-17 23:05 ` Neil Brown
2010-02-19 15:18 ` Piergiorgio Sartor
2010-02-19 22:02 ` Neil Brown
2010-02-19 22:37 ` Piergiorgio Sartor [this message]
2010-02-19 23:34 ` Asdo
2010-02-20 4:27 ` Goswin von Brederlow
2010-02-20 11:12 ` Asdo
2010-02-21 11:13 ` Goswin von Brederlow
[not found] ` <8754A21825504719B463AD9809E54349@m5>
[not found] ` <20100221194400.GA2570@lazy.lzy>
2010-02-22 13:01 ` Asdo
2010-02-22 13:30 ` Piergiorgio Sartor
2010-02-22 13:44 ` Piergiorgio Sartor
2010-02-24 19:42 ` Bill Davidsen
2010-02-20 4:23 ` Goswin von Brederlow
2010-02-24 14:54 ` Bill Davidsen
2010-02-24 21:37 ` Neil Brown
2010-02-26 20:48 ` Bill Davidsen
2010-02-26 21:09 ` Neil Brown
2010-02-26 22:01 ` Piergiorgio Sartor
2010-02-26 22:15 ` Bill Davidsen
2010-02-26 22:21 ` Piergiorgio Sartor
2010-02-26 22:20 ` Asdo
2010-02-27 6:01 ` Michael Evans
2010-02-28 0:01 ` Bill Davidsen
2010-02-24 14:46 ` Bill Davidsen
2010-02-24 16:12 ` Martin K. Petersen
2010-02-24 18:51 ` Piergiorgio Sartor
2010-02-24 22:21 ` Neil Brown
2010-02-25 8:41 ` Piergiorgio Sartor
2010-03-02 4:57 ` Neil Brown
2010-03-02 18:49 ` Piergiorgio Sartor
2010-02-24 21:39 ` Neil Brown
[not found] ` <4B8640A2.4060307@shiftmail.org>
2010-02-25 10:41 ` Neil Brown
2010-02-28 8:09 ` Luca Berra
2010-03-02 5:01 ` Neil Brown
2010-03-02 7:36 ` Luca Berra
2010-03-02 10:04 ` Michael Evans
2010-03-02 11:02 ` Luca Berra
2010-03-02 12:13 ` Michael Evans
2010-03-02 18:14 ` Asdo
2010-03-02 18:52 ` Piergiorgio Sartor
2010-03-02 23:27 ` Asdo
2010-03-03 9:13 ` Piergiorgio Sartor
2010-03-03 11:42 ` Asdo
2010-03-03 12:03 ` Piergiorgio Sartor
2010-03-02 20:17 ` Neil Brown
2010-02-24 21:32 ` Neil Brown
2010-02-25 7:22 ` Goswin von Brederlow
2010-02-25 7:39 ` Neil Brown
2010-02-25 8:47 ` John Robinson
2010-02-25 9:07 ` Neil Brown
2010-02-11 18:12 ` Piergiorgio Sartor
-- strict thread matches above, loose matches on Subject: below --
2010-02-01 23:14 Jon Hardcastle
2010-01-25 20:43 greg
2010-01-25 22:49 ` Steven Haigh
2010-01-27 21:54 ` Tirumala Reddy Marri
2010-01-28 9:16 ` Jon Hardcastle
2010-01-28 10:29 ` Asdo
2010-01-28 17:20 ` Tirumala Reddy Marri
2010-01-28 18:23 ` Goswin von Brederlow
2010-01-28 19:03 ` Tirumala Reddy Marri
2010-01-28 20:24 ` Goswin von Brederlow
2010-01-29 15:37 ` Jon Hardcastle
2010-01-29 23:52 ` Goswin von Brederlow
2010-01-30 10:39 ` Jon Hardcastle
2010-02-01 21:10 ` Bill Davidsen
2010-01-20 15:03 Jon Hardcastle
2010-01-20 15:34 ` Brett Russ
2010-01-20 20:44 ` Majed B.
2010-01-20 22:25 ` Brett Russ
2010-01-20 22:30 ` Majed B.
2010-01-20 22:43 ` Brett Russ
2010-01-20 23:01 ` Christopher Chen
2010-01-21 4:17 ` Steven Haigh
2010-01-21 8:08 ` Asdo
2010-01-21 10:52 ` Steven Haigh
2010-01-21 11:48 ` Farkas Levente
2010-01-21 12:15 ` Jon Hardcastle
2010-01-19 10:04 Jon Hardcastle
2010-01-20 14:19 ` Brett Russ
2010-01-20 14:34 ` Jon Hardcastle
2010-01-20 14:46 ` Brett Russ
2010-02-01 20:48 ` Bill Davidsen
2010-01-22 16:22 ` Jon Hardcastle
2010-01-22 16:34 ` Asdo
2010-01-22 17:41 ` Brett Russ
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100219223736.GA2381@lazy.lzy \
--to=piergiorgio.sartor@nexgo.de \
--cc=Jon@eHardcastle.com \
--cc=bryan.mesich@ndsu.edu \
--cc=davidsen@tmr.com \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=netwiz@crc.id.au \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).