From: Jamie Lokier <jamie@shareable.org>
To: Rusty Russell <rusty@rustcorp.com.au>
Cc: tytso@mit.edu, kvm@vger.kernel.org,
"Michael S. Tsirkin" <mst@redhat.com>, Neil Brown <neilb@suse.de>,
qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org,
Jens Axboe <qemu@kernel.dk>,
hch@lst.de
Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Date: Thu, 6 May 2010 15:57:45 +0100 [thread overview]
Message-ID: <20100506145745.GA28512@shareable.org> (raw)
In-Reply-To: <201005061535.20619.rusty@rustcorp.com.au>
Rusty Russell wrote:
> > Seems over-zealous.
> > If the recovery_header held a strong checksum of the recovery_data you would
> > not need the first fsync, and as long as you have two places to write recovery
> > data, you don't need the 3rd and 4th syncs.
> > Just:
> > write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
> > fsync / msync
> > overwrite_with_new_data()
> >
> > To recovery you choose the most recent log_space and replay the content.
> > That may be a redundant operation, but that is no loss.
>
> I think you missed a checksum for the new data? Otherwise we can't tell if
> the new data is completely written.
The data checksum can go in the recovery-data block. If there's
enough slack in the log, by the time that recovery-data block is
overwritten, you can be sure that an fsync has been done for that
data (by a later commit).
> But yes, I will steal this scheme for TDB2, thanks!
Take a look at the filesystems. I think ext4 did some optimisations
in this area, and that checksums had to be added anyway due to a
subtle replay-corruption problem that happens when the log is
partially corrupted, and followed by non-corrupt blocks.
Also, you can remove even more fsyncs by adding a bit of slack to the
data space and writing into unused/fresh areas some of the time -
i.e. a bit like btrfs/zfs or anything log-structured, but you don't
have to go all the way with that.
> In practice, it's the first sync which is glacial, the rest are pretty cheap.
The 3rd and 4th fsyncs imply a disk seek each, just because the
preceding writes are to different areas of the disk. Seeks are quite
slow - but not as slow as ext3 fsyncs :-) What do you mean by cheap?
That it's only a couple of seeks, or that you don't see even that?
>
> > Also cannot see the point of msync if you have already performed an fsync,
> > and if there is a point, I would expect you to call msync before
> > fsync... Maybe there is some subtlety there that I am not aware of.
>
> I assume it's this from the msync man page:
>
> msync() flushes changes made to the in-core copy of a file that was
> mapped into memory using mmap(2) back to disk. Without use of this
> call there is no guarantee that changes are written back before mun‐
> map(2) is called.
Historically, that means msync() ensures dirty mapping data is written
to the file as if with write(), and that mapping pages are removed or
refreshed to get the effect of read() (possibly a lazy one). It's
more obvious in the early mmap implementations where mappings don't
share pages with the filesystem cache, so msync() has explicit
behaviour.
Like with write(), after calling msync() you would then call fsync()
to ensure the data is flushed to disk.
If you've been calling fsync then msync, I guess that's another fine
example of how these function are so hard to test, that they aren't.
Historically on Linux, msync has been iffy on some architectures, and
I'm still not sure it has the same semantics as other unixes. fsync
as we know has also been iffy, and even now that fsync is tidier it
does not always issue a hardware-level cache commit.
But then historically writable mmap has been iffy on a boatload of
unixes.
> > > It's an implementation detail; barrier has less flexibility because it has
> > > less information about what is required. I'm saying I want to give you as
> > > much information as I can, even if you don't use it yet.
> >
> > Only we know that approach doesn't work.
> > People will learn that they don't need to give the extra information to still
> > achieve the same result - just like they did with ext3 and fsync.
> > Then when we improve the implementation to only provide the guarantees that
> > you asked for, people will complain that they are getting empty files that
> > they didn't expect.
>
> I think that's an oversimplification: IIUC that occurred to people *not*
> using fsync(). They weren't using it because it was too slow. Providing
> a primitive which is as fast or faster and more specific doesn't have the
> same magnitude of social issues.
I agree with Rusty. Let's make it perform well so there is no reason
to deliberately avoid using it, and let's make say what apps actually
want to request without being way too strong.
And please, if anyone has ideas on how we could make correct use of
these functions *testable* by app authors, I'm all ears. Right now it
is quite difficult - pulling power on hard disks mid-transaction is
not a convenient method :)
> > The abstraction I would like to see is a simple 'barrier' that contains no
> > data and has a filesystem-wide effect.
>
> I think you lack ambition ;)
>
> Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
> suboptimal for md? Since you have to hand your barrier to every device
> whereas a file-wide primitive may theoretically only go to some.
Yes.
Note that database-like programs still need fsync-like behaviour
*sometimes*: The "D" in ACID depends on it, and the "C" in ACID also
depends on it where multiple files are involved which must contain
consistent data with each other after crash/recovery (Perhaps Samba
depends on this?)
Single-file sync is valuable just like single-file barrier, and so is
the combination.
Since you mentioned ambition, think about multi-file updates. They're
analogous in userspace to MD's barrier/sync requirements in
kernelspace.
One API that supports multi-file update barriers is "long aio-fsync":
Something which returns when the data in earlier writes (to one file)
is committed, but does not force the commit to happen more quickly
than normal. Both single-file barriers (like you want for TDB) and
multi-file barriers can be implemented on top of that, but it's much
more difficult to use than an fbarrier() syscall, which is only
suitable for single-file. But I wonder if there would be many users
of fbarrier() who aren't perfectly capable of using something else if
needed.
-- Jamie
next prev parent reply other threads:[~2010-05-06 14:58 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-02-18 22:22 [Qemu-devel] [PATCH] virtio-spec: document block CMD and FLUSH Michael S. Tsirkin
2010-04-19 21:26 ` [Qemu-devel] " Michael S. Tsirkin
2010-04-28 15:52 ` Michael S. Tsirkin
2010-04-20 1:46 ` [Qemu-devel] " Jamie Lokier
2010-04-20 13:22 ` Paul Brook
2010-04-21 10:39 ` Michael S. Tsirkin
2010-05-04 18:56 ` Christoph Hellwig
2010-05-04 19:01 ` Michael S. Tsirkin
2010-05-04 4:38 ` [Qemu-devel] " Rusty Russell
2010-05-04 6:56 ` Stefan Hajnoczi
2010-05-04 8:34 ` Avi Kivity
2010-05-04 8:41 ` Jens Axboe
2010-05-04 20:17 ` Jamie Lokier
2010-05-05 4:58 ` Rusty Russell
2010-05-05 6:03 ` Neil Brown
2010-05-06 6:05 ` Rusty Russell
2010-05-06 14:57 ` Jamie Lokier [this message]
2010-05-06 15:25 ` Jamie Lokier
2010-05-04 10:05 ` Christoph Hellwig
2010-05-04 20:32 ` Jamie Lokier
2010-05-04 18:54 ` Christoph Hellwig
2010-05-04 18:56 ` Michael S. Tsirkin
2010-05-04 18:58 ` Michael S. Tsirkin
2010-05-05 5:00 ` Rusty Russell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100506145745.GA28512@shareable.org \
--to=jamie@shareable.org \
--cc=hch@lst.de \
--cc=kvm@vger.kernel.org \
--cc=mst@redhat.com \
--cc=neilb@suse.de \
--cc=qemu-devel@nongnu.org \
--cc=qemu@kernel.dk \
--cc=rusty@rustcorp.com.au \
--cc=tytso@mit.edu \
--cc=virtualization@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).