From: Neil Brown <neilb@suse.de>
To: Rusty Russell <rusty@rustcorp.com.au>
Cc: tytso@mit.edu, kvm@vger.kernel.org,
"Michael S. Tsirkin" <mst@redhat.com>,
qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org,
Jens Axboe <qemu@kernel.dk>,
hch@lst.de
Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Date: Wed, 5 May 2010 16:03:43 +1000 [thread overview]
Message-ID: <20100505160343.264fd015@notabene.brown> (raw)
In-Reply-To: <201005051428.41735.rusty@rustcorp.com.au>
On Wed, 5 May 2010 14:28:41 +0930
Rusty Russell <rusty@rustcorp.com.au> wrote:
> On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > Jens Axboe wrote:
> > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > > usual I/O suspects...
> > >
> > > It would be nice to have a more fuller API for this, but the reality is
> > > that only the flush approach is really workable. Even just strict
> > > ordering of requests could only be supported on SCSI, and even there the
> > > kernel still lacks proper guarantees on error handling to prevent
> > > reordering there.
> >
> > There's a few I/O scheduling differences that might be useful:
> >
> > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > before a BARRIER. That might be useful for time-critical WRITEs,
> > and those issued by high I/O priority.
>
> This is only because noone actually wants flushes or barriers, though
> I/O people seem to only offer that. We really want "<these writes> must
> occur before <this write>". That offers maximum choice to the I/O subsystem
> and potentially to smart (virtual?) disks.
>
> > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > only for data belonging to a particular file (e.g. fdatasync with
> > no file size change, even on btrfs if O_DIRECT was used for the
> > writes being committed). That would entail tagging FLUSHes and
> > WRITEs with a fs-specific identifier (such as inode number), opaque
> > to the scheduler which only checks equality.
>
> This is closer. In userspace I'd be happy with a "all prior writes to this
> struct file before all future writes". Even if the original guarantees were
> stronger (ie. inode basis). We currently implement transactions using 4 fsync
> /msync pairs.
>
> write_recovery_data(fd);
> fsync(fd);
> msync(mmap);
> write_recovery_header(fd);
> fsync(fd);
> msync(mmap);
> overwrite_with_new_data(fd);
> fsync(fd);
> msync(mmap);
> remove_recovery_header(fd);
> fsync(fd);
> msync(mmap);
Seems over-zealous.
If the recovery_header held a strong checksum of the recovery_data you would
not need the first fsync, and as long as you have two places to write recovery
data, you don't need the 3rd and 4th syncs.
Just:
write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
fsync / msync
overwrite_with_new_data()
To recovery you choose the most recent log_space and replay the content.
That may be a redundant operation, but that is no loss.
Also cannot see the point of msync if you have already performed an fsync,
and if there is a point, I would expect you to call msync before
fsync... Maybe there is some subtlety there that I am not aware of.
>
> Yet we really only need ordering, not guarantees about it actually hitting
> disk before returning.
>
> > In other words, FLUSH can be more relaxed than BARRIER inside the
> > kernel. It's ironic that we think of fsync as stronger than
> > fbarrier outside the kernel :-)
>
> It's an implementation detail; barrier has less flexibility because it has
> less information about what is required. I'm saying I want to give you as
> much information as I can, even if you don't use it yet.
Only we know that approach doesn't work.
People will learn that they don't need to give the extra information to still
achieve the same result - just like they did with ext3 and fsync.
Then when we improve the implementation to only provide the guarantees that
you asked for, people will complain that they are getting empty files that
they didn't expect.
The abstraction I would like to see is a simple 'barrier' that contains no
data and has a filesystem-wide effect.
If a filesystem wanted a 'full' barrier such as the current BIO_RW_BARRER,
it would send an empty barrier, then the data, then another empty barrier.
(However I suspect most filesystems don't really need barriers on both sides.)
A low level driver might merge these together if the underlying hardware
supported that combined operation (which I believe some do).
I think this merging would be less complex that the current need to split a
BIO_RW_BARRIER in to the three separate operations when only a flush is
possible (I know it would make md code a lot nicer :-).
I would probably expose this to user-space as extra flags to sync_file_range:
SYNC_FILE_RANGE_BARRIER_BEFORE
SYNC_FILE_RANGE_BARRIER_AFTER
This would make it clear that a barrier does *not* imply a sync, it only
applies to data for which a sync has already been requested. So data that has
already been 'synced' is stored strictly before data which has not yet been
submitted with write() (or by changing a mmapped area).
The barrier would still be filesystem wide in that if you
SYNC_FILE_WRITE_WRITE one file, then SYNC_FILE_RANGE_BARRIER_BEFORE another
file on the same filesystem, the pages scheduled in the first file would be
affect by the barrier request on the second file.
Implementing this would probably require a new address_space_operation so
that the filesystem would have a chance to ensure all necessary writes were
queued before issuing the barrier.
NeilBrown
next prev parent reply other threads:[~2010-05-05 6:04 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-02-18 22:22 [Qemu-devel] [PATCH] virtio-spec: document block CMD and FLUSH Michael S. Tsirkin
2010-04-19 21:26 ` [Qemu-devel] " Michael S. Tsirkin
2010-04-28 15:52 ` Michael S. Tsirkin
2010-04-20 1:46 ` [Qemu-devel] " Jamie Lokier
2010-04-20 13:22 ` Paul Brook
2010-04-21 10:39 ` Michael S. Tsirkin
2010-05-04 18:56 ` Christoph Hellwig
2010-05-04 19:01 ` Michael S. Tsirkin
2010-05-04 4:38 ` [Qemu-devel] " Rusty Russell
2010-05-04 6:56 ` Stefan Hajnoczi
2010-05-04 8:34 ` Avi Kivity
2010-05-04 8:41 ` Jens Axboe
2010-05-04 20:17 ` Jamie Lokier
2010-05-05 4:58 ` Rusty Russell
2010-05-05 6:03 ` Neil Brown [this message]
2010-05-06 6:05 ` Rusty Russell
2010-05-06 14:57 ` Jamie Lokier
2010-05-06 15:25 ` Jamie Lokier
2010-05-04 10:05 ` Christoph Hellwig
2010-05-04 20:32 ` Jamie Lokier
2010-05-04 18:54 ` Christoph Hellwig
2010-05-04 18:56 ` Michael S. Tsirkin
2010-05-04 18:58 ` Michael S. Tsirkin
2010-05-05 5:00 ` Rusty Russell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100505160343.264fd015@notabene.brown \
--to=neilb@suse.de \
--cc=hch@lst.de \
--cc=kvm@vger.kernel.org \
--cc=mst@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu@kernel.dk \
--cc=rusty@rustcorp.com.au \
--cc=tytso@mit.edu \
--cc=virtualization@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).