From: Jamie Lokier <jamie@shareable.org>
To: Rusty Russell <rusty@rustcorp.com.au>
Cc: tytso@mit.edu, kvm@vger.kernel.org,
"Michael S. Tsirkin" <mst@redhat.com>, Neil Brown <neilb@suse.de>,
qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org,
Jens Axboe <qemu@kernel.dk>,
hch@lst.de
Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Date: Thu, 6 May 2010 16:25:02 +0100 [thread overview]
Message-ID: <20100506152502.GB28512@shareable.org> (raw)
In-Reply-To: <201005051428.41735.rusty@rustcorp.com.au>
Rusty Russell wrote:
> On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > Jens Axboe wrote:
> > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > > usual I/O suspects...
> > >
> > > It would be nice to have a more fuller API for this, but the reality is
> > > that only the flush approach is really workable. Even just strict
> > > ordering of requests could only be supported on SCSI, and even there the
> > > kernel still lacks proper guarantees on error handling to prevent
> > > reordering there.
> >
> > There's a few I/O scheduling differences that might be useful:
> >
> > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > before a BARRIER. That might be useful for time-critical WRITEs,
> > and those issued by high I/O priority.
>
> This is only because noone actually wants flushes or barriers, though
> I/O people seem to only offer that. We really want "<these writes> must
> occur before <this write>". That offers maximum choice to the I/O subsystem
> and potentially to smart (virtual?) disks.
We do want flushes for the "D" in ACID - such things as after
receiving a mail, or blog update into a database file (could be TDB),
and confirming that to the sender, to have high confidence that the
update won't disappear on system crash or power failure.
Less obviously, it's also needed for the "C" in ACID when more than
one file is involved. "C" is about differently updated things staying
consistent with each other.
For example, imagine you have a TDB file mapping Samba usernames to
passwords, and another mapping Samba usernames to local usernames. (I
don't know if you do this; it's just an illustration).
To rename a Samba user involves updating both. Let's ignore transient
transactional issues :-) and just think about what happens with
per-file barriers and no sync, when a crash happens long after the
updates, and before the system has written out all data and issued low
level cache flushes.
After restarting, due to lack of sync, the Samba username could be
present in one file and not the other.
> > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > only for data belonging to a particular file (e.g. fdatasync with
> > no file size change, even on btrfs if O_DIRECT was used for the
> > writes being committed). That would entail tagging FLUSHes and
> > WRITEs with a fs-specific identifier (such as inode number), opaque
> > to the scheduler which only checks equality.
>
> This is closer. In userspace I'd be happy with a "all prior writes to this
> struct file before all future writes". Even if the original guarantees were
> stronger (ie. inode basis). We currently implement transactions using 4 fsync
> /msync pairs.
>
> write_recovery_data(fd);
> fsync(fd);
> msync(mmap);
> write_recovery_header(fd);
> fsync(fd);
> msync(mmap);
> overwrite_with_new_data(fd);
> fsync(fd);
> msync(mmap);
> remove_recovery_header(fd);
> fsync(fd);
> msync(mmap);
>
> Yet we really only need ordering, not guarantees about it actually hitting
> disk before returning.
>
> > In other words, FLUSH can be more relaxed than BARRIER inside the
> > kernel. It's ironic that we think of fsync as stronger than
> > fbarrier outside the kernel :-)
>
> It's an implementation detail; barrier has less flexibility because it has
> less information about what is required. I'm saying I want to give you as
> much information as I can, even if you don't use it yet.
I agree, and I've started a few threads about it over the last couple of years.
An fsync_range() system call would be very easy to use and
(most importantly) easy to understand.
With optional flags to weaken it (into fdatasync, barrier without sync,
sync without barrier, one-sided barrier, no lowlevel cache-flush, don't rush,
etc.), it would be very versatile, and still easy to understand.
With an AIO version, and another flag meaning don't rush, just return
when satisfied, and I suspect it would be useful for the most
demanding I/O apps.
-- Jamie
next prev parent reply other threads:[~2010-05-06 15:25 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-02-18 22:22 [Qemu-devel] [PATCH] virtio-spec: document block CMD and FLUSH Michael S. Tsirkin
2010-04-19 21:26 ` [Qemu-devel] " Michael S. Tsirkin
2010-04-28 15:52 ` Michael S. Tsirkin
2010-04-20 1:46 ` [Qemu-devel] " Jamie Lokier
2010-04-20 13:22 ` Paul Brook
2010-04-21 10:39 ` Michael S. Tsirkin
2010-05-04 18:56 ` Christoph Hellwig
2010-05-04 19:01 ` Michael S. Tsirkin
2010-05-04 4:38 ` [Qemu-devel] " Rusty Russell
2010-05-04 6:56 ` Stefan Hajnoczi
2010-05-04 8:34 ` Avi Kivity
2010-05-04 8:41 ` Jens Axboe
2010-05-04 20:17 ` Jamie Lokier
2010-05-05 4:58 ` Rusty Russell
2010-05-05 6:03 ` Neil Brown
2010-05-06 6:05 ` Rusty Russell
2010-05-06 14:57 ` Jamie Lokier
2010-05-06 15:25 ` Jamie Lokier [this message]
2010-05-04 10:05 ` Christoph Hellwig
2010-05-04 20:32 ` Jamie Lokier
2010-05-04 18:54 ` Christoph Hellwig
2010-05-04 18:56 ` Michael S. Tsirkin
2010-05-04 18:58 ` Michael S. Tsirkin
2010-05-05 5:00 ` Rusty Russell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100506152502.GB28512@shareable.org \
--to=jamie@shareable.org \
--cc=hch@lst.de \
--cc=kvm@vger.kernel.org \
--cc=mst@redhat.com \
--cc=neilb@suse.de \
--cc=qemu-devel@nongnu.org \
--cc=qemu@kernel.dk \
--cc=rusty@rustcorp.com.au \
--cc=tytso@mit.edu \
--cc=virtualization@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).