qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Rusty Russell <rusty@rustcorp.com.au>
To: Jamie Lokier <jamie@shareable.org>
Cc: tytso@mit.edu, kvm@vger.kernel.org,
	"Michael S. Tsirkin" <mst@redhat.com>, Neil Brown <neilb@suse.de>,
	qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org,
	Jens Axboe <qemu@kernel.dk>,
	hch@lst.de
Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Date: Wed, 5 May 2010 14:28:41 +0930	[thread overview]
Message-ID: <201005051428.41735.rusty@rustcorp.com.au> (raw)
In-Reply-To: <20100504201705.GA4360@shareable.org>

On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> Jens Axboe wrote:
> > On Tue, May 04 2010, Rusty Russell wrote:
> > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > usual I/O suspects...
> > 
> > It would be nice to have a more fuller API for this, but the reality is
> > that only the flush approach is really workable. Even just strict
> > ordering of requests could only be supported on SCSI, and even there the
> > kernel still lacks proper guarantees on error handling to prevent
> > reordering there.
> 
> There's a few I/O scheduling differences that might be useful:
> 
> 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
>    before a BARRIER.  That might be useful for time-critical WRITEs,
>    and those issued by high I/O priority.

This is only because noone actually wants flushes or barriers, though
I/O people seem to only offer that.  We really want "<these writes> must
occur before <this write>".  That offers maximum choice to the I/O subsystem
and potentially to smart (virtual?) disks.

> 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
>    only for data belonging to a particular file (e.g. fdatasync with
>    no file size change, even on btrfs if O_DIRECT was used for the
>    writes being committed).  That would entail tagging FLUSHes and
>    WRITEs with a fs-specific identifier (such as inode number), opaque
>    to the scheduler which only checks equality.

This is closer.  In userspace I'd be happy with a "all prior writes to this
struct file before all future writes".  Even if the original guarantees were
stronger (ie. inode basis).  We currently implement transactions using 4 fsync
/msync pairs.

	write_recovery_data(fd);
	fsync(fd);
	msync(mmap);
	write_recovery_header(fd);
	fsync(fd);
	msync(mmap);
	overwrite_with_new_data(fd);
	fsync(fd);
	msync(mmap);
	remove_recovery_header(fd);
	fsync(fd);
	msync(mmap);

Yet we really only need ordering, not guarantees about it actually hitting
disk before returning.

> In other words, FLUSH can be more relaxed than BARRIER inside the
> kernel.  It's ironic that we think of fsync as stronger than
> fbarrier outside the kernel :-)

It's an implementation detail; barrier has less flexibility because it has
less information about what is required. I'm saying I want to give you as
much information as I can, even if you don't use it yet.

Thanks,
Rusty.

  reply	other threads:[~2010-05-05  4:58 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-18 22:22 [Qemu-devel] [PATCH] virtio-spec: document block CMD and FLUSH Michael S. Tsirkin
2010-04-19 21:26 ` [Qemu-devel] " Michael S. Tsirkin
2010-04-28 15:52   ` Michael S. Tsirkin
2010-04-20  1:46 ` [Qemu-devel] " Jamie Lokier
2010-04-20 13:22   ` Paul Brook
2010-04-21 10:39     ` Michael S. Tsirkin
2010-05-04 18:56   ` Christoph Hellwig
2010-05-04 19:01     ` Michael S. Tsirkin
2010-05-04  4:38 ` [Qemu-devel] " Rusty Russell
2010-05-04  6:56   ` Stefan Hajnoczi
2010-05-04  8:34   ` Avi Kivity
2010-05-04  8:41   ` Jens Axboe
2010-05-04 20:17     ` Jamie Lokier
2010-05-05  4:58       ` Rusty Russell [this message]
2010-05-05  6:03         ` Neil Brown
2010-05-06  6:05           ` Rusty Russell
2010-05-06 14:57             ` Jamie Lokier
2010-05-06 15:25         ` Jamie Lokier
2010-05-04 10:05   ` Christoph Hellwig
2010-05-04 20:32   ` Jamie Lokier
2010-05-04 18:54 ` Christoph Hellwig
2010-05-04 18:56   ` Michael S. Tsirkin
2010-05-04 18:58     ` Michael S. Tsirkin
2010-05-05  5:00   ` Rusty Russell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201005051428.41735.rusty@rustcorp.com.au \
    --to=rusty@rustcorp.com.au \
    --cc=hch@lst.de \
    --cc=jamie@shareable.org \
    --cc=kvm@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=neilb@suse.de \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu@kernel.dk \
    --cc=tytso@mit.edu \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).