All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rusty Russell <rusty@rustcorp.com.au>
To: Jamie Lokier <jamie@shareable.org>
Cc: Jens Axboe <qemu@kernel.dk>,
	tytso@mit.edu, kvm@vger.kernel.org,
	"Michael S. Tsirkin" <mst@redhat.com>, Neil Brown <neilb@suse.de>,
	qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org,
	hch@lst.de
Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Date: Wed, 5 May 2010 14:28:41 +0930	[thread overview]
Message-ID: <201005051428.41735.rusty@rustcorp.com.au> (raw)
In-Reply-To: <20100504201705.GA4360@shareable.org>

On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> Jens Axboe wrote:
> > On Tue, May 04 2010, Rusty Russell wrote:
> > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > usual I/O suspects...
> > 
> > It would be nice to have a more fuller API for this, but the reality is
> > that only the flush approach is really workable. Even just strict
> > ordering of requests could only be supported on SCSI, and even there the
> > kernel still lacks proper guarantees on error handling to prevent
> > reordering there.
> 
> There's a few I/O scheduling differences that might be useful:
> 
> 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
>    before a BARRIER.  That might be useful for time-critical WRITEs,
>    and those issued by high I/O priority.

This is only because noone actually wants flushes or barriers, though
I/O people seem to only offer that.  We really want "<these writes> must
occur before <this write>".  That offers maximum choice to the I/O subsystem
and potentially to smart (virtual?) disks.

> 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
>    only for data belonging to a particular file (e.g. fdatasync with
>    no file size change, even on btrfs if O_DIRECT was used for the
>    writes being committed).  That would entail tagging FLUSHes and
>    WRITEs with a fs-specific identifier (such as inode number), opaque
>    to the scheduler which only checks equality.

This is closer.  In userspace I'd be happy with a "all prior writes to this
struct file before all future writes".  Even if the original guarantees were
stronger (ie. inode basis).  We currently implement transactions using 4 fsync
/msync pairs.

	write_recovery_data(fd);
	fsync(fd);
	msync(mmap);
	write_recovery_header(fd);
	fsync(fd);
	msync(mmap);
	overwrite_with_new_data(fd);
	fsync(fd);
	msync(mmap);
	remove_recovery_header(fd);
	fsync(fd);
	msync(mmap);

Yet we really only need ordering, not guarantees about it actually hitting
disk before returning.

> In other words, FLUSH can be more relaxed than BARRIER inside the
> kernel.  It's ironic that we think of fsync as stronger than
> fbarrier outside the kernel :-)

It's an implementation detail; barrier has less flexibility because it has
less information about what is required. I'm saying I want to give you as
much information as I can, even if you don't use it yet.

Thanks,
Rusty.

WARNING: multiple messages have this Message-ID (diff)
From: Rusty Russell <rusty@rustcorp.com.au>
To: Jamie Lokier <jamie@shareable.org>
Cc: tytso@mit.edu, kvm@vger.kernel.org,
	"Michael S. Tsirkin" <mst@redhat.com>, Neil Brown <neilb@suse.de>,
	qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org,
	Jens Axboe <qemu@kernel.dk>,
	hch@lst.de
Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Date: Wed, 5 May 2010 14:28:41 +0930	[thread overview]
Message-ID: <201005051428.41735.rusty@rustcorp.com.au> (raw)
In-Reply-To: <20100504201705.GA4360@shareable.org>

On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> Jens Axboe wrote:
> > On Tue, May 04 2010, Rusty Russell wrote:
> > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > usual I/O suspects...
> > 
> > It would be nice to have a more fuller API for this, but the reality is
> > that only the flush approach is really workable. Even just strict
> > ordering of requests could only be supported on SCSI, and even there the
> > kernel still lacks proper guarantees on error handling to prevent
> > reordering there.
> 
> There's a few I/O scheduling differences that might be useful:
> 
> 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
>    before a BARRIER.  That might be useful for time-critical WRITEs,
>    and those issued by high I/O priority.

This is only because noone actually wants flushes or barriers, though
I/O people seem to only offer that.  We really want "<these writes> must
occur before <this write>".  That offers maximum choice to the I/O subsystem
and potentially to smart (virtual?) disks.

> 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
>    only for data belonging to a particular file (e.g. fdatasync with
>    no file size change, even on btrfs if O_DIRECT was used for the
>    writes being committed).  That would entail tagging FLUSHes and
>    WRITEs with a fs-specific identifier (such as inode number), opaque
>    to the scheduler which only checks equality.

This is closer.  In userspace I'd be happy with a "all prior writes to this
struct file before all future writes".  Even if the original guarantees were
stronger (ie. inode basis).  We currently implement transactions using 4 fsync
/msync pairs.

	write_recovery_data(fd);
	fsync(fd);
	msync(mmap);
	write_recovery_header(fd);
	fsync(fd);
	msync(mmap);
	overwrite_with_new_data(fd);
	fsync(fd);
	msync(mmap);
	remove_recovery_header(fd);
	fsync(fd);
	msync(mmap);

Yet we really only need ordering, not guarantees about it actually hitting
disk before returning.

> In other words, FLUSH can be more relaxed than BARRIER inside the
> kernel.  It's ironic that we think of fsync as stronger than
> fbarrier outside the kernel :-)

It's an implementation detail; barrier has less flexibility because it has
less information about what is required. I'm saying I want to give you as
much information as I can, even if you don't use it yet.

Thanks,
Rusty.

  parent reply	other threads:[~2010-05-05  4:58 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-18 22:22 [PATCH] virtio-spec: document block CMD and FLUSH Michael S. Tsirkin
2010-02-18 22:22 ` [Qemu-devel] " Michael S. Tsirkin
2010-04-19 21:26 ` Michael S. Tsirkin
2010-04-19 21:26   ` [Qemu-devel] " Michael S. Tsirkin
2010-04-28 15:52   ` Michael S. Tsirkin
2010-04-28 15:52     ` [Qemu-devel] " Michael S. Tsirkin
2010-04-28 15:52   ` Michael S. Tsirkin
2010-04-19 21:26 ` Michael S. Tsirkin
2010-04-20  1:46 ` [Qemu-devel] " Jamie Lokier
2010-04-20  1:46 ` Jamie Lokier
2010-04-20  1:46   ` Jamie Lokier
2010-04-20 13:22   ` Paul Brook
2010-04-20 13:22     ` Paul Brook
2010-04-21 10:39     ` Michael S. Tsirkin
2010-04-21 10:39     ` Michael S. Tsirkin
2010-04-21 10:39       ` Michael S. Tsirkin
2010-04-20 13:22   ` Paul Brook
2010-05-04 18:56   ` Christoph Hellwig
2010-05-04 18:56     ` Christoph Hellwig
2010-05-04 19:01     ` Michael S. Tsirkin
2010-05-04 19:01     ` Michael S. Tsirkin
2010-05-04 19:01       ` Michael S. Tsirkin
2010-05-04  4:38 ` Rusty Russell
2010-05-04  4:38   ` [Qemu-devel] " Rusty Russell
2010-05-04  6:56   ` Stefan Hajnoczi
2010-05-04  6:56     ` Stefan Hajnoczi
2010-05-04  8:34   ` Avi Kivity
2010-05-04  8:34     ` [Qemu-devel] " Avi Kivity
2010-05-04  8:34   ` Avi Kivity
2010-05-04  8:41   ` Jens Axboe
2010-05-04  8:41     ` [Qemu-devel] " Jens Axboe
2010-05-04 20:17     ` Jamie Lokier
2010-05-04 20:17       ` Jamie Lokier
2010-05-05  4:58       ` Rusty Russell
2010-05-05  4:58       ` Rusty Russell [this message]
2010-05-05  4:58         ` Rusty Russell
2010-05-05  6:03         ` Neil Brown
2010-05-05  6:03           ` Neil Brown
2010-05-06  6:05           ` Rusty Russell
2010-05-06  6:05           ` Rusty Russell
2010-05-06  6:05             ` Rusty Russell
2010-05-06 14:57             ` Jamie Lokier
2010-05-06 14:57               ` Jamie Lokier
2010-05-06 14:57             ` Jamie Lokier
2010-05-05  6:03         ` Neil Brown
2010-05-06 15:25         ` Jamie Lokier
2010-05-06 15:25         ` Jamie Lokier
2010-05-06 15:25           ` Jamie Lokier
2010-05-04 20:17     ` Jamie Lokier
2010-05-04  8:41   ` Jens Axboe
2010-05-04 10:05   ` Christoph Hellwig
2010-05-04 10:05     ` [Qemu-devel] " Christoph Hellwig
2010-05-04 10:05   ` Christoph Hellwig
2010-05-04 20:32   ` [Qemu-devel] " Jamie Lokier
2010-05-04 20:32     ` Jamie Lokier
2010-05-04 20:32   ` Jamie Lokier
2010-05-04  4:38 ` Rusty Russell
2010-05-04 18:54 ` Christoph Hellwig
2010-05-04 18:54   ` [Qemu-devel] " Christoph Hellwig
2010-05-04 18:56   ` Michael S. Tsirkin
2010-05-04 18:56     ` [Qemu-devel] " Michael S. Tsirkin
2010-05-04 18:58     ` Michael S. Tsirkin
2010-05-04 18:58     ` Michael S. Tsirkin
2010-05-04 18:58       ` [Qemu-devel] " Michael S. Tsirkin
2010-05-04 18:56   ` Michael S. Tsirkin
2010-05-05  5:00   ` Rusty Russell
2010-05-05  5:00   ` Rusty Russell
2010-05-05  5:00     ` [Qemu-devel] " Rusty Russell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201005051428.41735.rusty@rustcorp.com.au \
    --to=rusty@rustcorp.com.au \
    --cc=hch@lst.de \
    --cc=jamie@shareable.org \
    --cc=kvm@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=neilb@suse.de \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu@kernel.dk \
    --cc=tytso@mit.edu \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.