From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1O9WhE-0003OW-Ca
	for qemu-devel@nongnu.org; Wed, 05 May 2010 00:58:56 -0400
Received: from [140.186.70.92] (port=39914 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1O9WhB-0003Ke-Gl
	for qemu-devel@nongnu.org; Wed, 05 May 2010 00:58:55 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <rusty@rustcorp.com.au>) id 1O9Wh9-0003LK-RI
	for qemu-devel@nongnu.org; Wed, 05 May 2010 00:58:53 -0400
Received: from ozlabs.org ([203.10.76.45]:38267)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <rusty@rustcorp.com.au>) id 1O9Wh9-0003Km-BM
	for qemu-devel@nongnu.org; Wed, 05 May 2010 00:58:51 -0400
From: Rusty Russell <rusty@rustcorp.com.au>
Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Date: Wed, 5 May 2010 14:28:41 +0930
References: <20100218222220.GA14847@redhat.com>
	<20100504084133.GH27497@kernel.dk>
	<20100504201705.GA4360@shareable.org>
In-Reply-To: <20100504201705.GA4360@shareable.org>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201005051428.41735.rusty@rustcorp.com.au>
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Jamie Lokier <jamie@shareable.org>
Cc: tytso@mit.edu, kvm@vger.kernel.org, "Michael S. Tsirkin" <mst@redhat.com>, Neil Brown <neilb@suse.de>, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, Jens Axboe <qemu@kernel.dk>, hch@lst.de

On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> Jens Axboe wrote:
> > On Tue, May 04 2010, Rusty Russell wrote:
> > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > usual I/O suspects...
> > 
> > It would be nice to have a more fuller API for this, but the reality is
> > that only the flush approach is really workable. Even just strict
> > ordering of requests could only be supported on SCSI, and even there the
> > kernel still lacks proper guarantees on error handling to prevent
> > reordering there.
> 
> There's a few I/O scheduling differences that might be useful:
> 
> 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
>    before a BARRIER.  That might be useful for time-critical WRITEs,
>    and those issued by high I/O priority.

This is only because noone actually wants flushes or barriers, though
I/O people seem to only offer that.  We really want "<these writes> must
occur before <this write>".  That offers maximum choice to the I/O subsystem
and potentially to smart (virtual?) disks.

> 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
>    only for data belonging to a particular file (e.g. fdatasync with
>    no file size change, even on btrfs if O_DIRECT was used for the
>    writes being committed).  That would entail tagging FLUSHes and
>    WRITEs with a fs-specific identifier (such as inode number), opaque
>    to the scheduler which only checks equality.

This is closer.  In userspace I'd be happy with a "all prior writes to this
struct file before all future writes".  Even if the original guarantees were
stronger (ie. inode basis).  We currently implement transactions using 4 fsync
/msync pairs.

	write_recovery_data(fd);
	fsync(fd);
	msync(mmap);
	write_recovery_header(fd);
	fsync(fd);
	msync(mmap);
	overwrite_with_new_data(fd);
	fsync(fd);
	msync(mmap);
	remove_recovery_header(fd);
	fsync(fd);
	msync(mmap);

Yet we really only need ordering, not guarantees about it actually hitting
disk before returning.

> In other words, FLUSH can be more relaxed than BARRIER inside the
> kernel.  It's ironic that we think of fsync as stronger than
> fbarrier outside the kernel :-)

It's an implementation detail; barrier has less flexibility because it has
less information about what is required. I'm saying I want to give you as
much information as I can, even if you don't use it yet.

Thanks,
Rusty.