From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1KaaWA-0001sR-Fy
	for qemu-devel@nongnu.org; Tue, 02 Sep 2008 14:22:18 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1KaaW9-0001s7-Tt
	for qemu-devel@nongnu.org; Tue, 02 Sep 2008 14:22:18 -0400
Received: from [199.232.76.173] (port=47769 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1KaaW9-0001rz-IN
	for qemu-devel@nongnu.org; Tue, 02 Sep 2008 14:22:17 -0400
Received: from mail2.shareable.org ([80.68.89.115]:35643)
	by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.60) (envelope-from <jamie@shareable.org>) id 1KaaW9-0006A7-A8
	for qemu-devel@nongnu.org; Tue, 02 Sep 2008 14:22:17 -0400
Received: from jamie by mail2.shareable.org with local (Exim 4.63)
	(envelope-from <jamie@shareable.org>) id 1KaaW7-00046H-Ek
	for qemu-devel@nongnu.org; Tue, 02 Sep 2008 19:22:15 +0100
Date: Tue, 2 Sep 2008 19:22:15 +0100
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: [Qemu-devel] [PATCH] bdrv_aio_flush
Message-ID: <20080902182215.GA15737@shareable.org>
References: <20080829133736.GH24884@duo.random>
	<18619.53638.814196.657141@mariner.uk.xensource.com>
	<20080901132502.GA15198@shareable.org>
	<18621.6542.901899.81559@mariner.uk.xensource.com>
	<20080902142807.GI20055@kernel.dk>
	<18621.28502.567886.718843@mariner.uk.xensource.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <18621.28502.567886.718843@mariner.uk.xensource.com>
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org

Ian Jackson wrote:
> Jens Axboe writes ("Re: [Qemu-devel] [PATCH] bdrv_aio_flush"):
> > On Tue, Sep 02 2008, Ian Jackson wrote:
> > > This is still not perfect because we unnecessarily flush some data
> > > thus delaying reporting completion of the WRITE FUA.  But there is at
> > > at least no need to wait for _other_ writes to complete.
> > 
> > I don't see how the above works. There's no dependency on FUA and
> > non-FUA writes, in fact FUA writes tend to jump the device queue due to
> > certain other operating systems using it for conditions where that is
> > appropriate. So unless you do all writes using FUA, there's no way
> > around a flush for committing dirty data. Unfortunately we don't have a
> > FLUSH_RANGE command, it's just a big sledge hammer.
> 
> Yes, certainly you do aio_sync _some_ data that doesn't need to be.
> Without an O_FSYNC flag on aio_write that's almost inevitable.

Btw, in principle for FUA writes you can set O_SYNC or O_DSYNC on the
file descriptor just for this operation.  Either using fcntl() (but
I'm not sure I believe that would be portable and really work), or
using two file descriptors.

> But if bdrv_aio_fsync also does a flush first then you're going to
> sync _even more_ unnecessarily: the difference between `bdrv_aio_fsync
> does flush first' and `bdrv_aio_fsync does not flush' only affects
> writes are queued but not completed when bdrv_aio_fsync is called.
> 
> That is, non-FUA writes which were submitted after the FUA write.
> There is no need to fsync these and that's what I think qemu should
> do.

I agree, that's a clever reason to make bdrv_aio_fsync() guarantee
less rather than more.

(Who knows, that might be the reason SuS doesn't offer a stronger
guarantee too, although I doubt it - if that was serious they might
have defined a more selective sync instead.)

It would be interesting to see if using aio_fsync(O_DSYNC) were slower
or faster than fdatasync() on a range of hosts - just in case the
former syncs previously submitted AIOs and the latter doesn't.

Btw, on Linux aio_fsync(O_DSYNC) does the equivalent of fsync(), not
fdatasync().  This is because Glibc defines O_DSYNC to be the same as
O_SYNC.  To get fdatasync(), you have to use the Linux-AIO API and
IOCB_CMD_FDSYNC.

> Andrea was making some comments about scsi and virtio.  It's possible
> that these have different intended semantics and perhaps those device
> models (in hw/*) need to call flush explicitly before sync.

Or perhaps they would benefit from an async equivalent, so they don't
have to pause and can queue more requests?

-- Jamie