From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1KaaWA-0001sR-Fy for qemu-devel@nongnu.org; Tue, 02 Sep 2008 14:22:18 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1KaaW9-0001s7-Tt for qemu-devel@nongnu.org; Tue, 02 Sep 2008 14:22:18 -0400 Received: from [199.232.76.173] (port=47769 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1KaaW9-0001rz-IN for qemu-devel@nongnu.org; Tue, 02 Sep 2008 14:22:17 -0400 Received: from mail2.shareable.org ([80.68.89.115]:35643) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1KaaW9-0006A7-A8 for qemu-devel@nongnu.org; Tue, 02 Sep 2008 14:22:17 -0400 Received: from jamie by mail2.shareable.org with local (Exim 4.63) (envelope-from ) id 1KaaW7-00046H-Ek for qemu-devel@nongnu.org; Tue, 02 Sep 2008 19:22:15 +0100 Date: Tue, 2 Sep 2008 19:22:15 +0100 From: Jamie Lokier Subject: Re: [Qemu-devel] [PATCH] bdrv_aio_flush Message-ID: <20080902182215.GA15737@shareable.org> References: <20080829133736.GH24884@duo.random> <18619.53638.814196.657141@mariner.uk.xensource.com> <20080901132502.GA15198@shareable.org> <18621.6542.901899.81559@mariner.uk.xensource.com> <20080902142807.GI20055@kernel.dk> <18621.28502.567886.718843@mariner.uk.xensource.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <18621.28502.567886.718843@mariner.uk.xensource.com> Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Ian Jackson wrote: > Jens Axboe writes ("Re: [Qemu-devel] [PATCH] bdrv_aio_flush"): > > On Tue, Sep 02 2008, Ian Jackson wrote: > > > This is still not perfect because we unnecessarily flush some data > > > thus delaying reporting completion of the WRITE FUA. But there is at > > > at least no need to wait for _other_ writes to complete. > > > > I don't see how the above works. There's no dependency on FUA and > > non-FUA writes, in fact FUA writes tend to jump the device queue due to > > certain other operating systems using it for conditions where that is > > appropriate. So unless you do all writes using FUA, there's no way > > around a flush for committing dirty data. Unfortunately we don't have a > > FLUSH_RANGE command, it's just a big sledge hammer. > > Yes, certainly you do aio_sync _some_ data that doesn't need to be. > Without an O_FSYNC flag on aio_write that's almost inevitable. Btw, in principle for FUA writes you can set O_SYNC or O_DSYNC on the file descriptor just for this operation. Either using fcntl() (but I'm not sure I believe that would be portable and really work), or using two file descriptors. > But if bdrv_aio_fsync also does a flush first then you're going to > sync _even more_ unnecessarily: the difference between `bdrv_aio_fsync > does flush first' and `bdrv_aio_fsync does not flush' only affects > writes are queued but not completed when bdrv_aio_fsync is called. > > That is, non-FUA writes which were submitted after the FUA write. > There is no need to fsync these and that's what I think qemu should > do. I agree, that's a clever reason to make bdrv_aio_fsync() guarantee less rather than more. (Who knows, that might be the reason SuS doesn't offer a stronger guarantee too, although I doubt it - if that was serious they might have defined a more selective sync instead.) It would be interesting to see if using aio_fsync(O_DSYNC) were slower or faster than fdatasync() on a range of hosts - just in case the former syncs previously submitted AIOs and the latter doesn't. Btw, on Linux aio_fsync(O_DSYNC) does the equivalent of fsync(), not fdatasync(). This is because Glibc defines O_DSYNC to be the same as O_SYNC. To get fdatasync(), you have to use the Linux-AIO API and IOCB_CMD_FDSYNC. > Andrea was making some comments about scsi and virtio. It's possible > that these have different intended semantics and perhaps those device > models (in hw/*) need to call flush explicitly before sync. Or perhaps they would benefit from an async equivalent, so they don't have to pause and can queue more requests? -- Jamie