From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Ka8TK-00039I-7B for qemu-devel@nongnu.org; Mon, 01 Sep 2008 08:25:30 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Ka8TH-000361-LK for qemu-devel@nongnu.org; Mon, 01 Sep 2008 08:25:28 -0400 Received: from [199.232.76.173] (port=34130 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Ka8TH-00035s-60 for qemu-devel@nongnu.org; Mon, 01 Sep 2008 08:25:27 -0400 Received: from host36-195-149-62.serverdedicati.aruba.it ([62.149.195.36]:51730 helo=mx.cpushare.com) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1Ka8TH-0007Eo-0P for qemu-devel@nongnu.org; Mon, 01 Sep 2008 08:25:27 -0400 Date: Mon, 1 Sep 2008 14:25:21 +0200 From: Andrea Arcangeli Subject: Re: [Qemu-devel] [PATCH] bdrv_aio_flush Message-ID: <20080901122521.GG25764@duo.random> References: <20080829133736.GH24884@duo.random> <18619.53638.814196.657141@mariner.uk.xensource.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <18619.53638.814196.657141@mariner.uk.xensource.com> Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org On Mon, Sep 01, 2008 at 12:27:02PM +0100, Ian Jackson wrote: > I think this is fine. We discussed this some time ago. bdrv_flush You mean fsync is just fine, or you mean replacing fsync with aio_fsync is fine and needed? ;) > guarantees that _already completed_ IO operations are flushed. It > does not guarantee that in flight AIO operations are completed and > then flushed to disk. In case you meant fsync is just fine, Linux will use the WIN_FLUSH_CACHE/WIN_FLUSH_CACHE_EXT see idedisk_prepare_flush: if (barrier) { ordered = QUEUE_ORDERED_DRAIN_FLUSH; prep_fn = idedisk_prepare_flush; } so if we don't want guest journaling to break with scsi/virtio, we've to make sure the AIO is committed to disk before the flush returns. To be clear: this is only a problem if there's a power outage in the host. > [..] I can't see any reason to think that the > `write cache' which is referred to by the spec is regarded as > containing data which has not yet been DMAd from the host to the disk > because the command which does that transfer is not yet complete. I'm not sure I follow, IDE is safe because it submits a command at once and we don't simulate dirty write cache. So by the time bdrv_flush is called, the previous aio_write is already completed, and in turn the dirty data is already visible to the kernel that will write it to disk with fsync. But anything a bit more clever than IDE that allows the guest to submit a barrier in a TCQ way, like scsi or virtio, will break the guest journaling if fsync is used. By the time the flush operation returns all previous data must be written to disk. Or at least the flush operation should return in order, so anything after the barrier operation should be written after the previous stuff. And fsync can't guarantee it, because it'll return immediately even if the aio queue is huge, and after the aio queue is flushed to kernel writeback cache, the kernel is free to write the writeback cache in whatever order it wants (in linux it'll try to write it in dirty-inode order first, and then in logical order according to the offset of the dirty data into the inode looking up the inode radix tree).