From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Koigk-0001lw-MA for qemu-devel@nongnu.org; Sat, 11 Oct 2008 13:55:38 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Koigi-0001lL-5m for qemu-devel@nongnu.org; Sat, 11 Oct 2008 13:55:37 -0400 Received: from [199.232.76.173] (port=59938 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Koigh-0001lH-Sb for qemu-devel@nongnu.org; Sat, 11 Oct 2008 13:55:35 -0400 Received: from mx1.redhat.com ([66.187.233.31]:51732) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1Koigh-0006Fe-3G for qemu-devel@nongnu.org; Sat, 11 Oct 2008 13:55:35 -0400 Message-ID: <48F0E83E.2000907@redhat.com> Date: Sat, 11 Oct 2008 13:54:06 -0400 From: Mark Wagner MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC] Disk integrity in QEMU References: <48EE38B9.2050106@codemonkey.ws> <48EF1D55.7060307@redhat.com> In-Reply-To: <48EF1D55.7060307@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Cc: Chris Wright , Mark McLoughlin , Ryan Harper , kvm-devel , Laurent Vivier Avi Kivity wrote: > Anthony Liguori wrote: > > [O_DSYNC, O_DIRECT, and 0] > >> >> Thoughts? > > There are (at least) three usage models for qemu: > > - OS development tool > - casual or client-side virtualization > - server partitioning > > The last two uses are almost always in conjunction with a hypervisor. > > When using qemu as an OS development tool, data integrity is not very > important. On the other hand, performance and caching are, especially > as the guest is likely to be restarted multiple times so the guest page > cache is of limited value. For this use model the current default > (write back cache) is fine. > > The 'causal virtualization' use is when the user has a full native > desktop, and is also running another operating system. In this case, > the host page cache is likely to be larger than the guest page cache. > Data integrity is important, so write-back is out of the picture. I > guess for this use case O_DSYNC is preferred though O_DIRECT might not > be significantly slower for long-running guests. This is because reads > are unlikely to be cached and writes will not benefit much from the host > pagecache. > > For server partitioning, data integrity and performance are critical. > The host page cache is significantly smaller than the guest page cache; > if you have spare memory, give it to your guests. O_DIRECT is > practically mandataed here; the host page cache does nothing except to > impose an additional copy. > > Given the rather small difference between O_DSYNC and O_DIRECT, I favor > not adding O_DSYNC as it will add only marginal value. > > Regarding choosing the default value, I think we should change the > default to be safe, that is O_DIRECT. If that is regarded as too > radical, the default should be O_DSYNC with options to change it to > O_DIRECT or writeback. Note that some disk formats will need updating > like qcow2 if they are not to have abyssal performance. > I think one of the main things to be considered is the integrity of the actual system call. The Linux manpage for open() states the following about the use of the O_DIRECT flag: O_DIRECT (Since Linux 2.6.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The I/O is synchronous, that is, at the completion of a read(2) or write(2), data is guaranteed to have been transferred. Under Linux 2.4 transfer sizes, and the alignment of user buffer and file offset must all be multiples of the logical block size of the file system. Under Linux 2.6 alignment to 512-byte boundaries suffices. If I focus on the sentence "The I/O is synchronous, that is, at the completion of a read(2) or write(2), data is guaranteed to have been transferred. ", I think there a bug here. If I open a file with the O_DIRECT flag and the host reports back to me that the transfer has completed when in fact its still in the host cache, its a bug as it violates the open()/write() call and there is no guarantee that the data will actually be written. So I guess the real issue isn't what the default should be (although the performance team at Red Hat would vote for cache=off), the real issue is that we need to honor the system call from the guest. If the file is opened with O_DIRECT on the guest, then the host needs to honor that and do the same. -mark