From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1KolBE-0003DX-JL for qemu-devel@nongnu.org; Sat, 11 Oct 2008 16:35:16 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1KolBE-0003Cy-4E for qemu-devel@nongnu.org; Sat, 11 Oct 2008 16:35:16 -0400 Received: from [199.232.76.173] (port=48538 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1KolBD-0003Cp-Td for qemu-devel@nongnu.org; Sat, 11 Oct 2008 16:35:15 -0400 Received: from mail-gx0-f19.google.com ([209.85.217.19]:50870) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1KolBD-0001hZ-Li for qemu-devel@nongnu.org; Sat, 11 Oct 2008 16:35:15 -0400 Received: by gxk12 with SMTP id 12so2124464gxk.10 for ; Sat, 11 Oct 2008 13:35:13 -0700 (PDT) Message-ID: <48F10DFD.40505@codemonkey.ws> Date: Sat, 11 Oct 2008 15:35:09 -0500 From: Anthony Liguori MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC] Disk integrity in QEMU References: <48EE38B9.2050106@codemonkey.ws> <48EF1D55.7060307@redhat.com> <48F0E83E.2000907@redhat.com> In-Reply-To: <48F0E83E.2000907@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Cc: Chris Wright , Mark McLoughlin , Ryan Harper , Laurent Vivier , kvm-devel Mark Wagner wrote: > Avi Kivity wrote: > > I think one of the main things to be considered is the integrity of the > actual system call. The Linux manpage for open() states the following > about the use of the O_DIRECT flag: > > O_DIRECT (Since Linux 2.6.10) > Try to minimize cache effects of the I/O to and from this file. In > general this will degrade performance, but it is useful in special > situations, such as when applications do their own caching. File > I/O is done directly to/from user space buffers. The I/O is > synchronous, that is, at the completion of a read(2) or write(2), > data is guaranteed to have been transferred. Under Linux 2.4 > transfer sizes, and the alignment of user buffer and file offset > must all be multiples of the logical block size of the file system. > Under Linux 2.6 alignment to 512-byte boundaries suffices. > > > If I focus on the sentence "The I/O is synchronous, that is, at > the completion of a read(2) or write(2), data is guaranteed to have > been transferred. ", It's extremely important to understand what the guarantee is. The guarantee is that upon completion on write(), the data will have been reported as written by the underlying storage subsystem. This does *not* mean that the data is on disk. If you have a normal laptop, your disk has a cache. That cache does not have a battery backup. Under normal operations, the cache is acting in write-back mode and when you do a write, the disk will report the write as completed even though it is not actually on disk. If you really care about the data being on disk, you have to either use a disk with a battery backed cache (much more expensive) or enable write-through caching (will significantly reduce performance). In the case of KVM, even using write-back caching with the host page cache, we are still honoring the guarantee of O_DIRECT. We just have another level of caching that happens to be write-back. > I think there a bug here. If I open a > file with the O_DIRECT flag and the host reports back to me that > the transfer has completed when in fact its still in the host cache, > its a bug as it violates the open()/write() call and there is no > guarantee that the data will actually be written. This is very important, O_DIRECT does *not* guarantee that data actually resides on disk. There are many possibly places that it can be cached (in the storage controller, in the disks themselves, in a RAID controller). > So I guess the real issue isn't what the default should be (although > the performance team at Red Hat would vote for cache=off), The consensus so far has been that we want to still use the host page cache but use it in write-through mode. This would mean that the guest would only see data completion when the host's storage subsystem reports the write as having completed. This is not the same as cache=off but I think gives the real effect that is desired. Do you have another argument for using cache=off? Regards, Anthony Liguori > the real > issue is that we need to honor the system call from the guest. If > the file is opened with O_DIRECT on the guest, then the host needs > to honor that and do the same. > > -mark > > > >