From: Anthony Liguori <anthony@codemonkey.ws>
To: qemu-devel@nongnu.org
Cc: Chris Wright <chrisw@redhat.com>,
Mark McLoughlin <markmc@redhat.com>,
Ryan Harper <ryanh@us.ibm.com>,
Laurent Vivier <Laurent.Vivier@bull.net>,
kvm-devel <kvm-devel@lists.sourceforge.net>
Subject: Re: [Qemu-devel] [RFC] Disk integrity in QEMU
Date: Sat, 11 Oct 2008 15:35:09 -0500 [thread overview]
Message-ID: <48F10DFD.40505@codemonkey.ws> (raw)
In-Reply-To: <48F0E83E.2000907@redhat.com>
Mark Wagner wrote:
> Avi Kivity wrote:
>
> I think one of the main things to be considered is the integrity of the
> actual system call. The Linux manpage for open() states the following
> about the use of the O_DIRECT flag:
>
> O_DIRECT (Since Linux 2.6.10)
> Try to minimize cache effects of the I/O to and from this file. In
> general this will degrade performance, but it is useful in special
> situations, such as when applications do their own caching. File
> I/O is done directly to/from user space buffers. The I/O is
> synchronous, that is, at the completion of a read(2) or write(2),
> data is guaranteed to have been transferred. Under Linux 2.4
> transfer sizes, and the alignment of user buffer and file offset
> must all be multiples of the logical block size of the file system.
> Under Linux 2.6 alignment to 512-byte boundaries suffices.
>
>
> If I focus on the sentence "The I/O is synchronous, that is, at
> the completion of a read(2) or write(2), data is guaranteed to have
> been transferred. ",
It's extremely important to understand what the guarantee is. The
guarantee is that upon completion on write(), the data will have been
reported as written by the underlying storage subsystem. This does
*not* mean that the data is on disk.
If you have a normal laptop, your disk has a cache. That cache does not
have a battery backup. Under normal operations, the cache is acting in
write-back mode and when you do a write, the disk will report the write
as completed even though it is not actually on disk. If you really care
about the data being on disk, you have to either use a disk with a
battery backed cache (much more expensive) or enable write-through
caching (will significantly reduce performance).
In the case of KVM, even using write-back caching with the host page
cache, we are still honoring the guarantee of O_DIRECT. We just have
another level of caching that happens to be write-back.
> I think there a bug here. If I open a
> file with the O_DIRECT flag and the host reports back to me that
> the transfer has completed when in fact its still in the host cache,
> its a bug as it violates the open()/write() call and there is no
> guarantee that the data will actually be written.
This is very important, O_DIRECT does *not* guarantee that data actually
resides on disk. There are many possibly places that it can be cached
(in the storage controller, in the disks themselves, in a RAID controller).
> So I guess the real issue isn't what the default should be (although
> the performance team at Red Hat would vote for cache=off),
The consensus so far has been that we want to still use the host page
cache but use it in write-through mode. This would mean that the guest
would only see data completion when the host's storage subsystem reports
the write as having completed. This is not the same as cache=off but I
think gives the real effect that is desired.
Do you have another argument for using cache=off?
Regards,
Anthony Liguori
> the real
> issue is that we need to honor the system call from the guest. If
> the file is opened with O_DIRECT on the guest, then the host needs
> to honor that and do the same.
>
> -mark
>
>
>
>
next prev parent reply other threads:[~2008-10-11 20:35 UTC|newest]
Thread overview: 101+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori
2008-10-10 7:54 ` Gerd Hoffmann
2008-10-10 8:12 ` Mark McLoughlin
2008-10-12 23:10 ` Jamie Lokier
2008-10-14 17:15 ` Avi Kivity
2008-10-10 9:32 ` Avi Kivity
2008-10-12 23:00 ` Jamie Lokier
2008-10-10 8:11 ` Aurelien Jarno
2008-10-10 12:26 ` Anthony Liguori
2008-10-10 12:53 ` Paul Brook
2008-10-10 13:55 ` Anthony Liguori
2008-10-10 14:05 ` Paul Brook
2008-10-10 14:19 ` Avi Kivity
2008-10-17 13:14 ` Jens Axboe
2008-10-19 9:13 ` Avi Kivity
2008-10-10 15:48 ` Aurelien Jarno
2008-10-10 9:16 ` Avi Kivity
2008-10-10 9:58 ` Daniel P. Berrange
2008-10-10 10:26 ` Avi Kivity
2008-10-10 12:59 ` Paul Brook
2008-10-10 13:20 ` Avi Kivity
2008-10-10 12:34 ` Anthony Liguori
2008-10-10 12:56 ` Avi Kivity
2008-10-11 9:07 ` andrzej zaborowski
2008-10-11 17:54 ` Mark Wagner
2008-10-11 20:35 ` Anthony Liguori [this message]
2008-10-12 0:43 ` Mark Wagner
2008-10-12 1:50 ` Chris Wright
2008-10-12 16:22 ` Jamie Lokier
2008-10-12 17:54 ` Anthony Liguori
2008-10-12 18:14 ` nuitari-qemu
2008-10-13 0:27 ` Mark Wagner
2008-10-13 1:21 ` Anthony Liguori
2008-10-13 2:09 ` Mark Wagner
2008-10-13 3:16 ` Anthony Liguori
2008-10-13 6:42 ` Aurelien Jarno
2008-10-13 14:38 ` Steve Ofsthun
2008-10-12 0:44 ` Chris Wright
2008-10-12 10:21 ` Avi Kivity
2008-10-12 14:37 ` Dor Laor
2008-10-12 15:35 ` Jamie Lokier
2008-10-12 18:00 ` Anthony Liguori
2008-10-12 18:02 ` Anthony Liguori
2008-10-15 10:17 ` Andrea Arcangeli
2008-10-12 17:59 ` Anthony Liguori
2008-10-12 18:34 ` Avi Kivity
2008-10-12 19:33 ` Izik Eidus
2008-10-14 17:08 ` Avi Kivity
2008-10-12 19:59 ` Anthony Liguori
2008-10-12 20:43 ` Avi Kivity
2008-10-12 21:11 ` Anthony Liguori
2008-10-14 15:21 ` Avi Kivity
2008-10-14 15:32 ` Anthony Liguori
2008-10-14 15:43 ` Avi Kivity
2008-10-14 19:25 ` Laurent Vivier
2008-10-16 9:47 ` Avi Kivity
2008-10-12 10:12 ` Avi Kivity
2008-10-17 13:20 ` Jens Axboe
2008-10-19 9:01 ` Avi Kivity
2008-10-19 18:10 ` Jens Axboe
2008-10-19 18:23 ` Avi Kivity
2008-10-19 19:17 ` M. Warner Losh
2008-10-19 19:31 ` Avi Kivity
2008-10-19 18:24 ` Avi Kivity
2008-10-19 18:36 ` Jens Axboe
2008-10-19 19:11 ` Avi Kivity
2008-10-19 19:30 ` Jens Axboe
2008-10-19 20:16 ` Avi Kivity
2008-10-20 14:14 ` Avi Kivity
2008-10-10 10:03 ` Fabrice Bellard
2008-10-13 16:11 ` Laurent Vivier
2008-10-13 16:58 ` Anthony Liguori
2008-10-13 17:36 ` Jamie Lokier
2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper
2008-10-13 18:43 ` Anthony Liguori
2008-10-14 16:42 ` Avi Kivity
2008-10-13 18:51 ` Laurent Vivier
2008-10-13 19:43 ` Ryan Harper
2008-10-13 20:21 ` Laurent Vivier
2008-10-13 21:05 ` Ryan Harper
2008-10-15 13:10 ` Laurent Vivier
2008-10-16 10:24 ` Laurent Vivier
2008-10-16 13:43 ` Anthony Liguori
2008-10-16 16:08 ` Laurent Vivier
2008-10-17 12:48 ` Avi Kivity
2008-10-17 13:17 ` Laurent Vivier
2008-10-14 10:05 ` Kevin Wolf
2008-10-14 14:32 ` Ryan Harper
2008-10-14 16:37 ` Avi Kivity
2008-10-13 19:00 ` Mark Wagner
2008-10-13 19:15 ` Ryan Harper
2008-10-14 16:49 ` Avi Kivity
2008-10-13 17:58 ` [Qemu-devel] " Rik van Riel
2008-10-13 18:22 ` Jamie Lokier
2008-10-13 18:34 ` Rik van Riel
2008-10-14 1:56 ` Jamie Lokier
2008-10-14 2:28 ` nuitari-qemu
2008-10-28 17:34 ` Ian Jackson
2008-10-28 17:45 ` Anthony Liguori
2008-10-28 17:50 ` Ian Jackson
2008-10-28 18:19 ` Jamie Lokier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=48F10DFD.40505@codemonkey.ws \
--to=anthony@codemonkey.ws \
--cc=Laurent.Vivier@bull.net \
--cc=chrisw@redhat.com \
--cc=kvm-devel@lists.sourceforge.net \
--cc=markmc@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=ryanh@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).