From: Avi Kivity <avi@redhat.com>
To: qemu-devel@nongnu.org
Cc: Chris Wright <chrisw@redhat.com>,
Mark McLoughlin <markmc@redhat.com>,
Ryan Harper <ryanh@us.ibm.com>,
kvm-devel <kvm-devel@lists.sourceforge.net>,
Laurent Vivier <Laurent.Vivier@bull.net>
Subject: Re: [Qemu-devel] [RFC] Disk integrity in QEMU
Date: Sun, 12 Oct 2008 12:12:06 +0200 [thread overview]
Message-ID: <48F1CD76.2000203@redhat.com> (raw)
In-Reply-To: <48F10DFD.40505@codemonkey.ws>
Anthony Liguori wrote:
>>
>> If I focus on the sentence "The I/O is synchronous, that is, at
>> the completion of a read(2) or write(2), data is guaranteed to have
>> been transferred. ",
>
> It's extremely important to understand what the guarantee is. The
> guarantee is that upon completion on write(), the data will have been
> reported as written by the underlying storage subsystem. This does
> *not* mean that the data is on disk.
>
It means that as far as the block-io layer of the kernel is concerned,
the guarantee is met. If the writes go to to a ramdisk, or to an IDE
drive with write-back cache enabled, or to disk with write-back cache
disabled but without redundancy, or to a high-end storage array with
double-parity protection but without a continuous data protection
offsite solution, things may still go wrong.
It is up to qemu to provide a strong link in the data reliability chain,
not to ensure that the entire chain is perfect. That's up to the
administrator or builder of the system.
> If you have a normal laptop, your disk has a cache. That cache does
> not have a battery backup. Under normal operations, the cache is
> acting in write-back mode and when you do a write, the disk will
> report the write as completed even though it is not actually on disk.
> If you really care about the data being on disk, you have to either
> use a disk with a battery backed cache (much more expensive) or enable
> write-through caching (will significantly reduce performance).
>
I think that with SATA NCQ, this is no longer true. The drive will
report the write complete when it is on disk, and utilize multiple
outstanding requests to get coalescing and reordering. Not sure about
this, though -- some drives may still be lying.
> In the case of KVM, even using write-back caching with the host page
> cache, we are still honoring the guarantee of O_DIRECT. We just have
> another level of caching that happens to be write-back.
No, we are lying. That's fine if the user tells us to lie, but not
otherwise.
>> I think there a bug here. If I open a
>> file with the O_DIRECT flag and the host reports back to me that
>> the transfer has completed when in fact its still in the host cache,
>> its a bug as it violates the open()/write() call and there is no
>> guarantee that the data will actually be written.
>
> This is very important, O_DIRECT does *not* guarantee that data
> actually resides on disk. There are many possibly places that it can
> be cached (in the storage controller, in the disks themselves, in a
> RAID controller).
O_DIRECT guarantees that the kernel is not the weak link in the
reliability chain.
>
>> So I guess the real issue isn't what the default should be (although
>> the performance team at Red Hat would vote for cache=off),
>
> The consensus so far has been that we want to still use the host page
> cache but use it in write-through mode. This would mean that the
> guest would only see data completion when the host's storage subsystem
> reports the write as having completed. This is not the same as
> cache=off but I think gives the real effect that is desired.
I am fine with write-through as default, but cache=off should be a
supported option.
>
> Do you have another argument for using cache=off?
Performance.
--
error compiling committee.c: too many arguments to function
next prev parent reply other threads:[~2008-10-12 10:14 UTC|newest]
Thread overview: 101+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori
2008-10-10 7:54 ` Gerd Hoffmann
2008-10-10 8:12 ` Mark McLoughlin
2008-10-12 23:10 ` Jamie Lokier
2008-10-14 17:15 ` Avi Kivity
2008-10-10 9:32 ` Avi Kivity
2008-10-12 23:00 ` Jamie Lokier
2008-10-10 8:11 ` Aurelien Jarno
2008-10-10 12:26 ` Anthony Liguori
2008-10-10 12:53 ` Paul Brook
2008-10-10 13:55 ` Anthony Liguori
2008-10-10 14:05 ` Paul Brook
2008-10-10 14:19 ` Avi Kivity
2008-10-17 13:14 ` Jens Axboe
2008-10-19 9:13 ` Avi Kivity
2008-10-10 15:48 ` Aurelien Jarno
2008-10-10 9:16 ` Avi Kivity
2008-10-10 9:58 ` Daniel P. Berrange
2008-10-10 10:26 ` Avi Kivity
2008-10-10 12:59 ` Paul Brook
2008-10-10 13:20 ` Avi Kivity
2008-10-10 12:34 ` Anthony Liguori
2008-10-10 12:56 ` Avi Kivity
2008-10-11 9:07 ` andrzej zaborowski
2008-10-11 17:54 ` Mark Wagner
2008-10-11 20:35 ` Anthony Liguori
2008-10-12 0:43 ` Mark Wagner
2008-10-12 1:50 ` Chris Wright
2008-10-12 16:22 ` Jamie Lokier
2008-10-12 17:54 ` Anthony Liguori
2008-10-12 18:14 ` nuitari-qemu
2008-10-13 0:27 ` Mark Wagner
2008-10-13 1:21 ` Anthony Liguori
2008-10-13 2:09 ` Mark Wagner
2008-10-13 3:16 ` Anthony Liguori
2008-10-13 6:42 ` Aurelien Jarno
2008-10-13 14:38 ` Steve Ofsthun
2008-10-12 0:44 ` Chris Wright
2008-10-12 10:21 ` Avi Kivity
2008-10-12 14:37 ` Dor Laor
2008-10-12 15:35 ` Jamie Lokier
2008-10-12 18:00 ` Anthony Liguori
2008-10-12 18:02 ` Anthony Liguori
2008-10-15 10:17 ` Andrea Arcangeli
2008-10-12 17:59 ` Anthony Liguori
2008-10-12 18:34 ` Avi Kivity
2008-10-12 19:33 ` Izik Eidus
2008-10-14 17:08 ` Avi Kivity
2008-10-12 19:59 ` Anthony Liguori
2008-10-12 20:43 ` Avi Kivity
2008-10-12 21:11 ` Anthony Liguori
2008-10-14 15:21 ` Avi Kivity
2008-10-14 15:32 ` Anthony Liguori
2008-10-14 15:43 ` Avi Kivity
2008-10-14 19:25 ` Laurent Vivier
2008-10-16 9:47 ` Avi Kivity
2008-10-12 10:12 ` Avi Kivity [this message]
2008-10-17 13:20 ` Jens Axboe
2008-10-19 9:01 ` Avi Kivity
2008-10-19 18:10 ` Jens Axboe
2008-10-19 18:23 ` Avi Kivity
2008-10-19 19:17 ` M. Warner Losh
2008-10-19 19:31 ` Avi Kivity
2008-10-19 18:24 ` Avi Kivity
2008-10-19 18:36 ` Jens Axboe
2008-10-19 19:11 ` Avi Kivity
2008-10-19 19:30 ` Jens Axboe
2008-10-19 20:16 ` Avi Kivity
2008-10-20 14:14 ` Avi Kivity
2008-10-10 10:03 ` Fabrice Bellard
2008-10-13 16:11 ` Laurent Vivier
2008-10-13 16:58 ` Anthony Liguori
2008-10-13 17:36 ` Jamie Lokier
2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper
2008-10-13 18:43 ` Anthony Liguori
2008-10-14 16:42 ` Avi Kivity
2008-10-13 18:51 ` Laurent Vivier
2008-10-13 19:43 ` Ryan Harper
2008-10-13 20:21 ` Laurent Vivier
2008-10-13 21:05 ` Ryan Harper
2008-10-15 13:10 ` Laurent Vivier
2008-10-16 10:24 ` Laurent Vivier
2008-10-16 13:43 ` Anthony Liguori
2008-10-16 16:08 ` Laurent Vivier
2008-10-17 12:48 ` Avi Kivity
2008-10-17 13:17 ` Laurent Vivier
2008-10-14 10:05 ` Kevin Wolf
2008-10-14 14:32 ` Ryan Harper
2008-10-14 16:37 ` Avi Kivity
2008-10-13 19:00 ` Mark Wagner
2008-10-13 19:15 ` Ryan Harper
2008-10-14 16:49 ` Avi Kivity
2008-10-13 17:58 ` [Qemu-devel] " Rik van Riel
2008-10-13 18:22 ` Jamie Lokier
2008-10-13 18:34 ` Rik van Riel
2008-10-14 1:56 ` Jamie Lokier
2008-10-14 2:28 ` nuitari-qemu
2008-10-28 17:34 ` Ian Jackson
2008-10-28 17:45 ` Anthony Liguori
2008-10-28 17:50 ` Ian Jackson
2008-10-28 18:19 ` Jamie Lokier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=48F1CD76.2000203@redhat.com \
--to=avi@redhat.com \
--cc=Laurent.Vivier@bull.net \
--cc=chrisw@redhat.com \
--cc=kvm-devel@lists.sourceforge.net \
--cc=markmc@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=ryanh@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).