qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Anthony Liguori <anthony@codemonkey.ws>
To: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Cc: Chris Wright <chrisw@redhat.com>,
	Mark McLoughlin <markmc@redhat.com>,
	kvm-devel <kvm-devel@lists.sourceforge.net>,
	Laurent Vivier <Laurent.Vivier@bull.net>,
	Ryan Harper <ryanh@us.ibm.com>
Subject: [Qemu-devel] [RFC] Disk integrity in QEMU
Date: Thu, 09 Oct 2008 12:00:41 -0500	[thread overview]
Message-ID: <48EE38B9.2050106@codemonkey.ws> (raw)

Hi,

There's been a lot of discussion recently mostly in other places about 
disk integrity and performance in QEMU.  I must admit, my own thinking 
has changed pretty recently in this space.  I wanted to try and focus 
the conversation on qemu-devel so that we could get everyone involved 
and come up with a plan for the future.

Right now, QEMU can open a file in two ways.  It can open it without any 
special caching flags (the default) or it can open it O_DIRECT.  
O_DIRECT implies that the IO does not go through the host page cache.  
This is controlled with cache=on and cache=off respectively.

When cache=on, read requests may not actually go to the disk.  If a 
previous read request (by some application on the system) has read the 
same data, then it becomes a simple memcpy().  Also, the host IO 
scheduler may do read ahead which means that the data may be available 
from that.  In general, the host knows the most about the underlying 
disk system and the total IO load on the system so it is far better 
suited to optimize these sort of things than the guest.

Write requests end up being simple memcpy()s too as the data is just 
copied into the page cache and the page is scheduled to be eventually 
written to disk.  Since we don't know when the data is actually written 
to disk, we tell the guest the data is written before it actually is.

If you assume that the host is stable, then there isn't an integrity 
issue.  This assumes that you have backup power and that the host OS has 
no bugs.  It's not a totally unreasonable assumption but for a large 
number of users, it's not a good assumption.

A side effect of cache=off is that data integrity only depends on the 
integrity of your storage system (which isn't always safe, btw) which is 
probably closer to what most users expect.  There many other side 
effects though.

An alternative to cache=off that addresses the data integrity problem 
directly is to open all disk images with O_DSYNC.  This will still use 
the host page cache (and therefore get all the benefits of it) but will 
only signal write completion when the data is actually written to disk.  
The effect of this is to make the integrity of the VM equal the 
integrity of the storage system (no longer relying on the host).  By 
still going through the page cache, you still get the benefits of the 
host's IO scheduler and read-ahead.  The only place affected by 
performance is writes (reads are equivalent).  If you run a write 
benchmark in a guest today, you'll see a number that is higher than 
native.  The implication here is that data integrity is not being 
maintained if you don't trust the host.  O_DSYNC takes care of this.

Read performance should be unaffected by using O_DSYNC.  O_DIRECT will 
significantly reduce read performance.  I think we should use O_DSYNC by 
default and I have sent out a patch that contains that.  We will follow 
up with benchmarks to demonstrate this.

There are certain benefits to using O_DIRECT.  One argument for using 
O_DIRECT is that you have to allocate memory in the host page cache to 
perform IO.  If you are not sharing data between guests, and the guest 
has a relatively large amount of memory compared to the host, and you 
have a simple disk in the host, going through the host page cache wastes 
some memory that could be used to cache other IO operations on the 
system.  I don't really think this is the typical case so I don't think 
this is an argument for having it on by default.  However, it can be 
enabled if you know this is going to be the case.

The biggest benefit to using O_DIRECT, is that you can potentially avoid 
ever bringing data into the CPUs cache.  Once data is cached, copying it 
is relatively cheap.  If you're never going to touch the data (think, 
disk DMA => nic DMA via sendfile()), then avoiding the CPU cache can be 
a big win.  Again, I don't think this is the common case but the option 
is there in case it's suitable.

An important point is that today, we always copy data internally in QEMU 
which means practically speaking, you'll never see this benefit.

So to summarize, I think we should enable O_DSYNC by default to ensure 
that guest data integrity is not dependent on the host OS, and that 
practically speaking, cache=off is only useful for very specialized 
circumstances.  Part of the patch I'll follow up with includes changes 
to the man page to document all of this for users.

Thoughts?

Regards,

Anthony Liguori

             reply	other threads:[~2008-10-09 17:00 UTC|newest]

Thread overview: 101+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-10-09 17:00 Anthony Liguori [this message]
2008-10-10  7:54 ` [Qemu-devel] [RFC] Disk integrity in QEMU Gerd Hoffmann
2008-10-10  8:12   ` Mark McLoughlin
2008-10-12 23:10     ` Jamie Lokier
2008-10-14 17:15       ` Avi Kivity
2008-10-10  9:32   ` Avi Kivity
2008-10-12 23:00     ` Jamie Lokier
2008-10-10  8:11 ` Aurelien Jarno
2008-10-10 12:26   ` Anthony Liguori
2008-10-10 12:53     ` Paul Brook
2008-10-10 13:55       ` Anthony Liguori
2008-10-10 14:05         ` Paul Brook
2008-10-10 14:19         ` Avi Kivity
2008-10-17 13:14           ` Jens Axboe
2008-10-19  9:13             ` Avi Kivity
2008-10-10 15:48     ` Aurelien Jarno
2008-10-10  9:16 ` Avi Kivity
2008-10-10  9:58   ` Daniel P. Berrange
2008-10-10 10:26     ` Avi Kivity
2008-10-10 12:59       ` Paul Brook
2008-10-10 13:20         ` Avi Kivity
2008-10-10 12:34   ` Anthony Liguori
2008-10-10 12:56     ` Avi Kivity
2008-10-11  9:07     ` andrzej zaborowski
2008-10-11 17:54   ` Mark Wagner
2008-10-11 20:35     ` Anthony Liguori
2008-10-12  0:43       ` Mark Wagner
2008-10-12  1:50         ` Chris Wright
2008-10-12 16:22           ` Jamie Lokier
2008-10-12 17:54         ` Anthony Liguori
2008-10-12 18:14           ` nuitari-qemu
2008-10-13  0:27           ` Mark Wagner
2008-10-13  1:21             ` Anthony Liguori
2008-10-13  2:09               ` Mark Wagner
2008-10-13  3:16                 ` Anthony Liguori
2008-10-13  6:42                 ` Aurelien Jarno
2008-10-13 14:38                 ` Steve Ofsthun
2008-10-12  0:44       ` Chris Wright
2008-10-12 10:21         ` Avi Kivity
2008-10-12 14:37           ` Dor Laor
2008-10-12 15:35             ` Jamie Lokier
2008-10-12 18:00               ` Anthony Liguori
2008-10-12 18:02             ` Anthony Liguori
2008-10-15 10:17               ` Andrea Arcangeli
2008-10-12 17:59           ` Anthony Liguori
2008-10-12 18:34             ` Avi Kivity
2008-10-12 19:33               ` Izik Eidus
2008-10-14 17:08                 ` Avi Kivity
2008-10-12 19:59               ` Anthony Liguori
2008-10-12 20:43                 ` Avi Kivity
2008-10-12 21:11                   ` Anthony Liguori
2008-10-14 15:21                     ` Avi Kivity
2008-10-14 15:32                       ` Anthony Liguori
2008-10-14 15:43                         ` Avi Kivity
2008-10-14 19:25                       ` Laurent Vivier
2008-10-16  9:47                         ` Avi Kivity
2008-10-12 10:12       ` Avi Kivity
2008-10-17 13:20         ` Jens Axboe
2008-10-19  9:01           ` Avi Kivity
2008-10-19 18:10             ` Jens Axboe
2008-10-19 18:23               ` Avi Kivity
2008-10-19 19:17                 ` M. Warner Losh
2008-10-19 19:31                   ` Avi Kivity
2008-10-19 18:24               ` Avi Kivity
2008-10-19 18:36                 ` Jens Axboe
2008-10-19 19:11                   ` Avi Kivity
2008-10-19 19:30                     ` Jens Axboe
2008-10-19 20:16                       ` Avi Kivity
2008-10-20 14:14                       ` Avi Kivity
2008-10-10 10:03 ` Fabrice Bellard
2008-10-13 16:11 ` Laurent Vivier
2008-10-13 16:58   ` Anthony Liguori
2008-10-13 17:36     ` Jamie Lokier
2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper
2008-10-13 18:43   ` Anthony Liguori
2008-10-14 16:42     ` Avi Kivity
2008-10-13 18:51   ` Laurent Vivier
2008-10-13 19:43     ` Ryan Harper
2008-10-13 20:21       ` Laurent Vivier
2008-10-13 21:05         ` Ryan Harper
2008-10-15 13:10           ` Laurent Vivier
2008-10-16 10:24             ` Laurent Vivier
2008-10-16 13:43               ` Anthony Liguori
2008-10-16 16:08                 ` Laurent Vivier
2008-10-17 12:48                 ` Avi Kivity
2008-10-17 13:17                   ` Laurent Vivier
2008-10-14 10:05       ` Kevin Wolf
2008-10-14 14:32         ` Ryan Harper
2008-10-14 16:37       ` Avi Kivity
2008-10-13 19:00   ` Mark Wagner
2008-10-13 19:15     ` Ryan Harper
2008-10-14 16:49       ` Avi Kivity
2008-10-13 17:58 ` [Qemu-devel] " Rik van Riel
2008-10-13 18:22   ` Jamie Lokier
2008-10-13 18:34     ` Rik van Riel
2008-10-14  1:56       ` Jamie Lokier
2008-10-14  2:28         ` nuitari-qemu
2008-10-28 17:34 ` Ian Jackson
2008-10-28 17:45   ` Anthony Liguori
2008-10-28 17:50     ` Ian Jackson
2008-10-28 18:19       ` Jamie Lokier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48EE38B9.2050106@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=Laurent.Vivier@bull.net \
    --cc=chrisw@redhat.com \
    --cc=kvm-devel@lists.sourceforge.net \
    --cc=markmc@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=ryanh@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).