Re: [Qemu-devel] [RFC] Disk integrity in QEMU

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Avi Kivity <avi@redhat.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Chris Wright <chrisw@redhat.com>,
	Mark McLoughlin <markmc@redhat.com>,
	kvm-devel <kvm-devel@lists.sourceforge.net>,
	Laurent Vivier <Laurent.Vivier@bull.net>,
	qemu-devel@nongnu.org, Ryan Harper <ryanh@us.ibm.com>
Subject: Re: [Qemu-devel] [RFC] Disk integrity in QEMU
Date: Sun, 12 Oct 2008 22:43:29 +0200	[thread overview]
Message-ID: <48F26171.70109@redhat.com> (raw)
In-Reply-To: <48F25720.9010306@codemonkey.ws>

Anthony Liguori wrote:
>>
>> Let me phrase this another way: is there an argument against O_DIRECT?   
>
> It slows down any user who frequently restarts virtual machines.  

This is an important use case (us developers), but not the majority of
deployments.

> It slows down total system throughput when there are multiple virtual
> machines sharing a single disk.  This later point is my primary
> concern because in the future, I expect disk sharing to be common in
> some form (either via common QCOW base images or via CAS).

Sharing via qcow base images is also an important use case, but for
desktop workloads.  Server workloads will be able to share a lot less,
and in any case will not keep reloading their text pages as desktops do.

Regarding CAS, the Linux page cache indexes pages by inode number and
offset, so it cannot share page cache contents without significant
rework.  Perhaps ksm could be adapted to do this, but it can't right
now.  And again, server consolidation scenarios which are mostly
unrelated workloads jammed on a single host won't benefit much from this.

>
> I'd like to see a benchmark demonstrating that O_DIRECT improves
> overall system throughput in any scenario today.  I just don't buy the
> cost of the extra copy today is going to be significant since the CPU
> cache is already polluted.  I think the burden of proof is on O_DIRECT
> because it's quite simple to demonstrate where it hurts performance
> (just the time it takes to do two boots of the same image).
>
>> In a significant fraction of deployments it will be both simpler and
>> faster.
>>   
>
> I think this is speculative.  Is there any performance data to back
> this up?

Given that we don't have a zero-copy implementation yet, it is
impossible to generate real performance data.  However it is backed up
by experience; all major databases use direct I/O and their own caching;
and since the data patterns of filesystems are similar to that of
databases (perhaps less random), there's a case for not caching them.

I'll repeat my arguments:

- cache size

In many deployments we will maximize the number of guests, so host
memory will be low.  If your L3 cache is smaller than your L2 cache,
your cache hit rate will be low.

Guests will write out data they are not expecting to need soon (the
tails of their LRU, or their journals) so caching it is pointless. 
Conversely, they _will_ cache data they have just read.

- cpu cache utilization

When a guest writes out its page cache, this is likely to be some time
after the cpu moved the data there.  So it's out of the page cache.  Now
we're bringing it back to the cache, twice (once reading guest memory,
second time writing to host page cache).

Similarly, when reading from the host page cache into the guest, we have
no idea whether the guest will actually touch the memory in question. 
It may be doing a readahead, or reading a metadata page of which it will
only access a small part.  So again we're wasting two pages worth of
cache per page we're reading.

Note also that we have no idea which vcpu will use the page, so even if
the guest will touch the data, there is a high likelihood (for large
guests) that it will be in the wrong cache.

- conflicting readahead heuristics

The host may attempt to perform readahead on the disk.  However the
guest is also doing readahead, so the host is extending the readahead
further than is likely to be a good idea.  Also, the guest does logical
(file-based) readahead while the host does physical (disk order based)
readahead, or qcow-level readahead which is basically reading random blocks.

Now I don't have data that demonstrates how bad these effects are, but I
think there is sufficient arguments here to justify adding O_DIRECT.  I
intend to recommend O_DIRECT unless I see performance data that favours
O_DSYNC on real world scenarios that take into account bandwidth, cpu
utilization, and memory utilization (i.e. a 1G guest on a 32G host
running fio but not top doesn't count).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

next prev parent reply	other threads:[~2008-10-12 20:44 UTC|newest]

Thread overview: 101+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori
2008-10-10  7:54 ` Gerd Hoffmann
2008-10-10  8:12   ` Mark McLoughlin
2008-10-12 23:10     ` Jamie Lokier
2008-10-14 17:15       ` Avi Kivity
2008-10-10  9:32   ` Avi Kivity
2008-10-12 23:00     ` Jamie Lokier
2008-10-10  8:11 ` Aurelien Jarno
2008-10-10 12:26   ` Anthony Liguori
2008-10-10 12:53     ` Paul Brook
2008-10-10 13:55       ` Anthony Liguori
2008-10-10 14:05         ` Paul Brook
2008-10-10 14:19         ` Avi Kivity
2008-10-17 13:14           ` Jens Axboe
2008-10-19  9:13             ` Avi Kivity
2008-10-10 15:48     ` Aurelien Jarno
2008-10-10  9:16 ` Avi Kivity
2008-10-10  9:58   ` Daniel P. Berrange
2008-10-10 10:26     ` Avi Kivity
2008-10-10 12:59       ` Paul Brook
2008-10-10 13:20         ` Avi Kivity
2008-10-10 12:34   ` Anthony Liguori
2008-10-10 12:56     ` Avi Kivity
2008-10-11  9:07     ` andrzej zaborowski
2008-10-11 17:54   ` Mark Wagner
2008-10-11 20:35     ` Anthony Liguori
2008-10-12  0:43       ` Mark Wagner
2008-10-12  1:50         ` Chris Wright
2008-10-12 16:22           ` Jamie Lokier
2008-10-12 17:54         ` Anthony Liguori
2008-10-12 18:14           ` nuitari-qemu
2008-10-13  0:27           ` Mark Wagner
2008-10-13  1:21             ` Anthony Liguori
2008-10-13  2:09               ` Mark Wagner
2008-10-13  3:16                 ` Anthony Liguori
2008-10-13  6:42                 ` Aurelien Jarno
2008-10-13 14:38                 ` Steve Ofsthun
2008-10-12  0:44       ` Chris Wright
2008-10-12 10:21         ` Avi Kivity
2008-10-12 14:37           ` Dor Laor
2008-10-12 15:35             ` Jamie Lokier
2008-10-12 18:00               ` Anthony Liguori
2008-10-12 18:02             ` Anthony Liguori
2008-10-15 10:17               ` Andrea Arcangeli
2008-10-12 17:59           ` Anthony Liguori
2008-10-12 18:34             ` Avi Kivity
2008-10-12 19:33               ` Izik Eidus
2008-10-14 17:08                 ` Avi Kivity
2008-10-12 19:59               ` Anthony Liguori
2008-10-12 20:43                 ` Avi Kivity [this message]
2008-10-12 21:11                   ` Anthony Liguori
2008-10-14 15:21                     ` Avi Kivity
2008-10-14 15:32                       ` Anthony Liguori
2008-10-14 15:43                         ` Avi Kivity
2008-10-14 19:25                       ` Laurent Vivier
2008-10-16  9:47                         ` Avi Kivity
2008-10-12 10:12       ` Avi Kivity
2008-10-17 13:20         ` Jens Axboe
2008-10-19  9:01           ` Avi Kivity
2008-10-19 18:10             ` Jens Axboe
2008-10-19 18:23               ` Avi Kivity
2008-10-19 19:17                 ` M. Warner Losh
2008-10-19 19:31                   ` Avi Kivity
2008-10-19 18:24               ` Avi Kivity
2008-10-19 18:36                 ` Jens Axboe
2008-10-19 19:11                   ` Avi Kivity
2008-10-19 19:30                     ` Jens Axboe
2008-10-19 20:16                       ` Avi Kivity
2008-10-20 14:14                       ` Avi Kivity
2008-10-10 10:03 ` Fabrice Bellard
2008-10-13 16:11 ` Laurent Vivier
2008-10-13 16:58   ` Anthony Liguori
2008-10-13 17:36     ` Jamie Lokier
2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper
2008-10-13 18:43   ` Anthony Liguori
2008-10-14 16:42     ` Avi Kivity
2008-10-13 18:51   ` Laurent Vivier
2008-10-13 19:43     ` Ryan Harper
2008-10-13 20:21       ` Laurent Vivier
2008-10-13 21:05         ` Ryan Harper
2008-10-15 13:10           ` Laurent Vivier
2008-10-16 10:24             ` Laurent Vivier
2008-10-16 13:43               ` Anthony Liguori
2008-10-16 16:08                 ` Laurent Vivier
2008-10-17 12:48                 ` Avi Kivity
2008-10-17 13:17                   ` Laurent Vivier
2008-10-14 10:05       ` Kevin Wolf
2008-10-14 14:32         ` Ryan Harper
2008-10-14 16:37       ` Avi Kivity
2008-10-13 19:00   ` Mark Wagner
2008-10-13 19:15     ` Ryan Harper
2008-10-14 16:49       ` Avi Kivity
2008-10-13 17:58 ` [Qemu-devel] " Rik van Riel
2008-10-13 18:22   ` Jamie Lokier
2008-10-13 18:34     ` Rik van Riel
2008-10-14  1:56       ` Jamie Lokier
2008-10-14  2:28         ` nuitari-qemu
2008-10-28 17:34 ` Ian Jackson
2008-10-28 17:45   ` Anthony Liguori
2008-10-28 17:50     ` Ian Jackson
2008-10-28 18:19       ` Jamie Lokier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48F26171.70109@redhat.com \
    --to=avi@redhat.com \
    --cc=Laurent.Vivier@bull.net \
    --cc=anthony@codemonkey.ws \
    --cc=chrisw@redhat.com \
    --cc=kvm-devel@lists.sourceforge.net \
    --cc=markmc@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=ryanh@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).