From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1Kp7nf-0003KX-E4
	for qemu-devel@nongnu.org; Sun, 12 Oct 2008 16:44:27 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1Kp7ne-0003Jo-45
	for qemu-devel@nongnu.org; Sun, 12 Oct 2008 16:44:27 -0400
Received: from [199.232.76.173] (port=35946 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Kp7nd-0003Jl-Vd
	for qemu-devel@nongnu.org; Sun, 12 Oct 2008 16:44:26 -0400
Received: from mx2.redhat.com ([66.187.237.31]:56824)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <avi@redhat.com>) id 1Kp7nc-0003sF-UJ
	for qemu-devel@nongnu.org; Sun, 12 Oct 2008 16:44:25 -0400
Message-ID: <48F26171.70109@redhat.com>
Date: Sun, 12 Oct 2008 22:43:29 +0200
From: Avi Kivity <avi@redhat.com>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC] Disk integrity in QEMU
References: <48EE38B9.2050106@codemonkey.ws>
	<48EF1D55.7060307@redhat.com>	<48F0E83E.2000907@redhat.com>
	<48F10DFD.40505@codemonkey.ws>
	<20081012004401.GA9763@acer.localdomain>
	<48F1CF9E.9030500@redhat.com> <48F23AF1.2000104@codemonkey.ws>
	<48F24320.9010201@redhat.com> <48F25720.9010306@codemonkey.ws>
In-Reply-To: <48F25720.9010306@codemonkey.ws>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Chris Wright <chrisw@redhat.com>, Mark McLoughlin <markmc@redhat.com>, kvm-devel <kvm-devel@lists.sourceforge.net>, Laurent Vivier <Laurent.Vivier@bull.net>, qemu-devel@nongnu.org, Ryan Harper <ryanh@us.ibm.com>

Anthony Liguori wrote:
>>
>> Let me phrase this another way: is there an argument against O_DIRECT?   
>
> It slows down any user who frequently restarts virtual machines.  

This is an important use case (us developers), but not the majority of
deployments.

> It slows down total system throughput when there are multiple virtual
> machines sharing a single disk.  This later point is my primary
> concern because in the future, I expect disk sharing to be common in
> some form (either via common QCOW base images or via CAS).

Sharing via qcow base images is also an important use case, but for
desktop workloads.  Server workloads will be able to share a lot less,
and in any case will not keep reloading their text pages as desktops do.

Regarding CAS, the Linux page cache indexes pages by inode number and
offset, so it cannot share page cache contents without significant
rework.  Perhaps ksm could be adapted to do this, but it can't right
now.  And again, server consolidation scenarios which are mostly
unrelated workloads jammed on a single host won't benefit much from this.

>
> I'd like to see a benchmark demonstrating that O_DIRECT improves
> overall system throughput in any scenario today.  I just don't buy the
> cost of the extra copy today is going to be significant since the CPU
> cache is already polluted.  I think the burden of proof is on O_DIRECT
> because it's quite simple to demonstrate where it hurts performance
> (just the time it takes to do two boots of the same image).
>
>> In a significant fraction of deployments it will be both simpler and
>> faster.
>>   
>
> I think this is speculative.  Is there any performance data to back
> this up?

Given that we don't have a zero-copy implementation yet, it is
impossible to generate real performance data.  However it is backed up
by experience; all major databases use direct I/O and their own caching;
and since the data patterns of filesystems are similar to that of
databases (perhaps less random), there's a case for not caching them.

I'll repeat my arguments:

- cache size

In many deployments we will maximize the number of guests, so host
memory will be low.  If your L3 cache is smaller than your L2 cache,
your cache hit rate will be low.

Guests will write out data they are not expecting to need soon (the
tails of their LRU, or their journals) so caching it is pointless. 
Conversely, they _will_ cache data they have just read.

- cpu cache utilization

When a guest writes out its page cache, this is likely to be some time
after the cpu moved the data there.  So it's out of the page cache.  Now
we're bringing it back to the cache, twice (once reading guest memory,
second time writing to host page cache).

Similarly, when reading from the host page cache into the guest, we have
no idea whether the guest will actually touch the memory in question. 
It may be doing a readahead, or reading a metadata page of which it will
only access a small part.  So again we're wasting two pages worth of
cache per page we're reading.

Note also that we have no idea which vcpu will use the page, so even if
the guest will touch the data, there is a high likelihood (for large
guests) that it will be in the wrong cache.

- conflicting readahead heuristics

The host may attempt to perform readahead on the disk.  However the
guest is also doing readahead, so the host is extending the readahead
further than is likely to be a good idea.  Also, the guest does logical
(file-based) readahead while the host does physical (disk order based)
readahead, or qcow-level readahead which is basically reading random blocks.

Now I don't have data that demonstrates how bad these effects are, but I
think there is sufficient arguments here to justify adding O_DIRECT.  I
intend to recommend O_DIRECT unless I see performance data that favours
O_DSYNC on real world scenarios that take into account bandwidth, cpu
utilization, and memory utilization (i.e. a 1G guest on a 32G host
running fio but not top doesn't count).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.