From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1Kp5mO-0003dT-5y
	for qemu-devel@nongnu.org; Sun, 12 Oct 2008 14:35:00 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1Kp5mM-0003dH-TM
	for qemu-devel@nongnu.org; Sun, 12 Oct 2008 14:34:59 -0400
Received: from [199.232.76.173] (port=49837 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Kp5mM-0003dE-Q0
	for qemu-devel@nongnu.org; Sun, 12 Oct 2008 14:34:58 -0400
Received: from mx2.redhat.com ([66.187.237.31]:55490)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <avi@redhat.com>) id 1Kp5mM-0000E4-9s
	for qemu-devel@nongnu.org; Sun, 12 Oct 2008 14:34:58 -0400
Message-ID: <48F24320.9010201@redhat.com>
Date: Sun, 12 Oct 2008 20:34:08 +0200
From: Avi Kivity <avi@redhat.com>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC] Disk integrity in QEMU
References: <48EE38B9.2050106@codemonkey.ws>
	<48EF1D55.7060307@redhat.com>	<48F0E83E.2000907@redhat.com>
	<48F10DFD.40505@codemonkey.ws>
	<20081012004401.GA9763@acer.localdomain>
	<48F1CF9E.9030500@redhat.com> <48F23AF1.2000104@codemonkey.ws>
In-Reply-To: <48F23AF1.2000104@codemonkey.ws>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Chris Wright <chrisw@redhat.com>, Mark McLoughlin <markmc@redhat.com>, kvm-devel <kvm-devel@lists.sourceforge.net>, Laurent Vivier <Laurent.Vivier@bull.net>, qemu-devel@nongnu.org, Ryan Harper <ryanh@us.ibm.com>

Anthony Liguori wrote:
>>
>> Getting good performance because we have a huge amount of free memory
>> in the host is not a good benchmark.  Under most circumstances, the
>> free memory will be used either for more guests, or will be given to
>> the existing guests, which can utilize it more efficiently than the
>> host.
>
> There's two arguments for O_DIRECT.  The first is that you can avoid
> bringing in data into CPU cache.  This requires zero-copy in QEMU but
> ignoring that, the use of the page cache doesn't necessarily prevent
> us from achieving this.
>
> In the future, most systems will have a DMA offload engine.  This is a
> pretty obvious thing to attempt to accelerate with such an engine
> which would prevent cache pollution.  

But would increase latency, memory bus utilization, and cpu overhead.

In the cases where the page cache buys us something (host page cache
significantly larger than guest size), that's understandable.  But for
the other cases, why bother?  Especially when many systems don't have
this today.

Let me phrase this another way: is there an argument against O_DIRECT? 
In a significant fraction of deployments it will be both simpler and faster.

> Another possibility is to directly map the host's page cache into the
> guest's memory space.
>

Doesn't work with large pages.

> The later is a bit tricky but is so much more interesting especially
> if you have a strong storage backend that is capable of
> deduplification (you get memory compaction for free).
>

It's not free at all.  Replacing a guest memory page involves IPIs and
TLB flushes.  It only works on small pages, and if the host page cache
and guest page cache are aligned with each other.  And with current
Linux memory management, I don't see a way to do it that doesn't involve
creating a vma for every page, which is prohibitively expensive.

> I also have my doubts that the amount of memory saved by using
> O_DIRECT will have a noticable impact on performance considering that
> guest memory and page cache memory are entirely reclaimable.  

O_DIRECT is not about saving memory, it is about saving cpu utilization,
cache utilization, and memory bandwidth.

> An LRU should make the best decisions about whether memory is more
> valuable for the guests or for the host page cache.
>

LRU typically makes fairly bad decisions since it throws most of the
information it has away.  I recommend looking up LRU-K and similar
algorithms, just to get a feel for this; it is basically the simplest
possible algorithm short of random selection.

Note that Linux doesn't even have an LRU; it has to approximate since it
can't sample all of the pages all of the time.  With a hypervisor that
uses Intel's EPT, it's even worse since we don't have an accessed bit.

On silly benchmarks that just exercise the disk and touch no memory, and
if you tune the host very aggresively, LRU will win on long running
guests since it will eventually page out all unused guest memory (with
Linux guests, it will never even page guest memory in).  On real life
applications I don't think there is much chance.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.