From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1KoHYV-0006o0-2K
	for qemu-devel@nongnu.org; Fri, 10 Oct 2008 08:57:19 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1KoHYQ-0006mE-Aa
	for qemu-devel@nongnu.org; Fri, 10 Oct 2008 08:57:18 -0400
Received: from [199.232.76.173] (port=45985 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1KoHYQ-0006mB-4I
	for qemu-devel@nongnu.org; Fri, 10 Oct 2008 08:57:14 -0400
Received: from mx2.redhat.com ([66.187.237.31]:56872)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <avi@redhat.com>) id 1KoHYP-0008QQ-Im
	for qemu-devel@nongnu.org; Fri, 10 Oct 2008 08:57:13 -0400
Message-ID: <48EF50F1.70604@redhat.com>
Date: Fri, 10 Oct 2008 14:56:17 +0200
From: Avi Kivity <avi@redhat.com>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC] Disk integrity in QEMU
References: <48EE38B9.2050106@codemonkey.ws> <48EF1D55.7060307@redhat.com>
	<48EF4BD7.3040000@codemonkey.ws>
In-Reply-To: <48EF4BD7.3040000@codemonkey.ws>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org
Cc: Chris Wright <chrisw@redhat.com>, Mark McLoughlin <markmc@redhat.com>, Ryan Harper <ryanh@us.ibm.com>, Laurent Vivier <Laurent.Vivier@bull.net>, kvm-devel <kvm-devel@lists.sourceforge.net>

Anthony Liguori wrote:
>>
>> For server partitioning, data integrity and performance are
>> critical.  The host page cache is significantly smaller than the
>> guest page cache; if you have spare memory, give it to your guests.
>
> I don't think this wisdom is bullet-proof.  In the case of server
> partitioning, if you're designing for the future then you can assume
> some form of host data deduplification either through qcow
> deduplification, a proper content addressable storage mechanism, or
> file system level deduplification.  It's becoming more common to see
> large amounts of homogeneous consolidation either because of cloud
> computing, virtual appliances, or just because most x86 virtualization
> involves Windows consolidation and there aren't that many versions of
> Windows.
>
> In this case, there is an awful lot of opportunity for increasing
> overall system throughput by caching common data access across virtual
> machines.

That's true.  But is the OS image a significant image of I/O in a
running system?

My guess is that it is not.

In any case, deduplication is far enough into the future to not attempt
to solve it now.  The solution may be part of the deduplication solution
itself, for example it may choose to cache shared data (since they are
read-only anyway) even with O_DIRECT.

>
>> O_DIRECT is practically mandataed here; the host page cache does
>> nothing except to impose an additional copy.
>>
>> Given the rather small difference between O_DSYNC and O_DIRECT, I
>> favor not adding O_DSYNC as it will add only marginal value.
>
> The difference isn't small.  Our fio runs are defeating the host page
> cache on write so we're adjusting the working set size.  But the
> difference in read performance between dsync and direct is many
> factors when the data can be cached.
>

That's because you're leaving host memory idle.  That's not a realistic
scenario.  What happens if you assign free host memory to the guest?

>> Regarding choosing the default value, I think we should change the
>> default to be safe, that is O_DIRECT.  If that is regarded as too
>> radical, the default should be O_DSYNC with options to change it to
>> O_DIRECT or writeback.  Note that some disk formats will need
>> updating like qcow2 if they are not to have abyssal performance.
>
> I think qcow2 will be okay because the only issue is image expansion
> and that is a relatively uncommon case that is amortized throughout
> the life time of the VM.  So far, while there is objection to using
> O_DIRECT by default, I haven't seen any objection to O_DSYNC by
> default so as long as no one objects in the next few days, I think
> that's what we'll end up doing.

I don't mind that as long as there is a way to request O_DIRECT (which I
think is cache=off under your proposal).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.