From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1JysQ3-0003DH-GQ
	for qemu-devel@nongnu.org; Wed, 21 May 2008 13:48:07 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1JysQ1-0003D0-G1
	for qemu-devel@nongnu.org; Wed, 21 May 2008 13:48:06 -0400
Received: from [199.232.76.173] (port=57320 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1JysQ1-0003Cw-9g
	for qemu-devel@nongnu.org; Wed, 21 May 2008 13:48:05 -0400
Received: from host36-195-149-62.serverdedicati.aruba.it
	([62.149.195.36]:58586 helo=mx.cpushare.com)
	by monty-python.gnu.org with esmtps
	(TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60)
	(envelope-from <andrea@qumranet.com>) id 1JysQ0-0005Qp-T5
	for qemu-devel@nongnu.org; Wed, 21 May 2008 13:48:05 -0400
Date: Wed, 21 May 2008 19:47:54 +0200
From: Andrea Arcangeli <andrea@qumranet.com>
Subject: Re: [Qemu-devel] Re: [PATCH][v2] Align file accesses with cache=off
	(O_DIRECT)
Message-ID: <20080521174754.GG22488@duo.random>
References: <483429EB.7070705@codemonkey.ws> <48342F05.2090603@qumranet.com>
	<48343106.4070801@codemonkey.ws> <48343844.1050107@qumranet.com>
	<20080521153454.GB20527@shareable.org>
	<48344793.2020902@codemonkey.ws>
	<20080521162406.GA21501@shareable.org>
	<48345258.9040004@qumranet.com> <20080521170129.GF22488@duo.random>
	<48345949.4050903@qumranet.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <48345949.4050903@qumranet.com>
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Avi Kivity <avi@qumranet.com>
Cc: Blue Swirl <blauwirbel@gmail.com>, Laurent Vivier <Laurent.Vivier@bull.net>, qemu-devel@nongnu.org, Paul Brook <paul@codesourcery.com>

On Wed, May 21, 2008 at 08:18:01PM +0300, Avi Kivity wrote:
> Yes, that's the reason.  Here zerocopy is not the motivation; instead, we 
> have host-cached pages that are used directly in the guest.  So we get both 
> reduced memory footprint, and host caching.  O_DIRECT reduces the memory 
> footprint but kills host caching.

Sure. So MAP_SHARED+remap_file_pages should work just fine to achieve
zerocopy I/O.

> The scenario is desktop/laptop use.  For server use O_DIRECT is clearly 
> preferred due to much reduced overhead.

Well in some ways there's more overhead with O_DIRECT because O_DIRECT
has to call get_user_pages and walk pagetables in software before
every I/O operation. MAP_SHARED walks them in hardware and it can take
advantage of the CPU tlb too.

The primary problem of MAP_SHARED isn't the overhead of the operation
itself that infact will be lower with MAP_SHARED after the cache is
allocated, but the write throttling and garbage collection of the host
caches. If you've a 250G guest image, MAP_SHARED will allocate as much
as 250G of cache and a cp /dev/zero /dev/hdb in the guest will mark
100% of guest RAM dirty. The mkclean methods and write throttling for
MAP_SHARED introduced in reasonably recent kernels can avoid filling
100% of host ram with dirty pages, but it still requires write
throttling and it'll pollute the host caches and it can result in
large ram allocations in the host having to block before the ram is
available, the same way as buffered writes in the host (current KVM
default).

I think O_DIRECT is the best solution and MAP_SHARED could become a
secondary option just for certain guest workloads with light I/O where
fairness isn't even a variable worth considering.

The cost of garbage collection of the mapped caches on the host isn't
trivial, and I don't mean because the nonlinear rmap logic has to scan
all pagetables, that's a minor cost compared to shrinking the host
caches before try_to_unmap is ever invoked etc... Leaving the host
caches purely for the host usage is surely more fair that won't ever
lead to one guest doing heavy I/O thrashing the host caches and
leading to all other guests and host tasks hanging. If they will hang
for a few msec with O_DIRECT it'll be because they're waiting for I/O
and the elevator put them on the queue to wait for the disk to return
ready. But it won't be because of some write throttling during writes,
or in alloc_pages shrink methods that are calling ->writepage on the
dirty pages.

The other significant advantage of O_DIRECT is that you won't have to
call msync to provide journaling.

I think O_DIRECT will work best for all usages, and it looks higher
priority to me to have than MAP_SHARED. MAP_SHARED will surely result
in better benchmarks for certain workloads though, imagine 'dd
if=/dev/hda of=/dev/zero iflag=direct bs=1M count=100 run on the
guest', it'll read from cache and it will do zero I/O starting from
the second run with MAP_SHARED ;).

If it was me I'd prefer O_DIRECT by default.

For full disclosure you may also want to read this but I strongly
disagree with those statements. http://kerneltrap.org/node/7563