From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1Jyt3h-0003ld-CS
	for qemu-devel@nongnu.org; Wed, 21 May 2008 14:29:05 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1Jyt3g-0003l5-BH
	for qemu-devel@nongnu.org; Wed, 21 May 2008 14:29:04 -0400
Received: from [199.232.76.173] (port=52944 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Jyt3g-0003l1-83
	for qemu-devel@nongnu.org; Wed, 21 May 2008 14:29:04 -0400
Received: from bzq-179-150-194.static.bezeqint.net ([212.179.150.194]:58629
	helo=il.qumranet.com) by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <avi@qumranet.com>) id 1Jyt3f-0001LJ-Uy
	for qemu-devel@nongnu.org; Wed, 21 May 2008 14:29:04 -0400
Message-ID: <483469ED.1050408@qumranet.com>
Date: Wed, 21 May 2008 21:29:01 +0300
From: Avi Kivity <avi@qumranet.com>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] Re: [PATCH][v2] Align file accesses with cache=off
	(O_DIRECT)
References: <483429EB.7070705@codemonkey.ws> <48342F05.2090603@qumranet.com>
	<48343106.4070801@codemonkey.ws> <48343844.1050107@qumranet.com>
	<20080521153454.GB20527@shareable.org>
	<48344793.2020902@codemonkey.ws>
	<20080521162406.GA21501@shareable.org>
	<48345258.9040004@qumranet.com> <20080521170129.GF22488@duo.random>
	<48345949.4050903@qumranet.com> <20080521174754.GG22488@duo.random>
In-Reply-To: <20080521174754.GG22488@duo.random>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Andrea Arcangeli <andrea@qumranet.com>
Cc: Blue Swirl <blauwirbel@gmail.com>, Laurent Vivier <Laurent.Vivier@bull.net>, qemu-devel@nongnu.org, Paul Brook <paul@codesourcery.com>

Andrea Arcangeli wrote:
> On Wed, May 21, 2008 at 08:18:01PM +0300, Avi Kivity wrote:
>   
>> Yes, that's the reason.  Here zerocopy is not the motivation; instead, we 
>> have host-cached pages that are used directly in the guest.  So we get both 
>> reduced memory footprint, and host caching.  O_DIRECT reduces the memory 
>> footprint but kills host caching.
>>     
>
> Sure. So MAP_SHARED+remap_file_pages should work just fine to achieve
> zerocopy I/O.
>
>   

No, when the guest writes to memory, it will affect the disk, which 
doesn't happen with normal memory writes.  MAP_PRIVATE is needed.

>> The scenario is desktop/laptop use.  For server use O_DIRECT is clearly 
>> preferred due to much reduced overhead.
>>     
>
> Well in some ways there's more overhead with O_DIRECT because O_DIRECT
> has to call get_user_pages and walk pagetables in software before
> every I/O operation. MAP_SHARED walks them in hardware and it can take
> advantage of the CPU tlb too.
>
> The primary problem of MAP_SHARED isn't the overhead of the operation
> itself that infact will be lower with MAP_SHARED after the cache is
> allocated, but the write throttling and garbage collection of the host
> caches. If you've a 250G guest image, MAP_SHARED will allocate as much
> as 250G of cache and a cp /dev/zero /dev/hdb in the guest will mark
> 100% of guest RAM dirty. The mkclean methods and write throttling for
> MAP_SHARED introduced in reasonably recent kernels can avoid filling
> 100% of host ram with dirty pages, but it still requires write
> throttling and it'll pollute the host caches and it can result in
> large ram allocations in the host having to block before the ram is
> available, the same way as buffered writes in the host (current KVM
> default).
>   

I'd do writes via the normal write path, not mmap().

> I think O_DIRECT is the best solution and MAP_SHARED could become a
> secondary option just for certain guest workloads with light I/O where
> fairness isn't even a variable worth considering.
>
> The cost of garbage collection of the mapped caches on the host isn't
> trivial, and I don't mean because the nonlinear rmap logic has to scan
> all pagetables, that's a minor cost compared to shrinking the host
> caches before try_to_unmap is ever invoked etc... Leaving the host
> caches purely for the host usage is surely more fair that won't ever
> lead to one guest doing heavy I/O thrashing the host caches and
> leading to all other guests and host tasks hanging. If they will hang
> for a few msec with O_DIRECT it'll be because they're waiting for I/O
> and the elevator put them on the queue to wait for the disk to return
> ready. But it won't be because of some write throttling during writes,
> or in alloc_pages shrink methods that are calling ->writepage on the
> dirty pages.
>
> The other significant advantage of O_DIRECT is that you won't have to
> call msync to provide journaling.
>
> I think O_DIRECT will work best for all usages, and it looks higher
> priority to me to have than MAP_SHARED. MAP_SHARED will surely result
> in better benchmarks for certain workloads though, imagine 'dd
> if=/dev/hda of=/dev/zero iflag=direct bs=1M count=100 run on the
> guest', it'll read from cache and it will do zero I/O starting from
> the second run with MAP_SHARED ;).
>
> If it was me I'd prefer O_DIRECT by default.
>
>   

Certainly O_DIRECT is the normal path.  We're considering mmap() as a 
way to have both host caching and avoiding double-caching.

> For full disclosure you may also want to read this but I strongly
> disagree with those statements. http://kerneltrap.org/node/7563
>   

I disagree with them strongly too.  For general purpose applications you 
want to avoid O_DIRECT, but special purpose applications that do their 
own caching (databases, virtualization, streaming servers), O_DIRECT is 
critical.

The kernel's cache management algorithms simply cannot compete with a 
specially tuned application, not to mention the additional overhead that 
comes from crossing a protection boundary.

[I've worked on a userspace filesystem that took every possible measure 
to get the OS out of the way: user level threads, O_DIRECT, aio, large 
pages]

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.