From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=44799 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1OxiU4-0004Vo-Hs
	for qemu-devel@nongnu.org; Mon, 20 Sep 2010 11:40:49 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1OxiU3-000604-9d
	for qemu-devel@nongnu.org; Mon, 20 Sep 2010 11:40:48 -0400
Received: from mail-px0-f173.google.com ([209.85.212.173]:63324)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1OxiU3-0005zx-2f
	for qemu-devel@nongnu.org; Mon, 20 Sep 2010 11:40:47 -0400
Received: by pxi12 with SMTP id 12so1320555pxi.4
	for <qemu-devel@nongnu.org>; Mon, 20 Sep 2010 08:40:45 -0700 (PDT)
Message-ID: <4C978071.2010209@codemonkey.ws>
Date: Mon, 20 Sep 2010 10:40:33 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC] block-queue: Delay and batch metadata writes
References: <1284991010-10951-1-git-send-email-kwolf@redhat.com>
	<4C977028.3050602@codemonkey.ws> <4C9778EC.9060704@redhat.com>
In-Reply-To: <4C9778EC.9060704@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org

On 09/20/2010 10:08 AM, Kevin Wolf wrote:
>> If you're comfortable with a writeback cache for metadata, then you
>> should also be comfortable with a writeback cache for data in which
>> case, cache=writeback is the answer.
>>      
> Well, there is a difference: We don't pollute the host page cache with
> guest data and we don't get a virtual "disk cache" as big as the host
> RAM, but only a very limited queue of metadata.
>
> Basically, in qemu we have three different types of caching:
>
> 1. O_DSYNC, everything is always synced without any explicit request.
>     This is cache=writethrough.
>    

I actually think O_DSYNC is the wrong implementation of 
cache=writethrough.  cache=writethrough should behave just like 
cache=none except that data goes through the page cache.

> 2. Nothing is ever synced. This is cache=unsafe.
>
> 3. We present a writeback disk cache to the guest and the guest needs
>     to explicitly flush to gets its data safe on disk. This is
>     cache=writeback and cache=none.
>    

We shouldn't tie the virtual disk cache to which cache= option is used 
in the host.  cache=none means that all requests go directly to the 
disk.  cache=writeback means the host acts as a writeback cache.

If your disk is in writethrough mode, exposing cache=none as a writeback 
disk cache is not correct.

> We're still lacking modes for O_DSYNC | O_DIRECT and unsafe | O_DIRECT,
> but they are entirely possible, because it's two different dimensions.
> (And I think Christoph was planning to actually make it two independent
> options)
>    

I don't really think O_DSYNC | O_DIRECT makes much sense.

>> If it's a matter of batching, batching can't occur if you have a barrier
>> between steps 3 and 5.  The only way you can get batching is by doing a
>> writeback cache for the metadata such that you can complete your request
>> before the metadata is written.
>>
>> Am I misunderstanding the idea?
>>      
> No, I think you understand it right, but maybe you were not completely
> aware that cache=none doesn't mean writethrough.
>    

No, cache=none means don't cache on the host.

In my mind, cache=none|cache=writethrough is specifically about 
eliminating the host from the cache hierarchy.  This is not a 
correctness issue with respect to integrity but rather about data loss.  
If you have strong storage with battery backed caches, then you can 
relax flushes.  However, if you've got a cache in the host and the host 
isn't battery backed, that's no longer safe to do.

So even with cache=none, if we added a writeback cache for metadata, it 
would really need to be an optional feature.  Something like 
cache=none|writethrough|metadata|writeback.

Regards,

Anthony Liguori

> Kevin
>