From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=52862 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1OxqZ6-0000wL-MF
	for qemu-devel@nongnu.org; Mon, 20 Sep 2010 20:18:35 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1OxqZ5-0007uc-7K
	for qemu-devel@nongnu.org; Mon, 20 Sep 2010 20:18:32 -0400
Received: from mail-iw0-f173.google.com ([209.85.214.173]:35918)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1OxqZ5-0007uR-1N
	for qemu-devel@nongnu.org; Mon, 20 Sep 2010 20:18:31 -0400
Received: by iwn38 with SMTP id 38so5021352iwn.4
	for <qemu-devel@nongnu.org>; Mon, 20 Sep 2010 17:18:30 -0700 (PDT)
Message-ID: <4C97F9C6.60501@codemonkey.ws>
Date: Mon, 20 Sep 2010 19:18:14 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
References: <4C97916E.2080801@codemonkey.ws> <20100920193451.GA11516@lst.de>
	<4C97BFF3.90103@codemonkey.ws> <20100920231742.GB18512@lst.de>
In-Reply-To: <20100920231742.GB18512@lst.de>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: [Qemu-devel] Re: Caching modes
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Christoph Hellwig <hch@lst.de>
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-devel <qemu-devel@nongnu.org>

On 09/20/2010 06:17 PM, Christoph Hellwig wrote:
> On Mon, Sep 20, 2010 at 03:11:31PM -0500, Anthony Liguori wrote:
>    
>>>> All read and write requests SHOULD avoid any type of caching in the
>>>> host.  Any write request MUST complete after the next level of storage
>>>> reports that the write request has completed.  A flush from the guest
>>>> MUST complete after all pending I/O requests for the guest have been
>>>> completed.
>>>>
>>>> As an implementation detail, with the raw format, these guarantees are
>>>> only in place for preallocated images.  Sparse images do not provide as
>>>> strong of a guarantee.
>>>>
>>>>          
>>> That's not how cache=none ever worked nor works currently.
>>>
>>>        
>> How does it work today compared to what I wrote above?
>>      
> For the guest point of view it works exactly as you describe
> cache=writeback.  There is no ordering or cache flushing guarantees.  By
> using O_DIRECT we do bypass the host file cache, but we don't even try
> on the others (disk cache, commiting metadata transaction that are
> required to actually see the commited data for sparse, preallocated or
> growing images).
>    

O_DIRECT alone to a pre-allocated file on a normal file system should 
result in the data being visible without any additional metadata 
transactions.

The only time when that isn't true is when dealing with CoW or other 
special filesystem features.

> What you describe above is the equivalent of O_DSYNC|O_DIRECT which
> doesn't exist in current qemu, except that O_DSYNC|O_DIRECT also
> guarantees the semantics for sparse images.  Sparse images really aren't
> special in any way - preallocaiton using posix_fallocate or COW
> filesystems like btrfs,nilfs2 or zfs have exactly the same issues.
>
>    
>>>                        | WC enable | WC disable
>>> -----------------------------------------------
>>> direct                |           |
>>> buffer                |           |
>>> buffer + ignore flush |           |
>>>
>>> currently we only have:
>>>
>>>   cache=none		direct + WC enable
>>>   cache=writeback	buffer + WC enable
>>>   cache=writethrough	buffer + WC disable
>>>   cache=unsafe		buffer + ignore flush + WC enable
>>>
>>>        
>> Where does O_DSYNC fit into this chart?
>>      
> O_DSYNC is used for all WC disable modes.
>
>    
>> Do all modern filesystems implement O_DSYNC without generating
>> additional barriers per request?
>>
>> Having a barrier per-write request is ultimately not the right semantic
>> for any of the modes.  However, without the use of O_DSYNC (or
>> sync_file_range(), which I know you dislike), I don't see how we can
>> have reasonable semantics without always implementing write back caching
>> in the host.
>>      
> Barriers are a Linux-specific implementation details that is in the
> process of going away, probably in Linux 2.6.37.  But if you want
> O_DSYNC semantics with a volatile disk write cache there is no way
> around using a cache flush or the FUA bit on all I/O caused by it.

If you have a volatile disk write cache, then we don't need O_DSYNC 
semantics.

>    We
> currently use the cache flush, and although I plan to experiment a bit
> more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very
> surprised if they actually are any faster.
>    

The thing I struggle with understanding is that if the guest is sending 
us a write request, why are we sending the underlying disk a write + 
flush request?  That doesn't seem logical at all to me.

Even if we advertise WC disable, it should be up to the guest to decide 
when to issue flushes.

>> I'm certainly happy to break up the caching option.  However, I still
>> don't know how we get a reasonable equivalent to cache=writethrough
>> without assuming that ext4 is mounted without barriers enabled.
>>      
> There's two problems here - one is a Linux-wide problem and that's the
> barrier primitive which is currenly the only way to flush a volatile
> disk cache.  We've sorted this out for the 2.6.37.  The other is that
> ext3 and ext4 have really bad fsync implementations.  Just use a better
> filesystem or bug one of it's developers if you want that fixed.  But
> except for disabling the disk cache there is no way to get data integrity
> without cache flushes (the FUA bit is nothing but an implicit flush).
>    

But why are we issuing more flushes than the guest is issuing if we 
don't have to worry about filesystem metadata (i.e. preallocated storage 
or physical devices)?

Regards,

Anthony Liguori