qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] Caching modes
@ 2010-09-20 16:53 Anthony Liguori
  2010-09-20 18:37 ` Blue Swirl
  2010-09-20 19:34 ` [Qemu-devel] " Christoph Hellwig
  0 siblings, 2 replies; 12+ messages in thread
From: Anthony Liguori @ 2010-09-20 16:53 UTC (permalink / raw)
  To: qemu-devel, Kevin Wolf, Christoph Hellwig

Moving to a separate thread since this has come up a few times and I 
think we need to discuss the assumptions a bit more.

This is how I understand the caching modes should behave and what 
guarantees a guest gets.

cache=none

All read and write requests SHOULD avoid any type of caching in the 
host.  Any write request MUST complete after the next level of storage 
reports that the write request has completed.  A flush from the guest 
MUST complete after all pending I/O requests for the guest have been 
completed.

As an implementation detail, with the raw format, these guarantees are 
only in place for preallocated images.  Sparse images do not provide as 
strong of a guarantee.

cache=writethrough

All read and write requests MAY be cached by the host.  Read requests 
MAY be satisfied by cached data in the host.  Any write request MUST 
complete after the next level of storage reports that the write request 
has completed.  A flush from the guest MUST complete after all pending 
I/O requests for the guest have been completed.

As an implementation detail, with the raw format, these guarantees also 
apply for sparse images.  In the future, we could relax this such that 
sparse images did not provide as strong of a guarantee.

cache=writeback

All read and writes requests MAY be cached by the host.  Read and write 
requests may be completed entirely within the cache.  A write request 
MAY complete before the next level of storage reports that the write 
request has completed.   A flush from the guest MUST complete after all 
pending I/O requests for the guest have been completed and acknowledged 
by the next level of the storage hierarchy.

Guest disk cache.

For all devices that support it, the exposed cache attribute should be 
independent of the host caching mode.  Here are correct usages of disk 
caching mode:

Writethrough disk cache; cache=none|writethrough if the disk cache is 
set to writethrough or the disk is considered "enterprise class" and has 
a battery backup.  cache=writeback IFF the host is backed by an UPS.

Writeback disk cache; cache=none|writethrough if the disk cache is set 
to writeback and the disk is not enterprise class.  cache=writeback if 
the host is not backed by an UPS.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Caching modes
  2010-09-20 16:53 [Qemu-devel] Caching modes Anthony Liguori
@ 2010-09-20 18:37 ` Blue Swirl
  2010-09-20 18:51   ` Anthony Liguori
  2010-09-20 19:34 ` [Qemu-devel] " Christoph Hellwig
  1 sibling, 1 reply; 12+ messages in thread
From: Blue Swirl @ 2010-09-20 18:37 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, qemu-devel, Christoph Hellwig

On Mon, Sep 20, 2010 at 4:53 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> Moving to a separate thread since this has come up a few times and I think
> we need to discuss the assumptions a bit more.
>
> This is how I understand the caching modes should behave and what guarantees
> a guest gets.
>
> cache=none
>
> All read and write requests SHOULD avoid any type of caching in the host.
>  Any write request MUST complete after the next level of storage reports
> that the write request has completed.  A flush from the guest MUST complete
> after all pending I/O requests for the guest have been completed.
>
> As an implementation detail, with the raw format, these guarantees are only
> in place for preallocated images.  Sparse images do not provide as strong of
> a guarantee.
>
> cache=writethrough
>
> All read and write requests MAY be cached by the host.  Read requests MAY be
> satisfied by cached data in the host.  Any write request MUST complete after
> the next level of storage reports that the write request has completed.  A
> flush from the guest MUST complete after all pending I/O requests for the
> guest have been completed.
>
> As an implementation detail, with the raw format, these guarantees also
> apply for sparse images.  In the future, we could relax this such that
> sparse images did not provide as strong of a guarantee.
>
> cache=writeback
>
> All read and writes requests MAY be cached by the host.  Read and write
> requests may be completed entirely within the cache.  A write request MAY
> complete before the next level of storage reports that the write request has
> completed.   A flush from the guest MUST complete after all pending I/O
> requests for the guest have been completed and acknowledged by the next
> level of the storage hierarchy.

It would be nice to have additional mode, like cache=always, where
even flushes MAY be ignored. This would max out the performance.

>
> Guest disk cache.
>
> For all devices that support it, the exposed cache attribute should be
> independent of the host caching mode.  Here are correct usages of disk
> caching mode:
>
> Writethrough disk cache; cache=none|writethrough if the disk cache is set to
> writethrough or the disk is considered "enterprise class" and has a battery
> backup.  cache=writeback IFF the host is backed by an UPS.

The "enterprise class" disks, battery backups and UPS devices are not
consumer equipment. Wouldn't this mean that any private QEMU user
would need to use cache=none?

As an example, what is the correct usage for laptop user, considering
that there is a battery, but it can also drain and the drainage is
dependent on flush frequency?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Caching modes
  2010-09-20 18:37 ` Blue Swirl
@ 2010-09-20 18:51   ` Anthony Liguori
  0 siblings, 0 replies; 12+ messages in thread
From: Anthony Liguori @ 2010-09-20 18:51 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Kevin Wolf, qemu-devel, Christoph Hellwig

On 09/20/2010 01:37 PM, Blue Swirl wrote:
>
> It would be nice to have additional mode, like cache=always, where
> even flushes MAY be ignored. This would max out the performance.
>    

That's cache=unsafe and we have it.  I ignored it for the purposes of 
this discussion.

>> Guest disk cache.
>>
>> For all devices that support it, the exposed cache attribute should be
>> independent of the host caching mode.  Here are correct usages of disk
>> caching mode:
>>
>> Writethrough disk cache; cache=none|writethrough if the disk cache is set to
>> writethrough or the disk is considered "enterprise class" and has a battery
>> backup.  cache=writeback IFF the host is backed by an UPS.
>>      
> The "enterprise class" disks, battery backups and UPS devices are not
> consumer equipment. Wouldn't this mean that any private QEMU user
> would need to use cache=none?
>    

No, cache=writethrough and cache=none should be equivalent from a data 
integrity/data loss perspective.  Using cache=writeback without 
enterprise storage is risky but practically speaking, most consumer 
storage is not battery backed and uses writeback caching anyway so there 
is already risk.

> As an example, what is the correct usage for laptop user, considering
> that there is a battery, but it can also drain and the drainage is
> dependent on flush frequency?
>    

Minus cache=unsafe, you'll never get data corruption.  The only 
consideration is how much data loss can occur from the last time there 
was a flush.  Well behaved applications always flush important data to 
avoid loss of anything important but practically speaking, the world 
isn't full of behaved applications.

The only difference between cache=writeback and a normal disk's 
writeback cache is that cache=writeback can be a very, very large cache 
that isn't frequently flushed.  So the amount of data loss can be much 
higher than expected.

For most laptop users, cache=none or cache=writethrough is appropriate.  
For a developer, cache=writeback probably is reasonable.

Regards,

Anthony Liguori

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: Caching modes
  2010-09-20 16:53 [Qemu-devel] Caching modes Anthony Liguori
  2010-09-20 18:37 ` Blue Swirl
@ 2010-09-20 19:34 ` Christoph Hellwig
  2010-09-20 20:11   ` Anthony Liguori
  1 sibling, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2010-09-20 19:34 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, qemu-devel, Christoph Hellwig

On Mon, Sep 20, 2010 at 11:53:02AM -0500, Anthony Liguori wrote:
> cache=none
> 
> All read and write requests SHOULD avoid any type of caching in the 
> host.  Any write request MUST complete after the next level of storage 
> reports that the write request has completed.  A flush from the guest 
> MUST complete after all pending I/O requests for the guest have been 
> completed.
> 
> As an implementation detail, with the raw format, these guarantees are 
> only in place for preallocated images.  Sparse images do not provide as 
> strong of a guarantee.

That's not how cache=none ever worked nor works currently.

But discussion the current cache modes is rather mood as they try to
map multi-dimension behaviour difference into a single options.  I have
some patches that I need to finish up a bit more that will give you
your no caching enabled mode, but I don't think mapping cache=none to it
will do anyone a favour.

With the split between the guest visible write-cache-enable (WCE) flag, and
the host-specific "use O_DIRECT" and "ignore cache flushes" flags we'll
get the following modes:


                      | WC enable | WC disable
-----------------------------------------------
direct                |           |
buffer                |           |
buffer + ignore flush |           |

currently we only have:

 cache=none		direct + WC enable
 cache=writeback	buffer + WC enable
 cache=writethrough	buffer + WC disable
 cache=unsafe		buffer + ignore flush + WC enable

splitting these up is important because we want to migrate between
hosts that can support direct I/O or not without requiring guest visible
state changes, and also because we want to use direct I/O with guest
that were installed using cache=unsafe without stopping the guest.

It also allows the guest to change the WC enable/disable flag, which
they can do for real IDE/SCSI hardware.  And it allows Anthony's belowed
no caching at all mode, which actually is useful for guest that can not
deal with volatile write caches.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: Caching modes
  2010-09-20 19:34 ` [Qemu-devel] " Christoph Hellwig
@ 2010-09-20 20:11   ` Anthony Liguori
  2010-09-20 23:17     ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Anthony Liguori @ 2010-09-20 20:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Kevin Wolf, qemu-devel

On 09/20/2010 02:34 PM, Christoph Hellwig wrote:
> On Mon, Sep 20, 2010 at 11:53:02AM -0500, Anthony Liguori wrote:
>    
>> cache=none
>>
>> All read and write requests SHOULD avoid any type of caching in the
>> host.  Any write request MUST complete after the next level of storage
>> reports that the write request has completed.  A flush from the guest
>> MUST complete after all pending I/O requests for the guest have been
>> completed.
>>
>> As an implementation detail, with the raw format, these guarantees are
>> only in place for preallocated images.  Sparse images do not provide as
>> strong of a guarantee.
>>      
> That's not how cache=none ever worked nor works currently.
>    

How does it work today compared to what I wrote above?

> But discussion the current cache modes is rather mood as they try to
> map multi-dimension behaviour difference into a single options.  I have
> some patches that I need to finish up a bit more that will give you
> your no caching enabled mode, but I don't think mapping cache=none to it
> will do anyone a favour.
>
> With the split between the guest visible write-cache-enable (WCE) flag, and
> the host-specific "use O_DIRECT" and "ignore cache flushes" flags we'll
> get the following modes:
>
>
>                        | WC enable | WC disable
> -----------------------------------------------
> direct                |           |
> buffer                |           |
> buffer + ignore flush |           |
>
> currently we only have:
>
>   cache=none		direct + WC enable
>   cache=writeback	buffer + WC enable
>   cache=writethrough	buffer + WC disable
>   cache=unsafe		buffer + ignore flush + WC enable
>    

Where does O_DSYNC fit into this chart?

Do all modern filesystems implement O_DSYNC without generating 
additional barriers per request?

Having a barrier per-write request is ultimately not the right semantic 
for any of the modes.  However, without the use of O_DSYNC (or 
sync_file_range(), which I know you dislike), I don't see how we can 
have reasonable semantics without always implementing write back caching 
in the host.

> splitting these up is important because we want to migrate between
> hosts that can support direct I/O or not without requiring guest visible
> state changes, and also because we want to use direct I/O with guest
> that were installed using cache=unsafe without stopping the guest.
>
> It also allows the guest to change the WC enable/disable flag, which
> they can do for real IDE/SCSI hardware.  And it allows Anthony's belowed
> no caching at all mode, which actually is useful for guest that can not
> deal with volatile write caches.
>    

I'm certainly happy to break up the caching option.  However, I still 
don't know how we get a reasonable equivalent to cache=writethrough 
without assuming that ext4 is mounted without barriers enabled.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: Caching modes
  2010-09-20 20:11   ` Anthony Liguori
@ 2010-09-20 23:17     ` Christoph Hellwig
  2010-09-21  0:18       ` Anthony Liguori
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2010-09-20 23:17 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Christoph Hellwig, qemu-devel

On Mon, Sep 20, 2010 at 03:11:31PM -0500, Anthony Liguori wrote:
> >>All read and write requests SHOULD avoid any type of caching in the
> >>host.  Any write request MUST complete after the next level of storage
> >>reports that the write request has completed.  A flush from the guest
> >>MUST complete after all pending I/O requests for the guest have been
> >>completed.
> >>
> >>As an implementation detail, with the raw format, these guarantees are
> >>only in place for preallocated images.  Sparse images do not provide as
> >>strong of a guarantee.
> >>     
> >That's not how cache=none ever worked nor works currently.
> >   
> 
> How does it work today compared to what I wrote above?

For the guest point of view it works exactly as you describe
cache=writeback.  There is no ordering or cache flushing guarantees.  By
using O_DIRECT we do bypass the host file cache, but we don't even try
on the others (disk cache, commiting metadata transaction that are
required to actually see the commited data for sparse, preallocated or
growing images).

What you describe above is the equivalent of O_DSYNC|O_DIRECT which
doesn't exist in current qemu, except that O_DSYNC|O_DIRECT also
guarantees the semantics for sparse images.  Sparse images really aren't
special in any way - preallocaiton using posix_fallocate or COW
filesystems like btrfs,nilfs2 or zfs have exactly the same issues.

> >                       | WC enable | WC disable
> >-----------------------------------------------
> >direct                |           |
> >buffer                |           |
> >buffer + ignore flush |           |
> >
> >currently we only have:
> >
> >  cache=none		direct + WC enable
> >  cache=writeback	buffer + WC enable
> >  cache=writethrough	buffer + WC disable
> >  cache=unsafe		buffer + ignore flush + WC enable
> >   
> 
> Where does O_DSYNC fit into this chart?

O_DSYNC is used for all WC disable modes.

> Do all modern filesystems implement O_DSYNC without generating 
> additional barriers per request?
> 
> Having a barrier per-write request is ultimately not the right semantic 
> for any of the modes.  However, without the use of O_DSYNC (or 
> sync_file_range(), which I know you dislike), I don't see how we can 
> have reasonable semantics without always implementing write back caching 
> in the host.

Barriers are a Linux-specific implementation details that is in the
process of going away, probably in Linux 2.6.37.  But if you want
O_DSYNC semantics with a volatile disk write cache there is no way
around using a cache flush or the FUA bit on all I/O caused by it.  We
currently use the cache flush, and although I plan to experiment a bit
more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very
surprised if they actually are any faster.

> I'm certainly happy to break up the caching option.  However, I still 
> don't know how we get a reasonable equivalent to cache=writethrough 
> without assuming that ext4 is mounted without barriers enabled.

There's two problems here - one is a Linux-wide problem and that's the
barrier primitive which is currenly the only way to flush a volatile
disk cache.  We've sorted this out for the 2.6.37.  The other is that
ext3 and ext4 have really bad fsync implementations.  Just use a better
filesystem or bug one of it's developers if you want that fixed.  But
except for disabling the disk cache there is no way to get data integrity
without cache flushes (the FUA bit is nothing but an implicit flush).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: Caching modes
  2010-09-20 23:17     ` Christoph Hellwig
@ 2010-09-21  0:18       ` Anthony Liguori
  2010-09-21  8:15         ` Kevin Wolf
  2010-09-21 14:26         ` Christoph Hellwig
  0 siblings, 2 replies; 12+ messages in thread
From: Anthony Liguori @ 2010-09-21  0:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Kevin Wolf, qemu-devel

On 09/20/2010 06:17 PM, Christoph Hellwig wrote:
> On Mon, Sep 20, 2010 at 03:11:31PM -0500, Anthony Liguori wrote:
>    
>>>> All read and write requests SHOULD avoid any type of caching in the
>>>> host.  Any write request MUST complete after the next level of storage
>>>> reports that the write request has completed.  A flush from the guest
>>>> MUST complete after all pending I/O requests for the guest have been
>>>> completed.
>>>>
>>>> As an implementation detail, with the raw format, these guarantees are
>>>> only in place for preallocated images.  Sparse images do not provide as
>>>> strong of a guarantee.
>>>>
>>>>          
>>> That's not how cache=none ever worked nor works currently.
>>>
>>>        
>> How does it work today compared to what I wrote above?
>>      
> For the guest point of view it works exactly as you describe
> cache=writeback.  There is no ordering or cache flushing guarantees.  By
> using O_DIRECT we do bypass the host file cache, but we don't even try
> on the others (disk cache, commiting metadata transaction that are
> required to actually see the commited data for sparse, preallocated or
> growing images).
>    

O_DIRECT alone to a pre-allocated file on a normal file system should 
result in the data being visible without any additional metadata 
transactions.

The only time when that isn't true is when dealing with CoW or other 
special filesystem features.

> What you describe above is the equivalent of O_DSYNC|O_DIRECT which
> doesn't exist in current qemu, except that O_DSYNC|O_DIRECT also
> guarantees the semantics for sparse images.  Sparse images really aren't
> special in any way - preallocaiton using posix_fallocate or COW
> filesystems like btrfs,nilfs2 or zfs have exactly the same issues.
>
>    
>>>                        | WC enable | WC disable
>>> -----------------------------------------------
>>> direct                |           |
>>> buffer                |           |
>>> buffer + ignore flush |           |
>>>
>>> currently we only have:
>>>
>>>   cache=none		direct + WC enable
>>>   cache=writeback	buffer + WC enable
>>>   cache=writethrough	buffer + WC disable
>>>   cache=unsafe		buffer + ignore flush + WC enable
>>>
>>>        
>> Where does O_DSYNC fit into this chart?
>>      
> O_DSYNC is used for all WC disable modes.
>
>    
>> Do all modern filesystems implement O_DSYNC without generating
>> additional barriers per request?
>>
>> Having a barrier per-write request is ultimately not the right semantic
>> for any of the modes.  However, without the use of O_DSYNC (or
>> sync_file_range(), which I know you dislike), I don't see how we can
>> have reasonable semantics without always implementing write back caching
>> in the host.
>>      
> Barriers are a Linux-specific implementation details that is in the
> process of going away, probably in Linux 2.6.37.  But if you want
> O_DSYNC semantics with a volatile disk write cache there is no way
> around using a cache flush or the FUA bit on all I/O caused by it.

If you have a volatile disk write cache, then we don't need O_DSYNC 
semantics.

>    We
> currently use the cache flush, and although I plan to experiment a bit
> more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very
> surprised if they actually are any faster.
>    

The thing I struggle with understanding is that if the guest is sending 
us a write request, why are we sending the underlying disk a write + 
flush request?  That doesn't seem logical at all to me.

Even if we advertise WC disable, it should be up to the guest to decide 
when to issue flushes.

>> I'm certainly happy to break up the caching option.  However, I still
>> don't know how we get a reasonable equivalent to cache=writethrough
>> without assuming that ext4 is mounted without barriers enabled.
>>      
> There's two problems here - one is a Linux-wide problem and that's the
> barrier primitive which is currenly the only way to flush a volatile
> disk cache.  We've sorted this out for the 2.6.37.  The other is that
> ext3 and ext4 have really bad fsync implementations.  Just use a better
> filesystem or bug one of it's developers if you want that fixed.  But
> except for disabling the disk cache there is no way to get data integrity
> without cache flushes (the FUA bit is nothing but an implicit flush).
>    

But why are we issuing more flushes than the guest is issuing if we 
don't have to worry about filesystem metadata (i.e. preallocated storage 
or physical devices)?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: Caching modes
  2010-09-21  0:18       ` Anthony Liguori
@ 2010-09-21  8:15         ` Kevin Wolf
  2010-09-21 14:26         ` Christoph Hellwig
  1 sibling, 0 replies; 12+ messages in thread
From: Kevin Wolf @ 2010-09-21  8:15 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Christoph Hellwig, qemu-devel

Am 21.09.2010 02:18, schrieb Anthony Liguori:
> On 09/20/2010 06:17 PM, Christoph Hellwig wrote:
>> On Mon, Sep 20, 2010 at 03:11:31PM -0500, Anthony Liguori wrote:
>>    
>>>>> All read and write requests SHOULD avoid any type of caching in the
>>>>> host.  Any write request MUST complete after the next level of storage
>>>>> reports that the write request has completed.  A flush from the guest
>>>>> MUST complete after all pending I/O requests for the guest have been
>>>>> completed.
>>>>>
>>>>> As an implementation detail, with the raw format, these guarantees are
>>>>> only in place for preallocated images.  Sparse images do not provide as
>>>>> strong of a guarantee.
>>>>>
>>>>>          
>>>> That's not how cache=none ever worked nor works currently.
>>>>
>>>>        
>>> How does it work today compared to what I wrote above?
>>>      
>> For the guest point of view it works exactly as you describe
>> cache=writeback.  There is no ordering or cache flushing guarantees.  By
>> using O_DIRECT we do bypass the host file cache, but we don't even try
>> on the others (disk cache, commiting metadata transaction that are
>> required to actually see the commited data for sparse, preallocated or
>> growing images).
>>    
> 
> O_DIRECT alone to a pre-allocated file on a normal file system should 
> result in the data being visible without any additional metadata 
> transactions.
> 
> The only time when that isn't true is when dealing with CoW or other 
> special filesystem features.

I think preallocated files are the exception, usually people use sparse
files. And even with preallocation, the disk cache is still left.

>> What you describe above is the equivalent of O_DSYNC|O_DIRECT which
>> doesn't exist in current qemu, except that O_DSYNC|O_DIRECT also
>> guarantees the semantics for sparse images.  Sparse images really aren't
>> special in any way - preallocaiton using posix_fallocate or COW
>> filesystems like btrfs,nilfs2 or zfs have exactly the same issues.
>>
>>    
>>>>                        | WC enable | WC disable
>>>> -----------------------------------------------
>>>> direct                |           |
>>>> buffer                |           |
>>>> buffer + ignore flush |           |
>>>>
>>>> currently we only have:
>>>>
>>>>   cache=none		direct + WC enable
>>>>   cache=writeback	buffer + WC enable
>>>>   cache=writethrough	buffer + WC disable
>>>>   cache=unsafe		buffer + ignore flush + WC enable
>>>>
>>>>        
>>> Where does O_DSYNC fit into this chart?
>>>      
>> O_DSYNC is used for all WC disable modes.
>>
>>    
>>> Do all modern filesystems implement O_DSYNC without generating
>>> additional barriers per request?
>>>
>>> Having a barrier per-write request is ultimately not the right semantic
>>> for any of the modes.  However, without the use of O_DSYNC (or
>>> sync_file_range(), which I know you dislike), I don't see how we can
>>> have reasonable semantics without always implementing write back caching
>>> in the host.
>>>      
>> Barriers are a Linux-specific implementation details that is in the
>> process of going away, probably in Linux 2.6.37.  But if you want
>> O_DSYNC semantics with a volatile disk write cache there is no way
>> around using a cache flush or the FUA bit on all I/O caused by it.
> 
> If you have a volatile disk write cache, then we don't need O_DSYNC 
> semantics.

What has semantics of a qemu option to do with the host disk write
cache? We always need to provide the same semantics. If anything, we can
take advantage of a host providing write-through/no caches so that we
don't have to issue the flushes ourselves.

>>    We
>> currently use the cache flush, and although I plan to experiment a bit
>> more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very
>> surprised if they actually are any faster.
>>    
> 
> The thing I struggle with understanding is that if the guest is sending 
> us a write request, why are we sending the underlying disk a write + 
> flush request?  That doesn't seem logical at all to me.
> 
> Even if we advertise WC disable, it should be up to the guest to decide 
> when to issue flushes.

Why should a guest ever flush a cache when it's told that this cache
doesn't exist?

Kevin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: Caching modes
  2010-09-21  0:18       ` Anthony Liguori
  2010-09-21  8:15         ` Kevin Wolf
@ 2010-09-21 14:26         ` Christoph Hellwig
  2010-09-21 15:13           ` Anthony Liguori
  1 sibling, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2010-09-21 14:26 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Christoph Hellwig, qemu-devel

On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote:
> O_DIRECT alone to a pre-allocated file on a normal file system should 
> result in the data being visible without any additional metadata 
> transactions.

Anthony, for the third time: no.  O_DIRECT is a non-portable extension
in Linux (taken from IRIX) and is defined as:


       O_DIRECT (Since Linux 2.4.10)
              Try  to minimize cache effects of the I/O to and from this file.
              In general this will degrade performance, but it  is  useful  in
              special  situations,  such  as  when  applications  do their own
              caching.  File I/O is done directly to/from user space  buffers.
              The O_DIRECT flag on its own makes at an effort to transfer data
              synchronously, but does not give the guarantees  of  the  O_SYNC
              that  data and necessary metadata are transferred.  To guarantee
              synchronous I/O the O_SYNC must be used in addition to O_DIRECT.
              See NOTES below for further discussion.

              A  semantically  similar  (but  deprecated)  interface for block
              devices is described in raw(8).

O_DIRECT does not have any meaning for data integrity, it just tells the
filesystem it *should* not use the pagecache.  Even if it should not
various filesystem have fallbacks to buffered I/O for corner cases.
It does *not* mean the actual disk cache gets flushed, and it *does*
not guarantee anything about metadata which is very important.

Metadata updates happen when filling sparse file, when extening the file
size, when using a COW filesystem, and when converting preallocated to
fully allocated extents in practice and could happen in many more cases
depending on the filesystem implementation.

> >Barriers are a Linux-specific implementation details that is in the
> >process of going away, probably in Linux 2.6.37.  But if you want
> >O_DSYNC semantics with a volatile disk write cache there is no way
> >around using a cache flush or the FUA bit on all I/O caused by it.
> 
> If you have a volatile disk write cache, then we don't need O_DSYNC 
> semantics.

If you present a volatile write cache to the guest you do indeed not
need O_DSYNC and can rely on the guest sending fdatasync calls when it
wants to flush the cache.  But for the statement above you can replace
O_DSYC with fdatasync and it will still be correct.  O_DSYNC in current
Linux kernels is nothing but an implicit range fdatasync after each
write.

> >   We
> >currently use the cache flush, and although I plan to experiment a bit
> >more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very
> >surprised if they actually are any faster.
> >   
> 
> The thing I struggle with understanding is that if the guest is sending 
> us a write request, why are we sending the underlying disk a write + 
> flush request?  That doesn't seem logical at all to me.

We only send a cache flush request *iff* we present the guest a device
without a volatile write cache so that it can assume all writes are
stable and we sit on a device that does have a volatile write cache.

> Even if we advertise WC disable, it should be up to the guest to decide 
> when to issue flushes.

No.  If we don't claim to have a volatile cache no guest will ever flush
the cache.  Which is just logially given that we just told it that we
don't have a cache that needs flushing.

> >ext3 and ext4 have really bad fsync implementations.  Just use a better
> >filesystem or bug one of it's developers if you want that fixed.  But
> >except for disabling the disk cache there is no way to get data integrity
> >without cache flushes (the FUA bit is nothing but an implicit flush).
> >   
> 
> But why are we issuing more flushes than the guest is issuing if we 
> don't have to worry about filesystem metadata (i.e. preallocated storage 
> or physical devices)?

Who is "we" and what is workload/filesystem/kernel combination?
Specific details and numbers please.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: Caching modes
  2010-09-21 14:26         ` Christoph Hellwig
@ 2010-09-21 15:13           ` Anthony Liguori
  2010-09-21 20:57             ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Anthony Liguori @ 2010-09-21 15:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Kevin Wolf, qemu-devel

On 09/21/2010 09:26 AM, Christoph Hellwig wrote:
> On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote:
>    
>> O_DIRECT alone to a pre-allocated file on a normal file system should
>> result in the data being visible without any additional metadata
>> transactions.
>>      
> Anthony, for the third time: no.  O_DIRECT is a non-portable extension
> in Linux (taken from IRIX) and is defined as:
>
>
>         O_DIRECT (Since Linux 2.4.10)
>                Try  to minimize cache effects of the I/O to and from this file.
>                In general this will degrade performance, but it  is  useful  in
>                special  situations,  such  as  when  applications  do their own
>                caching.  File I/O is done directly to/from user space  buffers.
>                The O_DIRECT flag on its own makes at an effort to transfer data
>                synchronously, but does not give the guarantees  of  the  O_SYNC
>                that  data and necessary metadata are transferred.  To guarantee
>                synchronous I/O the O_SYNC must be used in addition to O_DIRECT.
>                See NOTES below for further discussion.
>
>                A  semantically  similar  (but  deprecated)  interface for block
>                devices is described in raw(8).
>
> O_DIRECT does not have any meaning for data integrity, it just tells the
> filesystem it *should* not use the pagecache.  Even if it should not
> various filesystem have fallbacks to buffered I/O for corner cases.
> It does *not* mean the actual disk cache gets flushed, and it *does*
> not guarantee anything about metadata which is very important.
>    

Yes, I understand all of this but I was trying to avoid accepting it.  
But after the call today, I'm convinced that this is fundamentally a 
filesystem problem.

I think what we need to do is:

1) make virtual WC guest controllable.  If a guest enables WC, &= 
~O_DSYNC.  If it disables WC, |= O_DSYNC.  Obviously, we can let a user 
specify the virtual WC mode but it has to be changable during live 
migration.

2) only let the user choose between using and not using the host page 
cache.  IOW, direct=on|off.  cache=XXX is deprecated.

3) make O_DIRECT | O_DSYNC not suck so badly on ext4.

>>> Barriers are a Linux-specific implementation details that is in the
>>> process of going away, probably in Linux 2.6.37.  But if you want
>>> O_DSYNC semantics with a volatile disk write cache there is no way
>>> around using a cache flush or the FUA bit on all I/O caused by it.
>>>        
>> If you have a volatile disk write cache, then we don't need O_DSYNC
>> semantics.
>>      
> If you present a volatile write cache to the guest you do indeed not
> need O_DSYNC and can rely on the guest sending fdatasync calls when it
> wants to flush the cache.  But for the statement above you can replace
> O_DSYC with fdatasync and it will still be correct.  O_DSYNC in current
> Linux kernels is nothing but an implicit range fdatasync after each
> write.
>    

Yes.  I was stuck on O_DSYNC being independent of the virtual WC but 
it's clear to me now that it cannot be.

>>> ext3 and ext4 have really bad fsync implementations.  Just use a better
>>> filesystem or bug one of it's developers if you want that fixed.  But
>>> except for disabling the disk cache there is no way to get data integrity
>>> without cache flushes (the FUA bit is nothing but an implicit flush).
>>>
>>>        
>> But why are we issuing more flushes than the guest is issuing if we
>> don't have to worry about filesystem metadata (i.e. preallocated storage
>> or physical devices)?
>>      
> Who is "we" and what is workload/filesystem/kernel combination?
> Specific details and numbers please.
>    

My concern is ext4.  With a preallocated file and cache=none as 
implemented today, performance is good even when barrier=1.  If we 
enable O_DSYNC, performance will plummet.  Ultimately, this is an ext4 
problem, not a QEMU problem.

Perhaps we can issue a warning if the WC is disabled and we do an fsstat 
and see that it's ext4 with barriers enabled.

I think it's more common for a user to want to disable a virtual WC 
because they have less faith in the hypervisor than they have in the 
underlying storage.

The scenarios I am concerned about:

1) User has enterprise storage, but has an image on ext4 with 
barrier=1.  User explicitly disables WC in guest because they have 
enterprise storage but not an UPS for the hypervisor.

2) User does not have enterprise storage, but has an image on ext4 with 
barrier=1.  User explicitly disables WC in guest because they don't know 
what they're doing.

In the case of (1), the answer may be "ext4 sucks, remount with 
barrier=0" but I think we need to at least warn the user of this.

For (2), again it's probably the user doing the wrong thing because if 
they don't have enterprise storage, then they shouldn't care about a 
virtual WC.  Practically though, I've seen a lot of this with users.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: Caching modes
  2010-09-21 15:13           ` Anthony Liguori
@ 2010-09-21 20:57             ` Christoph Hellwig
  2010-09-21 21:27               ` Anthony Liguori
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2010-09-21 20:57 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Christoph Hellwig, qemu-devel

On Tue, Sep 21, 2010 at 10:13:01AM -0500, Anthony Liguori wrote:
> 1) make virtual WC guest controllable.  If a guest enables WC, &= 
> ~O_DSYNC.  If it disables WC, |= O_DSYNC.  Obviously, we can let a user 
> specify the virtual WC mode but it has to be changable during live 
> migration.

I have patches for that are almost ready to submit.

> 
> 2) only let the user choose between using and not using the host page 
> cache.  IOW, direct=on|off.  cache=XXX is deprecated.

Also done by that patch series.  That's exactly what I described to mail
roundtrips ago..

> My concern is ext4.  With a preallocated file and cache=none as 
> implemented today, performance is good even when barrier=1.  If we 
> enable O_DSYNC, performance will plummet.  Ultimately, this is an ext4 
> problem, not a QEMU problem.

For Linux or Windows guests WCE=0 is not a particularly good default
given that they can deal with the write caches, and mirrors the
situation with consumer SATA disk.  For for older Unix guests you'll
need to be able to persistently disable the write cache.

To make things more confusing the default ATA/SATA way to tune the
volatile write cache setting is not persistent - e.g. if you disable it
using hdparm it will come up enabled again. 

> 2) User does not have enterprise storage, but has an image on ext4 with 
> barrier=1.  User explicitly disables WC in guest because they don't know 
> what they're doing.
> 
> For (2), again it's probably the user doing the wrong thing because if 
> they don't have enterprise storage, then they shouldn't care about a 
> virtual WC.  Practically though, I've seen a lot of this with users.

This setting is just fine, especially if using O_DIRECT.  The guest
sends cache flush requests often enough to not make it a problem.  If
you do not use O_DIRECT in that scenario which will cache a lot more
data in theory - but any filesystem aware of cache flushes will flush
them frequent enough to not make it a problem.  It is a real problem
however when using ext3 in it's default setting in the guest which
doesn't use barrier.  But that's a bug in ext3 and nothing but
petitioning it's maintainer to fix it will help you there.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] Re: Caching modes
  2010-09-21 20:57             ` Christoph Hellwig
@ 2010-09-21 21:27               ` Anthony Liguori
  0 siblings, 0 replies; 12+ messages in thread
From: Anthony Liguori @ 2010-09-21 21:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Kevin Wolf, qemu-devel

On 09/21/2010 03:57 PM, Christoph Hellwig wrote:
> On Tue, Sep 21, 2010 at 10:13:01AM -0500, Anthony Liguori wrote:
>    
>> 1) make virtual WC guest controllable.  If a guest enables WC,&=
>> ~O_DSYNC.  If it disables WC, |= O_DSYNC.  Obviously, we can let a user
>> specify the virtual WC mode but it has to be changable during live
>> migration.
>>      
> I have patches for that are almost ready to submit.
>
>    
>> 2) only let the user choose between using and not using the host page
>> cache.  IOW, direct=on|off.  cache=XXX is deprecated.
>>      
> Also done by that patch series.  That's exactly what I described to mail
> roundtrips ago..
>    

Yes.

>> My concern is ext4.  With a preallocated file and cache=none as
>> implemented today, performance is good even when barrier=1.  If we
>> enable O_DSYNC, performance will plummet.  Ultimately, this is an ext4
>> problem, not a QEMU problem.
>>      
> For Linux or Windows guests WCE=0 is not a particularly good default
> given that they can deal with the write caches, and mirrors the
> situation with consumer SATA disk.  For for older Unix guests you'll
> need to be able to persistently disable the write cache.
>
> To make things more confusing the default ATA/SATA way to tune the
> volatile write cache setting is not persistent - e.g. if you disable it
> using hdparm it will come up enabled again.
>    

Yes, potentially, we could save this in a config file (and really, I 
mean libvirt could save it).

>> 2) User does not have enterprise storage, but has an image on ext4 with
>> barrier=1.  User explicitly disables WC in guest because they don't know
>> what they're doing.
>>
>> For (2), again it's probably the user doing the wrong thing because if
>> they don't have enterprise storage, then they shouldn't care about a
>> virtual WC.  Practically though, I've seen a lot of this with users.
>>      
> This setting is just fine, especially if using O_DIRECT.  The guest
> sends cache flush requests often enough to not make it a problem.  If
> you do not use O_DIRECT in that scenario which will cache a lot more
> data in theory - but any filesystem aware of cache flushes will flush
> them frequent enough to not make it a problem.  It is a real problem
> however when using ext3 in it's default setting in the guest which
> doesn't use barrier.  But that's a bug in ext3 and nothing but
> petitioning it's maintainer to fix it will help you there.
>    

It's not just ext3, it's ext4 with barrier=0 which is what certain 
applications are being told to do in the face of poor performance.

So direct=on,wc=on + ext4 barrier=0 in the guest is less safe than ext4 
barrier=0 on bare metal.

Very specifically, if we do cache=none as we do today, and within the 
guest, we have ext4 barrier=0 and run DB2, DB2's guarantees are weaker 
than they are on bare metal because of the fact that metadata is not 
getting flushed.

To resolve this, we need to do direct=on,wc=off + ext4 barrier=0 on the 
host.  This is safe and should perform reasonably well but there's far 
too much complexity for a user to get to this point.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-09-21 21:28 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-20 16:53 [Qemu-devel] Caching modes Anthony Liguori
2010-09-20 18:37 ` Blue Swirl
2010-09-20 18:51   ` Anthony Liguori
2010-09-20 19:34 ` [Qemu-devel] " Christoph Hellwig
2010-09-20 20:11   ` Anthony Liguori
2010-09-20 23:17     ` Christoph Hellwig
2010-09-21  0:18       ` Anthony Liguori
2010-09-21  8:15         ` Kevin Wolf
2010-09-21 14:26         ` Christoph Hellwig
2010-09-21 15:13           ` Anthony Liguori
2010-09-21 20:57             ` Christoph Hellwig
2010-09-21 21:27               ` Anthony Liguori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).